Germline-aware deep learning models and benchmarks for predicting antibody VH–VL pairing

Sara Joubbi; Enrico D’Arco; Giuseppe Maccari; Paolo Milazzo; Alessio Micheli

doi:10.1080/19420862.2025.2570749

. 2025 Oct 17;17(1):2570749. doi: 10.1080/19420862.2025.2570749

Germline-aware deep learning models and benchmarks for predicting antibody VH–VL pairing

Sara Joubbi ^a,^b,^✉,^*, Enrico D’Arco ^a,^*, Giuseppe Maccari ^b, Paolo Milazzo ^a, Alessio Micheli ^a

PMCID: PMC12536629 PMID: 41104651

ABSTRACT

Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH – VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy – light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH – VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.

KEYWORDS: Antibody language models, antibody pairing, benchmark, deep learning, germline

Introduction

Antibodies, or immunoglobulins, are Y-shaped proteins produced by B cells that serve as key effectors of the adaptive immune response. By binding specifically and tightly to antigens, such as pathogens or foreign molecules, they facilitate immune recognition and clearance. This high specificity is conferred by the antibody’s variable regions, which are capable of recognizing an immense diversity of molecular targets.¹

The molecular basis for this diversity lies in V(D)J recombination, a somatic DNA rearrangement process that occurs during B cell development.^2–5 In this process, variable (V), diversity (D), and joining (J) gene segments, encoded in the germline genome, are assembled to form functional immunoglobulin genes, as shown in Figure 1. The light chain variable region is generated by recombining one V and one J segment, while the heavy chain involves the sequential joining of V, D, and J segments.⁴ Further diversity is introduced at the junctions of these segments through nucleotide addition and deletion mediated by terminal deoxynucleotidyl transferase and exonuclease activity.^6,7 Together, these mechanisms generate a vast and diverse antibody repertoire, equipping the immune system with the ability to recognize a wide array of antigens.

Figure 1. — V(D)J recombination joins V, (D), and J gene segments to assemble the variable (Fv) regions, comprising VH and VL, that confer antigen specificity, while downstream constant (C) segments encode the Fc framework.

Functional antibodies are formed through the non-covalent pairing of variable heavy (VH) and variable light (VL) chains, which together make up the antigen-binding fragment variable (Fv) region.⁸ The structure and function of this region, critical for antigen recognition, are shaped by inter-chain interactions that influence binding specificity, affinity, and stability.⁹ While some VH – VL pairings are more frequently seen in natural repertoires, the conventional view is that heavy and light chains pair largely at random to form functional antibodies.^10–12 However, recent structural and computational studies challenge this view, showing that VH – VL interface geometry is influenced by the choice of germline V and J gene segments.^13–15 Moreover, experimental data demonstrate that specific V(D)J combinations are critical for productive antibody assembly,¹⁶ and that non-native pairings can disrupt structural compatibility and antigen binding.^17,18

To exploit this pairing diversity for therapeutic engineering, experimental strategies such as VL-shuffling recombine heavy chains with alternative light chains to identify high-performing variants.⁹ While effective, these methods require extensive cloning, expression, and screening, making them labor-intensive and time-consuming. Computational approaches offer a promising alternative by predicting VH – VL pairing compatibility directly from sequence. A major challenge in this setting is the lack of publicly available datasets containing confirmed non-pairing VH/VL sequences, which makes supervised model training and evaluation difficult.

Recent advances in deep learning (DL) have transformed protein bioinformatics, exemplified by the success of AlphaFold in structure prediction,^19,20 and by transformer-based language models such as ESM,²¹ ProtTrans,²² and ProteinBERT,²³ which have achieved state-of-the-art performance in diverse sequence analysis tasks.²⁴ Inspired by these developments, antibody-specific language models have emerged, including AntiBERTa,²⁵ AbLang,²⁶ BALM,²⁷ IgBERT,²⁸ and PARA.²⁹ Some of these models were trained or fine-tuned on paired VH – VL sequences to better capture inter-chain dependencies. For example, BALM-paired³⁰ was trained exclusively on naturally paired antibodies, while IgBERT and IgT5²⁸ were initially trained on unpaired sequences and subsequently fine-tuned using paired data from the Observed Antibody Space (OAS).^31,32

Although these models have shown strong performance in tasks such as antibody sequence recovery, structure prediction, and expression level estimation, only a few have directly tackled VH – VL pairing as a predictive task. One such model is PARA,²⁹ which approached pairing classification by contrasting native pairs against mismatches generated via similarity-based shuffling, achieving high AUC-ROC scores. Another relevant model is p-IgGen,³³ a generative VH – VL model trained on paired antibodies. While p-IgGen was not explicitly designed for classification, its ability to assign higher likelihoods to true VH – VL pairs, compared to randomly paired alternatives, indicates that it captures signals relevant to pairing compatibility. In parallel, alternative strategies have been proposed outside the transformer paradigm. HuMatch³⁴ uses a convolutional neural network (CNN) trained on human antibody sequences annotated with V-germline labels. Although developed primarily for antibody humanization, HuMatch includes a pairing classification component. ImmunoMatch³⁵ was built upon the AntiBERTa2 language model, which is fine-tuned using VH – VL pairs derived from single human B cells to discriminate between naturally cognate and randomly mismatched heavy – light chain combinations. SynPair³⁶ is a new model based on contrastive learning and treats the pairing problem as dense – retrieval problem. The model achieves state-of-the-art prediction for VH-VL pairing, outperforming ImmunoMatch.

Despite growing interest in VH – VL pairing prediction, there is currently no standardized dataset or evaluation framework to guide model development or assess performance in a consistent and biologically meaningful way. Existing methods often rely on ad hoc mismatching strategies and lack rigorous comparisons, making it difficult to interpret results across studies. Recent work has shown that dataset composition – particularly the definition of negative examples – plays a critical role in shaping what models learn and how well they generalize. In the context of antibody – antigen binding, Ursu et al.³⁷ demonstrated that negative sample selection not only influences predictive accuracy but also determines whether models recover biologically meaningful rules. These insights suggest that careful design of negative datasets is equally important for VH – VL pairing prediction, where ad hoc strategies risk introducing biases and limiting interpretability.

To address this gap, we introduce a dedicated benchmark dataset designed to test and compare deep learning models on VH – VL pairing prediction. We formulate the problem as a binary classification task and construct synthetic negative examples using three biologically motivated strategies: random recombination, V-gene mismatch, and full V(D)J germline mismatch. This approach yields structured and interpretable mismatched datasets that more closely reflect the biological constraints on antibody pairing.

Using this dataset, we train three deep learning models, each corresponding to a different negative sampling strategy, to serve as baselines for future benchmarking. These models are based on a simple yet effective architecture that combines IgBERT-derived embeddings with a multi-layer perceptron (MLP) classifier. By evaluating performance across diverse test splits (random, v-gene, and germlines), we provide a clear and reproducible framework for comparing pairing models under controlled conditions.

Our key contributions are as follows:

Benchmark Dataset: We release a benchmark dataset for VH – VL pairing classification, including positive examples from naturally paired antibodies and negative examples constructed via three complementary sampling strategies. Code for generating new mismatched samples is also provided.
Reference Models: We train and evaluate three IgBERT-based classifiers, each using a distinct negative sampling method, to serve as standardized baselines for future comparison.
Model Development: We present a lightweight yet effective DL framework that achieves $>$ 90% accuracy in distinguishing natural from synthetic VH – VL pairs. Notably, we tested the inter-chain predicted TM-score (ipTM) of AlphaFold3, a metric commonly used to assess interface quality, in this classification task, indicating that a dedicated method is needed.
Biological Insights: We assess how germline identity influences model performance, confirming that full V(D)J-based mismatching yields the most biologically discriminative features for accurate pairing prediction.
Developability Correlations: We compared different VH – VL pairing models on developability properties and observed preliminary correlations with experimental thermostability, suggesting that pairing predictions may hold potential utility in downstream antibody optimization and early-stage developability assessment.

These resources establish a much-needed foundation for systematic evaluation of antibody pairing models, enabling reproducibility, biological interpretability, and fair comparisons between emerging deep learning approaches.

Results

Building a large-scale paired antibody dataset with negative sampling

We created a large-scale dataset of antibody variable region pairs from the Observed Antibody Space (OAS) database, starting with 1,954,079 VH/VL pairs. After germline annotation (VJ for light chains and VDJ for heavy chains), pre-processing, and sequence clustering, we retained 1,357,155 high-quality native VH/VL pairs, consisting of 1,348,625 unique heavy chains and 595,539 unique light chains. In total, we identified 11,976 unique germline combinations. To facilitate evaluation under realistic biological scenarios, where test antibodies may originate from rare or unseen germline configurations, we partitioned the data based on germline origin into 716,325 training, 271,233 validation, and 369,597 test pairs.

To train models for predicting VH-VL compatibility, we augmented the dataset with mismatched (i.e., non-native) VH/VL pairs using two negative sampling strategies. In the first strategy, called the randomly paired dataset, we shuffled VH and VL sequences so that no synthetic pair matched any naturally observed one. The second strategy was inspired by the hypothesis of Jayaram et al.¹³ that germline origin influences VH – VL pairing. In this approach, called the germline-paired dataset, we generated synthetic pairs from germline combinations that were statistically unlikely based on the observed data. Specifically, we sampled absent germline pairs from the dataset, and for each selected combination, we independently sampled a VH and a VL sequence from the respective germline pools. To ensure diversity and avoid repeated or overrepresented pairs, we defined a probability distribution over germline pairs proportional to the product of available sequences for each germline. A smoothing parameter was applied to this distribution to reduce the skew toward high-frequency pairs.

We considered this procedure under two germline encoding schemes:

V-germline: using only the V-segment (e.g., H: VH1, L: KV1),
Full germline: using the full set of V, D, and J segments (e.g., H: VH1-VD2-VJ3, L: KV1-KJ2).

Figure 2 shows the distribution of germline co-occurrences in the training set and the corresponding synthetic pair distributions for the full germline and V-germline settings, respectively.

In all negative sampling strategies, the number of synthetic pairs was matched to the number of native (positive) pairs to maintain dataset balance. For further details on the negative sampling algorithm and dataset composition, see Materials and Methods.

Latent space visualization reveals separation in germline-based pairings

Our model is based on IgBert embeddings and a classification head. For each of the three datasets (randomly paired sequences, germline paired sequences, and germline paired using V-only germline sequences), a sample of 4,096 elements containing both positive (paired) and negative (synthetically paired) instances is drawn. The encoder, IgBert, processes these pairs to generate VH/VL pair embeddings of size 1,024, which are subsequently reduced to two-dimensional vectors using the t-SNE algorithm (Figure 3a–c), and in parallel with UMAP (Figure 3d–f). The t-SNE scatter plots reveal that randomly paired sequences significantly overlap with naive paired ones, indicating that models trained on the randomly paired dataset may struggle to discern meaningful patterns. In contrast, the germline paired datasets exhibit distinct clusters of paired and mismatched VH/VL pairs in the latent space projection, particularly in the V-only germline paired dataset, where the classes are well separated. UMAP further accentuates this trend, producing near-complete separation in the V-only case, while also revealing that full germline and random mismatching remain the most challenging settings, with substantial overlap between paired and mismatched sequences.

Figure 3. — Top row: t-SNE embeddings of IgBERT features reduced to 2D. Bottom row: UMAP embeddings of the same features. Light blue denotes paired sequences; magenta denotes mismatched sequences. Columns: (a,d) randomly paired dataset; (b,e) germline paired (V-only) dataset; (c,f) germline paired dataset.

Pairwise sequence similarity reveals divergence of synthetic pairings

To assess sequence-level coherence across different VH/VL pairing strategies, we analyzed four data sets: one consisting of native VH/VL pairs (referred to as naive), and three containing mismatched pairs generated using different synthetic strategies (random, germline, and germline-V). We define intra-similarity as the similarity among pairs within the same dataset, and inter-similarity as the similarity between mismatched VH/VL pairs and their corresponding naive VH/VL pair sharing the same heavy chain.

A common set of 1,000 heavy chains was sampled such that each heavy chain was present in all four datasets. From these, we extracted the paired VH and VL sequences and concatenated each pair into a single string. This yielded four sets of concatenated VH/VL sequences.

Intra-similarity was computed for each dataset by evaluating all pairwise combinations of sequences within the same set using the normalized Levenshtein similarity score:

sim (VL 1, VL 2) = 1 - \frac{dist (VL 1, VL 2)}{| VL 1 | + | VL 2 |}

(1)

The results of the intra-similarity analysis are shown in Table 1. Notably, the germline-V dataset exhibits the highest mean similarity, whereas germline displays the lowest, indicating differences in sequence homogeneity introduced by the pairing strategy.

Table 1.

Intra-similarity scores computed within each dataset. STD = standard deviation.

Dataset	Mean	STD	Min	Max
naive	0.62	0.08	0.42	0.98
random	0.62	0.08	0.43	0.97
germline-V	0.67	0.12	0.40	0.99
germline	0.59	0.09	0.40	0.99

Open in a new tab

Next, we calculated inter-similarity by comparing each mismatched pair to the naive pair sharing the same VH sequence, again using Eq. 1. Table 2 presents these results, showing the degree to which synthetic pairings diverge from native configurations. The germline-V dataset shows the greatest divergence from the naive sequences.

Table 2.

Inter-similarity scores between naive and mismatched VH/VL pairs. STD = standard deviation.

Dataset	Mean	STD	Min	Max
random	0.62	0.08	0.41	0.99
germline-V	0.53	0.06	0.39	0.83
germline	0.59	0.07	0.41	0.99

Open in a new tab

Germline overlap in randomly mismatched VH/VL pairs limits class separability

In this section, we investigate the VDJ/VJ germline combinations underlying the VH/VL sequence pairs in the randomly mismatched set derived from the training split of naive pairs. Specifically, we aim to quantify the probability that a given randomly paired VH/VL sequence originates from a VDJ/VJ germline combination that was never observed among the germline combinations of the original naive set.

To this end, we sample 10,000 VH/VL pairs from the randomly mismatched training set. For each VH and VL sequence, we extract the corresponding VDJ and VJ germlines, respectively, thereby obtaining a set of paired VDJ/VJ germline combinations associated with these mismatched VH/VL pairs. We then compare this set to the germline combinations observed in the naive training pairs. Following the approach described in Materials and Methods, we first compile all VDJ and VJ germline combinations used to generate the naive training set. For each observed VDJ/VJ pair, we count the number of naive VH/VL sequences derived from it. Finally, we assess the proportion of germline combinations in the randomly mismatched sample that were not present in the naive training set. We find that only 0.3% of the VDJ/VJ pairs in the mismatched sample correspond to entirely novel germline combinations (i.e., combinations that were never observed among the naive pairs).

This result indicates that the vast majority of germline combinations in the sampled mismatched pairs are already present in the training set. This likely explains the overlap observed in the embedding space shown in Figure 3a, where the random pairing class cannot be separated from the native pairs. In contrast, the other negative pairing strategies are explicitly constructed using germline information, leading to more distinct distributions. Altogether, these findings underscore the pivotal role of germline identity in determining VH/VL pair compatibility and emphasize its significance in devising effective negative pairing strategies.

Model evaluation under diverse negative pairing strategies

Our model architecture consists of an IgBERT encoder followed by a multi-layer perceptron classification head. We trained three variants of this model, each using a different negative sampling strategy: random, germline, and germline-V. We evaluated each model across three hold-out dataset splits (random, v-gene, and germlines) using three classification metrics: Accuracy, F1 Score, and AUC-ROC (Figure 4). For full metric details – containing precision, Recall, Accuracy, F1, Matthews Correlation Coefficient (MCC), AUC-ROC, and AUC-PR – refer to Supplementary Table 1.

Figure 4. — Heatmaps showing the performance of three models (random, V germline, and VDJ germline) across three dataset splits (random, v-gene, and germlines). Each cell displays the score achieved by a given model on a given split, with color intensity reflecting relative performance.

The VDJ germline model consistently achieved high performance across all datasets and metrics, with values exceeding 0.9 for Accuracy, F1, and AUC score on the v-gene and germlines splits. This reflects strong generalization and robust predictive capacity when leveraging full V(D)J germline information. However, the model performs less effectively on the random dataset, which can be attributed to the lack of unseen VDJ/VJ combinations in the mismatched pairs.

The V germline model also performed well, particularly on the v-gene split, achieving near-perfect classification scores (F1 = 0.98 and AUC-ROC = 1.0). However, its performance decreased significantly on the random and germlines splits, suggesting reduced generalization when only partial germline information is used during training.

The Random model, which uses no germline-aware partitioning, exhibited the weakest performance overall. Interestingly, its performance improves on the VDJ and V germline datasets compared to the random split, likely due to the better separability of the two problems.

These results underscore the importance of germline-aware dataset construction. Models trained and evaluated using V(D)J-consistent partitions yield more biologically grounded and generalizable predictions. The performance drop of the Random model on controlled splits further emphasizes the risks of data leakage and inflated performance in germline-agnostic benchmarks.

VDJ influence on the final output

To assess whether the VDJ model outputs depend on the heavy-chain D gene, we analyzed the full dataset of naturally paired antibodies, where D identity is biologically meaningful and not confounded by heavy – light mismatches. For each sequence, we defined VJ = HV – HJ and constructed a VJ×D matrix. For every (HD, VJ) cell, we computed (i) the number of sequences and (ii) the mean model probability (Figure 5). To account for imbalance, counts are min – max normalized to [0,1] in the top panel. The middle panel reports mean probabilities on their original scale, while the bottom panel applies a within-VJ (column-wise) min – max normalization, rescaling each VJ column to [0,1] to highlight the relative contribution of D in that specific VJ context.

Figure 5. — Top: counts per (HD, VJ) cell (min–max normalized). Middle: mean predicted probability per (HD, VJ); vertical banding highlights a dominant VJ effect. Bottom: within-VJ normalized mean probability (0–1 per column), emphasizing the relative impact of D for each VJ.

Because D segments are short and heavily influenced by junctional diversity, D-gene assignments are inherently less reliable than V and J. As expected, the figure (mid panel) shows a dominant V – J effect (vertical banding), with subtle, context-dependent D contributions supported by the data (bottom panel). Overall, the model remains robust – it captures D-specific nuances without becoming overly sensitive to noisy D calls.

Establishing a benchmark framework for antibody pairing prediction

To enable fair and interpretable comparisons between models for antibody chain-pairing prediction, we established a benchmark comprising three dataset splits (random, v-gene, and germlines) and defined two reference points: Topline, representing the best performance for each dataset, and Bottomline, representing the weakest. These references act as empirical performance bounds, offering a practical framework for evaluating new models in a controlled and biologically meaningful context. Any method falling below the Bottomline may be considered ineffective for this task, while approaches nearing or matching the Topline reflect optimal use of the available signal.

Using this framework, we evaluated three state-of-the-art models: p-IgGen, Humatch, and ImmunoMatch. SynPair was not analyzed as its code is currently unavailable. The results are shown in Figure 6.

Figure 6. — Benchmarking antibody pairing models across dataset splits.

p-IgGen performs relatively well on the random split, nearly matching Humatch in F1 score. Still, its performance declines on the more challenging v-gene and germlines splits, especially in terms of AUC-ROC and Accuracy. Humatch and ImmunoMatch, on the other hand, perform consistently across splits, achieving strong AUC-ROC scores, particularly on v-gene, but fall short of Topline performance in F1 Score and Accuracy, highlighting limitations in binary classification calibration.

Importantly, the Topline consistently dominates across all metrics and splits, underscoring the headroom available for future models. Meanwhile, Bottomline performs competitively in specific cases, especially on v-gene, reflecting the simplicity of some pairing signals in that split. This benchmark and evaluation framework supports robust, interpretable, and reproducible comparisons, providing clear targets for improvement in antibody pairing prediction. For detailed metrics, see Supplementary Table 1.

Assessing generalization and competitiveness on the PARA task

Our evaluation procedure involves three model variants, each tested across three distinct datasets using binary classification. However, due to the lack of a standardized benchmark for this task, fair comparison with existing state-of-the-art methods is challenging. To address this, we additionally implemented a separate evaluation inspired by the PARA framework, which frames VH – VL pairing as a ranking task based on sequence similarity. Specifically, we constructed test triplets $(VH, V L_{1}, V L_{2})$ , where $V L_{1}$ is the known binding partner and $VL 2$ is a mismatched chain with low sequence similarity, as described in Materials and Methods. The model must assign a higher pairing score to $(VH, V L_{1})$ than to $(VH, V L_{2})$ . While this comparison is inherently indirect, as the dataset and classification weights used by PARA are not publicly available, we align with its evaluation design to provide a reasonable point of reference.

Figure 7 shows model performance across Accuracy, F1 Score, and AUC-ROC (see Supplementary Table 1 for detailed metrics). Our VDJ germline model achieves AUC-ROC scores that closely approach the reference value for PARA (0.82), indicating competitive performance in external benchmarks. As the Random model also performs well, particularly in Accuracy and F1 Score, it suggests that PARA’s negative construction may align more closely with random mismatching than with biologically informed strategies. This hypothesis is further supported by the strong performance of ImmunoMatch, which is trained on randomly mismatched VH – VL pairs. Although the PARA benchmark does not constitute a definitive ground truth, these results reinforce the robustness of our method and highlight the value of germline-aware training for generalizable VH – VL pairing prediction.

Germline-based models achieve strong performance on a 7.2M-sequence dataset

Dudzic et al. recently presented PairedAbNGS,¹⁵ a comprehensive dataset of natively paired heavy – light antibody chains compiled from 58 single-cell studies ( $\sim 14.4 M$ productive chains; $\sim 7.2 M$ paired sequences; $\sim 3.7 M$ unique amino-acid pairs). Alongside the resource, the authors analyzed germline pairing patterns and conserved inter-chain contacts. PairedAbNGS complements OAS by expanding the diversity of available paired data, enriching germline coverage and sequence variability, and providing a valuable resource for antibody engineering and machine learning.

Since all models were trained OAS corpus, we further assessed their performance on PairedAbNGS, used as an external benchmark. To ensure a fair evaluation, we removed all sequences overlapping with our training, validation, or test splits. The results, summarized in Figure 8, highlight clear performance differences across approaches. In particular, germline-based models (V and VDJ) consistently achieved the highest accuracy, closely followed by Humatch. These findings confirm the robustness of germline information as a key determinant of pairing compatibility, even when evaluated on a large, independent dataset. However, with this dataset, we are testing the model to determine if the sequences are paired, but we cannot test the data on the negative (experimentally validated unpaired) sequences.

Figure 8. — Performance on the PairedAbNGS dataset. Horizontal bar plot showing the accuracy of the different models. Germline-based models outperform state-of-the-art and random models. Results are presented in terms of accuracy, given that the dataset exclusively comprises sequences from the paired class.

AlphaFold3 ipTM does not distinguish correct from incorrect VH/VL pairings

AlphaFold3’s ipTM score has previously been shown to correlate with the probability of binding,³⁸ supporting its potential utility in interface evaluation.³⁹ Based on this premise, we investigated whether ipTM could be used to distinguish correct from incorrect VH/VL pairings, as the ipTM metric reflects the predicted interface quality between two protein chains. Due to the limited number of submissions on the AF3 server, we evaluated 180 randomly selected antibody sequences under three conditions: correctly paired, random synthetic pairs, and germline-based synthetic pairs. In this analysis, we used a single random seed for each antibody pairing, since ipTM scores in VH – VL modeling are generally stable for the different sequences (see Supplementary Figure S1). The mean ipTM values observed were $0.89 \pm 0.02$ , $0.90 \pm 0.01$ , and $0.89 \pm 0.03$ for original, random, and germline, respectively. Statistical comparison using the Mann-Whitney U test revealed no significant differences between the original and control groups. This outcome may be due to the antibodies in the Protein Data Bank that were used for AlphaFold3’s training, which are mostly engineered or affinity-matured,³⁸ which do not represent the entire antibody space. Moreover, germline pairing biases are connected to transcriptional and genomic factors^13–15 that cannot be fully captured by interface quality metrics alone. These findings highlight the need for dedicated models that incorporate antibody-specific pairing determinants beyond general interface quality.

Early signals of thermostability in VH/VL pairing prediction models

Developability refers to a broad spectrum of biophysical and biochemical properties that critically influence an antibody’s manufacturability, safety, and clinical viability. Key attributes include aggregation propensity, solubility, viscosity, thermal and chemical stability, immunogenicity risk, expression yield, and pharmacokinetic behavior.^40–42 Balancing these factors is crucial to ensure that an antibody candidate can be produced at scale, stored under stable conditions, and administered safely. Early identification and optimization of these properties during the discovery phase can help mitigate late-stage attrition, streamline development pipelines, and reduce overall costs. Computational approaches offer promising tools for the early assessment of developability-relevant features.^43–50 For example, IgBERT demonstrated the ability to predict binding and expression properties.

To investigate whether the VH – VL pairing models capture signals related to antibody developability, we examined their relationship with two key properties: expression and thermostability. We used data from Jain et al.,⁴⁰ which includes 137 antibodies with Fab melting temperatures measured by differential scanning fluorimetry and HEK expression titers (mg/L). For each antibody, we computed Pearson’s $r$ and Spearman’s $ρ$ between model-derived pairing scores and the experimental measurements. The analysis was performed across different negative sampling strategies (random, V germline, and VDJ germline) as well as competing models (p-IgGen, Humatch, and ImmunoMatch). The resulting correlations are summarized in Table 3.

Table 3.

Correlation between model-derived VH – VL pairing scores and experimental developability. Reported are Pearson’s $r$ and Spearman’s $ρ$ values for expression (left) and thermostability (right) across different negative sampling strategies and competing models. Bold values indicate correlations with $p < 0.05$ .

	Expression		Thermostability
	Person	Spearman	Person	Spearman
Random	−0.06	−0.13	0.06	0.04
V germline	−0.11	−0.14	−0.15	0.06
VDJ germline	0.14	0.11	0.16	0.18
p-IgGen	0.17	0.15	0.02	0.04
Humatch	−0.15	0.01	0.25	0.24
ImmunoMatch	−0.11	−0.12	−0.03	−0.02

Open in a new tab

The observed correlations with developability properties are modest. Humatch shows the strongest association with thermostability, while p-IgGen displays the clearest trend with expression. However, the effect sizes remain small, indicating that current AI models struggle to capture developability signals from VH – VL pairing alone. These findings should be interpreted as preliminary, as the analysis was conducted on a limited dataset of 137 antibodies.

Discussion and conclusion

Antibody diversity arises from the stochastic processes of V(D)J recombination and somatic hypermutation, generating a vast array of variable heavy (VH) and light (VL) chains. Functional antibodies depend on the non-covalent pairing of VH and VL, which forms the antigen-binding site and influences key properties such as stability, expression, and specificity. Despite its significance in natural immunity and therapeutic design, predicting VH – VL pairing remains underexplored, with no widely accepted benchmarks or large-scale datasets that include experimentally confirmed non-pairings.

A key challenge in this area is the lack of biologically realistic negative examples, leading many prior studies to adopt ad hoc or implausible mismatching strategies. This undermines model generalization and interpretability, limiting the ability to compare methods fairly and assess their utility in real-world antibody engineering. Recent work on antibody-antigen binding has shown that the choice of negative examples is a key determinant of model generalization and biological interpretability.³⁷ This highlights the need for biologically grounded negative sampling in VH – VL pairing prediction to build robust and interpretable models.

To address these limitations, we present a structured and reproducible benchmark for VH – VL pairing prediction, based on large-scale data from naturally paired human antibodies. We define three biologically motivated negative sampling strategies: random pairing, V-gene mismatching, and full V(D)J mismatching, each representing different levels of pairing difficulty. Through comprehensive evaluation, we demonstrate that full V(D)J mismatching provides the most informative negative set, enabling robust classification without introducing excessive noise or trivial separability.

We contribute three IgBERT-based reference models, each trained on a different negative sampling strategy, serving as baselines for future studies. Our benchmark enables reproducible comparisons against recent methods like PARA, p-IgGen, and ImmunoMatch, revealing that performance can vary significantly depending on the pairing challenge and data split used. This highlights the need for germline-aware model evaluation, especially when generalization to unseen germline combinations is crucial. Germline-based models show strong performance on the large PairedAbNGS dataset ( $\sim 7 M$ pairs).

Beyond benchmarking, our framework has broader implications for generative antibody design. The ability to construct realistic, non-pairable examples lays the groundwork for developing or refining generative loss functions that penalize biologically implausible VH – VL combinations. Additionally, the VDJ-based model shows an early, statistically significant signal of association with thermostability ( $ρ \approx 0.18$ ), suggesting a potential link to developability. While preliminary, this finding highlights a promising direction for future work, where VH – VL pairing models could contribute to early-stage developability screening.

Finally, while we focus on human antibodies, the germline-aware data partitioning strategy and flexible sampling procedures can be easily extended to other species, including murine or bovine repertoires. We believe this work facilitates the development of new models for accurate, efficient, and biologically grounded predictions of VH – VL pairings.

Materials and Methods

Dataset

Paired human antibody sequences were downloaded from the OAS database, producing an initial collection of 1,954,079 pairs. We removed truncated sequences. Each sequence was annotated with its corresponding germline segments, three for heavy chains (V, D, and J) and two for light chains (V and J), using the IMGT nomenclature. After filtering, 1,622,802 sequence pairs with complete germline information were retained, corresponding to 1,622,674 unique pairs (99.99% of the total), 1,604,717 unique heavy sequences, and 699,889 unique light sequences. Regarding germline diversity, we identified 7 unique heavy V segments, 7 heavy D segments, and 6 heavy J segments, resulting in 294 observed VDJ heavy germline combinations. In addition, 18 unique light V segments and 12 light J segments produced 76 unique VJ light germline combinations. Out of all possible heavy and light germline combinations ( $294 \times 76 = 22344$ ), only 12,416 pairs were observed. A summary of the dataset is provided in Table 4.

Table 4.

Summary of the dataset.

Statistic	Value
Rows	1 622 802
Unique sequences	1 622 674
Unique heavy sequences	1 604 717
Unique light sequences	699 889
Heavy germlines	294
Light germlines	76
Germline pairs	12 416
V heavy germlines	7
D heavy germlines	7
J heavy germlines	6
V light germlines	18
J light germlines	12

Open in a new tab

To reduce data set redundancy, we use Linclust⁵¹ by specifying a minimal sequence identity of $0.80$ . The algorithm extracted $1, 357, 063$ groups, with a mean size of $1.196$ . Almost all clusters are singletons ( $1, 290, 392$ ), and the largest one has size $3, 940$ . We then selected the representative sequence for each cluster. The resulting dataset consists of $1, 357, 155$ pairs with $1, 348, 625$ unique heavy sequences and $595, 539$ unique light sequences. The number of germline combinations observed has also been reduced to $11, 976$ .

Germline-aware split

The dataset was partitioned into three folds for training, validation, and testing using a germline – aware splitting strategy rather than pure random assignment. Negative samples (“mismatched” sequence pairs) were generated based on germline origin to ensure that, in real – world applications, where test pairs may derive from germlines unseen during training, the model still correctly discriminates true from spurious antibody pairings. This approach enables us to extend our model to nonhuman species, for example, mice, where the repertoire of V(D)J gene segments is substantially larger than in humans.

Let $D$ be a dataset of paired VH/VL sequences. Denote by $G H_{D}$ and $G L_{D}$ the sets of heavy (VDJ) and light (VJ) germline combinations that give rise to the heavy and light sequences in $D$ , respectively. For each $GH \in G H_{D}$ (respectively, $GL \in G L_{D}$ ), define $S_{GH}$ (respectively, $S_{GL}$ ) as the subset of $D$ consisting of VH/VL pairs whose heavy (light) chain originates from $GH$ ( $GL$ ). The objective is to partition $D$ into two subsets, $D_{1}$ (training) and $D_{2}$ (testing), such that:

G H_{D_{2}} ∖G H_{D_{1}} \neq \emptyset and G L_{D_{2}} ∖G L_{D_{1}} \neq \emptyset .

This ensures that some VH/VL pairs in $D_{2}$ include heavy and light sequences with germline combinations not seen in $D_{1}$ .

The germline-aware splitting algorithm proceeds as follows (see Algorithm 1). It begins by sampling a subset $T_{1} \subset G H_{D}$ of heavy germline combinations and adding to an initially empty test set $D_{2}$ all VH/VL pairs from $⋃_{GH \in T_{1}} S_{GH}$ . A similar step is then performed for the light germlines: a subset $T_{2} \subset G L_{D}$ is sampled, and the corresponding VH/VL pairs $⋃_{GL \in T_{2}} S_{GL}$ are added to $D_{2}$ .

Once these pairs are removed from the original dataset $D$ , a subset $D_{1}$ is sampled from the remaining data. The rest of the remaining VH/VL pairs, i.e., $D ∖ D_{1}$ , are then added to $D_{2}$ . The resulting pair of disjoint sets $(D_{1}, D_{2})$ forms the final partition.

Algorithm 1. Partitioning algorithm

Open in a new tab

The sampling of $T_{1}$ and $T_{2}$ can influence the number of pairs included in $D_{2}$ . If in $T_{1}$ is included an heavy germline $GH$ that is associated with a large number of $VH$ s, $S_{GL}$ can cover a large portion of the sequences in $D$ , creating an unbalanced division. On the other hand, if $T_{1}$ contains only heavy germlines such that for each $GH$ , $S_{GH}$ contains only few pairs, their exclusive inclusion in $D_{2}$ does not contribute enough to an evaluation of the performance the model has when presented with never-seen germlines. Similar reasoning also applies to light germlines.

In light of these observations, for each VDJ heavy germline combination in the dataset, the number of heavy sequences originating from them is counted. Then, $10 %$ of the combinations with a sequence count between the 80th and 20th quantiles are sampled and included in $T_{1}$ . The same approach is applied to the light germline combinations. A graphical representation is provided in Figure 9. From the remaining sequences in $D$ , $75 %$ are sampled and added to $D_{1}$ , while the rest are placed in $D_{2}$ . The dataset split composition is shown in Table 5.

Figure 9. — (a) Distribution of heavy V(D)J germline combinations on a log–scale. (b) Distribution of light VJ germline combinations on a log–scale. Orange bars indicate germline combinations reserved exclusively for the test split.

Table 5.

Summary of training, validation, and test splits.

Training	Training	Validation	Test
Paired sequences	$716325$	$271233$	$369597$
Unique heavy	$713129$	$270406$	$368303$
Unique light	$325309$	$134013$	$176290$
Unique pairs	$716295$	$271230$	$369586$
Heavy germlines	$261$	$270$	$291$
Light germlines	$68$	$68$	$73$
Germlines pair	$8628$	$8125$	$9575$
V heavy germlines	$7$	$7$	$7$
D heavy germlines	$7$	$7$	$7$
J heavy germlines	$6$	$6$	$6$
V light germlines	$18$	$17$	$17$
J light germlines	$12$	$11$	$11$

Open in a new tab

Negative dataset preparation

Random pairing

The random pairing procedure constructs synthetic VH – VL combinations by repeatedly sampling one heavy chain and one light chain until a new pair (previously unobserved) is obtained, this is considered as the baseline of our work. Formally, let $V H$ and $V L$ denote the sets of all unique heavy and light sequences, and let $P \subset V H \times V L$ be the set of real (positive) pairs. At each iteration, a candidate pair $(vh, vl)$ is drawn uniformly from $V H \times V L$ ; if $(vh, vl) \in P$ , it is discarded and a new draw is performed. This sampling continues until $N$ distinct non-positive pairs have been collected, typically with $N = | P |$ to ensure that positives and negatives are balanced. Finally, the synthetic negatives are merged with $P$ to form the randomly paired dataset. Algorithm 2 outlines this process.

Algorithm 2. Random pairing algorithm

Open in a new tab

The previous algorithm is repeated three times, where in each execution, $P$ is set to be equal to the training set, validation set, or test set, thus obtaining three randomly paired datasets. Results are shown in Table 6.

Table 6.

Summary of the splits for the random dataset.

Training	Training	Validation	Test
Generated pairs	716325	271233	369597
Unique heavy	451185	171061	232860
Unique light	212387	87773	115172
Unique pairs	716097	271142	369481
Total size	1432650	542466	739194

Open in a new tab

Germline pairing

Following the hypothesis in,¹³ we investigated how germline origin constrains VH – VL pairing. Let $P \subset V H \times V L$ denote a set of naive heavy – light sequence pairs, where each heavy chain $VH$ derives from germline(s) $GH$ , and each light chain $VL$ from $GL$ . We evaluated two encoding schemes for $GH$ and $GL$ : V-germline and VDJ germlines.

Let $G H$ be the set of all heavy germlines that generated the heavy sequences in $P$ and $G L$ be the same but for the light sequences. The set of all possible germline pair combinations is $A = G H \times G L$ , where $\times$ indicates the cartesian product. For a $< GH, GL >\in A$ , denote $S (< GH, GL >)$ to be the subset of $P$ containing all the $VH$ and $VL$ pairs such that $VH$ is generated by $GH$ and $VL$ is generated by $GL$ . Similarly, denote as $T (GH)$ the set of heavy sequences from $P$ having germline $GH$ , and similarly for $GL$ . The following procedure works as long as there exists some $< GH, GL >$ such that $| S (< GH, GL >) |$ is zero. In the following, $A_{0}$ refers to the subset of $A$ such that for each $< GH, GL >\in A_{0}$ , $| S (< GH, GL >) |$ is zero.

Let $N$ be the number of synthetic pairs to generate. The algorithm starts by sampling $N$ germline pairs from $A_{0}$ . Then, for each sampled $< GH, GL >$ , a random heavy sequence is sampled from $T (GH)$ and a random light one is sampled from $T (GL)$ . This process generates $N$ random sequence pairs whose germlines are never observed together in the dataset. This approach is showed in Algorithm 3.

Algorithm 3. Germline pairing algorithm.

Open in a new tab

The last gap to fill is to specify a probability distribution for the elements of $A_{0}$ . The most trivial approach is to sample elements in $A_{0}$ using a uniform distribution, which will result in that, if $| A_{0} | = m$ , on average a germline pair is picked $N / m$ times. However, the number of sequence a $GH$ and a $GL$ are associated to. For instance, consider the extreme case where the heavy germline $GH$ and the light germline $GL$ appear just once in the dataset, with their corresponding $VH$ and $VL$ sequences never observed paired in the dataset. This means that, on average, $< GH, GL >$ is sampled $N / m$ times from $A_{0}$ , implying that the VH/VL pair formed by $VH$ and $VL$ will be repeated the same number of times.

This example motivated the use of the maximal number of possible combinations. Let $p$ be a germline pair $< GH, GL >$ from $A_{0}$ , the maximal number of possible combinations is given by $n_{p} = | T (GH) | \times | T (GL) |$ . Taking all the $n_{p}$ and dividing them by $N_{P} = \sum_{p \in A_{0}} n_{p}$ , the obtained quantities characterize a probability distribution over $A_{0}$ .

While this approach solves the problem of repeated sequences, to avoid a skewed distribution, we added smoothing: for a $p \in A_{0}$ , its probability is given by $(n_{p} + α) / (N_{P} + αm)$ where $α$ is a positive constant.

As in the random pairing case, the generated germline paired sequences are combined with the naturally paired ones to create a single dataset. In addition, it is necessary to specify whether the germline-paired dataset was generated with only the V germline or all germlines.

For each of the training, validation, and test splits of the dataset, a germline-paired dataset using only the V germline and one using all the germlines is generated with an alpha equal to 1000. Summaries are given in Tables 7 and 8.

Table 7.

Summary of the synthetic pairs generated by the germline pairing strategy using all the germlines.

Training	Training	Validation	Test
Generated pairs	716325	271233	369597
Unique heavy	172284	79278	109264
Unique light	102598	50571	62194
Unique pairs	702818	268702	366160
Total size	1432650	542466	739194

Open in a new tab

Table 8.

Summary of the synthetic pairs generated by the germline pairing strategy using only the V germlines.

Training	Training	Validation	Test
Generated pairs	716325	271233	369597
Unique heavy	259312	68424	78250
Unique light	150	73	71
Unique pairs	511619	154522	208502
Total size	1432650	542466	739194

Open in a new tab

Model

Architecture

Our classifier builds on the IgBERT model,²⁸ which we augment with a lightweight feedforward classification head to distinguish naturally paired VH – VL sequences (from OAS) from synthetically generated mismatches. An overview of the model is shown in Figure 10. Each VH and VL sequence is tokenized at the residue level, concatenated with a special [SEP] token to denote the chain boundary, and passed through the (frozen) IgBERT encoder. The encoder outputs a 1,024-dimensional contextual embedding per residue, which we average across the sequence to obtain a single representation for the pair. This embedding is then passed through a two-layer classifier that outputs logits for binary classification (real vs. synthetic), which are converted into probabilities via a softmax layer. During training, only the classification head is optimized using Adam and cross-entropy loss, while IgBERT’s parameters remain fixed.

Figure 10. — Architecture of the model. The heavy (in red) and the light (in blue) sequences are tokenized by assigning a token for each residue.

Although alternative models such as PARA were considered, we selected IgBERT due to its deeper architecture (400 M parameters vs. 45 M in PARA) and stronger representational capacity. Given our study’s focus on analyzing the effect of different negative sampling strategies, we opted for a more complex encoder to isolate the impact of training data design rather than model capacity.

Training

The model was trained for one epoch using the Adam optimizer with 1,432,650 samples in the training set and 739,194 samples in the validation set, performing a total of 44,771 updates. The loss function used was cross-entropy loss, and the batch size was set to 32. The hyperparameters employed for training each of the three models are summarized in Table 9.

Table 9.

Optimal hyperparameters for the models trained on the respective datasets.

Hyperparameter	Random pairing	Germline pairing, V-only	Germline pairing
Number of layers	3	3	5
Size of the layers	1024	2048	2048
Learning rate (LR)	$1 \times 10^{- 4}$	$1 \times 10^{- 4}$	$1 \times 10^{- 4}$
Weight decay	$1 \times 10^{- 4}$	$1 \times 10^{- 4}$	$1 \times 10^{- 3}$
LR scheduler	linear	None	linear
End factor	0.05	None	0.1
Steps	1000	None	1000
Adam $β_{1}$	0.85	0.90	0.95
Adam $β_{2}$	0.960	0.999	0.960

Open in a new tab

The experiments were carried out using Python 3.11.9 and PyTorch 2.5.0.dev20240909+cu124, on a system equipped with an ARM Neoverse N1 251 CPU and an NVIDIA A100 GPU (CUDA 12.5).

Test procedure

To benchmark our models in a manner analogous to PARA, we construct evaluation triplets $(VH, V L_{1}, V L_{2})$ under the constraint that $VH$ is known to bind $V L_{1}$ , while $V L_{2}$ is chosen to low similarity to $V L_{1}$ . An overview of the test procedure is given in Figure 11. Each pair $(VH, V L_{i})$ is fed independently through the model, yielding confidence scores $s_{i}$ for the “real”class. By labeling $(VH, V L_{1})$ as positive and $(VH, V L_{2})$ as negative, we accumulate $2 N$ (score, label) observations over $N$ triplets. This procedure mirrors the PARA challenge of distinguishing true binders from close decoys, ensuring that performance reflects model robustness to varying degrees of chain similarity. Final metrics are then computed on this pooled set of $2 N$ predictions. In the original paper, the author did not publish the fine-tuned model or the dataset used; therefore, we compared our model with the results presented in the original paper.

Figure 11. — Overview of the pipeline to get the test measures. A $(VH, VL 1, VL 2)$ triplet is split into two pairs, which the model independently processes. The scores regarding the confidence the model has in the pairing between sequences are stored and tagged as $0$ , if the sequences are known to bind, $1$ otherwise.

To build the set of triples, a sample of $10, 000$ light sequences is first collected. Then, the similarity defined in Equation 1 is computed for each possible pair of these sampled light sequences, and the mean ( $μ$ ) and standard deviation ( $σ$ ) are derived. The $dist$ function in Equation 1 represents the Levenshtein distance, defined as the minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Given a dataset of paired VH/VL sequences, for each pair $(VH, VL 1)$ , a light sequence $VL 2$ is searched such that $sim (VL 1, VL 2)$ is less than a threshold defined by $μ - σ$ . Two datasets are generated: one from the validation set and the other from the test set.

Supplementary Material

Supplemental Material

KMAB_A_2570749_SM2690.pdf^{(127.3KB, pdf)}

Acknowledgments

We would like to thank the University of Pisa Data Center for providing the necessary hardware resources for this project. We acknowledge that grammar checks were conducted using EditGPT.

Funding Statement

This work was supported by the European Union – Next Generation EU, under the Italian National Recovery and Resilience Plan (PNRR), Mission 4 Component 2 Investment 1.5, project ECS00000017 Tuscany Health Ecosystem – Spoke 6 [CUP I53C22000780001] and by the project “Hub multidisciplinare e interregionale di ricerca e sperimentazione clinica per il contrasto alle pandemie e all’antibioticoresistenza (PAN-HUB)” funded by the Italian Ministry of Health (POS 2014-2020), project ID: [T4-AN-07], [CUP: I53C22001300001]. Funding for the publication of this work was generously provided by Fondazione Toscana Life Sciences. G.M. was employed by TLS during the course of this work, and S.J. is a PhD student partially funded by TLS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Nomenclature/Notation

The following abbreviations are used in this manuscript:
AF3: AlphaFold 3
AUC-ROC: Area Under the Receiver Operating Characteristic Curve
CDR: Complementarity-Determining Region
DL: Deep Learning
MCC: Matthews Correlation Coefficient
Fc: Constant Fragment
Fv: Variable Fragment
ipTM: Interface predicted template modelling
MLP: MultiLayer Perceptron
OAS: Observed Antibody Space
STD: Standard Deviation
t-SNE: t-distributed Stochastic Neighbor Embedding
UMAP: Uniform Manifold Approximation and Projection
VDJ/V(D)J: Variable, Diversity, and Joining gene recombination process
VH: Variable Heavy
VL: Variable Light

Disclosure statement

No potential conflict of interest was reported by the author(s).

Author contributions

Conceptualization, G.M., and S.J.; methodology, E.D. and S.J.; software, E.D.; validation, E.D., and S.J.; formal analysis, E.D. and S.J; investigation, E.D. and S.J.; resources, A.M. and P.M.; data curation, E.D.; writing – original draft preparation, S.J. and E.D; writing – review and editing, A.M., P.M., G.M., E.D. and S.J.; visualization, S.J.; supervision, A.M., G.M., P.M. and S.J.; project administration, A.M.; funding acquisition, G.M. All authors have read and agreed to the published version of the manuscript.

Data availability statement

Data and code are available at https://github.com/darcoenr/thesis-2.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2025.2570749

References

1.Joubbi S, Micheli A, Milazzo P, Maccari G, Ciano G, Cardamone D, Medini D.. Antibody design using deep learning: from sequence and structure design to affinity maturation. Briefings Bioinf. 2024;25(4). doi: 10.1093/bib/bbae307. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–21. doi: 10.1038/302575a0. [DOI] [PubMed] [Google Scholar]
3.Alt FW, Oltz EM, Young F, Gorman J, Taccioli G, Chen J. Vdj recombination. Immunol Today. 1992;13(8):306–314. doi: 10.1016/0167-5699(92)90043-7. [DOI] [PubMed] [Google Scholar]
4.Jung D, Giallourakis C, Mostoslavsky R, Alt FW. Mechanism and control of V (D) J recombination at the immunoglobulin heavy chain locus. Annu Rev Immunol. 2006;24(1):541–s570. doi: 10.1146/annurev.immunol.23.021704.115830. [DOI] [PubMed] [Google Scholar]
5.Schatz DG, Ji Y. Recombination centres and the orchestration of V (D) J recombination. Nat Rev Immunol. 2011;11(4):251–263. doi: 10.1038/nri2941. [DOI] [PubMed] [Google Scholar]
6.Schatz DG, Swanson PC. V (d) j recombination: mechanisms of initiation. Annu Rev Genet. 2011;45(1):167–202. doi: 10.1146/annurev-genet-110410-132552. [DOI] [PubMed] [Google Scholar]
7.Jung D, Alt FW. Unraveling V (D) J recombination: insights into gene regulation. Cell. 2004;116(2):299–311. doi: 10.1016/S0092-8674(04)00039-X. [DOI] [PubMed] [Google Scholar]
8.Guloglu B, Deane CM. Specific attributes of the VL domain influence both the structure and structural variability of CDR-H3 through steric effects. Front Immunol. 2023;14:1223802. doi: 10.3389/fimmu.2023.1223802. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Christensen PA, Danielczyk A, Ravn P, Larsen M, Stahn R, Karsten U, Goletz S. Modifying antibody specificity by chain shuffling of vh/vl between antibodies with related specificities. Scand J Immunol. 2009;69(1):1–10. doi: 10.1111/j.1365-3083.2008.02164.x. [DOI] [PubMed] [Google Scholar]
10.De Wildt RM, Hoet RM, van Venrooij WJ, Tomlinson IM, Winter G. Analysis of heavy and light chain pairings indicates that receptor editing shapes the human antibody repertoire. J Mol Biol. 1999;285(3):895–901. doi: 10.1006/jmbi.1998.2396. [DOI] [PubMed] [Google Scholar]
11.Brezinschek H-P, Foster SJ, Dorner T, Brezinschek RI, Lipsky PE. Pairing of variable heavy and variable chains in individual naive and memory B cells. J Immunol. 1998;160(10):4762–4767. [PubMed] [Google Scholar]
12.Parren PW, Lugovskoy AA. Therapeutic antibody engineering: current and future advances driving the strongest growth area in the pharmaceutical industry. MAbs-Austin. 2013;5(2):175–177. doi: 10.4161/mabs.23654. [DOI] [Google Scholar]
13.Jayaram N, Bhowmick P, Martin AC. Germline VH/VL pairing in antibodies. Protein Eng Des Sel. 2012;25(10):523–530. doi: 10.1093/protein/gzs043. [DOI] [PubMed] [Google Scholar]
14.Madsen AV, Mejias-Gomez O, Pedersen LE, Preben Morth J, Kristensen P, Jenkins TP, Goletz S. Structural trends in antibody-antigen binding interfaces: a computational analysis of 1833 experimentally determined 3D structures. Comput Struct Biotechnol J. 2024;23:199–211. doi: 10.1016/j.csbj.2023.11.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Dudzic P, Chomicz D, Bielska W, Jaszczyszyn I, Zieliski M, Janusz B, Wróbel S, Le Pannérer M-M, Philips A, Ponraj P, et al. Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol. 2025;8(1):1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Seidler CA, Spanke VA, Gamper J, Bujotzek A, Georges G, Liedl KR. Data-driven analyses of human antibody variable domain germlines: pairings, sequences and structural features. MAbs-Austin. 2025;17(1):2507950. doi: 10.1080/19420862.2025.2507950. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Robert R, Lefranc M-P, Ghochikyan A, Agadjanyan MG, Cribbs DH, Van Nostrand WE, Wark KL, Dolezal O. Restricted V gene usage and VH/VL pairing of mouse humoral response against the N-terminal immunodominant epitope of the amyloid peptide. Mol Immunol. 2010;48(1–3):59–72. doi: 10.1016/j.molimm.2010.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tanno H, McDaniel JR, Stevens CA, Voss WN, Li J, Durrett R, Lee J, Gollihar J, Tanno Y, Delidakis G, et al. A facile technology for the high-throughput sequencing of the paired VH: VL and TCR: TCR repertoires. Sci Adv. 2020;6(17):eaay9093. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Ždek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Ždek A, Nelson AW, Bridgland A. et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710. [DOI] [PubMed] [Google Scholar]
21.Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Myle Ott CLZ, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci India Sect B Biol Sci. 2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, et al. Prottrans: towards cracking the language of life’s code through self-supervised learning. Ieee T Pattern Anal. 2021;44(10):7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
23.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–2110. doi: 10.1093/bioinformatics/btac020. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Wang L, Li X, Zhang H, Wang J, Jiang D, Xue Z, Wang Y. A comprehensive review of protein language models. 2025.
25.Leem J, Mitchell LS, Farmery JH, Barton J, Galson JD. Deciphering the language of antibodies using self-supervised learning. Patterns. 2022;3(7):100513. doi: 10.1016/j.patter.2022.100513. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Olsen TH, Moal IH, Deane CM. Ablang: an antibody language model for completing antibody sequences. Bioinf Adv. 2022;2(1):vbac046. doi: 10.1093/bioadv/vbac046. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Briefings Bioinf. 2024;25(4):bbae245. doi: 10.1093/bib/bbae245. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kenlay H, Dreyer FA, Kovaltsuk A, Miketa D, Pires D, Deane CM. Large scale paired antibody language models. arXiv preprint arXiv:2403.17889. 2024. [DOI] [PMC free article] [PubMed]
29.Gao X, Cao C, He C, Lai L. Pre-training with a rational approach for antibody sequence representation. Front Immunol. 2024;15:1468599. doi: 10.3389/fimmu.2024.1468599. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns. 2024;5(5):100967. doi: 10.1016/j.patter.2024.100967. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Olsen TH, Boyles F, Deane CM. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 2022;31(1):141–146. doi: 10.1002/pro.4205. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J Immunol. 2018;201(8):2502–2509. doi: 10.4049/jimmunol.1800708. [DOI] [PubMed] [Google Scholar]
33.Turnbull OM, Oglic D, Croasdale-Wood R, Deane CM. P-iggen: a paired antibody generative language model. Bioinformatics. 2024;40(11):btae659. doi: 10.1093/bioinformatics/btae659. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chinery L, Jeliazkov JR, Deane CM. Humatch-fast, gene-specific joint humanisation of antibody heavy and light chains. MAbs-Austin. 2024;16(1):2434121. doi: 10.1080/19420862.2024.2434121. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Guo D, Dunn-Walters DK, Fraternali F, Ng JC. Immunomatch learns and predicts cognate pairing of heavy and light immunoglobulin chains. bioRxiv. 2025; 2025–02.
36.Turnbull OM, Deane C. Synpair: pairing unpaired antibody chains at billion-sequence scale with contrastive learning. ICML 2025 Generative AI and Biology (GenBio) Workshop; Vancouver, BC, Canada. [Google Scholar]
37.Ursu E, Minnegalieva A, Rawat P, Chernigovskaya M, Tacutu R, Kjetil Sandve G, Robert PA, Greiff V. Training data composition determines machine learning generalization and biological rule discovery. Nat Mach Intell. 2025;7(8):1206–1219. doi: 10.1038/s42256-025-01089-5. [DOI] [Google Scholar]
38.Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Yu Y, Shrock EL, Ault R, et al. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2025; 2024–03.
39.Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Jain T, Sun T, Durand S, Hall A, Rewa Houston N, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y. et al. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci India Sect B Biol Sci. 2017;114(5):944–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Waight AB, Prihoda D, Shrestha R, Metcalf K, Bailly M, Ancona M, Widatalla T, Rollins Z, Cheng AC, Bitton DA, et al. A machine learning strategy for the identification of key in silico descriptors and prediction models for IgG monoclonal antibody developability properties. MAbs-austin. 2023;15(1):2248671. doi: 10.1080/19420862.2023.2248671. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Raybould MI, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci India Sect B Biol Sci. 2019;116(10):4025–4030. doi: 10.1073/pnas.1810576116. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Prihoda D, Maamary J, Waight A, Juan V, Fayadat-Dilman L, Svozil D, Bitton DA. Biophi: a platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs-Austin. 2022;14(1):2020203. doi: 10.1080/19420862.2021.2020203. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. Soluprot: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37(1):23–28. doi: 10.1093/bioinformatics/btaa1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Mitternacht S. Freesasa: an open source C library for solvent accessible surface area calculations. F1000Research. 2016;5:189. doi: 10.12688/f1000research.7931.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Søndergaard CR, Olsson MH, Rostkowski M, Jensen JH. Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pka values. J Chem Theory Comput. 2011;7(7):2284–2295. doi: 10.1021/ct200133y. [DOI] [PubMed] [Google Scholar]
47.Olsson MH. Protein electrostatics and pKa blind predictions; contribution from empirical predictions of internal ionizable residues. Proteins: Struct Func Bioinf. 2011;79(12):3333–3345. doi: 10.1002/prot.23113. [DOI] [PubMed] [Google Scholar]
48.Akbar R, Bashour H, Rawat P, Robert PA, Smorodina E, Cotet T-S, Flem-Karlsen K, Frank R, Bhushan Mehta B, Vu MH, et al. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs-Austin. 2022;14(1):2008790. doi: 10.1080/19420862.2021.2008790. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Reynisson B, Barra C, Kaabinejadian S, Hildebrand WH, Peters B, Nielsen M. Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J Proteome Res. 2020;19(6):2304–2315. doi: 10.1021/acs.jproteome.9b00874. [DOI] [PubMed] [Google Scholar]
50.Zhou Y, Xie S, Yang Y, Jiang L, Liu S, Li W, Bukari Abagna H, Ning L, Huang J. Ssh2.0: a better tool for predicting the hydrophobic interaction risk of monoclonal antibody. Front Genet. 2022;13:842127. doi: 10.3389/fgene.2022.842127. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1). doi: 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

KMAB_A_2570749_SM2690.pdf^{(127.3KB, pdf)}

Data Availability Statement

Data and code are available at https://github.com/darcoenr/thesis-2.

[cit0001] 1.Joubbi S, Micheli A, Milazzo P, Maccari G, Ciano G, Cardamone D, Medini D.. Antibody design using deep learning: from sequence and structure design to affinity maturation. Briefings Bioinf. 2024;25(4). doi: 10.1093/bib/bbae307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0002] 2.Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–21. doi: 10.1038/302575a0. [DOI] [PubMed] [Google Scholar]

[cit0003] 3.Alt FW, Oltz EM, Young F, Gorman J, Taccioli G, Chen J. Vdj recombination. Immunol Today. 1992;13(8):306–314. doi: 10.1016/0167-5699(92)90043-7. [DOI] [PubMed] [Google Scholar]

[cit0004] 4.Jung D, Giallourakis C, Mostoslavsky R, Alt FW. Mechanism and control of V (D) J recombination at the immunoglobulin heavy chain locus. Annu Rev Immunol. 2006;24(1):541–s570. doi: 10.1146/annurev.immunol.23.021704.115830. [DOI] [PubMed] [Google Scholar]

[cit0005] 5.Schatz DG, Ji Y. Recombination centres and the orchestration of V (D) J recombination. Nat Rev Immunol. 2011;11(4):251–263. doi: 10.1038/nri2941. [DOI] [PubMed] [Google Scholar]

[cit0006] 6.Schatz DG, Swanson PC. V (d) j recombination: mechanisms of initiation. Annu Rev Genet. 2011;45(1):167–202. doi: 10.1146/annurev-genet-110410-132552. [DOI] [PubMed] [Google Scholar]

[cit0007] 7.Jung D, Alt FW. Unraveling V (D) J recombination: insights into gene regulation. Cell. 2004;116(2):299–311. doi: 10.1016/S0092-8674(04)00039-X. [DOI] [PubMed] [Google Scholar]

[cit0008] 8.Guloglu B, Deane CM. Specific attributes of the VL domain influence both the structure and structural variability of CDR-H3 through steric effects. Front Immunol. 2023;14:1223802. doi: 10.3389/fimmu.2023.1223802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0009] 9.Christensen PA, Danielczyk A, Ravn P, Larsen M, Stahn R, Karsten U, Goletz S. Modifying antibody specificity by chain shuffling of vh/vl between antibodies with related specificities. Scand J Immunol. 2009;69(1):1–10. doi: 10.1111/j.1365-3083.2008.02164.x. [DOI] [PubMed] [Google Scholar]

[cit0010] 10.De Wildt RM, Hoet RM, van Venrooij WJ, Tomlinson IM, Winter G. Analysis of heavy and light chain pairings indicates that receptor editing shapes the human antibody repertoire. J Mol Biol. 1999;285(3):895–901. doi: 10.1006/jmbi.1998.2396. [DOI] [PubMed] [Google Scholar]

[cit0011] 11.Brezinschek H-P, Foster SJ, Dorner T, Brezinschek RI, Lipsky PE. Pairing of variable heavy and variable chains in individual naive and memory B cells. J Immunol. 1998;160(10):4762–4767. [PubMed] [Google Scholar]

[cit0012] 12.Parren PW, Lugovskoy AA. Therapeutic antibody engineering: current and future advances driving the strongest growth area in the pharmaceutical industry. MAbs-Austin. 2013;5(2):175–177. doi: 10.4161/mabs.23654. [DOI] [Google Scholar]

[cit0013] 13.Jayaram N, Bhowmick P, Martin AC. Germline VH/VL pairing in antibodies. Protein Eng Des Sel. 2012;25(10):523–530. doi: 10.1093/protein/gzs043. [DOI] [PubMed] [Google Scholar]

[cit0014] 14.Madsen AV, Mejias-Gomez O, Pedersen LE, Preben Morth J, Kristensen P, Jenkins TP, Goletz S. Structural trends in antibody-antigen binding interfaces: a computational analysis of 1833 experimentally determined 3D structures. Comput Struct Biotechnol J. 2024;23:199–211. doi: 10.1016/j.csbj.2023.11.056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0015] 15.Dudzic P, Chomicz D, Bielska W, Jaszczyszyn I, Zieliski M, Janusz B, Wróbel S, Le Pannérer M-M, Philips A, Ponraj P, et al. Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol. 2025;8(1):1110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0016] 16.Seidler CA, Spanke VA, Gamper J, Bujotzek A, Georges G, Liedl KR. Data-driven analyses of human antibody variable domain germlines: pairings, sequences and structural features. MAbs-Austin. 2025;17(1):2507950. doi: 10.1080/19420862.2025.2507950. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0017] 17.Robert R, Lefranc M-P, Ghochikyan A, Agadjanyan MG, Cribbs DH, Van Nostrand WE, Wark KL, Dolezal O. Restricted V gene usage and VH/VL pairing of mouse humoral response against the N-terminal immunodominant epitope of the amyloid peptide. Mol Immunol. 2010;48(1–3):59–72. doi: 10.1016/j.molimm.2010.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0018] 18.Tanno H, McDaniel JR, Stevens CA, Voss WN, Li J, Durrett R, Lee J, Gollihar J, Tanno Y, Delidakis G, et al. A facile technology for the high-throughput sequencing of the paired VH: VL and TCR: TCR repertoires. Sci Adv. 2020;6(17):eaay9093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0019] 19.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Ždek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0020] 20.Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Ždek A, Nelson AW, Bridgland A. et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710. [DOI] [PubMed] [Google Scholar]

[cit0021] 21.Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Myle Ott CLZ, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci India Sect B Biol Sci. 2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0022] 22.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, et al. Prottrans: towards cracking the language of life’s code through self-supervised learning. Ieee T Pattern Anal. 2021;44(10):7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]

[cit0023] 23.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–2110. doi: 10.1093/bioinformatics/btac020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0024] 24.Wang L, Li X, Zhang H, Wang J, Jiang D, Xue Z, Wang Y. A comprehensive review of protein language models. 2025.

[cit0025] 25.Leem J, Mitchell LS, Farmery JH, Barton J, Galson JD. Deciphering the language of antibodies using self-supervised learning. Patterns. 2022;3(7):100513. doi: 10.1016/j.patter.2022.100513. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0026] 26.Olsen TH, Moal IH, Deane CM. Ablang: an antibody language model for completing antibody sequences. Bioinf Adv. 2022;2(1):vbac046. doi: 10.1093/bioadv/vbac046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0027] 27.Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Briefings Bioinf. 2024;25(4):bbae245. doi: 10.1093/bib/bbae245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0028] 28.Kenlay H, Dreyer FA, Kovaltsuk A, Miketa D, Pires D, Deane CM. Large scale paired antibody language models. arXiv preprint arXiv:2403.17889. 2024. [DOI] [PMC free article] [PubMed]

[cit0029] 29.Gao X, Cao C, He C, Lai L. Pre-training with a rational approach for antibody sequence representation. Front Immunol. 2024;15:1468599. doi: 10.3389/fimmu.2024.1468599. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0030] 30.Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns. 2024;5(5):100967. doi: 10.1016/j.patter.2024.100967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0031] 31.Olsen TH, Boyles F, Deane CM. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 2022;31(1):141–146. doi: 10.1002/pro.4205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0032] 32.Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J Immunol. 2018;201(8):2502–2509. doi: 10.4049/jimmunol.1800708. [DOI] [PubMed] [Google Scholar]

[cit0033] 33.Turnbull OM, Oglic D, Croasdale-Wood R, Deane CM. P-iggen: a paired antibody generative language model. Bioinformatics. 2024;40(11):btae659. doi: 10.1093/bioinformatics/btae659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0034] 34.Chinery L, Jeliazkov JR, Deane CM. Humatch-fast, gene-specific joint humanisation of antibody heavy and light chains. MAbs-Austin. 2024;16(1):2434121. doi: 10.1080/19420862.2024.2434121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0035] 35.Guo D, Dunn-Walters DK, Fraternali F, Ng JC. Immunomatch learns and predicts cognate pairing of heavy and light immunoglobulin chains. bioRxiv. 2025; 2025–02.

[cit0036] 36.Turnbull OM, Deane C. Synpair: pairing unpaired antibody chains at billion-sequence scale with contrastive learning. ICML 2025 Generative AI and Biology (GenBio) Workshop; Vancouver, BC, Canada. [Google Scholar]

[cit0037] 37.Ursu E, Minnegalieva A, Rawat P, Chernigovskaya M, Tacutu R, Kjetil Sandve G, Robert PA, Greiff V. Training data composition determines machine learning generalization and biological rule discovery. Nat Mach Intell. 2025;7(8):1206–1219. doi: 10.1038/s42256-025-01089-5. [DOI] [Google Scholar]

[cit0038] 38.Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Yu Y, Shrock EL, Ault R, et al. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2025; 2024–03.

[cit0039] 39.Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0040] 40.Jain T, Sun T, Durand S, Hall A, Rewa Houston N, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y. et al. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci India Sect B Biol Sci. 2017;114(5):944–949. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0041] 41.Waight AB, Prihoda D, Shrestha R, Metcalf K, Bailly M, Ancona M, Widatalla T, Rollins Z, Cheng AC, Bitton DA, et al. A machine learning strategy for the identification of key in silico descriptors and prediction models for IgG monoclonal antibody developability properties. MAbs-austin. 2023;15(1):2248671. doi: 10.1080/19420862.2023.2248671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0042] 42.Raybould MI, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci India Sect B Biol Sci. 2019;116(10):4025–4030. doi: 10.1073/pnas.1810576116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0043] 43.Prihoda D, Maamary J, Waight A, Juan V, Fayadat-Dilman L, Svozil D, Bitton DA. Biophi: a platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs-Austin. 2022;14(1):2020203. doi: 10.1080/19420862.2021.2020203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0044] 44.Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. Soluprot: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37(1):23–28. doi: 10.1093/bioinformatics/btaa1102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0045] 45.Mitternacht S. Freesasa: an open source C library for solvent accessible surface area calculations. F1000Research. 2016;5:189. doi: 10.12688/f1000research.7931.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0046] 46.Søndergaard CR, Olsson MH, Rostkowski M, Jensen JH. Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pka values. J Chem Theory Comput. 2011;7(7):2284–2295. doi: 10.1021/ct200133y. [DOI] [PubMed] [Google Scholar]

[cit0047] 47.Olsson MH. Protein electrostatics and pKa blind predictions; contribution from empirical predictions of internal ionizable residues. Proteins: Struct Func Bioinf. 2011;79(12):3333–3345. doi: 10.1002/prot.23113. [DOI] [PubMed] [Google Scholar]

[cit0048] 48.Akbar R, Bashour H, Rawat P, Robert PA, Smorodina E, Cotet T-S, Flem-Karlsen K, Frank R, Bhushan Mehta B, Vu MH, et al. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs-Austin. 2022;14(1):2008790. doi: 10.1080/19420862.2021.2008790. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0049] 49.Reynisson B, Barra C, Kaabinejadian S, Hildebrand WH, Peters B, Nielsen M. Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J Proteome Res. 2020;19(6):2304–2315. doi: 10.1021/acs.jproteome.9b00874. [DOI] [PubMed] [Google Scholar]

[cit0050] 50.Zhou Y, Xie S, Yang Y, Jiang L, Liu S, Li W, Bukari Abagna H, Ning L, Huang J. Ssh2.0: a better tool for predicting the hydrophobic interaction risk of monoclonal antibody. Front Genet. 2022;13:842127. doi: 10.3389/fgene.2022.842127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[cit0051] 51.Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1). doi: 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Germline-aware deep learning models and benchmarks for predicting antibody VH–VL pairing

Sara Joubbi

Enrico D’Arco

Giuseppe Maccari

Paolo Milazzo

Alessio Micheli

ABSTRACT

Introduction

Figure 1.

Results

Building a large-scale paired antibody dataset with negative sampling

Figure 2.

Latent space visualization reveals separation in germline-based pairings

Figure 3.

Pairwise sequence similarity reveals divergence of synthetic pairings

Table 1.

Table 2.

Germline overlap in randomly mismatched VH/VL pairs limits class separability

Model evaluation under diverse negative pairing strategies

Figure 4.

VDJ influence on the final output

Figure 5.

Establishing a benchmark framework for antibody pairing prediction

Figure 6.

Assessing generalization and competitiveness on the PARA task

Figure 7.

Germline-based models achieve strong performance on a 7.2M-sequence dataset

Figure 8.

AlphaFold3 ipTM does not distinguish correct from incorrect VH/VL pairings

Early signals of thermostability in VH/VL pairing prediction models

Table 3.

Discussion and conclusion

Materials and Methods

Dataset

Table 4.

Germline-aware split

Figure 9.

Table 5.

Negative dataset preparation

Random pairing

Table 6.

Germline pairing

Table 7.

Table 8.

Model

Architecture

Figure 10.

Training

Table 9.

Test procedure

Figure 11.

Supplementary Material

Acknowledgments

Funding Statement

Nomenclature/Notation

Disclosure statement

Author contributions

Data availability statement

Supplementary material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases