Abstract
Vaccines trigger an immune response that results in a population of memory cells that can quickly respond to subsequent antigen re-encounters. Most vaccines are designed to induce memory B cells with vaccine-specific B cell receptors (BCRs). Post-vaccination, clonal expansion of B cells results in measurably expanded vaccine-specific BCR clonotypes. We set out to determine to what extent it is predictable which specific BCR clonotypes are vaccine-induced in an individual. We sequenced the BCR heavy chain repertoire in a cohort of 19 individuals prior- and 7 days post Tdap booster vaccination. We tested two modalities to predict which clonotypes were expanded post-vaccination: first, we utilized a small database of monoclonal antibodies with known specificity to Tdap vaccine antigens and tested various sequence look-up methods, identifying clonal look-up as the best method. We then utilized a leave-one-out approach in which expanded clonotypes in one individual were predicted using data from other members of the cohort. The second approach significantly outperformed the first, indicating that BCR clonotype expansion can be learned across subjects. These results support the utility of systematically collecting BCR specificity data through efforts like the Immune Epitope database and highlight the limitations on general prediction approaches resulting from relatively small dataset sizes for BCRs with known specificities. Additionally, our study provides 1) a comparison of several BCR specificity prediction methods, 2) a dataset that can be used for benchmarking of subsequent methods, and 3) a methodological framework for comparing BCR repertoires pre- and post-vaccination.
2. Introduction
B cell receptors (BCRs) are the membrane-bound form of antibodies, heterodimeric (two different chains, heavy and light) immune proteins that can bind to other molecules with high specificity and affinity. The totality of BCRs produced by an individual is referred to as their BCR repertoire. BCRs are produced by the recombination of highly polymorphic genes across three loci in humans (IGH, IGK and IGL). During this process, additional diversity is introduced through the insertion and deletion of junctional nucleotides, and subsequently, naïve BCRs undergo additional diversification through somatic hypermutation. Recombination can, in theory, produce more than 1013 unique receptor sequences. In practice, for most diseases for which large cohorts have been studied, the repertoires include “public” BCR clonotypes found in multiple individuals, which suggests that the generation of BCRs is a directed process(1). Public clonotypes have been described in the “naïve” repertoire, as well as in the context of viruses such as Ebola virus (EBOV), Dengue, Influenza, SARS-CoV-2 and HIV, and for bacteria, e.g., for Haemophilus influenzae (2–6).
BCRs are related to one another in the language of clones or clonotypes, which have various literature definitions (7). The essence of these definitions is shared encoding genes and identity over the most variable complementarity determining regions (CDR), CDR3. This may be calculated for the variable VH or both VH and VL, where these are available. However, one might imagine that there are shared specificity-determining sequence motifs that might be common across multiple IGV genes, for example, or that for a particular binding mode, the CDRH3 may not contribute to specificity(8). Learning the rules of antigen- or epitope-specificity beyond the clonotype definition may significantly improve the sensitivity with which we can identify responding BCRs.
Several methods have been developed to predict antigen-specificity of antibodies. Most commonly, such models are only applied to their test data as proof-of-principle and it is not always clear that they provide a significant advantage over simple sequence identity - especially with training datasets which tend to be very limited in size in comparison to model complexity(9–11). Public training datasets are limited due to the time- and resource-intensiveness of generating new data points, meaning that they only approach reasonable sizes for a limited number of well-studied antigens, such as from SARS and Influenza viruses(12). Historically, most public data has been produced via X-ray crystallography of antibody-antigen complexes, and recently, cryogenic electron microscopy (cryoEM) has significantly accelerated the deposition of antibody-antigen complex structures in public databases from which epitope-level resolution is provided(3). The Immune Epitope Database (IEDB) has the advantage of curating additional receptor-epitope data generated through lower resolution methods like mutagenesis and HDX footprinting(13). Other databases, like CoV-AbDab (Coronaviruses) or CATNAP (HIV)(14),(15), report receptor-antigen level data. However, for many pathogens, dedicated public antibody databases do not exist.
Bordetella pertussis is an example of a pathogen for which there are limited publicly available sequences of antigen-specific antibodies(16). B. pertussis causes Whooping cough which is a respiratory infection that causes an estimated 160,700 deaths in children younger than 5 worldwide, annually (CDC). The most common B. pertussis vaccines in the 21st century are the pediatric and adult vaccines, DTaP and Tdap, which consist of several Pertussis antigens (Pertussis toxoid, filamentous hemagglutinin (FHA), pertactin and fimbriae proteins) in formulation with Diphtheria and Tetanus toxoids; these are referred to as “acellular Pertussis” or “aP” vaccines, replacing earlier reactogenic whole-cell Pertussis or “wP” vaccine formulations. In the United States, DTaP is administered multiple times during the childhood vaccination series, and adults are recommended to receive Tdap (Tetanus-diphtheria-acellular Pertussis) booster vaccinations every 5–10 years as well as during pregnancy. Resurgence of Pertussis following the switch from wP to aP has prompted studies of waning immunity post aP vaccination, and differences in the immune system of individuals primed with aP or wP have been described, such as T cell polarization(17,18).
Here, we used samples from these studies to quantify how Tdap booster vaccination impacted the BCR repertoire and evaluated different methods for predicting clonotypes expanded following vaccination. We first devised a novel format for predicting expanded VH clonotypes sequenced by bulk heavy chain sequencing. We demonstrate the high specificity of a clonotype-based sequence identity method for identifying responding clonotypes following vaccination using publicly available monoclonal antibody data. We attempted to extend these predictions using more sophisticated models trained to classify vaccine-specific monoclonal antibodies; these models outperform simple baselines on a small database with limited diversity but provide limited additional value when applied to vaccinees’ repertoires. We subsequently trained leave-one-out (LOO) models on our cohort, which significantly outperformed sequence baselines on held-out individuals. Finally, we demonstrate that our process can distinguish Tdap-expanded vs. EBOV vaccination expanded clonotypes. We demonstrate that BCR clonotype expansion can be predicted, which opens the door to more sophisticated models that can drive insights into vaccine responses.
3. Methods
3.1. BCR sequencing of Tdap vaccinees
3.1.1. Individuals and blood samples
Nineteen individuals who received Tdap booster vaccination as part of our broader CMI-PB challenge were selected(17,19). Peripheral blood was taken on the day of vaccination (D0) and seven days post-vaccination (D7). Peripheral blood mononuclear cells (PBMCs) were isolated via Ficoll-Paque density centrifugation as described in Willemsen et al, 2025(17,19). This study was performed with approvals from the institutional review board at the La Jolla Institute for Immunology (protocol number VD-101). All participants provided written informed consent before donation.
3.1.2. Bulk heavy chain sequencing and bioinformatics
RNA was extracted with an RNeasy kit (QIAGEN). Sequencing libraries were prepared using Takara Bio’s human SMART-Seq kit with UMIs. Libraries were sequenced on a MiSeq. Immcantation’s SMART-seq presets were used for processing (20) and their default pipeline was used with the exception that Immcantation’s default V, D and J gene libraries (which are by default from IMGT) were replaced with the AIRR-C Human Reference Set, to reduce alignment biases caused by non-truncated entries, as well as downstream errors caused by duplicates and erroneously reported alleles (21). SMART-Seq minimizes misattribution of reads to samples (and therefore inflation of estimates of publicness) by unique dual indexing. To further minimize this effect, only UMIs supported by at least two reads (consensus count ≥ 2) were retained.
3.1.3. Clonotyping
To identify clonally-related BCRs, Immcantation’s DefineClones was used with the parameters: –act set (default), –mode gene (default), –sf cdr3 (non-default), –link single (default), –model aa (non-default) and –dist 0.9 (non-default)(20). This amounts to single-linkage clustering with a length-matched 90% CDRH3 identity threshold. While there are methods to learn appropriate CDRH3 identity thresholds on a per-sample basis, e.g. nearest neighbor distributions, we wanted to select a threshold that was constant across all samples and subjects. We selected 90% as intermediate in the typical range of 80% and 100%(7). Clonotyping was performed both on a per-individual level (pooling D0 and D7, to identify overlap) and pooling all samples of all individuals, to identify public clonotypes.
3.1.4. Germline reversion
Germline reversion refers to reversion of somatically-hypermutated residues to an estimate of the corresponding germline-encoded residue. The identity of the germline-encoded residue is most often predicted as the corresponding residue according to alignment in the sequence of the assigned IGV or IGJ gene. For our purposes, which was to match our receptor data to a monoclonal antibody database where we did not have J gene inferences available, germline-reverted sequences were constructed as the concatenation of the amino acid sequence of the IMGT *01 allele from IMGT from position 1 – 104, and the CDRH3. The sequence is therefore missing IMGT position 118 – 128 (FWR4)(22).
3.2. Clonotype expansion prediction
3.2.1. Tdap vaccinee database (tdap-vaccinee-db)
We devised a novel prediction set-up, in which we aim to predict for a given subject which clonotypes are expanded post-vaccination. In practice, this amounts to prediction of whether a clonotype was observed at baseline (assigned a label of 0) or was expanded post-boost (1).
We define clonotype size as the sum of UMIs of a given clonotype (clonotype i of C, Ci) at D0 or D7 (Ci,0,7). Expansion is defined with respect to the clonotype size distribution of a given subject at D7 (Cx,7,x): we define a clonotype as” expanded” if it has a size greater than or equal to the 10th percentile of clonotype sizes at D7, is at least five-fold greater at D7 than D0, and if it is not in the top 10% of clonotypes at D0. This is the positive class (Yi = 1). Smaller clonotypes containing D7 sequences are discarded from the prediction pipeline. All D0 clonotypes, excepting those later observed at D7, are labelled negative (Yi = 0). This annotated data set is called tdap-vaccinee-db.
Each clonotype, i, contains varying numbers of unique V(D)J sequences, xn,i. In practice, a model could either take a consensus of these sequences or operate on unique sequences and pool predictions. We trained models that take as input some representation of a unique sequence (xn,i), and either learn some function to map this to the label of the parent clonotype, Yi, or use a function from another task as a zero-shot predictor of Yi. The final prediction for Ci is the average of predictions for the amino acid sequence representatives. We selected the average as this is unbiased with respect to the label. Pooling operations like maximum pooling favor random methods as the expected maximum value from n samples of U(0,1) is n/n+1 and therefore is directly related to the size of the clonotype.
We describe performance using the ROC curve via the ROC-AUC, with particular emphasis on the early retrieval region, with a maximum FPR of 0.1 (ROC-AUC_0.1)(23).
3.2.2. Baselines
Well-motivated baselines are important for prediction, both to establish that a model has additional value over simpler predictors and for interpreting the more complex model. Sequence identity to positive training instances is one baseline, motivated by the hypothesis that highly sequence-similar BCRs should share specificity.
There is no single established sequence identity baseline that is best for antigen-specificity prediction. The simplest is sequence identity over equivalent IMGT positions in VHs (VH identity), CDRs (CDR identity) or CDRH3s (CDRH3 identity). In addition, we used a sequence-identity method called CloneSearch(3), which is equivalent to a CDRH3 identity calculation but only to positive instances with the same IGHV. This has previously been shown to identify multiple expanded clonotypes following vaccination(3).
Many BCRs expanded following vaccination at day 7 are expected to belong to plasmablasts derived from memory B cells which are expected to have more mutated BCRs. This means that mutation from IGHV germline was a powerful predictor. To quantify this, we use the fractional identity over the IGHV gene alignment, ‘v identity’ as calculated by IgBLAST/Immcantation.
Another biologically-motivated baseline is IGHV representation. Specific IGHVs are overrepresented in positive vs. negative clonotypes. We call this the IGHV ratio, and this is calculated on training instances as the average frequency of IGHV in Positive-labelled instances/ average frequency of IGHV in Negative-labelled instances. This is motivated by the literature suggestion of responding IGHVs e.g., IGHV1–69(24).
We combined all baselines into a single “Biological Baseline” (B), whose only hyperparameter is the CloneSearch score (C) threshold, t (Equation 1). This is combined with the Mutation (M) and IGHV ratio (I). The threshold t was selected by cross-validation to maximize ROC-AUC_0.1, as 0.7.
3.2.3. Vaccine monoclonal database (tdap-monoclonal-db)
We curated a database of monoclonal antibodies that are identified in the literature as binding to a subset of the vaccine antigens, Tetanus toxoid (TT) and Pertussis toxoid (PT). Most TT antibody sequences were provided by J. Galson as curated from seven studies (25–32). Most P[7, 9, 10, 15, 23–25, 45]T sequences were derived from PTx vaccination of the Kymouse (16). These were the ”Positives” in our vaccine database. We used the labelled human subset of the Coronavirus antibody database, CoV-AbDab, as a “background” in initial analyses (14). This is referred to as the CoV database. The CoV database is substantially larger than the combined PT and TT database by an order of magnitude.
To train antigen-specificity classifiers on the PT- and TT databases, we used antibodies from CoV-AbDab as negatives, and supplemented the negative class with miscellaneous human antibodies from the Immune Epitope Database (IEDB) as well as Ebolavirus antibodies (3,13). Unfortunately, most TT antibodies only had IGHV and CDRH3 information reported in their source publications; to enable the use of extended featurizations, we reconstructed all sequences by concatenating the sequence of the *01 allele (as reported by IMGT, due to its historical use) and the CDRH3 (Section 3.1.4). IGHJ assignment is inaccurate on amino acid sequences (data not shown), so we did not attempt to reconstruct FWR4. As a result, we do not have information about mutations in these antibodies, and all antibodies in the database therefore have germline-encoded IGHV regions.
We made this database non-redundant at the 90% CDR identity level, and between any train-validation-test fold, verified that there were no sequences with CloneSearch score ≥ 0.8. The data set we term tdap-monoclonal-db. To train models, we held out 20% of the data; with the remaining 80%, we performed five-fold cross-validation. For each cross-validation fold, we trained five models on different splits of the training data; the final test prediction was the mean of these predictions. The cross-validation was used to select hyperparameters according to maximizing validation ROC-AUC 0.1, and then final models were trained on the full cross-validation data and tested on the hold-out 20%.
3.2.4. Leave-one-out models trained directly on BCR sequencing data
For results reported in Section 4.4 and Section 4.5, we did any hyperparameter selection via leave-one-out cross-validation on 14 subjects (again with hyperparameter selection maximizing ROC-AUC 0.1 in validation), holding out 5 subjects for testing. The final test model was trained as a LOO-averaged ensemble model (i.e. 14 different models, each trained on 13/14 subjects), and performance metrics were calculated for each of the five test subjects.
To identify models that could outperform the trivial sequence homology baseline, we removed sequences that had more than 90% CDR identity or a CloneSearch score of more than 0.8 between any pair of training, validation, or test splits.
3.2.5. Model architectures
We explored a non-exhaustive variety of architectures inspired by the literature, using different parts of the VH sequence (CDRs, CDRH3, VH), annotations like the IGHV genes, and different encodings including one-hot, BLOSUM and protein language model (PLM) embeddings.
All models were reimplemented in Pytorch according to literature descriptions and trained with the same class-weighted binary cross-entropy loss with early-stopping according to validation ROC-AUC 0.1 and patience of 50 epochs. For protein language models, all sequences were encoded as the full VH, and final layer representations were average pooled across the relevant residues (e.g., CDR, CDRH3 or VH residues).
3.2.6. Non-protein language models
For non-PLMs, we used three models inspired by the literature of classification on receptor sequences: NetTCR, TCRAI and a classifier using three separate encoder-only transformers (one per CDR).
NetTCR is a CNN-based family of models with original application in TCR-pMHC specificity prediction; here, the input is CDRH1, CDRH2 and CDRH3. These are BLOSUM50-encoded and passed through a series of convolutional layers, whose outputs are concatenated and passed through a final classification layer(s) (33). Here, we refer to this method as “NetBCR”.
TCR-AI is also CNN-based, operating on the CDRH3, which is one-hot encoded; this is combined with an encoding layer to encode the IGHV gene label. Again, the original application is in TCR specificity prediction; here we refer to it as BCR-AI.(34)
The final model architecture was based on an encoder-only transformer model originally purposed for multi-label BCR epitope classification on SARS-CoV-2 antibodies(10). The model tokenizes CDRs; each CDR is embedded, encoded by a transformer and the resultant representations are concatenated and passed through fully-connected layers for subsequent classification. We refer to this model as “CDRs_Transformer”. Models are summarized in Table 1.
Table 1:
summary of models used for classification in both tdap-vaccinee-db and tdap-monoclonal-db.
| Model name | Description | Reference |
|---|---|---|
| AbLang [VH, CDRH3, CDRs, V_CDRH3] | VH encoded using AbLang2; either the full VH, CDRH3 or CDR tokens are averaged to produce L x 480 embedding. For the V_CDRH3 models, there is an additional encoding layer for the IGHV gene name label. Passed to 1 – 2 FCs. | Olsen et al, 2024 (35) |
| ESM [VH, CDRH3, CDRs, V_CDRH3] | VH encoding using ESM2-t33–650M; either the full VH, CDRH3 or CDR tokens are averaged to produce an L x 1280 embedding. For the V_CDRH3 models, there is an additional encoding layer for the IGHV gene name label. Passed to 1 –2 FCs. | Lin et al, 2023 (36) |
| Net”BCR” | CDRs are BLOSUM50 encoded and passed through a series of CNNs with varying kernel size. These are concatenated across individual CDRs and passed to a FC layer. | Jensen & Nielsen 2024 (33) |
| “BCR”AI | IGHV gene labels are passed to an encoding layer. CDRH3 is one-hot encoded and passed through CNN. IGHV and CDRH3 representations are concatenated prior to FC layer classification. | Zhang et al, 2023 (37) |
| CDR Transformer | Each CDR type is passed through a separate basic transformer encoder with sinusoidal positional embedding. The outputs are concatenated prior to FC layer classification. | Wang et al, 2022 (9,10) |
3.2.7. Protein language models
We used an antibody-specific language model, AbLang2, as well as the general protein language model ESM2 (35,36). For CDR, CDRH3 and VH models, final layer representations are averaged across the relevant residues, and this average-pooled representation is passed to fully connected classification layer(s). We also created a CDRH3 V model, which benefits from an additional IGHV encoding layer that is added to the representation.
In addition to the classifiers, we also explored a simple cosine look-up in the average-pooled representation, and a K-nearest neighbors (KNN) classifier based on these cosine similarities. Through cross-validation, we selected the KNN hyperparameters as: 100 neighbors using distance-weighted cosine similarity. This was implement via scikit-learn’s KNeighborsClassifier(38).
4. Results
4.1. Seven days post-Tdap booster, BCR repertoires are characterized by somatic hypermutation, class-switching, and increases in clonality
We sequenced bulk VH sequences from 19 subjects at day 0 (D0) and seven days following booster vaccination with Tdap (D7). We retrieved on average 32,059±4,002 total unique VH sequences at D0 and 56,433±10,704 total unique sequences at D7 per sample, which we clonotyped into 2,491±85 and 2,049±84 clonotypes, respectively. Sequence and clonotype numbers by isotype can be found in Table S1. This is the first comprehensive dataset on antibody sequences pre- and post-Tdap booster vaccination.
At D7, cell frequency data indicates that there is an influx of plasmablasts; this is reflected in the BCR sequencing data in a significant increase in clonality (mean Gini entropy goes from 0.77±0.01 to 0.87±0.02, p = 3.8e-6 via Wilcoxon sign-rank; Figure 1A) and class-switching from the baseline repertoire (0.70±0.10 to 0.92±0.03, p = 3.8e-6; Figure 1B). As this is a booster vaccination, this constitutes recall of the memory response, and therefore we also note a significant reduction in the average IGHV identity in class-switched BCRs (0.93±0.004 vs. 0.92±0.004; Figure 1C) and a significant increase in average CDRH3 length from D0 to D7 repertoires (14.9±0.2 vs. 15.3±0.2; Figure 1D). Notably, these repertoire-wide features can be used to separate D0 and D7 repertoires with AUC ranging from 0.75 (using Av. CDRH3 length) to near-perfect 0.98 (Gini index) (Figure 1E), illustrating that there are clear signatures separating baseline from D7 repertoires. We also noted significant changes in the frequency of nine IGHV genes (Figure S1).
Figure 1:

The B cell repertoire showed a significant increase in clonality (p = 3.8e-6) (A) and was dominated by class-switched (p = 3.8e-6) (B), somatically mutated (p = 5.3e-5) BCRs (C) following booster vaccination, with a significant increase in CDRH3 length (p = 2e-3). These repertoire statistics can be used to separate D0 and D7 repertoires with minimally an AUC of 0.75, demonstrating that D0 and D7 repertoires have clear signatures (E) (N.B. p-values are equal to 3.8e-6, as this is the default p-value returned for Wilcoxon statistic = 0 for 19 paired samples in scipy.stats.wilcoxon).
We noted no significant difference in any of these repertoire descriptors between aP- and wP-primed individuals. aP-primed individuals had a higher median proportion of class-switched reads, but this was non-significant (p = 0.26), and while two genes showed significantly higher fold-changes in wP vs. aP (IGHV3–20) and vice-versa (IGHV4–4) pre-correction, these were not significant after correction (Figure S2).
Interestingly, we noted a significant difference in the fold-change in the proportion of IgE reads following booster vaccination, with aP individuals having a significantly higher fold change (6.96±4.98 vs. 0.85±0.64, p = 0.006) (Figure S3). Thus, while we found no differences in the use of variable receptor regions between BCRs of aP and wP primed individuals, we did find a difference in isotype usage.
Overall, our analyses indicated that simple metrics like class-switching, mutation, IGHV usage, and CDRH3 length can distinguish D7 post-boost repertoires from baseline repertoires across all individuals, and that an increase in IgE production is associated with aP priming.
4.2. Defining vaccine-expanded clonotypes
Our definition of a vaccine-expanded clonotype is intended to generalize across datasets:
Expansion is defined by clonotype size (number of UMIs). Different samples will have different total numbers of UMIs (here, 3,230 to 85,449; 28,067±5,505). To standardize across samples of different depth, we define size as proportional to the entire repertoire and consider the percentile in the clonotype size distribution.
Clonotypes that are observed at D0 at comparable frequency to post-vaccination cannot be considered vaccine-expanded.
We subset the data to class-switched (IgG, IgA, IgE) reads. Normalized clonotype size follows a power law with respect to percentile in clonotype size distribution at both D0 and D7 (Figure 2A). One method for selecting a threshold to define expansion is to use the point at which this curve flattens into the tail; by inspection, we estimated a threshold of 10% as appropriate. The first part of the definition (1) was therefore to label clonotypes as expanded if they were within the 10% largest clonotypes at D7. We subsequently looked at the fold-change distribution from D0 to D7 in this subset and selected a fold-change threshold of 5 to satisfy the second part of the definition. Finally, we removed the small number of clonotypes that were in the top 10% of clonotypes at both time points.
Figure 2:

Expanded clonotypes were defined as clonotypes with a size greater than or equal to the 10% percentile in the clonotype size distribution at D7, with at least a 5-fold increase in frequency over D0. Clonotypes that were also within the top 10% of clonotypes at D0 were excluded. The selection of 10% used the point at which the power curve relating percentile in clone size distribution and proportion of reads flatten; the by-eye selection of 10% was validated by identifying the average percentile at which the derivative of this curve approaches zero to 2 decimal places (9±1%)(A). Clonotypes that meet the D7 frequency and fold-change cut-offs are labelled as positives (green dashed lines); clonotypes with D7 sequences that do not meet these criteria are excluded, and D0-only clonotypes are labelled as negatives (B) to create our labelled dataset, tdap-vaccinee-db.
Based on these definitions, we created a dataset of 84,076 clonotypes in our 19 subjects; each has a label of 0 (D0 clonotype, not observed at D7) or a label of 1 (within the top 10% of clonotypes at D7; if observed at D0, occurs at a frequency at least 5x greater at D7 than at D0, and not within the top 10% clonotypes at D0) (Figure 2B). Clonotypes that satisfy neither condition were removed. There are 75,793 total negative clonotypes (mean 3,999; 1,462 – 6,526 across subjects, labelled as 0) and 8,103 total positive clonotypes (mean 427; 281 – 630, labelled as 1). This dataset is referred to as tdap-vaccinee-db. We subsequently explore multiple approaches for distinguishing, vaccine-expanded clonotypes from baseline clonotypes with maximum sensitivity- and specificity.
As described in Methods, we make predictions on a per-clonotype basis where the clonotype score is the average of the scores for the constituent V(D)J amino acid sequences. Given that we noted that post-boost repertoires had higher average mutation scores (p = 3.8e-7) and longer CDRH3 lengths (p = 0.002), we examined the predictive ability of these features. We demonstrated that mutation is an important feature for discriminating the D7 expanded vs baseline clonotypes, with a total ROC-AUC of 0.65; however, mutation has sub-random performance in the early-retrieval part of the ROC curve with an AUC of 0.495. CDRH3 length has a total ROC-AUC of only 0.51 and ROC-AUC_0.1 of 0.503 (Figure S4). This revealed mutation but not CDRH3 length to be an important feature, informing our construction of a baseline in Section 4.4.
4.3. Using small monoclonal antibody databases accurately identifies a small minority of vaccine-induced clonotypes
The most widely used method for comparing BCR sequences uses the clonal definition (shared encoding gene(s) and amino acid identity in the CDR3). To do this, we used CloneSearch(3) which identifies BCRs that have the same IGV (and optionally IGJ) and calculates a length-matched amino acid identity in the CDRH3; this can either provide a binary label by supplying a threshold (usually 70–100%) or one can use the mean or maximum CloneSearch score to the predictor database. We previously used it to identify vaccine-specific clonotypes following EBOV vaccination, using a database of Ebolavirus antibodies compiled from the literature(3). Here, we compiled a database of mAbs specific to TT or PT (tdap-monoclonal-db; Methods 3.2.1) and this was used as the predictor database, against our repertoires. Up to 7.8% of expanded clonotypes were hits to this predictor database via CloneSearch using the laxest 70% CDRH3 identity threshold. At the more conservative 90% threshold, the maximum value dropped to 1.8% of clonotypes (Figure 3A). On average, CloneSearch annotated 2.3%, 1.1%, 0.3% and 0.1% of clonotypes at the 70%, 80%, 90% and 100% thresholds.
Figure 3:

Up to 7.8% of expanded clonotypes were hits to tdap-monoclonal-db via CloneSearch using standard thresholds (A). At 90%, each subject had at least one “hit” clonotype (marked as a white box for presence, black for absence), predicted by one of 27 mAbs, encoded by eight IGHV genes. IGHV4–39 mAbs stood out for their publicness; we identify these mAbs as the “TT-01 clonotype” (B) We calculated corresponding FPRs for each threshold in the range 50–100%, establishing that there is no significant FPR for CloneSearch above 90% CDRH3 threshold (C). We contextualize performance using tdap-monoclonal-db (blue) against a CoV database (orange), plotting the ROC curves for each subject (D, E). E is a zoomed-in version of D focused on the early retrieval area. We note above-random performance of tdap-monoclonal-db in contrast to CoV, most notably in the very early retrieval area, the range of which corresponds well to the standard thresholds used (grey dashed line; calculated as average FPR at 70%) (D, E). Despite excellent specificity in this range, the very limited sensitivity of the method results in only slightly better than random ROC-AUC and ROC-AUC_0.1 (F). However, this is significantly better than random (p = 6.45e-4 and 3.81e-6 for AUC and AUC_0.1 respectively), and significantly better than the CoV database– (p = 1.62e-3 and 3.78e-7 respectively).
Hits to the vaccine database at the 90% threshold derived from 8 IGHV genes. The most public CloneSearch hit, an IGHV4–39/IGHJ5 anti-TT mAb was identified among the expanded D7 clonotypes in 16/19 subjects (Figure 3B). This IGHV4–39/IGHJ5 clonotype of mAbs has been discovered in two separate studies (27,30) as well as appearing in single-cell sequencing data of a single Tdap vaccinee by Khatri and colleagues(39). We refer to this clonotype as the “TT-01 clonotype”.
Most hits were to the TT part of vaccine-monoclonal-db (Figure S5), and this part of the database provided the best classification of D7 vs D0 repertoires, improving upon the repertoire classification based on other repertoire metrics (Figure S6). The greater hit rate to the TT than PT database holds despite comparable numbers of unique mAb sequences (306 unique IGHV + CDRH3 combinations for PT, vs. 311 for TT). This can reflect either that the frequency of TT-specific plasma cells is higher than the frequency of PT-specific plasma cells or is a feature of database quality. The PT database is largely derived from a human transgenic mouse model; IGHV gene usage differences are expected, and the average CDRH3 length is significantly shorter in the PT database than either the TT database or the D7 expanded clonotypes (average 14.7 aa vs. 15.5 aa vs. 15.6 aa respectively). The lower hit rate to the PT database may reflect these differences.
To describe how well this method performed at distinguishing vaccine-expanded from baseline clonotypes at the common thresholds used for this method, we calculated the FPR in addition to the TPR at 70, 80, 90 and 100% (Figure 3C). The FPR in all subjects was 0.0 at a threshold of 100%, and zero in 18/19 subjects at 90%, i.e. up until 90%, there are effectively no false positives and up to 1.8% of true positives are captured. This reveals that using CloneSearch with a stringent threshold (90–100%) is a highly specific method for identifying vaccine-expanded clonotypes, with limited sensitivity.
To compare the predictive ability of this method across the full range of thresholds, we calculated the ROC curve and corresponding AUCs and compared this with a human Coronavirus antibody database (derived from CoV-AbDab(14), referred to as CoV) ROC curve for each subject (Figure 3D, 3E, 3F). The ROC curve demonstrates that CloneSearch using tdap-monoclonal-db has above-random performance but with limited sensitivity which results in a median ROC-AUC of just 0.52 and ROC-AUC_0.1 of just 0.51.
In addition to expansion prediction, we could correlate CloneSearch predictions with clonotype publicness (Figure S7). As noted, the TT-01 clonotype was observed within the expanded D7 clonotypes of 16/19 subjects (Figure 3B). Considering all D7 clonotypƒgrantes (not just the expanded ones), the TT-01 clonotype was observed in a further subject; in all cases the clonotype was a 100% CDRH3 hit to an entry in the TT database. This clonotype was the second most public observed; the most public clonotype was not a hit to either TT or PT components of tdap-monoclonal-db. Again, average TT/PT-CloneSearch had non-random ROC-AUCs for the prediction of publicness.
Using our existing antigen-specificity predictor (CloneSearch) with common thresholds (70–100%) is therefore a highly specific (low FPR in operational range) but low-sensitivity tool for predicting clonotype expansion following vaccination. We would expect sensitivity to increase with the size and diversity of the predictor database; given limited data, we then asked whether we could improve this via more sophisticated look-up methods or training classifiers on tdap-monoclonal-db itself to learn generalizable features of vaccine-specific antibodies.
4.4. Deep models trained on small monoclonal antibody databases provided limited additional value over sequence identity
Following initial promising results with CloneSearch, we aimed to produce predictors with improved sensitivity and without a significant cost to specificity. We first considered excluding the IGHV matching in CloneSearch, however, we found that this significantly reduced performance (Figure S8A). Relaxing IGHV matching by using allele-similarity clusters (ASCs) or allele-similarity cluster families (ASCFs), which are clusters of highly-similar IGHV alleles (40), likewise did not improve performance(40) (Figure S8B).
We subsequently trained several models (Table 1) on tdap-monoclonal-db (our database of monoclonal antibody sequences with reported specificity to PT or TT, supplemented with CoV and miscellaneous antibody sequences) to predict vaccine-specificity (TT and PT mAbs as Positive; the remainder as Negative). We held out 20% of tdap-monoclonal-db with no more than 90% CDR identity or 80% CloneSearch score to the remainder, which we used for hyperparameter selection and training. We trained each model on five different folds of the training data and made predictions as an ensemble by taking the average of these predictions. On the hold-out test set of tdap-monoclonal-db, we found that for most models, we could obtain an improvement in ROC-AUC, ROC-AUC_0.1, and PR-AUC over the best sequence identity method, which in this instance was CloneSearch (excluding IGHV matching) (Figure S10). Deep learning models trained as ensembles identified held-out vaccine mAbs (the tdap-monoclonal-db test set) with a ROC-AUC of 0.80, ROC-AUC 0.1 of 0.65 and PR-AUC of 0.24 for the top model (vs. random PR-AUC of 0.03). The top model is referred to as CDRH3_V_AbLang and used an IGHV encoding layer and a fully connected layer on AbLang2 embeddings average-pooled across the CDRH3 tokens.
We subsequently applied these classifiers to tdap-vaccinee-db, with the hope of improving upon simple CloneSearch look-up (Figure 4A). We first noted a performance shift that occurred since we did not have access to information about mutations in our tdap-monoclonal-db training data as for most positive instances, only IGHV and CDRH3 information was publicly available. This contrasts with tdap-vaccinee-db, where mutation is an important feature separating D7 vs. D0 clonotypes (Figure S4). The effect was most dramatic for the language models AbLang2 (Figure S11). To mitigate this effect, we germline-reverted tdap-vaccinee-db (see Methods 3.1.4.), which significantly improved performance of the models trained on tdap-monoclonal-db. The top performing models in tdap-vaccinee-db were CDRH3_ESM and CDRH3_V_ESM, outperforming the top models in the tdap-monoclonal-db test set (CDRH3_V_Ablang and CDRH3_Ablang); these classifiers enabled a small but significant improvement in total ROC-AUC, however, were unable to outperform CloneSearch in the region of most interest, the early retrieval area (Figure 4B).
Figure 4:

we trained several model architectures as ensembles for the prediction of vaccine antigen (PT or TT) specificity vs. CoV, EBOV and miscellaneous antibodies (Figure S10 for performance on the tdap-monoclonal-db test set), and then applied these classifiers to tdap-vaccinee-db to see if they improved identification of vaccine-expanded clonotypes (A). The best performing used averages over the length dimension in ESM- or AbLang representations, subset to the CDRH3, with or without an additional IGHV gene encoding layer (“_V_”) where classifiers enabled small but significant improvements in the total ROC-AUC (B, C). However, inspection of the early-retrieval area reveals that no classifier was able to improve upon the performance of CloneSearch in the highest scoring regions; we can however combine CloneSearch with the classifier to create the convex hull of the two ROC curves (CloneSearch+CDRH3_V_ESM), enabling significant improvement in the ROC-AUC_0.1 and PR-AUC. The ROC curve shows TPR and FPR calculated on the pooled data (i.e. across all subjects to provide a single estimate). Boxplots are sorted on the y-axis by median ROC-AUC_0.1 and show the performance estimate for each of the 19 subjects for the CDRH3 and CDRH3_V models: p-value per the Wilcoxon test of the method vs. CloneSearch.
As can be seen by inspection of the ROC curve in Figure 4B, no model is able to achieve significantly superior sensitivity to CloneSearch in the early retrieval region; we therefore combined CloneSearch with the top classifier to create “CloneSearch+CDRH3_V_ESM”. This resulted in a significant improvement in ROC-AUC_0.1 over CloneSearch (p = 0.032, Wilcoxon test) however the median ROC-AUC is still barely above random at 0.52. We consider this to be the upper performance limit when the only training data available is a small monoclonal antibody database.
4.5. Learning from other individuals: Clonal overlap, IGHV frequency and mutation can be used to predict clonotype expansion in held-out subjects
The performance of the models we trained on our monoclonal database was limited by the scarcity of monoclonal antibody sequences of vaccine antigens, as well as the lack of mutation in the data (see section 3.2.4). While these classifiers still had predictive power, we hit a low performance limit that increasingly complex model architectures could not overcome. The performance shift from mutated to germline-reverted data highlighted the importance of mutation in this task; mutation alone has a median ROC-AUC of 0.66 (with a ROC-AUC_0.1 of 0.50). We therefore asked to what extent expansion could be learned between subjects.
The first question was: if a clonotype is expanded in one subject and is observed in another, is it likely to be also expanded in that second subject? To avoid data leakage, we held out 5 test subjects. With the remaining 14 subjects, we considered all subject pairs. Using the CloneSearch score as a prediction produces an average ROC-AUC of 0.55 (0.48 – 0.68) and average ROC-AUC_0.1 of 0.53 (0.50 – 0.60) between a given pair of subjects (Figure S12A). To contextualize these metrics within the standard clonal thresholds (70%, 80%, 90% and 100% CDRH3 identity) we highlight them on the ROC curve calculated pooling all pairs of subjects (Figure S12B) and plot the corresponding TPR and FPR at these thresholds (Figure S12C).
To discover how best to combine clonal information where this is available for multiple subjects, we constructed a LOO set-up in which the prediction of a single clonotype is made using data from multiple subjects. These predictions can be pooled via a sum or maximum; we found that maximum pooling (i.e. taking the maximum score observed in any subject) enabled the most significant improvement in ROC-AUC_0.1 with increasing cohort size (Figure S12D).
Aside from clonal information, we noted the significance of mutation in this task producing a median ROC-AUC of 0.66 (with random performance in the ER region; ROC-AUC_0.1 of 0.50). We also demonstrated that the overrepresentation of specific IGHV genes in positive vs. negative clonotypes in held-out training subjects was predictive in test subjects, with a median ROC-AUC of 0.58 and ROC-AUC_0.1 of 0.51 (Figure S13). To combine clonal information with these features, we developed a novel “Biological Baseline” which is effectively the convex hull of the respective ROC curves (Methods 2.4). In cross-validation (in 14/19 subjects, with redundancy filtering between train-test folds to remove instances with CloneSearch scores >0.8 and CDR identity >0.9), this led to a significantly improved ROC-AUC over CloneSearch alone from median 0.55 to 0.66 (p = 1.2e-4, Wilcoxon test), and a small improvement in the ROC-AUC_0.1 from 0.53 to 0.54 (p = 0.002, Wilcoxon test; Figure S13).
These analyses first demonstrated that vaccine-expanded clonotypes in one subject can be used to predict vaccine-expanded clonotypes in a separate subject using CloneSearch. Subsequently, we demonstrated how best to use data from multiple subjects to maximize classification performance. Finally, we developed a method to combine the clonotype-based predictor with other biologically motivated features (IGHV frequency in training data, and mutation) to produce a superior classifier.
4.6. Exploring deep methods for improved prediction of clonotype expansion
We subsequently sought to improve upon the simple BiologicalBaseline in a LOO set-up with stringent sequence identity cut-offs to examine predictive performance in instances with no direct similarity to training data. To improve upon our baseline, we first explored other methods for sequence look-up. One look-up method is similarity in a pooled pLM embedding representation. Using the same annotation approach as our clonal method, whereby a test instance is annotated with its maximal cosine similarity to a positive training instance, neither pLM embedding could improve on CloneSearch in the ROC-AUC or ROC-AUC_0.1 (Figure S14). Performance could be improved by training a KNN classifier which could weigh similarity to multiple test instances (with 100 neighbors, using distance-weighted cosine similarity); we were able to achieve a small improvement in ROC-AUC over the Biological Baseline for all test subjects with a median test value of 0.73 vs. 0.69; however, this KNN did not improve upon the ROC-AUC_0.1.
While KNN classifiers using pLM embedding space distances were able to secure a small performance advantage, we questioned whether performance could be improved by training fully-connected layer(s) or indeed via supervised non-pLM methods. We therefore investigated the use of machine learning models introduced earlier for the tdap-monoclonal-db classification task trained directly on tdap-vaccinee-db. Here, we found that all models except for BCR-AI improved upon the BiologicalBaseline (Figure S15); the top performing model via ROC-AUC_0.1 was CDRH3_AbLang with a median ROC-AUC of 0.74, ROC-AUC_0.1 of 0.57, and PR-AUC of 0.26 (Figure 5). This contrasts with the baseline with a median ROC-AUC of 0.69, ROC-AUC_0.1 of 0.54 and PR-AUC of 0.21. All metrics were improved for all five test subjects.
Figure 5:

we trained ensemble models to predict clonotype expansion in five held-out test subjects within tdap-vaccinee-db, via ensembles trained in a leave-one-out (LOO) fashion (with no clonal homology above 80% or CDR identity exceeding 90% between train-validation-test) (A) Among several model architectures (Figure SX), our top performing models used pre-trained pLM featurizations, with the top model being CDRH3 AbLang; non-pretrained models such as NetBCR and a CDR transformer model were able to perform competitively with ESM-based models and were able to improve ROC-AUC, ROC-AUC_0.1 and PR-AUC for all five test subjects over the Biological Baseline or CloneSearch alone (B). Each point in the boxplot reflects the performance estimate in one of the five test subjects. The superior performance of CDRH3_AbLang across the full operating range can be seen in the ROC and PR curves (C).
We noted correlations between CDRH3_AbLang and the constituent features of our biological baseline: CloneSearch score, mutation and IGHV ratio (Figure S16). In addition, we noted correlation with CDRH3 length which by itself had limited predictive power (Figure S4). To delve further into CDRH3 length, we looked at score by J gene and found higher predictions for IGHJ6, which tends to produce longer CDRH3s due to the polytyrosine motif (Figure S17).
4.7. Application of a Tdap-specific model to EBOV vaccinee repertoires demonstrates Tdap-specificity
As a final test of our top model, CDRH3_AbLang, we applied the model trained on Tdap vaccinees (Tdap-CDRH3_AbLang) to the IgG BCR repertoires of 40 Ad26.ZEBOV/MVA-BN-Filo (EBOV vaccine) vaccinees, who were sequenced at baseline and 7 days following booster vaccination. We defined positives as clonotypes expanded at the post-boost time point and negatives as clonotypes at baseline, as with the Tdap set-up. If our model was Tdap-specific we would expect i) random classification performance of baseline vs. EBOV post-boost clonotypes and ii) positive classification performance of Tdap vs. EBOV post-boost clonotypes.
We found near-random classification performance of baseline vs. post-boost clonotypes in EBOV vaccination (median ROC-AUC of 0.47, ROC-AUC_0.1 of 0.50) (Figure 6A). We then examined Tdap-CDRH3_AbLang’s classification performance on post-boost Tdap clonotypes vs. post-boost EBOV vaccine clonotypes; Tdap-CDRH3_AbLang classifies clonotypes as Tdap-specific with a ROC-AUC of 0.75, ROC-AUC 0.1 of 0.58 and PR-AUC of 0.47 (random PR-AUC = 0.23). This is a large improvement over the CloneSearch baseline which has a ROC-AUC of 0.55, ROC-AUC_0.1 of 0.51 and PR-AUC of 0.21 (Figure 6B and C), and a smaller improvement on the Biological Baseline (ROC-AUC of 0.73, ROC-AUC_0.1 of 0.54 and PR-AUC of 0.37). Notably, the performance difference between Tdap-CDRH3_AbLang and the BiologicalBaseline arises due to superiority in the early retrieval area (i.e. superior sensitivity at the highest specificity). We also noted that using average Tdap-CDRH3_AbLang scores per repertoire provides perfect classification of Tdap vs. EBOV post-boost repertoires (vs. an AUC of 0.88 using CloneSearch and 0.99 using the BiologicalBaseline) meaning that clonotype-based classifiers can also be used for repertoire classification on unseen datasets (Figure 6D).
Figure 6:

We applied the model trained on Tdap vaccinees to 40 unseen EBOV vaccinee repertoires to find evidence that the Tdap model (Tdap-CDRH3_Ablang) is specific to Tdap booster vaccination; in Test 1, The Tdap-CDRH3_Ablang model trained on Tdap vaccinees produces Tdap-specific predictions - i.e. it does not predict a generic post-booster response. This is evident from random baseline vs. post-boost EBOV vaccine clonotype classification (A) and better than random classification of Tdap-expanded vs. EBOV-expanded clonotypes (ROC-AUC = 0.75, ROC-AUC_0.1=0.58 and PR-AUC = 0.47) (B), where clonal homology alone (CloneSearch) is poor (C). The BiologicalBaseline also produces a reasonable ROC-AUC (0.73) demonstrating the value of mutation and IGHV enrichment as features. However in the early-retrieval area, Tdap-CDRH3_AbLang outperforms this baseline resulting in an improvement in PR-AUC from 0.37 to 0.47. Using a repertoire-average score across clonotypes results in perfect classification of Tdap vaccinee vs. EBOV vaccinee repertoires (D).
In summary, we demonstrated that our top model trained to distinguish Tdap-expanded vs. baseline clonotypes was able to distinguish Tdap-expanded clonotypes in unseen subjects from EBOV vaccine-expanded clonotypes in an independent cohort, and that this clonotype-level classification can be used for successful repertoire classification.
5. Discussion
Prediction of vaccine responses is a growing field, due to both the methodological challenge it presents - predictions using noisy, high-dimensional, and multimodal data - and its prospective utility in vaccine design (19,41). In a recent series of vaccine response prediction contests (Computational Models of Immunity-Pertussis Boost, “CMI-PB”), contestants were tasked with predicting vaccine response read-outs from participants post-vaccination given baseline measurements, and training data from different cohorts of Tdap vaccinees. Models of varying complexity were successful, but for certain read-outs, the simplest models are not yet beatable(19). CMI-PB challenges have not yet included information about TCR and BCR repertoires: bulk BCR repertoires of vaccinees have been studied extensively for certain vaccines, such as SARS-CoV-2, HIV, influenza, or EBOV vaccines; this is the first study for Tdap vaccination with targeted sequencing of BCR heavy chains, with the only previous study being single-cell BCR sequencing in a single donor (39).
Bulk BCR sequencing revealed significant changes in class-switching clonality, somatic hypermutation, CDRH3 length and IGHV gene usage, which are all consistent with the context of booster vaccination. A more specific question in the literature regarding Tdap vaccination is the difference between the newer aP formulation and the original whole cell Pertussis wP formulation; the literature on the difference in B cell response to restimulation has focused on the elevation of antigen-specific IgE responses in aP individuals (42–46). Indeed, the only significant difference in the BCR repertoires of aP and wP primed individuals was a significantly higher fold-change in IgE reads from D7 to D0 in aP-vaccinated individuals. This provides further evidence for the IgE difference observed. A potential confounding factor is the role of allergic status; allergic status was not reported as part of this study, however the literature supports that the IgE effect is antigen-specific (46) and that antigen-specific IgE does not differ by allergic status (44).
In this specific work, we questioned whether the clonotypes found expanded seven days after Tdap booster vaccination could be predicted. As described, we observed a significant increase in class-switched, mutated BCRs, with a decreasingly even distribution of reads among clonotypes at D7. These features were able to classify repertoires as D7 vs. D0 with up to a 0.98 ROC-AUC. Likewise, when classifying expanded D7 vs. D0 clonotypes, mutation from the IGHV germline had a ROC-AUC of 0.65, which is superior to any classifier we were able to train based solely on publicly available sequences of vaccine-specific antibodies.
As in our previous predictive vaccinology work, it is clear that simple, biologically motivated prediction baselines should not be ignored. However, we wanted to build more complex models of BCR clonotype expansion following vaccination, that would be specific to the vaccine in question: mutation and class-switching are not specific to Tdap vaccination, but rather to the phenomenon of clonal expansion of memory B cells stimulated by vaccination. Models that can identify repertoires as belonging to a person recently vaccinated or infected by agent X vs. agent Y, or further, identify BCR clonotypes as responding to agent X vs. agent Y, are both more challenging and more useful. We attempted this via sequence-based models, under the hypothesis that expanded sequences appear due to their specificity to the immunogen.
The simplest sequence-based models use either sequence identity or “shared clonality”, which here is a shared IGHV gene assignment and sequence identity in the CDRH3. One might expect sequence identity to be a superior metric, as closely related IGHV genes might be “missed out” by a strict clonotype definition. In all tasks, we found CloneSearch to be superior to sequence identity when it is extended to the CDRs or the full VH. For the expansion task, this is most likely because the probability of sharing sequence identity with an independent sequence set will decline with mutation, which was itself significantly positively associated with the positive label; indeed, the maximum VH or CDR identity of a test instance to a positive training instance, was inversely correlated with mutation of that test instance. Our interpretation of this is that when sequence identity is numerically dominated by IGHV-encoded regions, signal about shared IGHV is undermined by somatic hypermutation. Methods that focus on shared somatic hypermutations given an IGHV gene would likely be an improvement. We found that extending the CloneSearch method to closely related IGHVs via allele-similarity clusters (ASCs) and ASC families did not significantly improve performance in any task; this is because it largely modified our predictions by “scooping up” clonotypes that had low CDRH3 identity and thus low CloneSearch scores. Finally, we found that for 2/3 tasks (using tdap-monoclonal-db to predict expansion in tdap-vaccinee-db; LOO set-up for prediction in tdap-vaccinee-db), including IGHV matching in CloneSearch improved performance over simply CDRH3 identity matching. These results demonstrate that clonotype searching can be a powerful approach, and that using IGHV genes is both necessary and sufficient for full utility of these methods. Another important feature in the success of CloneSearch was the database: we found that the TT component of our vaccine mAb database had significantly greater predictive power than the PT component. While this makes sense given serological data and the non-human origin of most of the PT database, it is an important caveat that the success of the method will depend on the representativeness of the mAb database, be that by antigen or epitope.
To attempt to improve upon our baselines, we first explored simple look-up methods via pooled pLM representations. We found that there was not a performance advantage in doing so, and performance was improved only when using large numbers of neighbors in a KNN set-up. We subsequently trained deep learning models, either with pLM representations or non-pretrained models. It is hoped that using featurizations derived from pLMs can improve prediction performance through their extensive pretraining on large corpuses of BCR sequencing data. However, there have generally been minor performance advantages in “downstream tasks” over general protein language models, such as ESM. We found here that i) generally the antibody-language model featurization we used (AbLang2) was superior to ESM, and that ii) the language models outperformed non-pretrained deep architectures, such as CNN-based NetBCR, BCR-AI and our CDR transformer model. Most clearly, we noted an effect of data scale and diversity on model performance, i.e., training classifiers on a small, sparse monoclonal database vs. a sequencing dataset. For the monoclonal database classification task, we found no deep classifier could outperform sequence identity in the task of expansion prediction in ROC-AUC_0.1 or the PR-AUC - despite outperforming sequence identity baselines in the held-out test set of tdap-monoclonal-db. When training directly on BCR sequencing data, we found that the deep models excelled and convincingly outperformed our baselines. While this trend is expected given wisdom about model complexity and training data scale, it is an important reminder to researchers who wish to use these methods as research tools. We continue to recommend clonotype-based sequence identity metrics for interpreting BCR sequencing data using small monoclonal antibody databases such as the IEDB.
Deep models could learn features of vaccine-specific clonotypes that extend beyond the clonotype definition - i.e., could identify vaccine-specific expanded clonotypes in the absence of conventional clonal homology to known expanded clonotypes. As well as discriminating Tdap-expanded clonotypes from D0 clonotypes from the same subjects, the model discriminated Tdap-expanded clonotypes from EBOV vaccine-expanded clonotypes in a separate cohort. Our top performing model, CDRH3 AbLang, produced predictions correlated with our biological baseline (IGHV ratio, mutation and clonal homology) as well as with an additional feature, CDRH3 length, which was not predictive by itself. That expansion is learnable between subjects beyond clonality, mutation and IGHV, indicates that there are additional sequence features discriminating Tdap-specific clones that can be learned by deep models.
In summary, we have outlined here a novel prediction format that defines clone expansion in a generalizable way, established a biologically-motivated baseline, and demonstrated current prediction limits using standard models in the field. The key implications of our work for the field more generally is that firstly, simple baselines should not be ignored – both from the perspective of their ability to enhance prediction, but from their interaction with more complex predictors. For example, we noted that pLM cosine similarity look-up was not useful in this task for either pLM explored due to a negative correlation with mutation; this should be considered in future applications of pLM representations for antibodies (such as using them in lieu of clonal clustering). IGHV usage and mutation from germline emerged as key features, emphasizing the importance of accurate IGV reference sets. Finally, there are learnable features about clonotype expansion between subjects that extend beyond our clonotype definition; here, we demonstrated correlation with mutation, overrepresentation of IGHV, and CDRH3 length.
Supplementary Material
9. Acknowledgements
The authors are grateful to all donors who participated in the study and the La Jolla Institute for Immunology’s Next Generation Sequencing core, particularly Dr. Suzie Alarcón. We thank Dr. Jake Galson for sharing his literature-curated TT monoclonal antibody database.
8. Funding sources
This project has been funded in whole or in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract no. 75N93019C00001 and award no U01-AI150753.
7. Data and code availability
BCR sequencing data is available on the SRA under BioProject PRJNA1298155 with accessions SAMN50255538 – SAMN50255575. tdap-vaccinee-db and tdap-monoclonal-db are available at Zenodo at https://zenodo.org/records/16410082. Training code is available at https://github.com/erichardson97/tdap_expansion_pred.
Bibliography
- 1.Rao VN, Coelho CH. Public antibodies: convergent signatures in human humoral immunity against pathogens. mBio. 2025. Apr 16;0(0):e02247–24. [Google Scholar]
- 2.Natali EN, Horst A, Meier P, Greiff V, Nuvolone M, Babrak LM, et al. The dengue-specific immune response and antibody identification with machine learning. Npj Vaccines. 2024. Jan 20;9(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Richardson E, Bibi S, McLean F, Schimanski L, Rijal P, Ghraichy M, et al. Computational mining of B cell receptor repertoires reveals antigen-specific and convergent responses to Ebola vaccination. Front Immunol [Internet]. 2024. Jul 8 [cited 2024 Oct 4];15. Available from: https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1383753/full [Google Scholar]
- 4.Chen EC, Gilchuk P, Zost SJ, Suryadevara N, Winkler ES, Cabel CR, et al. Convergent antibody responses to the SARS-CoV-2 spike protein in convalescent and vaccinated individuals. Cell Rep [Internet]. 2021. Aug 24 [cited 2024 Oct 4];36(8). Available from: https://www.cell.com/cell-reports/abstract/S2211-1247(21)01042-1 [Google Scholar]
- 5.Setliff I, McDonnell WJ, Raju N, Bombardi RG, Murji AA, Scheepers C, et al. Multi-Donor Longitudinal Antibody Repertoire Sequencing Reveals the Existence of Public Antibody Clonotypes in HIV-1 Infection. Cell Host Microbe. 2018. Jun 13;23(6):845–854.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Trück J, Ramasamy MN, Galson JD, Rance R, Parkhill J, Lunter G, et al. Identification of Antigen-Specific B Cell Receptor Sequences Using Public Repertoire Analysis. J Immunol. 2015. Jan 1;194(1):252–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yaari G, Kleinstein SH. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 2015. Nov 20;7(1):121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shrock EL, Timms RT, Kula T, Mena EL, West AP, Guo R, et al. Germline-encoded amino acid–binding motifs drive immunodominant public antibody responses. Science. 2023. Apr 7;380(6640):eadc9498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wang Y, Lv H, Teo QW, Lei R, Gopal AB, Ouyang WO, et al. An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies. Immunity. 2024. Oct 8;57(10):2453–2465.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang Y, Yuan M, Lv H, Peng J, Wilson IA, Wu NC. A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2. Immunity. 2022. Jun 14;55(6):1105–1117.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns [Internet]. 2024. May 10 [cited 2024 Oct 15];5(5). Available from: https://www.cell.com/patterns/abstract/S2666-3899(24)00075-8 [Google Scholar]
- 12.Mason DM, Reddy ST. Predicting adaptive immune receptor specificities by machine learning is a data generation problem. Cell Syst. 2024. Dec 18;15(12):1190–7. [DOI] [PubMed] [Google Scholar]
- 13.Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019. Jan 8;47(D1):D339–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Raybould MIJ, Kovaltsuk A, Marks C, Deane CM. CoV-AbDab: the coronavirus antibody database. Bioinformatics. 2021. Mar 1;37(5):734–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yoon H, Macke J, West AP Jr, Foley B, Bjorkman PJ, Korber B, et al. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. Nucleic Acids Res. 2015. Jul 1;43(W1):W213–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Richardson E, Galson JD, Kellam P, Kelly DF, Smith SE, Palser A, et al. A computational method for immune repertoire mining that identifies novel binders from different clonotypes, demonstrated by identifying anti-pertussis toxoid antibodies. mAbs. 2021. Jan 1;13(1):1869406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Willemsen L, Lee J, Shinde P, Soldevila F, Aoki M, Orfield S, et al. Th1 polarization in Bordetella pertussis vaccine responses is maintained through a positive feedback loop. Nat Commun. 2025. Apr 1;16(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.da Silva Antunes R, Babor M, Carpenter C, Khalil N, Cortese M, Mentzer AJ, et al. Th1/Th17 polarization persists following whole-cell pertussis vaccination despite repeated acellular boosters. J Clin Invest. 2018. Aug 31;128(9):3853–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shinde P, Willemsen L, Anderson M, Aoki M, Basu S, Burel JG, et al. Putting computational models of immunity to the test - an invited challenge to predict B. pertussis vaccination outcomes [Internet]. bioRxiv; 2024. [cited 2024 Oct 15]. p. 2024.09.04.611290. Available from: https://www.biorxiv.org/content/10.1101/2024.09.04.611290v1
- 20.Gabernet G, Marquez S, Bjornson R, Peltzer A, Meng H, Aron E, et al. nf-core/airrflow: An adaptive immune receptor repertoire analysis workflow employing the Immcantation framework. PLOS Comput Biol. 2024. Jul 26;20(7):e1012265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Collins AM, Ohlin M, Corcoran M, Heather JM, Ralph D, Law M, et al. AIRR-C IG Reference Sets: curated sets of immunoglobulin heavy and light chain germline genes. Front Immunol [Internet]. 2024. Feb 9 [cited 2024 Sep 4];14. Available from: https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2023.1330153/full [Google Scholar]
- 22.Manso T, Folch G, Giudicelli V, Jabado-Michaloud J, Kushwaha A, Nguefack Ngoune V, et al. IMGT® databases, related tools and web resources through three main axes of research and development. Nucleic Acids Res. 2022. Jan 7;50(D1):D1262–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Richardson E, Trevizani R, Greenbaum JA, Carter H, Nielsen M, Peters B. The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024. Jun 14;5(6):100994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chen F, Tzarum N, Wilson IA, Law M. VH1–69 antiviral broadly neutralizing antibodies: genetics, structures, and relevance to rational vaccine design. Curr Opin Virol. 2019. Feb 1;34:149–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.DeKosky BJ, Ippolito GC, Deschner RP, Lavinder JJ, Wine Y, Rawlings BM, et al. High-throughput sequencing of the paired human immunoglobulin heavy and light chain repertoire. Nat Biotechnol. 2013. Feb;31(2):166–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Poulsen TR, Meijer PJ, Jensen A, Nielsen LS, Andersen PS. Kinetic, Affinity, and Diversity Limits of Human Polyclonal Antibody Responses against Tetanus Toxoid. J Immunol. 2007. Sep 15;179(6):3841–50. [DOI] [PubMed] [Google Scholar]
- 27.Frölich D, Giesecke C, Mei HE, Reiter K, Daridon C, Lipsky PE, et al. Secondary Immunization Generates Clonally Related Antigen-Specific Plasma Cells and Memory B Cells. J Immunol. 2010. Sep 1;185(5):3103–10. [DOI] [PubMed] [Google Scholar]
- 28.Yousefi M, Khosravi-Eghbal R, Reza Mahmoudi A, Jeddi-Tehrani M, Rabbani H, Shokri F. Comparative in vitro and in vivo assessment of toxin neutralization by anti-tetanus toxin monoclonal antibodies. Hum Vaccines Immunother. 2014. Feb 1;10(2):344–51. [Google Scholar]
- 29.Lavinder JJ, Wine Y, Giesecke C, Ippolito GC, Horton AP, Lungu OI, et al. Identification and characterization of the constituent human serum antibodies elicited by vaccination. Proc Natl Acad Sci. 2014. Feb 11;111(6):2259–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Poulsen TR, Jensen A, Haurum JS, Andersen PS. Limits for Antibody Affinity Maturation and Repertoire Diversification in Hypervaccinated Humans. J Immunol. 2011. Oct 15;187(8):4229–35. [DOI] [PubMed] [Google Scholar]
- 31.Persson MA, Caothien RH, Burton DR. Generation of diverse high-affinity human monoclonal antibodies by repertoire cloning. Proc Natl Acad Sci. 1991. Mar 15;88(6):2432–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Faber C, Shan L, chang Fan Z, Guddat LW, Furebring C, Ohlin M, et al. Three-dimensional structure of a human Fab with high affinity for tetanus toxoid1. Immunotechnology. 1998. Jan 1;3(4):253–70. [DOI] [PubMed] [Google Scholar]
- 33.Jensen MF, Nielsen M. NetTCR 2.2 - Improved TCR specificity predictions by combining pan- and peptide-specific training strategies, loss-scaling and integration of sequence similarity. eLife [Internet]. 2024. Feb 2 [cited 2024 Oct 15];12. Available from: https://elifesciences.org/reviewed-preprints/93934 [Google Scholar]
- 34.Zhang W, Hawkins PG, He J, Gupta NT, Liu J, Choonoo G, et al. A framework for highly multiplexed dextramer mapping and prediction of T cell receptor sequences to antigen specificity. Sci Adv. 2021. May 14;7(20):eabf5835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Olsen TH, Moal IH, Deane CM. Addressing the antibody germline bias and its effect on language models for improved antibody design [Internet]. bioRxiv; 2024. [cited 2024 Oct 15]. p. 2024.02.02.578678. Available from: https://www.biorxiv.org/content/10.1101/2024.02.02.578678v1
- 36.Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023. Mar 17;379(6637):1123–30. [DOI] [PubMed] [Google Scholar]
- 37.Zhang J, Ma W, Yao H. Accurate TCR-pMHC interaction prediction using a BERT-based transfer learning method. Brief Bioinform. 2024. Jan 1;25(1):bbad436. [Google Scholar]
- 38.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12(85):2825–30. [Google Scholar]
- 39.Khatri I, Diks AM, van den Akker EB, Oosten LEM, Zwaginga JJ, Reinders MJT, et al. Longitudinal Dynamics of Human B-Cell Response at the Single-Cell Level in Response to Tdap Vaccination. Vaccines. 2021. Nov;9(11):1352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Peres A, Lees WD, Rodriguez OL, Lee NY, Polak P, Hope R, et al. IGHV allele similarity clustering improves genotype inference from adaptive immune receptor repertoire sequencing data. Nucleic Acids Res. 2023. Sep 8;51(16):e86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Soldevila FC, Shinde P, Kojima M, Overton JA, Ha B, Greenbaum J, et al. Computational models of Immunity – Pertussis Boost (CMI-PB): Engaging the broader scientific community to develop predictive models of Tdap booster vaccination. J Immunol. 2021. May 1;206(1_Supplement):59.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Aalberse RC, Grüber C, Ljungman M, Kakat S, Wahn U, Niggemann B, et al. Further investigations of the IgE response to tetanus and diphtheria following covaccination with acellular rather than cellular Bordetella pertussis. Pediatr Allergy Immunol. 2019;30(8):841–7. [DOI] [PubMed] [Google Scholar]
- 43.Blauvelt A, Simpson EL, Tyring SK, Purcell LA, Shumel B, Petro CD, et al. Dupilumab does not affect correlates of vaccine-induced immunity: A randomized, placebo-controlled trial in adults with moderate-to-severe atopic dermatitis. J Am Acad Dermatol. 2019. Jan 1;80(1):158–167.e1. [DOI] [PubMed] [Google Scholar]
- 44.Edelman K, Malmström K, He Q, Savolainen J, Terho EO, Mertsola J. Local reactions and IgE antibodies to pertussis toxin after acellular diphtheria-tetanus-pertussis immunization. Eur J Pediatr. 1999. Nov 1;158(12):989–94. [DOI] [PubMed] [Google Scholar]
- 45.Holt PG, Snelling T, White OJ, Sly PD, DeKlerk N, Carapetis J, et al. Transiently increased IgE responses in infants and pre-schoolers receiving only acellular Diphtheria–Pertussis–Tetanus (DTaP) vaccines compared to those initially receiving at least one dose of cellular vaccine (DTwP) – Immunological curiosity or canary in the mine? Vaccine. 2016. Jul 29;34(35):4257–62. [DOI] [PubMed] [Google Scholar]
- 46.da Silva Antunes R, Soldevila F, Pomaznoy M, Babor M, Bennett J, Tian Y, et al. A system-view of Bordetella pertussis booster vaccine responses in adults primed with whole-cell versus acellular vaccine in infancy. JCI Insight. 6(7):e141023. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
BCR sequencing data is available on the SRA under BioProject PRJNA1298155 with accessions SAMN50255538 – SAMN50255575. tdap-vaccinee-db and tdap-monoclonal-db are available at Zenodo at https://zenodo.org/records/16410082. Training code is available at https://github.com/erichardson97/tdap_expansion_pred.
