Abstract
Sequence alignment of immunoglobulin (Ig) sequences is central to the computational analysis of adaptive immune receptor repertoire sequencing (AIRR-seq) data, impacting adaptive immunity research and antibody engineering. Traditional Ig sequence aligners often struggle to handle the complexities of V(D)J recombination and somatic hypermutation (SHM), resulting in suboptimal allele assignment accuracy and sequence segmentation. We introduce AlignAIR, a novel deep learning-based aligner that leverages advanced simulation approaches and a multi-task learning framework. AlignAIR sets new state-of-the-art results in allele assignment accuracy, productivity assessments, sequence segmentation, and speed. The model’s latent space captures SHM characteristics, offering more profound insights into sequence variability. AlignAIR is designed for seamless integration with existing AIRR-seq pipelines and includes a user-friendly web interface and a container image for efficient local processing of millions of sequences. AlignAIR represents a significant advancement in immunogenetics research and antibody engineering, providing a critical resource for analyzing adaptive immune receptor repertoires.
Graphical Abstract
Graphical Abstract.
Introduction
Adaptive immune receptor repertoire sequencing (AIRR-seq) has revolutionized our ability to investigate the vast diversity of B- and T-cell receptors (BCRs and TCRs), providing a detailed view of immune responses to pathogens, vaccines, and cancer, as well as insights into autoimmunity. Receptor diversity is generated via a stochastic process known as V(D)J recombination, where variable (V), diversity (D), and joining (J) gene segments are recombined during lymphocyte development. In B cells, this is followed by an affinity maturation process involving somatic hypermutation (SHM) and affinity-dependent selection, further enhancing receptor variability [1].
This complex process and the resulting diversity present significant challenges in AIRR-seq data analysis, posing unique bioinformatic approaches [2], including clonotype clustering to group similar immune receptors likely originating from the same ancestor cell [3], lineage tree construction to map the evolutionary pathways of receptors [4], mutation analysis to track and study SHM [5], and antigen-specificity prediction to determine the targets of immune receptors [6, 7]. All these analyses critically depend on the accurate alignment of sequences to their respective V(D)J germline allele sequences [8]. In the context of AIRR-seq, sequence alignment refers to the process of matching the sequenced immune receptor reads to known germline V, D, and J gene segments to accurately identify their origin alleles and characterize sequence variations. There are two main approaches for the alignment of BCR and TCR sequences. The first utilizes string distance-based algorithms, e.g. IgBLAST [9], IMGT/HighV-QUEST [10], and MiXCR [11]. The second approach employs hidden Markov models (HMMs) to parameterize the recombination process, including tools such as Partis [12] and IHMMuneAlign [13]. While HMM-based methods can effectively model typical rearrangement events, their performance may degrade in the presence of extensive insertions, deletions, high mutation rates, or sequencing errors. This reduction in accuracy often stems from the probabilistic nature and the training data of HMMs, which may not adequately capture rare or extreme mutation scenarios. Consequently, HMM-based aligners can show suboptimal accuracy in germline allele assignment and sequence segmentation [14], especially under conditions that deviate significantly from those represented in their training sets. Enhanced accuracy in allele detection could have important applications, particularly in fields like vaccine development, where precise immune repertoire profiling can inform the design of effective immunogens and predict immune responses [15].
Recent advances in machine learning have transformed how complex scientific problems are approached [16]. Its enormous potential in computational biology [17] is evident across various fields, including protein structure prediction—exemplified by AlphaFold’s groundbreaking achievements, which garnered recognition in this year’s Nobel Prize [18], the development of specialized large language models for protein sequence generation and analysis [19], and other bioinformatics domains such as omics, biomedical imaging, and biomedical signal processing [20].
This technological leap has also extended to the analysis of AIRR-seq data. Machine learning has driven significant advances in modeling the underlying mechanisms and dynamics of AIRRs [21, 22], and in classifying different pathologies by analyzing immune receptor patterns [23–27]. Furthermore, machine learning techniques have enhanced the inference of per-sequence generation probabilities, allowing for more precise predictions on how immune receptors are generated [21, 28], identifying B-cell epitopes [29], and detecting disease-specific signatures [30, 31].
Building on the foundation of traditional machine learning approaches, deep learning has expanded these capabilities by applying advanced techniques to tasks such as receptor specificity prediction [32, 33], sequence determinant characterization [34], clonotype inference [35], synthetic sequence generation [36], SHM modeling [37], and public TCR classification [38]. The ability of deep networks to generate embedded latent spaces has proven instrumental in advancing these tasks [39–41].
Several attempts have been made to apply deep learning to DNA sequence alignment, most notably through deep reinforcement learning [42], convolutional neural networks [43], and large language models [44]. These approaches have outperformed classical algorithms like MAFFT [45] and ClustalW [46]. However, due to the unique complexities of adaptive immune receptor sequences—ranging from the stochastic nature of VDJ recombination to SHM—these methods are not directly applicable to AIRR-seq alignment. Existing AIRR-seq alignment tools often assume a uniform mutation model and produce deterministic results without providing interpretable likelihood measures. This approach oversimplifies the inherently stochastic nature of receptor generation and limits the tools’ ability to resolve sequence ambiguities [14].
Only a few alignment tools provide estimates of alignment confidence, expressed as probabilities that reflect sequence uncertainty. These methods are traditionally trained on empirical data that can capture some of these stochastic events [12, 13, 47]. However, these tools are also prone to dataset-specific biases that can affect performance. Using data with a known ground truth enables objective validation of alignment method quality, providing a standard against which performance can be evaluated. Ambiguities that arise from sequence alterations during antibody formation obscure the identification of true germline alleles, particularly when allele sequences are highly similar [48]. This challenge is especially pronounced for the short D alleles, where trimming and mutations leave so few nucleotides that distinguishing alleles becomes nearly impossible [14, 49].
Given these challenges, it is evident that a new approach is needed. The stochastic nature of V(D)J recombination and SHM requires an adequate scoring approach. Considering the various sources of ambiguity inherent in AIRR-seq data—ranging from mutations and trimming events to sequencing errors—a more advanced model is needed to effectively capture and learn the underlying distributions of these stochastic processes. Leveraging these distributions can effectively reduce ambiguities in allele detection. Likelihood-based scoring offers a powerful solution by providing a probabilistic evaluation that aligns with the stochastic nature of the data, delivering more accurate and detailed interpretations of these nondeterministic processes.
Here, we introduce AlignAIR, a novel deep learning-based alignment tool designed to learn and interpret the complexities of adaptive immune sequences. By leveraging advanced simulation techniques and a multi-task learning framework, AlignAIR achieves unprecedented accuracy in allele assignment, productivity assessments, and sequence segmentation without relying on traditional alignment methods. Crucially, AlignAIR integrates domain-specific knowledge of adaptive immune sequence formation with output formats compatible with existing AIRR-seq analysis tools, ensuring optimal performance and seamless integration.
Material and methods
Full algorithm details
Extensive explanations of the components and their underlying motivations are provided in the Supplementary Methods. Figure 1 B shows diagrams of the neural network architecture and its components, and the complete pipeline and workflow appear in Supplementary Fig. S1 andSupplementary Fig. Supplementary S2 and Section 1.2 detail the structure of the input data and the method used to encode DNA sequences. Supplementary Figs S3–S5 illustrate the architectural designs of the feature extraction, segmentation, and classification modules. Training and inference procedures are described in Supplementary Section S1.4 and S1.5.
Figure 1.
(A) The AlignAIR suite comprises four main stages. The first phase involves data simulation using the GenAIRR package [14], which generates an exhaustive and unbiased dataset for both training and testing, featuring reliable ground truths and a variety of sequence conditions. The second phase details the training procedure of the AlignAIR deep learning architecture, including innovative components and a dynamic loss function. The third phase illustrates the inference process performed by the trained AlignAIR model, complemented by a suite of tools and post-processing algorithms that ensure compatibility with the AIRR schema for seamless integration with existing AIRR-seq analysis tools. The fourth and final phase highlights the available interfaces: an easy-to-use web interface for quick, insightful sequence analysis suitable for smaller datasets, and a Docker image for efficient local processing of millions of sequences. (B) An architecture diagram depicting the core components and structure of the AlignAIR model. The workflow includes specialized feature extraction blocks for each segment (V, D, and J), allowing for precise segmentation and positional embedding within input sequences. Each block is equipped with dedicated layers to enhance the recognition of allele likelihoods. The internal structure of a feature extraction block is shown as an abstraction in the bottom right corner. Created in BioRender. Peres, A. (2025) https://BioRender.com/q08a035.
Inputs and data sources
The inputs to the network are DNA sequences encoded into vectors of tokens based on a mapping from the accepted grammar to integers. The accepted grammar consists of the nucleotides A, T, C, and G, the “N” base, and a padding token (see additional details in Supplementary Section S1.1.2). To eliminate biases intrinsic to real sequence data, such as batch effects and sequencing errors, and to avoid optimizing the model to the results of any existing alignment algorithm, we generated the data using the GenAIRR simulation suite [14]. Four datasets were generated using this framework, corresponding to different mutation models: Uniform, S5F [5], S5F Opposite, and S5F 60.
Each mutation model represents a distinct approach to simulate SHM patterns. The Uniform model assumes random mutations uniformly spanning all positions, while the S5F model incorporates empirical mutation frequencies based on 5-mer motifs. The S5F Opposite model uses the complementary probabilities of the S5F, and the S5F 60 model shifts the mutation probabilities of the original S5F by 60% to introduce inference inaccuracies. These models are further detailed in Supplementary Section S2. Each dataset consists of 15 million sequences for training and 6 million sequences for testing. Of the testing sequences, 3 million are productive sequences with only mutations applied, and 3 million are nonproductive sequences with both indels and mutations. For the training datasets, the sequences are evenly split, with 7.5 million productive and 7.5 million nonproductive sequences. These datasets cover a broad spectrum of alignment challenges, from straightforward to extreme edge cases. The dataset generation parameters are defined in GenAIRR’s [14] Supplementary Table S3.
Training scheme
For training, the metadata produced by GenAIRR are leveraged (Fig. 1A.1), including the start and end positions of each allele (V, D, and J), productivity status of the sequence, the number of indels in the sequence, and the mutation rate, which is defined using the following formula:
![]() |
Here, “mutation rate” refers to the fraction of nucleotides in a given sequence that differ from the naive germline. For example, a 10% mutation rate means that 10% of all nucleotides in the input sequence have changed from the germline state. This same logic applies to V, D, and J mutation rates, except the denominator is the total number of nucleotides in the respective gene segment rather than the entire sequence. All V, D, and J alleles corresponding to each input sequence are included, where multiple assignments in the ground truth indicate an undecidable ambiguity between multiple alleles, which is attributed to sequence trimming rather than mutations. The model is preset to accept a maximum sequence length of 576 nucleotides but is structurally adaptable to any size at the cost of increased memory and time complexity. The encoded input sequences are batched into sizes of 512. The model is trained on the 15 million training samples using a learning rate scheduler (see Supplementary Section S1.4.2 for the exact configuration) until convergence, typically around 500 epochs, with 150 000 sequences processed per epoch. On an NVIDIA Titan RTX, the model reaches convergence, completing the 500 epochs in ∼15 h. The embedding layer of the model combines two traditional embedding layers. One layer encodes the nucleotides of the input sequence as integers and embeds them into a higher-dimensional space, while the other embeds the relative positions of these nucleotides into the same space. These embeddings share the same shape and are summed element-wise to form the final input embedding used by the model. Specifically, the first layer embeds each nucleotide, and the second layer embeds each position (e.g. integers 1–576). The resulting embedding at position k is the sum of the nucleotide embedding and the position embedding (see Supplementary Section S1.2.1 for a formal definition). The model is trained in a supervised fashion using a custom loss function incorporating multiple auxiliary losses with dynamic scaling (see Supplementary Section S1.3 for the exact definition of the loss function and dynamic scaling). The first auxiliary task involves predicting the locations of the V, D, and J alleles within the input sequence, which generates binary masks applied to the positional embeddings using the Hadamard product (Fig. 1B). The three resulting masked positional embeddings are then passed through residual convolutional feature extraction blocks (see Supplementary Section S1.2.2 for the exact definitions). Additionally, an extra feature extraction block is connected to the positional embeddings to create a latent space for meta-task predictions, including regression for the number of indels, mutation rate regression, and sequence productivity status classification. Finally, the latent spaces created by the individual feature extraction blocks for each V, D, and J allele are passed through a shallow, dense network to produce a likelihood vector over all the options in the germline reference used per V, D, and J (Fig. 1B).
A step-by-step overview of the AlignAIR ML pipeline
This section provides a concrete, step-by-step walkthrough of how AlignAIR trains and infers the alignment of V, D, and J segments for a given input sequence. By illustrating each stage, we aim to clarify whether V, D, and J computations occur in parallel or sequentially, how the model segments at the single-position level, and how the alignment process unfolds. Supplementary Fig. S1 offers a high-level schematic illustration. Here, we elaborate on the individual steps in both the training and the inference modes.
Input and tokenization
Training
Each raw DNA sequence is first tokenized into integer values (see Supplementary Section S1.1.2). Sequences exceeding 576 nucleotides undergo the optimal region selection (Supplementary Section S1.1.1). For each tokenized sequence, we have ground-truth labels indicating
the start and end positions of the V, D, and J segments (or no segment present).;
the true alleles for each segment (multi-hot if multiple alleles can fit);
productivity status, mutation rate, and indel count.
All of these are compiled into batches for efficient training.
Inference
The same tokenization process applies. However, no ground-truth labels are given, only the raw sequences (optionally with known chain type).
Embedding
Both the training and the inference proceed by embedding each token (nucleotide) via (i) a token embedding matrix and (ii) a positional embedding layer (Supplementary Section S1.2.1). The result is a matrix (sequence length × embedding dimension) capturing both the identity and the position of each nucleotide.
Initial segmentation prediction
Parallel feature extraction
AlignAIR uses three parallel convolutional residual blocks (one each for V, D, and J) to process the embeddings. Each block produces features specialized to detect potential start/end boundaries for its respective segment.
![]() |
where E is the embedded input.
Position-level outputs
Dense layers on top of each block output start and end coordinates for the corresponding segment. These are real-valued predictions (e.g. 123.7). We round them to integer indices to create an initial boundary estimate for each segment:
![]() |
Thus, segmentation occurs in parallel for V, D, and J.
During training, the model is heavily penalized whenever there is an overlap between predicted indices, and in practice no such overlaps are observed. As an additional safeguard, if the predicted end of one segment exceeds the predicted start of the next segment, we cap the former by the latter to avoid any overlap.
Masking and segmented embeddings
For each segment, we create a binary mask spanning the predicted boundaries. We then apply element-wise multiplication (Hadamard product) to isolate each segment’s portion of the embedded sequence:
![]() |
As a result, each segment-specific embedding is “zeroed out” except for the region predicted to belong to that segment.
Segment classification
Parallel feature extraction for allele calls We feed EV, ED, and EJ into additional convolutional residual blocks (distinct from the initial boundary-detection blocks). These produce segment-specific feature vectors, from which we perform classification:
![]() |
![]() |
![]() |
Allele likelihoods
For each segment, a dense layer with sigmoid activation outputs a vector of length KV, KD, or KJ, representing the reference’s known V, D, or J alleles. Hence,
![]() |
where each pV[i] is the likelihood that allele i was used, and similarly for D and J.
Meta-tasks (mutation rate, indels, productivity)
Separate feature extraction
In parallel to steps 3–5, a dedicated path processes the unmasked embeddings E to predict meta attributes. A convolutional residual block yields Fmeta, which is fed into dense layers for regression (mutation rate, number of indels) and binary classification (productivity):
![]() |
These tasks run independently and do not alter the segmentation or allele classification computations.
Training process
During training, AlignAIR
minimizes a multi-task loss
, summing segmentation, classification, and meta-task terms (Supplementry Section 1.3);receives ground-truth boundaries (start/end), ground-truth allele multi-hot vectors, and meta-task labels;
iteratively updates all network parameters so that segmented embeddings, allele calls, and meta attributes converge to their correct targets.
The parallel architecture means that the model infers V, D, and J boundaries simultaneously, not in a cascade.
Inference and post-processing
Once trained, the model receives an input sequence, tokenizes/embeds it, and produces:
segment boundaries
, etc;allele likelihood vectors (pV, pD, pJ);
meta-task outputs (
).
Optionally, we apply post-processing rules (Supplementary Section S1.5) to filter or threshold multiple allele calls and to align the predicted segments with the germline reference for final reporting.
Single-position alignment clarification
AlignAIR does not perform single-nucleotide alignments in a classical sense (like Smith–Waterman). Instead, it predicts continuous boundaries for each segment, and then masks the embedding inside those boundaries. Thus, it learns an implicit “alignment” at the segment level rather than iterating over each position. If a user requires single-nucleotide alignment post hoc, the model’s predicted boundaries can be refined using a standard local or global alignment approach (see Supplementry Section S1.5.3).
By detailing each step in the pipeline—input tokenization, embedding, parallel boundary detection, segmentation, allele classification, and meta-task predictions—we illustrate how AlignAIR handles V, D, and J computations concurrently and how it performs segmentation without classical single-nucleotide alignment. This approach provides a robust framework for deep learning-based Ig alignment and classification.
Inference scheme
The inference of the AlignAIR model consists of two stages. In the first stage, the model generates raw predictions, which primarily include metadata such as indel count, mutation rate, and productivity status, as well as allele likelihoods and initial segmentation estimates. These predictions require adjustments to account for preprocessing steps, such as input sequence padding, which can impact the segmentation estimates (Fig. 1A.3). The second stage involves extensive post-processing designed to transform the raw model predictions into a tabular format that aligns with the AIRR schema [50]. The post-processing primarily focuses on converting the predicted likelihoods into assignment labels per allele, determining meaningful start and end positions for each allele, as well as the germline start and end positions. Additionally, it filters out redundant or inaccurate estimates and can optionally derive a personalized genotype (detailed post-processing definition in Supplementary Section S1.5).
Metrics
The allele assignment is evaluated using the “agreement” convention presented in GenAIRR [14]. According to this convention, a sequence assignment is considered correct if the intersection between the predicted alleles and the ground truth alleles is not empty.
Segmentation quality is assessed using the root mean squared error (RMSE) metric. RMSE is calculated by comparing the absolute start and end positions of each of the V, D, and J alleles in the ground truth segments to the corresponding predicted positions.
The accuracy of the productivity status classification is evaluated using the ratio of correctly classified instances. In our case, the GenAIRR framework ensures a balanced representation of both productive and nonproductive sequences, allowing for a straightforward accuracy measurement.
Benchmarking setup
We designed two benchmarking setups, one based on simulated data and the other on real-world sequencing data. For the simulation-based evaluation, we followed the procedure outlined in GenAIRR [14], generating four datasets, two for the heavy chain and two for the light chain, each consisting of 6 million sequences. Mutations were introduced using the S5F model [5] to replicate the statistical features observed in experimental AIRR-seq data. To ensure broad allele representation and eliminate biases, alleles were selected uniformly from the AIRR-C reference set [51]. For each chain, one dataset contained only productive sequences, representing standard AIRR-seq conditions, while the other was unfiltered, including mostly nonproductive sequences to evaluate alignment performance under more challenging conditions.
For real-world data validation, we analyzed a dataset containing both genomic annotations and IgG AIRR-seq repertoires from a public study (PRJNA555323 [52]). Four samples were selected with sufficient genomic coverage and repertoire depth, also examined in a recent study [48]. Genomic data were obtained from VDJbase [53], and AIRR-seq reads were processed as described previously [52]. To establish a genotype for each sample, we filtered out non-functional alleles and retained only those with valid recombination signal sequences, ensuring that only well-supported alleles were included. Novel alleles inferred from genomic data were excluded to maintain consistency across aligners. Each tool was then applied to annotate the IgG repertoire using the AIRR-C reference set [51], and its performance was evaluated based on agreement with the inferred genomic genotype.
Results
AlignAIR outperforms other aligners in allele classification, segmentation, and productivity estimation
As illustrated in Fig. 1A, the AlignAIR framework is built around four core components that enable its high accuracy and performance. Training an aligner based on experimental data is hindered by the lack of verified ground truth. For this reason, the initial component of AlignAIR is the GenAIRR [14] simulation suite, which generates comprehensive and unbiased datasets for both training and testing. For a concise overview of GenAIRR, please see Supplementary Section S4. These datasets provide reliable ground truths across a variety of conditions. For unbiased training, a dataset with an equal representation of V(D)J alleles is utilized. For the heavy chain, V-D, D-J, and V-D-J allele combinations were equally represented, while for the light chain, only V-J allele combinations were equally represented. The second component involves the training of the deep architecture, which integrates innovative convolutional blocks and a novel multi-task loss function (Fig. 1B), specifically designed to capture the complex ambiguities present in AIRR-seq data, such as sequencing errors, sequence corruptions, indels, and SHM. This structure enables the model to accurately segment the V(D)J alleles, assign likelihoods to each allele, and predict other meta attributes such as productivity status and mutation rate. The third component, the inference phase, involves AlignAIR post-processing its predictions using a suite of tools and algorithms that ensure compatibility with the AIRR schema [50], facilitating seamless integration with existing AIRR-seq analysis pipelines [54]. The fourth and final component, the implementation phase, offers a choice between a user-friendly web interface for quick analysis of smaller datasets and a scalable local installation for efficient processing of millions of sequences. To ensure platform independence, a Docker image is provided (see the “Docker image and model prediction parameters” section ). This versatility makes AlignAIR accessible to a wide range of use cases, from individual researchers interested in a handful of accurate Ig alignments to large-scale genomic studies.
To comprehensively evaluate AlignAIR, we employed the procedure outlined in GenAIRR [14], utilizing four independently generated datasets: two for the heavy chain and two for the light chain. Each dataset comprised 6 million sequences, with mutations introduced via the S5F model to replicate the statistical features observed in experimental data (see the “Inputs and data sources” section). For each chain, one dataset was filtered to include only productive sequences, while the other was left unfiltered, comprising mostly nonproductive sequences. Both sets provide a uniform representation of all V(D)J alleles in the germline reference [55]. The two datasets were aligned using AlignAIR and compared with the IgBlast and Partis, both of which have demonstrated high accuracy in prior benchmark studies [14, 56]. AlignAIR demonstrated superior performance compared to IgBlast and Partis in accurately assigning V, D, and J alleles across various mutation rates (Fig. 2A, B, C, and E, and Table 1).
Figure 2.
Benchmarking performance of AlignAIR, IgBLAST, and Partis on immunoglobulin sequence analysis. (A) V allele agreement of alignment tools (AlignAIR, IgBLAST, and Partis) on productive sequences across varying V mutation rates. (B) J allele agreement on productive sequences across varying J mutation rates. (C) D allele agreement on sequences with mutations across varying D segment lengths. (D) Segment start and end regression RMSE for productive sequences at different positions. (E) D allele agreement on non-mutated sequences across different D segment lengths. (F) Retrieval rate of V calls in IgG AIRR-seq data, representing the proportion of sequences for which each aligner returned results across different samples. (G) Retrieval rate of V calls that match the genomic genotype, indicating the proportion of expressed IGHV alleles correctly identified by each aligner. (H) Fraction of V calls that do not match any allele in the genomic genotype, reflecting potential misassignments. (I) Retrieval rate of D calls in IgG AIRR-seq data, similar to panel (F), but for IGHD alleles. (J) Retrieval rate of D calls matching the genomic genotype, analogous to panel (G) but for IGHD. (K) Fraction of D calls that do not match any allele in the genomic genotype, analogous to panel (H) but for IGHD. In panels (F)–(K), the x-axis represents the aligners, and the colored points correspond to different samples, as indicated in the legend.
Table 1.
Comparison of various Ig sequence aligners based on key criteria, including custom reference support, operational efficiency, and accuracy across different mutation rates (MR)
| Criteria | AlignAIR | IgBLAST | Partis | |||
|---|---|---|---|---|---|---|
| Prod. | Non | Prod. | Non | Prod. | Non | |
| Accepts custom reference | ✓ | ✓ | ✓ | |||
| 6M seqs runtime (min) | 48 | 2200 | 5400 | |||
| V accuracy (MR <10%) | 98.36 | 97.95 | 97.68 | 97.13 | 96.26 | 94.23 |
| V accuracy (MR >10%) | 94.58 | 91.90 | 90.32 | 89.05 | 85.76 | 84.48 |
| J accuracy (MR <10%) | 99.92 | 99.87 | 99.68 | 99.43 | 99.72 | 99.23 |
| J accuracy (MR >10%) | 98.64 | 97.62 | 95.16 | 93.91 | 97.77 | 97.22 |
| D accuracy (length <10) | 56.78 | 55.67 | 39.16 | 38.38 | 36.74 | 35.98 |
| D accuracy (length >10) | 84.44 | 83.60 | 69.42 | 69.17 | 73.75 | 73.19 |
| V 3′ RMSE | 2.79 | 5.10 | 1.88 | 4.04 | 1.57 | 5.38 |
| D 5′ RMSE | 3.84 | 4.71 | 7.43 | 8.16 | 5.69 | 7.01 |
| D 3′ RMSE | 3.97 | 4.75 | 6.75 | 7.44 | 4.97 | 8.42 |
| J 5′ RMSE | 2.31 | 3.10 | 6.56 | 7.55 | 2.01 | 6.44 |
| Productivity accuracy | 99.21 | 99.38 | 99.70 | 96.99 | 89.30 | 98.74 |
Accuracy metrics are reported separately for productive (left) and nonproductive (right) sequences, with the top results highlighted in bold. RMSE measures the precision of segment start and end position predictions; lower values indicate higher precision.
For V segments, AlignAIR maintained a higher agreement rate with the ground truth compared to IgBLAST and Partis, especially as the mutation rate increased beyond 10% (Fig. 2A). A similar trend was observed for J alleles, where AlignAIR’s accuracy remained stable, even when mutation rates exceeded 15%(Fig. 2B). Assigning D alleles to mutated Ig sequences with trimmed D segments is the most challenging task of an aligner. This is because of the shorter length of D segments compared to V and J segments, their frequent trimming during recombination, and the high degree of similarity among D alleles (Supplementary Fig. S8). AlignAIR excels in this task in both mutated (Fig. 2C) and unmutated (Fig. 2E) sequences. Proper sequence segmentation is crucial for accurate attribution of SHM events to specific segments and for reliable assessment of sequence productivity. Based on our performance evaluation, we observed that AlignAIR obtained a significantly lower RMSE, indicating that its predicted start and end positions were closer to the ground truth. However, for the V-3′ region, IgBLAST presented a slightly lower RMSE (Fig. 2D). In terms of accuracy in predicting sequence productivity status, AlignAIR performed slightly better than Partis. Both AlignAIR and Partis classified sequences by productivity status with near-perfect accuracy, while IgBlast demonstrated a slightly higher rate of false positives (see Supplementary Fig. S10D–F). Analogous results for nonproductive sequences regarding allele assignment, segmentation, and productivity classification are presented in Supplementary Fig. S10A–F.
To comprehensively evaluate aligners on real-world datasets, we utilized a unique dataset containing both genomic and matching IgG-specific AIRR-seq data from the same individuals [52]. This dataset enables the assessment of alignment accuracy on expressed AIRR-seq data while leveraging high-quality genomic annotations as the ground truth for alignment calls. The analysis focused on four samples with sufficient genomic coverage and repertoire depth, which were also used in a recent study [48]. First, we assessed the proportion of actual IgG sequences for which each aligner successfully returned results across the four samples. AlignAIR provided annotations for all queried sequences, with IgBLAST slightly trailing behind, whereas Partis failed to return results for ∼3% of the sequences, depending on the sample (Fig. 2F). Next, we examined the fraction of IGHV alleles present in the personalized genotype that were detected in at least one sequence of the expressed repertoire. AlignAIR and IgBLAST achieved comparable retrieval rates, detecting nearly all alleles with a mean retrieval of ≈90%, whereas Partis exhibited significantly lower coverage (Fig. 2G). To further evaluate the precision of allele assignment, we quantified the fraction of sequences with alignment calls that did not match any allele in the personalized genotype. AlignAIR demonstrated the lowest rate of such mismatches, outperforming both IgBLAST and Partis (Fig. 2H), indicating improved precision in allele assignment. An analog analysis was performed for the IGHD alleles (Fig. 2I–K). All aligners successfully retrieved 100% of the alleles present in the personalized genotype (Fig. 2J). However, AlignAIR outperformed the other tools by minimizing the proportion of sequences assigned to alleles absent from the genomic reference (Fig. 2K).
Considering all these, the results underline AlignAIR’s robustness, especially in the context of high mutation rates and corrupted sequences, setting a new state-of-the-art result in correctly assigning and properly segmenting V(D)J alleles.
AlignAIR’s likelihoods capture alignment uncertainty
The availability of a meaningful likelihood distribution across the reference alleles offers significant advantages in inference over traditional alignment scores. These likelihoods encapsulate a wide range of parameters influencing the model’s decisions, including the mutation model, mutation rate, indels, trimming, corruptions, and reference sparsity, which refers to the similarity distribution among alleles in the reference. To demonstrate how the likelihoods produced by AlignAIR capture the uncertainty in the reference assignments, we aggregated the likelihoods for each allele in the reference across all V, D, and J genes (see Fig. 3A). The likelihoods were grouped into intervals of 0.1, ranging from 0 to 1. The average percentage of agreement under each bin was calculated (see Fig. 3B) and compared with the likelihoods generated by the closest alternative, Partis (see Fig. 3C).
Figure 3.
(A) A schematic summary of how AlignAIR likelihoods are produced. An Ig sequence is fed into the AlignAIR model, which generates a likelihood vector for each gene type, assigning a likelihood value for each allele in the reference set. These likelihoods are then normalized by the total likelihood sum. The final assignment for the given Ig sequence is chosen using a dynamic threshold approach, capped at a maximum number of alleles. (B) Average normalized likelihoods produced by AlignAIR, illustrating how the model’s likelihoods relate to the assignment confidence. Examples of 20% and 70% likelihoods are emphasised with dots to show the agreement between the likelihoods and the observed agreeement. (C) Similar to panel (B), but using the likelihoods produced by Partis instead of AlignAIR. (D–F) Agreement rates (on V, D, and J alleles, respectively) for AlignAIR alone, plotted against the average mutation rate. Curves compare three prediction strategies: Top 1 (highest-likelihood single assignment), Top 3, and the dynamic threshold approach. (G–I) Agreement rates for Partis (Top 1, Top 3) compared against AlignAIR’s dynamic threshold approach, plotted against the average mutation rate for V, D, and J alleles. The solid line denotes the dynamic threshold approach in all panels, demonstrating improved consistency and accuracy relative to single-likelihood (Top 1) predictions. Created in BioRender. Peres, A. (2025) https://BioRender.com/t68bjb2.
Figure 3B demonstrates solid agreements between AlignAIR’s likelihood and the empirical allele assignment accuracies. For example, as indicated with dashed lines in Fig. 3B, when the likelihood assigned to an allele in any of the V, D, and J genes is at least 70%, there is a corresponding 70% chance of a correct assignment. Conversely, when the likelihood value falls below 20%, the confidence in the assignment similarly drops to below 20%. This linear relationship highlights the robustness of the likelihood values as indicators of assignment accuracy.
Figure 3C presents the same graph using likelihoods produced by Partis, offering a comparative perspective. This comparison demonstrates that AlignAIR’s likelihood estimates more effectively reflect alignment confidence and accuracy.
To capture the probabilistic nature of sequence alignment results and maintain compatibility with established tools like IgBLAST and HighV-QUEST, AlignAIR outputs a set of possible allele assignments. The set size is capped at three and determined by a dynamic threshold approach (see Supplementary Section S1.5.2), which adjusts the number of assignments based on the model’s confidence scores. This method ensures that the selections are optimized for agreement with the naive strategies of returning the Top 1, Top 2, or Top 3 likelihoods while maintaining a consistent format with traditional tools.
In Fig. 3D–F, we plot the agreement rates against the average mutation rates for V, D, and J alleles for different prediction strategies (Top 1, dynamic threshold, Top 3) for both AlignAIR and Partis. These plots indicate that AlignAIR consistently outperforms Partis in V and J allele assignments across all mutation rates. For D alleles, this advantage becomes pronounced as mutation rates increase. As mentioned, the number of returned results is capped at a maximum of 3, even with the dynamic threshold approach, to maintain precision. This limitation is demonstrated in Supplementary Fig. S7, which shows the average number of calls as a function of mutation rate.
Supplementary Fig. S7 further supports these findings by showing the distribution of likelihood sums and differences across V, D, and J alleles in AlignAIR’s predictions. For V and J alleles, the sum of likelihoods correlates consistently with the number of ground truth alleles, while D alleles exhibit higher uncertainty, as expected due to their shorter length and short inter-allele discriminative regions [14]. The probability distributions of the difference between the maximum likelihood and the fourth-highest likelihood for each allele type reveal that AlignAIR maintains high confidence in its predictions for V and J alleles, with clear distinctions between top likelihoods. In contrast, D alleles show greater uncertainty, indicated by smaller differences between likelihoods, reflecting the inherent challenge in predicting these alleles.
Overall, these results demonstrate that the likelihoods produced by AlignAIR are reliable indicators of alignment confidence and are also crucial for maintaining high accuracy in allele assignments, effectively accounting for the inherent uncertainties in alignment tasks.
AlignAIR effectively learns mutation models
To evaluate AlignAIR’s ability to learn and differentiate between mutation models, we trained AlignAIR using four mutation models (see Supplementary Section S2) and tested its prediction accuracy across these models. Figure 4A illustrates the average agreement scores for AlignAIR’s V allele predictions when trained and tested on different mutation models, including three S5F variants (S5F Classic, S5F Opposite, and S5F 60) and the uniform mutation model, all with mutation rates below 10%. The results demonstrate that AlignAIR maintains high agreement scores across all models, ranging from 99.14% to 99.40%. This consistency underscores the model’s robustness, demonstrating its ability to effectively generalize across different mutation models without significant loss in accuracy at low mutation rates.
Figure 4.
The average agreement score comparison of AlignAIR’s prediction of V calls when trained and tested on different mutation models with a mutation rate below 10% (A) and between 10% and 25% (B). The Euclidean distance between the V germline alleles with a given number of mutations (x-axis) and the unmutated germline alleles in the AlignAIR V latent space, trained on data using the S5F mutation model (C) and uniform model (D). The blue line represents sequences where the mutations follow the S5F model, while the orange line represents sequences with the same number of mutations applied using the uniform mutation model. (E) Illustration of the experimental protocol used to demonstrate that AlignAIR’s likelihoods successfully capture the effect of the mutation model embedded in the data. (1) Two alleles differing by exactly one base are selected. (2) A synthetic sequence with a third nucleotide at the position of difference is created. Given a nonuniform mutation model, the likelihood that the selected alleles will mutate at this position to transform into the synthetic sequence is not necessarily equal. (3) The mutation model’s mutability likelihood and AlignAIR likelihood for both alleles given the synthetic sequence are derived. (4) This protocol is repeated for all allele pairs differing by exactly one position, recording the likelihood differences. (5) This is done twice: once with a nonuniform mutation model (S5F) and once with a uniform model. (F) Scatter plot of the differences between the likelihoods and the mutation model mutability differences for all allele pairs that differ by a single nucleotide for the uniform (orange) and S5F (blue) mutation models. Each quadrant also displays the percentage of allele pairs sharing the same relationship. Created in BioRender. Peres, A. (2025) https://BioRender.com/s0qxnfu.
Figure 4B presents the average agreement scores for V allele predictions at higher mutation rates (10%–25%). While we observed greater variability, particularly when using the S5F Opposite model, AlignAIR still achieves high accuracy, with agreement scores ranging from 93.71% to 97.09%. The diagonal in Fig. 4B emphasizes the AlignAIR’s reliability on the mutation model used in its training; the best performance is seen here in cases where the training and testing data share the same mutation model.
This variation in performance highlights AlignAIR’s adaptability to mutation models that share structural patterns with its training data. Specifically, when AlignAIR is trained on the S5F model, it captures the nuanced mutational dynamics and context-specific nucleotide propensities unique to this model. The S5F 60 model, although it has a shifted mutational probability, still aligns with the core mutation trends defined by the S5F Classic model, maintaining some consistency in the mutation hotspots and patterns. This overlap allows AlignAIR to transfer the learned mutation structure from S5F Classic to S5F 60, resulting in better generalization and alignment accuracy.
In contrast, when AlignAIR is trained on the S5F Opposite model, the mutation landscape is inherently inverted, leading AlignAIR to learn a distinct pattern of nucleotide changes that diverges from the trends seen in both S5F Classic and S5F 60. This inversion limits the model’s adaptability when tested on S5F 60, as seen in the figure, where the performance drops to 94.13%. Since S5F Opposite’s mutation probabilities are reversed, AlignAIR’s latent space representation becomes highly specialized to this opposite structure, which hampers its ability to generalize to S5F 60’s modified, but non-inverted, mutation patterns.
These findings underscore the importance of training on mutation models that align closely with expected data characteristics to optimize model performance, particularly at high mutation rates.
Figure 4C and D explores how well AlignAIR captures the underlying mutation dynamics in its latent space. In Fig. 4C, we visualize the Euclidean distance between the latent projection of unmutated germline V alleles and germline alleles with varying mutation numbers (x-axis) when AlignAIR is trained on data with mutations following the S5F model. The increasing Euclidean distance as the number of mutations rises highlights how sequences progressively diverge in the latent space as mutations accumulate. Specifically, when AlignAIR is trained on S5F-mutated sequences and tested on sequences mutated using the same model, the divergence from unmutated germlines is more controlled than the sequences where mutations were applied uniformly. This suggests that using a biologically relevant, nonrandom mutation model enables leveraging additional structural information inherent in real sequencing data, enhancing the model’s ability to recognize observed sequence patterns.
In contrast, Fig. 4D displays the same Euclidean distance metric when AlignAIR is trained on data generated using a uniform mutation model. The results show that the model’s latent space is less robust to context-specific patterns when trained on uniform mutations. As a result, sequences with mutations following the S5F model diverge more quickly from the unmutated germlines than those with uniformly applied mutations, highlighting the impact of the training data’s structure on model robustness.
Both panels (4C and D) illustrate how the underlying latent space of the V allele classification in AlignAIR is influenced by the mutation rate and the type of mutation model applied.
We further explored the model’s latent space by tracking how each germline allele shifted as mutations accumulated. By simulating and projecting these movements with Principal Component Analysis (PCA), we could see how the alleles “drift” in this space. As the mutation level increased, the alleles moved closer to a common region where we had also projected random DNA sequences. This convergence highlights that with more mutations, the germline alleles become less distinct and resemble random sequences, as shown in Supplementary Fig. S9.
Beyond tracing allele movements in latent space, another crucial aspect of understanding mutation model learnability lies in examining the likelihood scores AlignAIR assigns to different allele classifications. These likelihoods offer a window into the model’s sensitivity to mutation contexts, especially when those mutations are driven by specific nucleotide arrangements, as in the S5F model. We can assess how well the model reflects context-dependent mutational trends by comparing AlignAIR’s likelihood predictions against the empirical mutability scores. This comparison provides further insight into how AlignAIR has internalized and learned the distinct mutational patterns defined by S5F.
To investigate how well AlignAIR captures mutational dynamics, we compared the likelihood scores generated by AlignAIR when predicting allele assignments to the S5F mutability score. Given that AlignAIR was trained on the S5F mutation model, we aimed to test whether its likelihoods are sensitive to context-dependent mutations as described by this model and whether these likelihoods align with the biological mutability scores of the S5F model. The goal was to evaluate whether AlignAIR could correctly reflect the varying mutability of alleles, particularly where mutation probability is influenced by the nucleotide context. By comparing these likelihoods, we gain insight into how effectively AlignAIR has learned the nuances of the S5F-driven mutational patterns. We followed a specific protocol (Fig. 4E): (i) We selected pairs of alleles (in this case, allele 05 and allele 06) from the reference, which differs at a single position (Fig. 4E.1). For simplicity, we refer to allele 05 as “A” and allele 06 as “B” in the following text. (ii) We then created a third synthetic sequence, where the differing positions contains a nucleotide that does not match either A or B (Fig. 4E.2). In cases with nonuniform mutations, the likelihood of the differing position mutating from A to the synthetic sequence is not equal to the likelihood that it was mutated from B, as the five-nucleotide context of these alleles is different in this position. (iii) Next, we extracted the AlignAIR likelihoods and the S5F mutability likelihoods for these sequences (Fig. 4E.3 and E.4). We compared these likelihoods to assess whether AlignAIR’s likelihoods agree with the S5F mutability scores when the model is trained on S5F-mutated sequences.
This experiment was repeated for all pairs of alleles differing at a single position. We plotted the difference in likelihoods assigned by AlignAIR for alleles A and B (y-axis) against the difference in S5F mutability scores for the same alleles (x-axis, Fig. 4F). A strong correlation was observed for the S5F-trained model (R2= 0.778) compared to the uniform-trained model (R2= 0.001), indicating that AlignAIR effectively learns the S5F mutation model. For 95% of allele pairs, where allele A had a higher likelihood than allele B of transitioning to the synthetic sequence according to the S5F mutability scores, AlignAIR predicted the same trend (Fig. 4F, Q1). Conversely, where allele A had a lower likelihood than allele B was true for 74% of the pairs (Fig. 4F, Q3). This agreement is significantly higher than observed when using AlignAIR trained on data with uniform mutations, which shows near-random performance (Fig. 4F, orange).
Implementation
Docker image and model prediction parameters
AlignAIR can be easily deployed using a Docker image, ensuring a consistent environment across different computational platforms. The Docker setup simplifies the installation process, requiring only the Docker to be installed on the host machine. The Dockerfile configures the necessary environment, installs dependencies, and sets up the AlignAIR model for prediction.
When predicting with the model via Docker, several parameters can be adjusted to control the prediction process, offering users a high level of customization. Users can specify paths for the saved model weights, the output directory for alignment results, and the input sequences file. The chain type (e.g. heavy or light) can also be set, along with configuration files for different chain data. Other adjustable parameters include the maximum input size for the model, batch size for processing, and thresholds for predicting V, D, and J alleles. Additionally, options are available to translate names to ASCs [48] from IUIS—necessary because the model is trained on ASC alleles rather than IUIS names—fix DNA orientation, and use custom orientation models. These parameters ensure that the prediction process is highly controllable, allowing users to fine-tune the model to their specific needs. The use of Docker further enhances reproducibility and ease of use, making AlignAIR accessible to a wide range of researchers.
Web interface at AlignAIR.ai
The AlignAIR suite is also accessible through a web interface hosted at https://alignair.ai/. This platform offers a user-friendly experience, enabling researchers to perform sequence alignment without needing extensive computational resources or technical expertise in deep learning. The web interface provides several key benefits. It is designed to be intuitive, allowing users to upload sequences and configure alignment settings with minimal effort. The platform includes comprehensive visualization tools for aligned data, helping users to interpret and analyze results easily. Additionally, users can export the aligned sequences and accompanying reports directly from the platform, facilitating further analysis and documentation.
Supplementary Fig. S11 shows examples of the web interface, highlighting its features and usability. This interface ensures that AlignAIR’s powerful alignment capabilities are accessible to a broader audience, promoting its adoption in the AIRR research community.
Discussion
The results presented in this study underscore the significant advancements introduced by AlignAIR in immunogenetics, particularly in enhancing Ig sequence alignment. By leveraging a multi-task deep learning framework and simulation-based training, AlignAIR addresses the complexities of V(D)J recombination and SHM. It surpasses existing tools in allele classification accuracy and introduces new capabilities in sequence segmentation, metadata estimation, and productivity classification, enabling a more comprehensive analysis of immune receptor repertoires. Primarily trained on B-cell light and heavy chains, AlignAIR’s versatile framework is also designed to support T-cell receptor alpha and beta chains, with model training and validation pending, thereby broadening its applications across immune receptors.
A key component of AlignAIR’s success is its integration of domain-specific knowledge with advanced simulation techniques, particularly through the use of simulated sequences generated by the GenAIRR simulation suite [14]. Experimental datasets often lack ground truth, making alignment validation highly subjective [57]. Additionally, relying on pre-aligned sequences would inherently limit AlignAIR’s performance to the accuracy of the pre-alignment tool. Simulated datasets, however, provide known ground truths and allow for thorough evaluation across diverse conditions, including high mutation rates, indels, sequencing errors, and data corruptions [58]. This simulation-based approach enables the modeling of complex biological processes such as stochastic recombination, SHM patterns, and sequence trimming, all of which pose challenges in real-world data. AlignAIR’s domain-specific multi-task learning framework further addresses these complexities, allowing it to maintain state-of-the-art performance in allele assignment, segmentation, and productivity classification, even in the most challenging cases.
Simulated sequences also offer an unbiased representation of all V(D)J alleles, avoiding the biases of empirical datasets, such as uneven allele distributions and platform-specific sequencing errors [14]. This unbiased distribution enables AlignAIR to capture the complexity of AIRR-seq data, allowing it to generalize across various repertoires and perform well under diverse conditions [59].
While simulations offer a controlled and unbiased benchmarking framework, real-world validation is essential to ensure AlignAIR’s robustness. Unlike previous alignment methods that rely solely on simulated or pre-aligned experimental data [9–13], we expanded our evaluation by incorporating a dataset that includes both genomic and matching AIRR-seq data from the same individuals. This approach allows us to assess AlignAIR’s performance in a more realistic setting, bridging the gap between theoretical benchmarking and practical application. It enables validation of AlignAIR’s performance using high-confidence genomic annotations as the ground truth, where it successfully recovered the majority of expressed alleles while minimizing incorrect assignments. Although this approach provides strong repertoire-level validation, it does not establish a direct sequence-level ground truth, as genomic data reflect the overall genotype rather than the precise origin of each AIRR-seq read. Future advances in experimental methodologies will enable more granular validation at the sequence level, further refining benchmarking approaches for AIRR-seq aligners.
For mutation modeling, we adopted the S5F framework [5] due to its well-characterized mutability profile across 5-mer motifs. We acknowledge, however, that S5F does not capture all SHM dynamics, particularly in cases with context-specific mutational tendencies [60]. Testing a modified S5F model with a 60% shift in mutational probabilities confirms that moderate changes in the mutation spectrum do not significantly affect AlignAIR’s performance. Future improvements in SHM modeling or more comprehensive mutation models could further refine our simulations and enhance AlignAIR’s robustness.
AlignAIR’s model size—only 12 million trainable weights—makes it resource-efficient and faster to train and deploy. This compact architecture contrasts with large language models and transformer architectures, which can contain hundreds of millions to billions of parameters [61]. This efficiency makes AlignAIR accessible to labs and researchers with limited computational resources while still maintaining high performance in AIRR-seq alignment tasks without the overhead associated with larger models (model prediction runtime benchmarking can be seen in Supplementary Fig. S6).
While AlignAIR addresses many sequence complexities, ongoing advancements in immunogenetic understanding offer new insights into receptor formation that could be integrated into future simulations. For instance, incorporating double-D recombination events [62] and refining junctional diversity models for indel and mutation nuances [63, 64] will enhance AlignAIR’s relevance as sequencing technologies and immunogenetic knowledge continue to evolve. Likewise, the current indel model, while effective for common patterns, is preliminary. Future development will expand it to handle complex indel events, including those in specific contexts, such as junctional diversity or platform-specific errors [65].
Our results show that the latent space learned by AlignAIR retains substantial immunogenetic information, providing a strong foundation for downstream tasks like antigen specificity prediction, clonotype clustering, and lineage tracing. Future extensions to the model, such as amino acid support, would enable the capture of biologically relevant features like physicochemical properties influencing binding affinity and receptor specificity, thereby broadening AlignAIR’s utility for therapeutic antibody discovery and vaccine development.
The latent space also enables mapping allele variability within and across species, supporting inference of novel alleles in minimally characterized repertoires [66, 67]. This feature is advantageous for species with incomplete allele references, such as the rhesus macaque [68], allowing AlignAIR to support comparative immunogenetic studies and insights into cross-species immune variability. Informed by this latent space, AlignAIR’s scoring system offers a probabilistic view of allele relationships, thereby capturing the complexities and nuances of immune receptor diversity.
This likelihood-based scoring offers a notable advancement by accounting for the stochastic nature of V(D)J recombination and SHM. Unlike traditional aligners relying on maximum likelihood estimates, AlignAIR’s probabilistic output considers multiple possible alleles, each with a likelihood score. This probabilistic model aligns with the randomness of receptor generation processes, providing nuanced interpretations of sequence ambiguity.
Likelihood scores further capture uncertainties in alignment, especially when mutations, insertions, or deletions obscure the true allele. Our results demonstrate a strong correlation between likelihood scores and prediction accuracy, underscoring their reliability. For example, when an allele’s likelihood score reaches 70%, there is an ∼70% chance of correct assignment. This probabilistic approach informs more accurate receptor alignment and supports insights into immunogenetic diversity and receptor evolution, underscoring the value of probabilistic over deterministic models in managing AIRR-seq data variability [69].
While specific germline and SHM parameters, such as the S5F mutation model, limit the model’s generalizability across distinct mutational landscapes, expanding AlignAIR’s training to include additional biological contexts and platform-specific biases could enhance its applicability. The methodologies developed for AlignAIR, particularly the multi-task framework and simulation-based training, have broader applications in bioinformatics [17, 70]. These approaches could be adapted to alignment and classification tasks in fields such as genomics and proteomics [71, 72], where complex sequence variations pose similar challenges.
AlignAIR represents a significant leap in AIRR alignment and analysis, integrating deep learning with simulation-based training to deliver improved accuracy, deeper insights into sequence variability, and enhanced usability. While limitations remain in handling indels and certain biological processes like double-D recombination [62], future work will address these. The model’s versatility, with planned expansions for TCRs, amino acid support, non-human species, and allele mapping capabilities, makes AlignAIR a powerful tool in the immunogenetics toolkit.
Supplementary Material
Acknowledgements
Schematic figures were created in https://BioRender.com.
Author contributions: Conceptualization, G.Y. and O.L. Data curation, T.K. and A.P. Formal analysis, T.K., A.P., and G.Y. Funding acquisition, G.Y. and O.L. Investigation, T.K., A.P., R.E., O.L., and G.Y. Methodology, T.K., A.P., R.E., O.L., and G.Y. Project administration, P.P., O.L., and G.Y. Software, T.K. and A.P. Supervision, O.L. and G.Y. Validation, A.P. and R.E. Visualization, T.K. and A.P. Writing—original draft, T.K., A.P., P.P., and G.Y. Writing—review and editing all authors.
Contributor Information
Thomas Konstantinovsky, Department of Bioengineering, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel; Bar Ilan Institute of Nanotechnology and Advanced Materials, Bar Ilan University, 5290002 Ramat Gan, Israel.
Ayelet Peres, Department of Bioengineering, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel; Bar Ilan Institute of Nanotechnology and Advanced Materials, Bar Ilan University, 5290002 Ramat Gan, Israel; Department of Pathology, Yale School of Medicine, New Haven, CT 06510, United States.
Ran Eisenberg, Department of Information Processing and Data Science, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel.
Pazit Polak, Department of Bioengineering, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel; Bar Ilan Institute of Nanotechnology and Advanced Materials, Bar Ilan University, 5290002 Ramat Gan, Israel.
Ofir Lindenbaum, Department of Information Processing and Data Science, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel.
Gur Yaari, Department of Bioengineering, Faculty of Engineering, Bar Ilan University, 5290002 Ramat Gan, Israel; Bar Ilan Institute of Nanotechnology and Advanced Materials, Bar Ilan University, 5290002 Ramat Gan, Israel; Department of Pathology, Yale School of Medicine, New Haven, CT 06510, United States.
Supplementary data
Supplementary data is available at NAR online.
Conflict of interest
None declared.
Funding
This study was partially supported by grants from the Israel Science Foundation (ISF) (2940/21), VATAT, NIAID (U24AI177622), and Ministry of Innovation, Science & Technology (1001576181/0004941).
Data availability
The datasets generated and analyzed during the current study are available on Zenodo. The training datasets and evaluation datasets can be accessed via the following URLs: Training datasets: https://doi.org/10.5281/zenodo.15687765, Evaluation datasets: https://doi.org/10.5281/zenodo.15687841. For any additional data requests or inquiries, please contact the corresponding author. The AlignAIR suite is accessible through multiple platforms to accommodate different user preferences and computational resources. The full codebase, including training scripts, inference tools, and a Docker image for seamless deployment, is archived on Zenodo (https://doi.org/10.5281/zenodo.15687939). A continuously maintained and regularly updated version is available on GitHub at https://github.com/MuteJester/AlignAIR. For users who prefer not to manage local installations, AlignAIR can also be accessed via a user-friendly web interface hosted at https://alignair.ai/, enabling quick and efficient sequence analysis directly through a browser.
References
- 1. Murphy K, Weaver C Janeway’s immunobiology. 2016; New York: Garland Science. [Google Scholar]
- 2. Yaari G, Kleinstein SH Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 2015; 7:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Gupta NT, Vander Heiden JA, Uduman M et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics. 2015; 31:3356–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hoehn KB, Kleinstein SH B cell phylogenetics in the single cell era. Trends Immunol. 2024; 45:62–74. 10.1016/j.it.2023.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Yaari G, Vander Heiden J, Uduman M et al. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front Immunol. 2013; 4:358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Carter JA, Preall JB, Grigaityte K et al. Single T cell sequencing demonstrates the functional role of αβ TCR pairing in cell lineage and antigen specificity. Front Immunol. 2019; 10:1516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Snir O, Mesin L, Gidoni M et al. Analysis of celiac disease autoreactive gut plasma cells and their corresponding memory compartment in peripheral blood using high-throughput sequencing. J Immunol. 2015; 194:5703–12. [DOI] [PubMed] [Google Scholar]
- 8. Mhanna V, Bashour H, Lê Quý K et al. Adaptive immune receptor repertoire analysis. Nat Rev Methods Primol. 2024; 4:6. [Google Scholar]
- 9. Ye J, Ma N, Madden TL et al. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 2013; 41:W34–40. 10.1093/nar/gkt382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Brochet X, Lefranc MP, Giudicelli V IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res. 2008; 36:W503–8. 10.1093/nar/gkn316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Bolotin DA, Poslavsky S, Mitrophanov I et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015; 12:380–1. 10.1038/nmeth.3364. [DOI] [PubMed] [Google Scholar]
- 12. Ralph DK, Matsen FA Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation. PLoS Comput Biol. 2016; 12:e1004409. 10.1371/journal.pcbi.1004409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gaëta BA, Malming HR, Jackson KJL et al. iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences. Bioinformatics. 2007; 23:1580–7. 10.1093/bioinformatics/btm147. [DOI] [PubMed] [Google Scholar]
- 14. Konstantinovsky T, Peres A, Polak P et al. An unbiased comparison of immunoglobulin sequence aligners. Brief Bioinform. 2024; 25:bbae556. 10.1093/bib/bbae556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Avnir Y, Watson CT, Glanville J et al. IGHV1-69 polymorphism modulates anti-influenza antibody repertoires, correlates with IGHV utilization shifts and varies by ethnicity. Sci Rep. 2016; 6:20842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wang H, Fu T, Du Y et al. Scientific discovery in the age of artificial intelligence. Nature. 2023; 620:47–60. 10.1038/s41586-023-06221-2. [DOI] [PubMed] [Google Scholar]
- 17. Mahmud M, Kaiser MS, Hussain A et al. Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst. 2018; 29:2063–79. 10.1109/tnnls.2018.2790388. [DOI] [PubMed] [Google Scholar]
- 18. Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–89. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Unsal S, Atas H, Albayrak M et al. Learning functional properties of proteins with language models. Nat Mach Intell. 2022; 4:227–45. 10.1038/s42256-022-00457-9. [DOI] [Google Scholar]
- 20. Zemouri R, Zerhouni N, Racoceanu D Deep learning in the biomedical applications: recent and future status. Appl Sci. 2019; 9:1526. 10.3390/app9081526. [DOI] [Google Scholar]
- 21. Marcou Q, Mora T, Walczak AM High-throughput immune repertoire analysis with IGoR. Nat Commun. 2018; 9:561. 10.1038/s41467-018-02832-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Konstantinovsky T, Yaari G A novel approach to T-cell receptor beta chain (TCRB) repertoire encoding using lossless string compression. Bioinformatics. 2023; 39:btad426. 10.1093/bioinformatics/btad426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Emerson RO, DeWitt WS, Vignali M et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat Genet. 2017; 49:659–65. 10.1038/ng.3822. [DOI] [PubMed] [Google Scholar]
- 24. Safra M, Werner L, Peres A et al. A somatic hypermutation–based machine learning model stratifies individuals with Crohn’s disease and controls. Genome Res. 2022; 33:71–79. 10.1101/gr.276683.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Shemesh O, Polak P, Lundin KEA et al. Machine learning analysis of Naïve B-Cell receptor repertoires stratifies celiac disease patients and controls. Front Immunol. 2021; 12:627813. 10.3389/fimmu.2021.627813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Safra M, Tamari Z, Polak P et al. Altered somatic hypermutation patterns in COVID-19 patients classifies disease severity. Front Immunol. 2023; 14:1031914. 10.3389/fimmu.2023.1031914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Zaslavsky ME, Craig E, Michuda JK et al. Disease diagnostics using machine learning of B cell and T cell receptor sequences. Science. 2025; 387:eadp2407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Sethna Z, Elhanati Y, Callan CG et al. OLGA: fast computation of generation probabilities of B- and T-cell receptor amino acid sequences and motifs. Bioinformatics. 2019; 35:2974–81. 10.1093/bioinformatics/btz035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Bukhari SNH, Jain A, Haq E et al. Machine learning techniques for the prediction of B-cell and T-cell epitopes as potential vaccine targets with a specific focus on SARS-CoV-2 pathogen: a review. Pathogens. 2022; 11:146. 10.3390/pathogens11020146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Schmidt-Barbo P, Kalweit G, Naouar M et al. Detection of disease-specific signatures in B cell repertoires of lymphomas using machine learning. PLoS Comput Biol. 2024; 20:e1011570. 10.1371/journal.pcbi.1011570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Greiff V, Yaari G, Cowell LG Mining adaptive immune receptor repertoires for biological and clinical information using machine learning. Curr Opin Syst Biol. 2020; 24:109–19. 10.1016/j.coisb.2020.10.010. [DOI] [Google Scholar]
- 32. Zhao Y, He B, Xu F et al. DeepAIR: A deep learning framework for effective integration of sequence and 3D structure to enable adaptive immune receptor analysis. Sci Adv. 2023; 9:eabo5128. 10.1126/sciadv.abo5128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Sidhom JW, Larman HB, Pardoll DM et al. DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat Commun. 2021; 12:1605. 10.1038/s41467-021-21879-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Isacchini G, Walczak AM, Mora T et al. Deep generative selection models of T and B cell receptor repertoires with soNNia. Proc Natl Acad Sci. 2021; 118:e2023141118. 10.1073/pnas.2023141118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Lindenbaum O, Nouri N, Kluger Y et al. Alignment free identification of clones in B cell receptor repertoires. Nucleic Acids Res. 2021; 49:e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Davidsen K, Olson BJ, DeWitt WS et al. Deep generative models for T cell receptor protein sequences. eLife. 2019; 8:e46935. 10.7554/elife.46935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Tang C, Krantsevich A, MacCarthy T Deep learning model of somatic hypermutation reveals importance of sequence context beyond hotspot targeting. iScience. 2022; 25:103668. 10.1016/j.isci.2021.103668. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Goldner Kabeli R, Zevin S, Abargel A et al. Self-supervised learning of T cell receptor sequences exposes core properties for T cell membership. Sci Adv. 2024; 10:eadk4670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Ostrovsky-Berman M, Frankel B, Polak P et al. Immune2vec: embedding B/T cell receptor sequences in RN using natural language processing. Front Immunol. 2021; 12:680687. 10.3389/fimmu.2021.680687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Dvorkin S, Levi R, Louzoun Y Autoencoder based local T cell repertoire density can be used to classify samples and T cell receptors. PLoS Comput Biol. 2021; 17:e1009225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Wang M, Patsenker J, Li H et al. Language model-based B cell receptor sequence embeddings can effectively encode receptor specificity. Nucleic Acids Res. 2023; 52:548–57. 10.1093/nar/gkad1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Lajevardy SA, Kargari M Developing new genetic algorithm based on integer programming for multiple sequence alignment. Soft Comput. 2022; 26:3863–70. 10.1007/s00500-022-06790-w. [DOI] [Google Scholar]
- 43. Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A et al. Analysis of DNA sequence classification using CNN and hybrid models. Comput Math Methods Med. 2021; 2021:1–12. 10.1155/2021/1835056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Dotan E, Wygoda E, Ecker N et al. BetaAlign: a deep learning approach for multiple sequence alignment. Bioinformatics. 2025; 41:btaf009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Katoh K, Standley DM MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30:772–80. 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Thompson JD, Higgins DG, Gibson TJ CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994; 22:4673–80. 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Glanville J, Zhai W, Berka J et al. Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire. Proc Natl Acad Sci USA. 2009; 106:20216–21. 10.1073/pnas.0909775106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Peres A, Lees WD, Rodriguez OL et al. IGHV allele similarity clustering improves genotype inference from adaptive immune receptor repertoire sequencing data. Nucleic Acids Res. 2023; 51:gkad603. 10.1093/nar/gkad603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Omer A, Peres A, Rodriguez OL et al. T cell receptor beta germline variability is revealed by inference from repertoire data. Genome Med. 2022; 14:1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Rubelt F, Busse CE, Bukhari SAC et al. Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data. Nat Immunol. 2017; 18:1274–8. 10.1038/ni.3873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Collins AM, Ohlin M, Corcoran M et al. AIRR-C IG Reference Sets: curated sets of immunoglobulin heavy and light chain germline genes. Front Immunol. 2024; 14:1330153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Rodriguez OL, Safonova Y, Silver CA et al. Genetic variation in the immunoglobulin heavy chain locus shapes the human antibody repertoire. Nat Commun. 2023; 14:4419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Omer A, Shemesh O, Peres A et al. VDJbase: an adaptive immune receptor genotype and haplotype database. Nucleic Acids Res. 2020; 48:D1051–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Peres A, Klein V, Frankel B et al. Guidelines for reproducible analysis of adaptive immune receptor repertoire sequencing data. Brief Bioinform. 2024; 25:bbae221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Lees W, Busse CE, Corcoran M et al. OGRDB: a reference database of inferred immune receptor genes. Nucleic Acids Res. 2020; 48:D964–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Smakaj E, Babrak L, Ohlin M et al. Benchmarking immunoinformatic tools for the analysis of antibody repertoire sequences. Bioinformatics. 2019; 36:1731–9. 10.1093/bioinformatics/btz845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Sandve GK, Greiff V Access to ground truth at unconstrained size makes simulated data as indispensable as experimental data for bioinformatics methods development and benchmarking. Bioinformatics. 2022; 38:4994–6. 10.1093/bioinformatics/btac612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Lupo C, Spisak N, Walczak AM et al. Learning the statistics and landscape of somatic mutation-induced insertions and deletions in antibodies. PLoS Comput Biol. 2022; 18:e1010167. 10.1371/journal.pcbi.1010167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Bogatinovski J, Todorovski L, Džeroski S et al. Comprehensive comparative study of multi-label classification methods. Expert Syst Appl. 2022; 203:117215. 10.1016/j.eswa.2022.117215. [DOI] [Google Scholar]
- 60. Spisak N, Walczak AM, Mora T Learning the heterogeneous hypermutation landscape of immunoglobulins from high-throughput repertoire data. Nucleic Acids Res. 2020; 48:10702–12. 10.1093/nar/gkaa825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Zhang B, Liu Z, Cherry C et al. When scaling meets LLM finetuning: the effect of data, model and finetuning method. arXiv27 February 2024, preprint: not peer reviewed 10.48550/arXiv.2402.17193. [DOI]
- 62. Safonova Y, Pevzner PA De novo inference of diversity genes and analysis of non-canonical V(DD)J recombination in immunoglobulins. Front Immunol. 2019; 10:987. 10.3389/fimmu.2019.00987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Imkeller K, Wardemann H Assessing human B cell repertoire diversity and convergence. Immunol Rev. 2018; 284:51–66. 10.1111/imr.12670. [DOI] [PubMed] [Google Scholar]
- 64. Gu H, Förster I, Rajewsky K Sequence homologies, N sequence insertion and JH gene utilization in VHDJH joining: implications for the joining mechanism and the ontogenetic timing of Ly1 B cell and B-CLL progenitor generation. EMBO J. 1990; 9:2133–40. 10.1002/j.1460-2075.1990.tb07382.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Hou XL, Wang L, Ding YL et al. Current status and recent advances of next generation sequencing techniques in immunological repertoire. Genes Immun. 2016; 17:153–64. 10.1038/gene.2016.9. [DOI] [PubMed] [Google Scholar]
- 66. Crowley G, consortium ts, Quake SR Benchmarking cell type annotation by large language models with anndictionary. bioRxiv13 October 2024, preprint: not peer reviewed 10.1101/2024.10.10.617605. [DOI]
- 67. Cvijović I, Jerison ER, Quake SR Reference-free germline immunoglobulin allele discovery from B cell receptor sequencing data. bioRxiv26 November 2023, preprint: not peer reviewed 10.1101/2023.11.25.568681. [DOI]
- 68. Peres A, Upadhyay AA, Klein V et al. A broad survey and functional analysis of immunoglobulin loci variation in rhesus macaques. bioRxiv10 January 2025, preprint: not peer reviewed 10.1101/2025.01.07.631319. [DOI]
- 69. Olson BJ, Matsen FA The Bayesian optimist’s guide to adaptive immune receptor repertoire analysis. Immunol Rev. 2018; 284:148–66. 10.1111/imr.12664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Xu Q, Yang Q A survey of transfer and multitask learning in bioinformatics. J Comput Sci Eng. 2011; 5:257–68. [Google Scholar]
- 71. Weinberger E, Beebe-Wang N, Lee SI Moment matching deep contrastive latent variable models. International Conference on Artificial Intelligence and Statistics. 2022; [Google Scholar]
- 72. Elnaggar A, Heinzinger M, Dallago C et al. End-to-end multitask learning, from protein language to protein features without alignments. bioRxiv24 January 2020, preprint: not peer reviewed 10.1101/864405. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and analyzed during the current study are available on Zenodo. The training datasets and evaluation datasets can be accessed via the following URLs: Training datasets: https://doi.org/10.5281/zenodo.15687765, Evaluation datasets: https://doi.org/10.5281/zenodo.15687841. For any additional data requests or inquiries, please contact the corresponding author. The AlignAIR suite is accessible through multiple platforms to accommodate different user preferences and computational resources. The full codebase, including training scripts, inference tools, and a Docker image for seamless deployment, is archived on Zenodo (https://doi.org/10.5281/zenodo.15687939). A continuously maintained and regularly updated version is available on GitHub at https://github.com/MuteJester/AlignAIR. For users who prefer not to manage local installations, AlignAIR can also be accessed via a user-friendly web interface hosted at https://alignair.ai/, enabling quick and efficient sequence analysis directly through a browser.














