Abstract
A continued challenge in the analysis of adaptive immune receptor repertoires (AIRRs) is the prediction of antigen reactivity from primary sequence data. Many algorithms infer antigen-specific responses by measuring sequence similarity between receptors. Similarity is often scored using tools for protein alignment such as the BLOSUM matrices. However, these metrics were designed to identify homology in genomic proteins, not VDJ-recombined immune receptors. Comparison of these metrics against other approaches is underexplored. We used matrix factorization to make physiochemical-based alternatives which may improve performance. We evaluated these metrics by clustering 383 simulated and biological repertoires using traditional and physiochemical-based scoring matrices. While physiochemical and traditional scoring had similar efficacy, the membership of antigen-specific clusters varied. Lastly, we inferred antigen-specific immune responses in pancreatic cancer and rheumatoid arthritis. Results varied depending on the matrix used, emphasizing a poor consensus among methods. Despite equivocal performance, physiochemical factors can increase the interpretability of clustered repertoires. These results suggest that analysts must carefully consider characteristics of sequence similarity measures to apply the most appropriate methods to their data. We facilitate further exploration of scoring metrics by centralizing AIRR clustering and physiochemical sequence characterization in a software tool called Homolig (Homol-Ig) associated with our analyses.
Graphical Abstract
Graphical Abstract.
Introduction
The analysis of adaptive immune receptor repertoires (AIRRs) has been instrumental in advancing our understanding of many immune contexts over the last two decades including cancer, autoimmunity, and infection [1, 2]. Of particular interest to AIRR analysis is the inference of antigen-specific or disease-specific immune receptors based on sequence similarity. Multiple software tools have emerged to cluster sequences in the last decade. Some tools in this space, such as GLIPH, employ K-mer matching to find identical subsequences or sequence motifs in immune receptors [3]. Other tools, including TCRDist and GIANA, perform sequence alignment using a weighted scoring matrix to generate pairwise sequence distances followed by unsupervised clustering [4, 5]. While methods employing string distance or K-mer matching can be effective, they are limited by relying on unweighted scoring based on binary match/mismatch determinations and do not account for amino acid similarity. As an improvement over unweighted string distances, substitution matrices were developed decades ago for studies of protein alignment to penalize highly dissimilar amino acid mismatches.
The BLOSUM (BLOcks SUbstitution Matrix) matrices were developed in 1992 to assess evolutionary divergence between proteins [6]. Matrices were generated by screening the BLOCKS database of highly conserved protein regions, including nonhuman sequences, catalogued in the database Swiss-Prot [6, 7]. The most utilized BLOSUM matrix is the BLOSUM62, derived from the log-odds ratio of residue substitutions in protein blocks with at least 62% sequence identity. This substitution matrix is the standard for many algorithms to identify proteins homologous to a given query sequence, and has been used in some form to predict the function of millions of uncharacterized protein sequences [8–11]. While BLOSUM matrices are the gold standard in protein homology, they incorporate at least two germline protein biases intrinsic to their design. First, amino acid frequencies in the proteins used to derive BLOSUM scores may differ from the composition of immune receptors, and therefore, influence the inferred probability of specific amino acid mismatches. Second, as a predictable consequence of the codon code some transitions between amino acid residues are more likely than others; the BLOSUM and similar tools incorporate these biases through empirical training on genomic protein sequences. Possible biases in the BLOSUM have recently been noted by others, potentially posing limits to its applicability to immune receptor clustering. tcrBLOSUM, a similarity measure generated using antigen-specific receptor training data, exemplifies one possible solution to this issue [12].
The prediction of immune receptor reactivity is ultimately an inference of structural similarity and antigen binding, rather than ancestral homology. Recent work has highlighted this critical link between immune receptor structure and function by demonstrating a positive correlation between CDR3 hydrophobicity and the promotion and maintenance of a regulatory T cell phenotype [13]. Thus, tools which aim to characterize physiochemical sequence similarity may also be appropriate for the inference of immune receptor reactivity. A standardized evaluation of the efficacy of various scoring measures has been lacking, with attempts to do so only recently being considered [12].
The Kidera factors (KFs) and Atchley factors (AFs) are two landmark attempts from prior decades to summarize amino acids in physiochemical space [14–16]. Summaries of physiochemical characteristics including the KFs have been recently incorporated into AIRR processing tools such as MiXCR [17] but are not directly used to cluster sequences. The recent, innovative algorithms Trex and Ibex offer both Kidera and Atchley factorizations in a CDR3 autoencoder to cluster immune receptors [18, 19]. However, neither factorization has been transformed to create similarity scores analogous to the BLOSUM and PAM matrices, or leverage the complete set of amino acid properties currently available. Regardless of the basis of the metric for clustering, tools for immune receptor alignment often do not address root causes of AIRR similarity within learned clusters. Similarly to the desire for explainable AI in machine learning [20, 21], a natural interest of researchers performing immune receptor clustering is to interpret the features which define a group of immune receptors. While certain tools such as GLIPH provide sequence motifs to describe clusters, further work is needed to define and interpret the physiochemical impact of these motifs on receptor function.
We sought to assess the performance of clustering analysis of AIRR sequencing through a thorough comparison of sequence similarity measures. This comparison required a unified software tool able to perform concurrent clustering and characterization of AIRR data based on user-specified similarity measures. Therefore, to enable this analysis we developed the software tool Homolig (Homol-Ig) to characterize sequence properties in absolute terms using user-specified matrix factors. Using this software, we compared the performance of distance metrics based on sequence alone, various transformations of physiochemical metrics, and traditional substitution matrices (e.g. the BLOSUM62) for the clustering of adaptive immune receptor data, including both T cell receptor (TCR) and B cell receptor (BCR) sequencing data. Our approach provides a standardized pipeline through which to compare the results of various similarity metrics under identical pre- and post-processing conditions while additionally describing the physiochemical properties of immune receptor sequences in absolute terms.
Materials and methods
Quantification of AIRR amino acid composition
To determine the applicability of genome-derived metrics for AIRR clustering, we sought to assess the degree of difference between genomic residue frequency and that of AIRR data. TCR beta-chain (TCRB) amino acid frequency was estimated using the junctions of 133 healthy controls from two previously published studies [22, 23] (median number of TCRs = 9.4e4, total number of TCRs = 1.2e7), while IgH amino acid frequencies were estimated using the junctions of 86 healthy controls (median number of BCRs = 9613, total number of BCRs = 1.13e6) compiled from various studies in the cAb-Rep BCR database [24]. Residue frequencies in each subject were averaged to produce confidence intervals for every residue. These data were compared with amino acid residue frequency in the human proteome as computed from the Complete Human Genebank [25], and with the BLOCKS protein database [6], from which the BLOSUM62 is derived.
Generation of novel physiochemical matrix factorizations
To create novel sequence metrics solely based upon physiochemical characteristics of the amino acids, we used 548 properties catalogued AAIndex, version 9.2. Properties with missing values for any amino acid were removed from consideration, leaving 534 properties. One property with binary values was also removed. Remaining properties were normalized through the formula
. Three tests were performed on adjusted values to confirm a normal distribution: (1) feature must not reject the null hypothesis of the Lilliefors test with alpha = 0.05, (2) have a third moment between −0.77 and 0.77, and (3) have a fourth moment >0.95 and <4.05. These criteria are identical to those used in creating the KFs [14]. Features which failed any one of these criteria underwent a Box-Cox transformation; those with <5 unique values following transformation were removed. Remaining features were re-introduced into analysis if Box-Cox transformed values passed the Lilliefors test. Overall, 443 properties met normality criteria and were incorporated into matrix factorization. Independent Component Analysis (ICA) using the infomax algorithm as implemented in the R package ica version 1.0–3 and Bayesian non-negative matrix factorization (NMF) using CoGAPS version 3.19.1 were performed on normalized data to generate between 2 and 15 factors each. Six was chosen as the ideal number of factors for both methods as a compromise between minimizing the sum of squared residuals and maximizing human interpretability.
Correlative analysis of matrix factor systems
We compared the ICA and Co-GAPS derived matrix factorizations described above to two previously published physiochemical matrix factors, the KFs [14] and AFs [15]. Each factor was cross-correlated with all other factors using Pearson’s R to assess similarities between each method. Lastly, matrix factors were hierarchically clustered using the complete linkage method as implemented in the R function hclust and divided into six clusters using the cutree function. Six hierarchical clusters were generated as a balance between human interperatability and maximization of the statistics Average Proportion of Non-Overlap, Average Distance between Means, Average Distance, and Figure of Merit as computed in the R package ClValid [26]. The mean of factors within each cluster was computed to generate an ensemble factor representative of the entire cluster.
Generation of similarity matrices for immune receptor clustering
Immune receptor clustering, and more generally any sequence distance computation, uses a scoring matrix to assign weighted values to all possible pairwise alignments of amino acids. Traditional scoring tools including the PAM250 and BLOSUM62 exist in this form. To use our physiochemical matrix factors in an analogous manner, we transformed the pattern matrices of ICA and CoGAPS-derived matrix factors into similarity matrices. Pattern matrices were normalized such that each pattern had a mean of zero and standard deviation of one. Normalized pattern matrices were then made into symmetric similarity matrices using the transformation
, where D represents the Euclidian distance matrix of the normalized pattern matrix for either ICA or CoGAPS patterns (CPs), respectively. In an identical manner, we generated similarity matrices from two existing physiochemical matrix factorization systems, the KFs and AFs, for a total of four new scoring matrices based on physiochemical characteristics. All generated similarity matrices have a maximum of one (most similar) and a minimum value of zero (least similar), with diagonal values equal to one.
The Homolig software to enable flexible clustering of immune receptors using custom alignment metrics
We created new immune receptor clustering software, Homolig, for the flexible clustering of either TCR or BCR immune receptors using any user-specified scoring matrix. Pairwise distances and clustering are evaluated in Python 3.10, with sequence alignment performed in C++ for improved efficiency. IMGT data for TRBV and IGHV genes is available for 30 species, permitting species-specific AIRR analysis in humans, mice and other model organisms. Sequence distances may be computed for single or paired-chain inputs with each chain weighted equally. In addition to a sequence clustering module, we generated a physiochemical characterization module in R 4.3.1 which allows for the absolute scoring of sequences based on user-specified matrix factors. The combination of these two modules allows for both relative clustering of immune receptors and an absolute descriptive analysis of the properties responsible for the co-clustering of specific sequences. By transforming matrix factors into similarity matrices, we are able to execute both modules using the same scoring system, a novel innovation to our knowledge unavailable using current tools.
Homolig package: pairwise clustering module
The pairwise clustering module of Homolig is designed to use the physiochemical substitution matrix described above but can be configured for any substitution matrix. Immune receptor data should have a variable gene name and a Complementarity Determining Region (CDR) 3 amino acid sequence. In the absence of variable gene information, CDR3 sequence alone may be provided. Homolig first parses variable gene germline sequences to extract CDR1, CDR2, and CDR2.5 amino acid sequences using reference FASTA files obtained from IMGT [27, 28]. CDR alignment scores are calculated according to the substitution matrix supplied. For CDRs with different lengths, Homolig performs a sliding window comparison based on the length of the shorter sequence; the highest scoring alignment window is used. This alignment method is constant regardless of the substitution matrix supplied. The total distance between two immune receptor chains is the weighted sum of CDR alignments: CDR3 comprises 50% of the final alignment score, while CDRs 1, 2, and 2.5 comprise the other 50% of the total. In cases of paired chain input (alpha/beta TCR or heavy/light BCR), each chain is weighted equally. These weights are identical to TCRDist [4].
After calculation of pairwise sequence distance for all immune receptors supplied, Homolig stores the input file and output matrix in AnnData format. Additional functions support k-means clustering analysis using the scanpy framework [29] and uniform mainfold approximation and projection (UMAP) visualization [30]. K-means clustering may be performed to generate any number of clusters and can be done on either the raw similarity score matrix or its principal components reduction. While we focus the majority of our analyses on human sequences, we note that Homolig supports the analysis of TCRs, BCRs, or arbitrary amino acid sequences and can accept V gene inputs from any of the 30 species present in IMGT.
Homolig package: descriptive feature module
In addition to generating sequence distances between immune receptors, the descriptive module of Homolig can describe physiochemical properties of sequences in the same matrix factorization feature space used for clustering. For any arbitrary group of sequences, the average pattern score is computed and compared with sequences drawn from a uniform random distribution of the amino acids using a Student’s t-test and false-discovery rate correction using the Benjamini–Yekutieli step-up procedure [31]. Pattern scores are compared with any other group of sequences using either a Kruskal–Wallis test or Wilcox test in the case of two groups, followed by false discovery rate (FDR) correction using the Benjamini–Yekutieli procedure.
Simulated data for the evaluation of cluster scoring matrices
Simulated motifs
In each repertoire simulation, we implanted semi-random motifs into immune receptors from published healthy control repertoire data [23]. We sought to implant motifs with conserved physiochemical characteristics while also generating a diverse set of test cases. Six properties were chosen from AAIndex with the following accession numbers: pK-a (Fauchere et al., 1988, FAUJ880113), α-helical index (Geisow-Roberts 1980, GEIM800101), β-strand index (Geisow-Roberts, 1980, GEIM800105), Polarity (Grantham, 1974, GRAR740102), Volume (Grantham 1974, GRAR740103), Hydropathy Index (Kyte-Doolittle 1982, KYTJ820101). For each property, the top five ranked amino acids were selected for semi-random implantation into immune receptors. For pK-a (Fauchere et al.,1988, FAUJ880113) and Volume (Grantham, 1974, GRAR740103), the bottom five residues were also selected to generate eight total test cases.
In each modified immune receptor, an eight-residue semi-random sequence was generated with 60% probability of selecting one of the five amino acids chosen for that motif and 40% probability of preserving the existing residue. This eight-residue sequence was randomly placed anywhere in the CDR3 excluding the first four and last three residues, which are highly conserved and biased towards germline V and J residues. Motif implantation rates of 1%, 2%, 5%, and 10% were used. The procedure for a 2% implantation rate is outlined below.
Simulated TCR repertoires
TCRB repertoires from 20 healthy controls were randomly selected from a publicly available dataset sequenced using the Adaptive ImmunoSeq platform [23]. Each repertoire was filtered to remove nonproductive sequences, sequences without an annotated TRBV gene, or sequences without a fully resolved CDR3 amino acid sequence. Following filtering, 9.8e3 sequences were randomly selected. Two hundred sequences with a TRBV gene in the TRBV7 family were then selected, exclusive from the 9.8 × 103 subsetted sequences. This degree of V gene restriction was chosen to emulate the limited TRBV breadth of an antigen-specific population. Each of these 200 sequences was manipulated in 8 separate simulations using to contain the motif of interest and added to previously selected 9.8 × 103 sequences, to generate 10 final repertoires of 1 × 104 total sequences per subject, each with a different population of 200 manipulated sequences of interest.
Simulated BCR repertoires
Immunoglobulin heavy-chain (IGH) repertoires from three healthy controls were selected from a publicly available dataset of memory-sorted BCR repertoires [32]. Memory B-cell data were selected to best model antigen-specific populations, which following initial immune responses are affinity matured and in the memory compartment. For each healthy control, 7 random subsets of 9.8e3 filtered BCRs were selected and combined with 200 implanted sequences in a manner similar to TCR repertoire generation, resulting in a total of 21 repertoires of 1e4 sequences for each of 12 implanted motifs. All implanted sequences used an IGHV gene in the IGHV3 family to emulate the limited V gene breadth of an antigen-specific population.
Viral infection simulations
Two populations of epitope-HLA restricted TCRs were obtained from VDJdb [33] accessed 10 May 2023. The first population comprised 238 TCRB sequences specific to HIV gag protein, epitope KAFSPEVIPMF, HLA-B*57:01; the second population totaled 271 TCRB sequences reactive to YFV NS4B protein, epitope LLWNGPMAV, HLA-A*02:01. In each of 20 healthy control repertoires of 9.8e3 unique sequences, 200 HIV-specific TCRs were randomly selected for implantation into the repertoire. The process was repeated with YFV-specific sequences. Final simulations therefore contained 1e4 sequences total, with 2% of sequences reactive to a specific viral epitope-HLA complex.
Curated antigen-specific datasets
Validated, antigen-specific TCR sequences were obtained from VDJdb [33] and published literature [34–36, 37]. Data were filtered to include only paired-chain TCR sequences with annotated TRAV and TRBV genes as well as productive CDR3a and CDR3b amino acid junctions. Sequences were further filtered to omit epitopes with <10 reactive sequences and divided into two datasets: 4583 antiviral TCRs reactive to 63 viral epitopes, and 350 antitumor TCRs reactive to 9 mutation-associated neoantigens (MANAs). Data was clustered using one of five portions of the TCR sequence: CDR3b alone, CDR3a alone, CDR3b + TRBV, CDR3a + TRAV, or full paired-chain information.
mKRAS-VAX pancreatic cancer analysis
A cohort of pancreatic ductal adenocarcinoma patients were administered a synthetic peptide vaccine targeting common KRAS mutations. In vitro T-cell expansion [38] was performed on Peripheral Blood Mononuclear Cells (PBMCs) following vaccine administration to identify T cell clones specific to mKRAS peptides, a positive control (CEF peptide pool), or a negative control (FLC peptide). Separately, T cell reactivity to mKRAS peptides was assessed using the ELISPOT assay. Following expansion, TCRB sequencing was performed using the Adaptive Biotechnologies Immunoseq platform [39–40]. One patient from this cohort was selected for analysis using Homolig. Each of 291 expansion-validated TCRs against mKRAS peptides were compared against 83 310 TCRs sequenced without expansion from the same individual. TCRs were labeled “high similarity” to a known mKRAS TCR (hsTCR) if they had a similarity score ≥99.99th percentile of all similarity scores in healthy control repertoires (Supplementary methods). Comparisons were performed using each of nine similarity metrics.
Rheumatoid arthritis discovery cohort
Genomic DNA was isolated from whole blood collected into PAXgene Blood DNA tubes and isolated per manufacturer’s instructions from 22 individuals with rheumatoid arthritis (RA). Patients were selected from a observational cohort of adult RA patients followed at Johns Hopkins Arthritis Center, as previously described [41]. All patients were seropositive for anticitrullinated protein antibodies and 14 were also seropositive for antipeptidylarginine deminase 4 (PAD4) autoantibodies. The isolated DNA samples underwent TCRB sequencing on the Adaptive Biotechnologies Immunoseq platform [39, 42, 40]. IGH sequencing was additionally performed on 14 of the same samples. Repertoires from all patients were combined and pairwise clustering performed in Homolig. Clustering was performed at multiple resolutions to determine ideal cluster number (Supplementary methods). TCRB clusters from anti-PAD4+ individuals were identified by passing a one-tailed fishers’ exact test with Benjamini–Yekutieli FDR correction at alpha = 0.01. Additionally, TCRB clusters had to contain clones from 2 or more anti-PAD4+ patients and zero clones from anti-PAD4− patients. IGH clusters were permitted to contain clones from anti-PAD4− patients to create a less stringent definition; no significant clusters were identified without relaxing this constraint. For each significant cluster, consensus TRBV genes were identified if at least 50% of sequences used one TRBV gene; consensus sequence motifs were identified using ClustalW as implemented in the R package msa.
Clustering performance measures
To benchmark clustering performance, we used previously published clustering measures. These measures or their variations have been used in GIANA [5], GLIPH [3], iSMART [43], and ClusTCR [44].
Purity
Purity is defined as the proportion of sequences belonging to the largest class for a given cluster. For example, if a cluster of 20 sequences contains 12 sequences belonging to group A with the remainder belonging to a plurality of other identities, that cluster has 60% purity. The possible range of values is 0 ≤ purity≤ 1. High purity suggests that clusters have successfully segregated sequences of different groups and is analogous to precision in the context of machine learning.
Consistency
For each true group of sequences (e.g. antigen specificity), the cluster containing the most sequences belonging to the group is identified (the ‘True’ cluster of that group). Consistency is defined as the number of group-specific sequences in the ‘True’ cluster divided by the total number of group sequences. The possible range of values is 0 ≤ consistency≤ 1. High consistency means many sequences belonging to a given class were grouped into the same cluster. This is analogous to recall in machine learning classification.
Retention
The proportion of sequences successfully placed into a cluster. In the context of simulated repertoires, antigen-specific sequences were only considered retained if they were placed in a cluster with at least 10% of the total antigen-specific population. The possible range of values is 0 ≤ retention≤ 1, with one indicating that all sequences were retained.
Silhouette score
A measure of how well an observation is sorted into a cluster, e.g. how far away the observation is removed from the nearest adjacent cluster. The possible range of silhouette scores is −1 ≤ silhouette≤ 1, with larger values indicating high separation from adjacent clusters. Elaboration on the rationale of silhouette scores can be found elsewhere [45]. Silhouette scores were calculated using cluster labels and the unique NxN matrix of sequence alignment scores using function silhouette samples in python package sklearn.
Overlap index
Overlap between two sets of sequences. Overlap between set A and set B is defined as:
. The possible range of values is 0 ≤ overlap≤ 1, with one indicating two identical sets of overlapping sequences, and zero indicating totally distinct sets of sequences.
If a scoring method clusters sequences perfectly then purity, consistency, retention, and silhouette scores all equal one. If two scoring methods identify the exact same sequences as antigen-specific, overlap index will also be equal to one.
Results
Quantification of AIRR residue frequency deviates substantially from the human proteome
Most immune receptor clustering tools calculate similarity (or distance) between immune receptors. Similarity calculations are heavily influenced by the weighted scoring matrix used in sequence alignment (Fig. 1A). Common scoring matrices in protein sequence alignment, including the BLOSUM and PAM matrices, are influenced by the native frequency of amino acids in their respective source data. An implicit assumption of this design is that query sequences aligned using these metrics share similar residue frequencies to training data. To gauge the applicability of this assumption to AIRR data, we sought to quantify the amino acid frequency of AIRR CDR3 regions. Our test data comprised of 133 TCR repertoires and 86 BCR repertoires from previously published healthy control data. Residue frequency in TCR CDR3 regions had a Pearson correlation of 0.41 with the human genome, suggestive of a modest positive correlation (p = 0.073). IgH junctions had an amino acid frequency correlation coefficient of –0.12 with the human genome (p = ns; Fig. 1B). Similar biases were observed in comparison to residue frequency in the BLOCKS database, upon which the BLOSUM matrices are based (Fig. 1C). The amino acids C, F, and Y had the greatest over-representation in TCRB junctions relative to the human genome with >1 fold increase each, while W, C, and Y were most over-represented in IgH junctions with fold-changes of 3.1, 2, and 1.8, respectively. These biases are consistent with the consensus sequences of V and J genes flanking the CDR3: C and F are conserved residues at the ends of each TRB junction while C and W are conserved ends of IGH junctions. While the repertoire data used in these residue frequency calculations are limited in scope, these findings suggest limitations in the applicability of genomically-derived scoring matrices for the analysis of AIRR sequences. As receptor-ligand interactions are ultimately molecular interactions based upon physiochemical characteristics, we hypothesized that antigen specificity should be inferable based upon these properties. Though the physiochemical characteristics of a 3D protein are inherently more complex than inference based upon primary sequence data, the potential utility of such measures to infer antigen binding has been noted by others [12, 17, 46].
Figure 1.
Biases in the amino acid frequency of immune repertoires. (A) Many tools for immune receptor clustering involve an algorithm to score similarity between any immune receptor sequences. A critical parameter in scoring is the weights matrix used to assign similarity between residues. The most common weights matrix is the BLOSUM62, derived from the BLOCKS database for use in genomic protein alignment. (B) Frequency of the 20 amino acids in the human genome plotted against residue frequency in TCRB and IGH repertoires of healthy controls. (C) Amino acid frequency within the BLOCKS database plotted against residue frequency in TCRB and IGH repertoires of healthy controls. Fold change of all residues in IGH repertoires compared to the human genome. Mean ± standard error shown where applicable.
Generation of novel matrix factors for physiochemical characterization of protein sequences
To generate substitution matrices which described physiochemical characteristics while ignoring genomic biases, we applied matrix factorization approaches to describe 443 physiochemical properties from AAIndex, a database of values curated from the literature [47]. This number of features greatly exceeds those considered in the generation of the KFs (188 properties) and the AFs (54 properties). Matrix factorization refers to an array of methods which reduce high-dimensional data to a lower number of patterns, and can be applied to interpret the dominant features in a high-dimensional dataset (Fig. 2A) [48]. We applied two different matrix factorization methods, ICA and NMF with our Bayesian method CoGAPS to determine the low dimensional set of factors that describe the features in AAIndex (Supplementary Fig. S1). For each approach, factor sets of between 2 and 15 factors were generated. The final factor set was determined through assessment of sum of squared error and the human interpretability of resultant factors (Supplementary Fig. S2). ICA generated a set of six factors to describe source data. Exclusive markers for each independent component (IC) were computed by measuring the association of each feature loading in the amplitude matrix with each IC using the patternMarkers statistic published previously [49]. Three of six ICs were interpretable, corresponding to hydrophobicity, solvent accessibility, and beta-sheet propensity while three ICs were combinations of properties (Supplementary Table S1).
Figure 2.
Physiochemical matrix factorizations to describe physiochemical properties of amino acids. (A) Schematic of how matrix factorization may be applied to properties of the AAIndex database to generate low-dimensional patterns which represent the larger dataset. We performed matrix factorization using two separate methods: the InfoMax algorithm for ICA, and CoGAPS. We additionally studied two existing factor sets, the Atchley and KFs, for a total of four physiochemical factor sets. The resultant pattern matrices from these methods were each transformed into a 20 × 20 similarity matrix for use in similarity scoring. (B) Pearson correlation coefficient between seven similarity matrices including four physiochemical measures and three traditional measures. Correlation times 100 shown in boxes. (C) Magnitude of the largest changes in similarity matrix indices between the BLOSUM62 and four physiochemical similarity matrices based on the KFs, Atchley, ICA, or CoGAPS factors, respectively. Change quantified in standard deviations from the mean. (D) Hierarchical clustering of all four matrix factorizations led to a new ensemble set of six physiochemical patterns describing the totality of amino acid properties. (E) Correlation of Ensemble similarity matrix with all other similarity matrices (left) and the largest element-wise changes in ensemble similarity matrix relative to the BLOSUM62 (right). Figure components created using BioRender (biorender.com/n78k103).
CoGAPS NMF is an alternative matrix factorization method to ICA which models feature contributions as purely additive and maximizes parsimony through a sparsity constraint embedded in prior distribution, while relaxing constraints on the independence of matrix factors [48, 50, 51]. We applied CoGAPS to AAIndex in the same manner as ICA in an attempt to generate more biologically interpretable patterns. Four of six total CPs were interpretable, corresponding to hydrophobicity, a-helix propensity, b-sheet propensity, and accessibility (Supplementary Table S2). These proportions are improved over the 6/10 mixed-property factors developed by Kidera and comparable to 2/5 mixed AFs (Supplementary Tables S3 and S4) [14, 15]. Consistent with the tendency of the CoGAPS prior towards increased parsimony, the amplitude matrix of this factorization had fewer high-amplitude features per pattern relative to ICA (Supplementary Fig. S1). Overall, the ICA factorization explained 81% of variance while CoGAPS factors captured 66% of total variance (Supplementary Fig. S2). Nonetheless, the features defined from the CoGAPS with the patternMarker scores suggest increased interpretability compared to those obtained from the ICA factorization. In summary, we generated two new distinct physiochemical matrix factorizations for potential use in characterizing immune receptor sequences.
Correlation analysis among matrix factorization systems
With knowledge that our ICA and CoGAPS reductions each provided distinct low-dimensional summaries of physicochemical properties, we decided to compare our learned features with the previously published KF and AF sets. Pearson’s R calculated among the described four sets of physiochemical patterns: ICs, CPs, KFs, and AFs. Both IC and KF sets had minimal autocorrelation. Factors describing bulk were highly correlated between CoGAPs and ICA factorizations, with a correlation coefficient of 0.84. AF3 and AF5 had a correlation coefficient of 0.83, with the remainder having low autocorrelation. CPs had low to moderate correlations with one another, ranging from −0.52 to 0.39 (average 0.17). While our ideal matrix factorization would comprise of orthogonal factors, some autocorrelation may improve biological interpretability. Between factorizations, no two sets of factors were redundant, but correlations between individual patterns were observed. AF1, KF4, and CP1 all demonstrated high correlations >0.85 consistent with their shared interpretation as polarity-related factors. Secondary structure factor KF1 is nearly identical to AF2 (R = 0.96) while beta sheet factor KF3 had similarity to IC6 and CP4 (R = 0.81). High correlations were not observed between mixed-property factors with unclear biological interpretability.
Comparison of similarity matrices
Existing metrics for protein homology such as the PAM and BLOSUM systems are 20 × 20 pairwise similarity matrices which compare the similarity of every amino acid to every other amino acid. To use our own physiochemical metrics in an analogous manner, we converted our ICA and NMF pattern matrices into similarity matrices through the transformation
, where D is the Euclidian distance of the pattern matrix. We additionally transformed the KF and AF matrices for analogous comparisons. In total, we compared our four transformed factorizations (Kidera, Atchley, ICA, and CoGAPS) to three common similarity matrices from the literature (PAM250, BLOSUM62, and Gonnet) [6, 52, 53]. All seven matrices were at least moderately correlated, with a minimum pairwise correlation of 0.41 (Fig. 2B). The three existing tools BLOSUM, Gonnet, and PAM250 matrices were the most related with all pairwise correlations ≥0.87. In contrast, most pairwise correlations between physiochemical metrics were below R = 0.8. The maximum correlation between any of the four physicochemical matrices was between CoGAPS and ICA, with R = 0.86. The Atchley similarity matrix had the least similarity to any of the other matrices with maximum correlation of 0.57 with the Kidera matrix and minimum R of 0.41 with the PAM250. As the BLOSUM62 is most commonly used in AIRR clustering algorithms, we compared individual values in our four physiochemical similarity matrices to those in the BLOSUM62 (Fig. 2C). Unsurprisingly, W-W scoring was the most overvalued match in the BLOSUM62 compared most physiochemical reductions, consistent with germline-encoded frequency biases discussed previously. Matches undervalued in the BLOSUM62 relative to our physiochemical metrics varied depending on the metric used.
Collectively, our correlation analysis demonstrates the nonredundancy of each physiochemical metric. To describe the aggregate result of all physiochemical factorizations into one model, we performed hierarchical clustering of all physiochemical matrix factors. Hierarchical clustering generated six clusters of factors, four of which had clear biological interpretability (Fig. 2D and Supplementary Table S5). For example, Ensemble cluster 4 comprised AF 1, KF 4, and CoGAPS factor 1. Each of these factors was independently interpreted as corresponding to Polarity. We averaged individual factors within each hierarchical cluster to create a final set of six ensemble matrix factors which consider the results of all four physiochemical matrix factorizations. The correlation of each ensemble factor with individual properties from AAIndex was consistent with the interpretation of each hierarchical cluster. We lastly generated a similarity transform of our Ensemble matrix factorization and compared the resultant similarity matrix to all other tools. Our ensemble similarity matrix was most similar to the CoGAPS-generated metric (R = 0.87) while least comparable to the Atchley similarity matrix (R = 0.58, Fig. 2E, left). Again, our ensemble similarity matrix was most divergent from the BLOSUM62 in W-W scoring (Fig. 2E, right).
Homolig as a framework for comparison of clustering metrics including physiochemical clustering
We next sought to compare the ability of each similarity matrix to identify rare immune receptor populations of interest. This analysis was enabled through the development of our new immune receptor clustering software Homolig (Homol-Ig) for the flexible clustering of either TCR or BCR immune receptors using any user-specified scoring matrix. Sequence distances may be computed for single or paired-chain inputs with each chain weighted equally, implemented in Python 3.10. In addition to a sequence clustering module, we generated a physiochemical characterization module in R 4.3.1 which allows for the absolute scoring of sequences based on user-specified matrix factors (Supplementary Fig. S3). The combination of these two modules allows for both relative clustering of immune receptors and an absolute descriptive analysis of the properties responsible for the co-clustering of specific sequences. By transforming matrix factors into similarity matrices, our software can uniquely execute both modules using the same scoring system.
The goal of our new software was to centralize clustering metrics of AIRR sequences for cross-comparison. Therefore, we first used it to evaluate the impact of the sequence scoring for the similarity matrices in simulated data with a known ground truth. Specifically, we simulated a dataset by spiking artificial motifs into repertoire data from previously published healthy controls [23, 32]. We used six specific parameters from AAIndex as the basis for motif generation to create motifs of physiochemically diverse sequences. These parameters included measures of polarity, volume, and other key residue characteristics. For each of the six parameters, we created eight-residue motifs which were biased in composition towards the five amino acids with the greatest values of the given parameter. For two parameters, p-Ka and volume, we also generated motifs biased towards the bottom five residues for a total of eight physiochemical motifs. We independently inserted each of these motifs into 2% of immune receptors in 20 TCR repertoires from 20 healthy controls, and 21 memory B-cell BCR repertoires subsampled from 3 healthy controls [23, 32]. Following this process, we obtained eight versions of each healthy control repertoire, each with 2% of sequences containing a motif of interest.
Each simulated repertoire underwent clustering using multiple scoring matrices: the identity matrix (which computes Hamming distance), the related PAM250 and Gonnet matrices, reductions based on the KFs and AFs, our ICA and CoGAPS factors, and our physiochemical ensemble metric. We also included the tcrBLOSUM, PhysChemSim matrix, and TopoSim matrix recently created by Postovskaya et al. for a total of twelve similarity metrics to cluster each repertoire [12]. In each simulation, we first identified ‘Motif Clusters’ as clusters which contain at least 10% of motif sequences. We then evaluated several measures of clustering performance within these clusters similarly to previously published analyses [4, 5]. Cluster purity was defined as the proportion of motifs within each Motif Cluster (range 0–1), consistency as the proportion of all motifs within the most populated cluster (range 0–1), and retention as the number of motifs successfully grouped into any Motif Cluster. An ideal method would group all motif-containing sequences into one cluster, with purity of one (all cluster members have the motif), consistency of one (all motif sequences are members of the cluster), and retention of one (all motif sequences were assigned to a motif cluster). Therefore, optimal performance minimizes the number of motif clusters while maximizing purity, consistency, and retention. Lastly, we computed the silhouette score of all members of Motif Clusters to measure goodness-of-fit to their respective cluster. Unless specified, we report the median value among all simulations for each similarity metric.
In TCR simulations, purity ranged between 0.20 and 0.32 for all matrices, indicating poor specificity in all conditions. Consistency ranged between 0.30 and 0.40, with all five physiochemical metrics (Ensemble, ICA, CoGAPS, Atchley, and Kidera) returning lower consistency than the PAM250, Gonnet, or BLOSUM matrices (Fig. 3C and D). All similarity metrics demonstrated strong retention of motif TCRs (retention 0.91–0.99) (Fig. 3C and D). Despite lower consistency, physiochemical metrics trended towards higher silhouette scores (0.23–0.40) than PAM250, Gonnet, BLOSUM62, or TCR-BLOSUM (Silhouette 0.14–0.21, Fig. 3C and D). The greatest silhouette score was achieved by the PhysChemSim matrix (0.40, Fig. 3C and D). All matrices performed worse in BCR repertoires relative to TCR simulations. Median purity ranged from 0.06 to 0.08 indicating a false positive rate of ≥92% in all conditions (Fig. 3F and G). Consistency was maximized using the PhysChemSim matrix (0.40) and minimized in all physiochemical reductions (0.11–0.13) (Fig. 3F and G). Cluster retention was consistently lower most Physiochemical metrics (range 0.13–0.23) relative to PAM250, Gonnet, or BLOSUM (median 0.28–0.34) with the exception of the PhysChemSim matrix (retention 0.66). Similarly to TCR simulations, silhouette scores were greater for our physiochemical metrics (median 0.13–0.18) and greatest for the PhysChemSim matrix (0.26, Fig. 3F and G). When metric performance was assessed separately across all eight simulated motifs, some variability was observed. However, performance by motif was largely similar to aggregated results.
Figure 3.
Clustering performance in synthetic immune repertoire simulations. (A) Experimental design. A total of 20 HC TCRB repertoires and 21 HC IGH repertories were implanted eight times with a specific motif at 2% implantation rate. Following motif implantation each repertoire was clustered using nine similarity metrics. (B) For each of eight motifs, altered sequences were enriched for one of five amino acids with a shared physiochemical property. (C) Aggregate statistics of clustering efficacy in TCRB repertoires across all eight motifs. (D) Statistics of clustering efficacy in TCRB chains displayed by motif. (E) Overlap index, the proportion of cluster members which are identical between conditions, displayed by motif. Panels (F–H) identical to panels (C–E) for IGH repertoire simulations. Panel (A) created using BioRender (biorender.com/i56u515).
Lastly, we examined the degree of overlap between solutions. Metrics with a high overlap tend to identify the same sequences as motif-specific, while those with a low overlap arrive at divergent solutions. We found high overlap between all metrics in TCR simulations, with most values ≥0.9 (median 0.95, interquartile range (IQR) 0.93–0.96, Fig. 3G); physiochemical metrics trended towards higher pairwise overlap than traditional metrics. In BCR simulations, lower overlap was observed (median 0.37, IQR 0.31–0.44) (Fig. 3H). Some variation in these patterns was seen among individual motifs.
In summary, clustering of eight motif simulations in both TCR and BCR repertoires demonstrated heterogeneity in performance depending on the similarity metric used. Repeating these simulations using different frequencies of motif insertion yielded similar results (Supplementary Fig. S4). While cluster purity and consistency was higher in traditional metrics in most simulations, physiochemical metrics tended to display higher silhouette scores. Cluster retention was greater in physiochemical motifs in TCR simulations, but lower in BCR simulations. We speculate this decreased performance may be related to the greater range in lengths of BCR CDR3 regions relative to TCRs. However, our simulations are highly contrived approximations of antigen-specific populations, with limited generalizability to other applications.
Performance of similarity matrices in simulations of human immunodeficiency virus and yellow fever infection
To explore immune clustering in additional contexts, we simulated viral infection in the same 20 healthy control TCR repertoires used previously. Rather than insert artificial motifs into TCRs, we identified two populations of HLA-restricted, epitope-specific CD8 TCRB sequences against antigens in human immunodeficiency virus (HIV) and yellow fever virus (YFV) catalogued in VDJdb [33]. These two TCR groups are ideal positive controls for clustering as each group binds one validated HLA-peptide complex. Additionally, our cohort of healthy controls was recruited USA and unlikely to have been exposed to either infection, with HIV prevalence at 0.4% in the USA [54] and YFV endemic to sub-Saharan Africa. Each simulation replaced 2% of healthy control TCRs with TCRB chains specific to either HIV Gag protein (peptide KAFSPEVIPMF, HLA-B*57:01) or YFV NS4B protein (peptide LLWNGPMAV, HLA-A*02:01), and was evaluated for the same cluster performance statistics described previously. Similarly to artificial motif simulations, we considered any cluster containing 10% or more of all antigen-specific sequences to be antigen-specific. In HIV clustering, hamming distance performed poorly, with the lowest overall purity, consistency, and silhouette score among all nine metrics (Fig. 4B–E). PAM250 had the greatest purity (0.4), consistency (0.28), and silhouette score (0.39) while the BLOSUM62 had the greatest retention (0.44). All metrics performed exceptionally poorly in clustering YFV TCRB sequences, failing to generate any antigen-specific clusters in the majority of samples (Fig. 4B–E). The PhysChemSim matrix yielded a higher range of cluster purities for HIV simulations, and higher silhouette scores in YFV simulations compared to other metrics (Fig. 4B and E). Among the minority of repertoires where antigen-specific clusters were generated, our Ensemble metric had greatest median purity (0.11) and consistency (0.12). Our ICA metric had the greatest mean retention (0.05) while Hamming distance had the highest median silhouette score (0.32). Overlap between HIV-specific clusters was high between all metrics, ranging between 0.76 and 0.92, while YFV overlap ranged between 0 and 0.16 (Fig. 4F). The greatest overlap between HIV clusters was observed between nonphysiochemical metrics, while in YFV the greatest overlap occurred between Ensemble, CoGAPS, and ICA metrics. In HIV simulations, the tcrBLOSUM had moderate overlap with traditional similarity metrics Gonnet, PAM250, and BLOSUM62 (overlap range 0.89–0.90), but this was not observed in YFV simulations (overlap range 0.00–0.08, Fig. 4E).
Figure 4.
Clustering of viral TCRB repertoire simulations. (A) Experimental design. A total of 20 Healthy Control TCR repertories were implanted with validated HIV Gag protein or YFV NS4B protein at a 2% implantation rate. (B–E) Purity, Consistency, Retention, and Silhouette scores, respectively, for HIV and YFV simulations. (F) Overlap Index for HIV and YFV simulations. (G) Sequence motif (left) and physiochemical characterization (right) of HIV and YFV sequences. Despite both TCR groups lacking strong consensus sequences, anti-HIV TCRs appear to demonstrate a more conserved physiochemical profile than YFV sequences, possibly explaining improved clustering. Panel (A) created using BioRender (biorender.com/v08 × 040).
We next sought to understand why all similarity metrics clustered HIV-specific sequences more effectively than YFV sequences. We aligned the CDR3s of each antigen-specific group using ClustalW to identify possible motifs, but failed to identify strong sequence conservation in either HIV or YFV, with the exception of conserved V and J positions at each end of the CDR3 (Fig. 4G). We then applied our Ensemble factors to characterize physiochemical properties, computing the difference in factor scores between antigen-specific CDR3s and healthy control TCRB sequences. We found our HIV-sequences to have a more distinct physiochemical profile than YFV sequences, with significantly larger magnitude scores for four of six Ensemble factors (padj < 0.001, Fig. 4G). Our Beta-sheet Ensemble factor most strongly distinguished HIV sequences from YFV (0.21 versus 0.05, padj < 1e-13). The physiochemical conservation of HIV TCRB properties, but not sequence motifs, suggests that physiochemical properties may be responsible for the improved ability of similarity metrics to cluster HIV sequences relative to YFV sequences.
Performance of similarity matrices in multiple group classification
Another context of immune receptor clustering may be the distinct separation of multiple groups, rather than isolation of one antigen-specific population against the remainder of a repertoire. We curated two biologically relevant datasets of paired-chain TCRs curated from VDJdb [33] and additional studies [34–37, 55]: 350 TCRs validated against 9 epitopes from MANAs and 4583 TCRs validated against 63 viral epitopes (Fig. 5A). Clustering was performed under each similarity metric using alpha and beta chains, weighted equally. For each scoring metric, we measured purity and consistency across all epitopes. Under this testing framework, the best performing similarity metrics should yield higher consistency and purity measures across epitope-specific sequences.
Figure 5.
Clustering of curated antigen-specific TCRs. (A) Experimental design. Two datasets of paired-chain TCR sequences from VDJDb and published literature reactive to 9 tumor neoantigens (dataset 1) or 63 viral epitopes (dataset 2) or were clustered. Clustering performance was evaluated across similarity measures. (B) Epitope purity by metric for anti-MANA TCRs. (C) Epitope consistency by metric for anti-MANA TCRs. (D) Epitope similarity ratio (ESR) by metric for anti-MANA TCRs. ESR is defined as the median similarity score among members of epitope-specific TCRs divided by the median similarity score of those TCRs against nonepitope-specific sequences. (E) Overlap index among similarity measures averaged across all epitope-reactive anti-MANA TCR groups. Panels (F–I) identical to (B–E) for antiviral repertoire simulations. Individual viral epitopes not labeled in panels (F) and (G) for readability. Panel (A) created using BioRender (biorender.com/k10c986).
We failed to identify a consistently superior scoring metric in either MANA or Viral datasets. Different similarity metrics were most effective in classifying certain epitope-specific sequences, but not others (Fig. 5B and C). For example, in MANA data the BLOSUM62 had the greatest purity (0.90) in clustering epitope FLASKIGRLV (Protein iPLA2-beta, multitumor antigen), but the lowest purity (0.20) in clustering epitope EAAGIGILTV (Protein Melan-A, melanoma antigen). The PhysChemSim matrix exhibited a greater ESR than any other similarity metric, including other physiochemical-based metrics (Fig. 5E and I). In contrast, the largest changes in cluster purity and consistency were observed based on epitope rather than similarity metric (Supplementary Fig. S5). This may relate to inherent differences in the stringency underlying different antigen-receptor interactions, or variability in the quality of source data.
We then repeated clustering using variable amounts of sequence information, using either alpha or beta chains alone, and including or omitting V gene information. Using paired chain information improved performance relative to single-chain clustering. Trends observed between similarity metrics did not substantially change based on the amount of sequencing information, although we note variation of the collective performance of all metrics (Supplementary Fig. S6). Interestingly, the use of CDR3α sequence alone was more effective in clustering anti-MANA TCRs than using all TCRα chain information for the majority of similarity metrics; this was not true in clustering antiviral TCRs.
While often useful, clustering has limitations and may yield unstable solutions [56, 57]. Identifying the optimal clustering algorithm and resolution for each test case is beyond the scope of this manuscript. To overcome possible limitations of clustering, we sought to gauge the efficacy of our similarity metrics using continuous measures. We defined an “ESR” as the median similarity between TCRs with the same epitope specificity divided by the median similarity between TCRs with different epitope specificities. Under this framework, higher ESRs suggest improved performance with better separation of epitope-specific sequences from background. In both MANA and viral datasets, similarity metrics yielded comparable outcomes (Fig. 5E and I). A comparable range (MANA: 0.86, 3.22; viral: 0.95, 2.23) and median (MANA: 1.07; viral: 1.05) ESR was observed between datasets. The greatest variability in ESR was observed based on epitope specificity rather than the similarity metric used to score sequences (Supplementary Fig. S5). Overlap between solutions was greatest among physiochemical matrices in MANA data, but traditional metrics in viral data (Fig. 5D and H). In summary, we clustered two curated datasets of TCRs validated as specific to a variety of epitopes. No single simpilarity metric was more effective in sorting epitope-specific sequences, with different metrics yielding the best performance depending on the epitope. The greatest determinant of efficacy was the TCR epitope in question, rather than the similarity metric used to group sequences. Following the above examples of simulated data, we next sought to perform repertoire analysis in two translationally relevant sets of patient data: (i) the inference of tumor-reactive TCRs in pancreatic cancer, and (ii) inference of PAD4-reactive TCR and BCRs in rheumatoid arthritis.
Inference of anti-mKRAS TCRs in peripheral blood repertoires of a pancreatic cancer patient
In a separate study, a cohort of pancreatic ductal adenocarcinoma (PDAC) patients were administered a synthetic peptide vaccine targeting common KRAS mutations. Following vaccination, TCRB sequences reactive against mKRAS peptides were identified using in vitro T-cell expansion [38]. One patient from this cohort was selected for analysis using Homolig. Each of 291 expansion-validated TCRs against mKRAS peptides or the CEF peptide pool were compared against 83 310 TCRs sequenced without expansion from peripheral blood of the same individual (Fig. 6A). We sought to identify TCRs which have high similarity against validated anti-mKRAS TCRs (hsTCRs) to infer the frequency of antigen-specific T cells in peripheral blood. High Similarity was defined as a similarity score ≥99.99th percentile of all similarity scores in healthy control repertoires.
Figure 6.
Analysis of TCRs from an individual with resected pancreatic cancer who received an mKRAS vaccine. (A) Experimental design. Anti-mKRAS TCRs identified through T-cell expansion assays were scored against nonexpanded peripheral TCRs from the same individual to infer the native prevalence of circulating antigen-specific T cells. (B) Proportion of high-similarity TCRs (hsTCRs) identified against TCRs validated to expand against one of six common KRAS mutations or CEF viral peptide pool. (C) Correlation of the proportion of hsTCRs by mKRAS antigen against ELISPOT data. Greater correlations suggest a more accurate inference of mKRAS reactivity. (D) Overlap indices for the hsTCRs identified by each similarity measure against G12A, G13D, and CEF, respectively. The degree of overlap between methods varied considerably depending on the antigen studied, with CEF displaying very high conservation of responses among all measures except the Gonnet, PAM250, and TCR-BLOSUM matrices. Panel (A) created using BioRender (biorender.com/fmida78).
Most similarity metrics were comparable in their ability to identify hsTCRs against mKRAS peptides, ranging from 0.001% to 0.085% of all TCRs (Fig. 6B and Supplementary Fig. S9A). The greatest variation was observed based on mKRAS mutation, rather than the metric used to compute similarity. On average, only 0.003% of TCRs were hsTCRs against G12D while 0.05% of TCRs were classified as hsTCRs against G13D. We note that the majority of PDAC tumors are G12D mutated, and the consistent underrepresentation of TCRs associated with this antigen in our PBMC samples may be associated with a loss in the periphery from successful trafficking to the tumor.
The greatest proportion of hsTCRs were identified against the CEF peptide pool, consistent with the expectation that antiviral responses against common pathogens are greater in frequency than a de novo antitumor response. We next correlated the proportion of mKRAS hsTCRs against in vitro ELISPOT data. While most metrics demonstrated moderate correlation with ELISPOT, physiochemical metrics outperformed traditional ones with Atchley, CoGAPS, and Kidera matrices exhibiting the greatest values of Pearson’s R (Fig. 6C). The Gonnet and PAM matrices each performed poorly with Pearson’s R <0.4 with ELISPOT data. For the remaining metrics, hsTCR frequencies had a greater correlation with ELISPOT data than the frequency of exact matches to known anti-mKRAS TCRs, supporting the general validity of inferring mKRAS reactivity based on sequence similarity (Supplementary Fig. S7). While hsTCRs identified against CEF were highly conserved across metrics (median overlap 0.98), conservation varied for different mKRAS peptides (Fig. 6D). hsTCRs against G12A were strongly conserved among metrics (median overlap 0.77), while G13D hsTCRs had low overlap (median overlap 0.38). Physiochemical metrics trended towards greater overlap than nonphysiochemical metrics. The PAM250, Gonnet, and tcrBLOSUM were markedly different from other metrics, with a lower overlap index in most instances (Fig. 6D).
Inference of anti-InfluenzaA TCRs in murine splenic repertoires
The Homolig software package can cluster immune repertoires in multiple species. To extend the previous analysis strategy beyond human repertoires, we used Homolig to compare published TCRB sequences from 333 validated mouse anti-InfluenzaA TCRs against splenic repertoires from two C57/B6 and two BALB-C mice ranging in size from 57 691 to 72 295 productive clones (Supplementary Fig. S8A). Splenic TCRs were considered High Similarity to anti-InfluenzaA TCRs if similarity score exceeded the 99.99th percentile of all scores between splenic sequences, analogous to the pancreatic cancer/mKRAS analyses described previously (supplementary methods). The median percentage of hsTCRs identified across repertoires ranged from 0.11 to 0.48 percent depending on similarity metric used, with 9 similarity metrics between 0.23 and 0.27 percent. The PAM250 and Gonnet matrices identified few sequences as hsTCRs (0.11% each), while the tcrBLOSUM labeled at least 0.43% of sequences as hsTCRs in all mice (Supplementary Fig. S8B). While the ground truth, the true frequency of anti-InfluenzaA TCRs in mouse spleens, is unknown, the overlap index between sets of hsTCRs varied widely (median 0.16, IQR 0.10–0.30), suggesting divergent solutions which vary based on the similarity metric used. Relative overlap between similarity metrics also varied mouse to mouse (Supplementary Fig. S8C).
Characterization of Immune Receptors associated with anti-PAD4 autoantibodies in a sample of RA subjects
Lastly, we sought to understand the practical impact of scoring metrics in a setting distinct from tumor immunity or infection. RA is a chronic autoimmune disease in which self-reactive antibodies cause inflammation in joints and other tissues [58, 59]. While self-reactivity against certain groups of antigens such as rheumatoid factor or citrullinated proteins is common among many individuals, a subset of RA patients with more severe disease additionally have reactivity against PAD4, a citrullinating enzyme implicated in the generation of citrullinated autoantigens [60]. To infer features of anti-PAD4 immune receptors, we sought to compare AIRR data of RA subjects with and without serological reactivity to PAD4. We performed TCRB sequencing on peripheral blood of 22 subjects with RA, including 14 with PAD4 reactivity (anti-PAD4+) and 8 with no PAD4 reactivity (anti-PAD4−). BCR heavy chain sequencing was additionally performed on 14 of the same individuals. We applied Homolig to cluster data from all TCR or BCR repertoires respectively, followed by enrichment testing to identify clusters specific to anti-PAD4+ subjects (Fig. 7A). For TCR analyses, enriched clusters were defined by a significant Fisher’s exact test following FDR correction (alpha = 0.01), cluster members belonging to at least two anti-PAD4+ subjects, and zero cluster members belonging to anti-PAD4− subjects. BCR clustering followed a similar procedure; however, as BCR clustering did not produce any PAD4+ exclusive clusters, enriched clusters were allowed to contain sequences belonging to anti-PAD4− subjects. By comparing the results of this pipeline using each of the similarity metrics previously discussed, we sought to assess their degree of consensus identifying likely anti-PAD4 immune receptors.
Figure 7.
Clustering of repertoires from anti-PAD4+ and anti-PAD4− RA patients. (A) Experimental design. Pooled TCRB sequences from 22 subjects with or without serologic reactivity against PAD4 were clustered using Homolig. Enrichment testing identified clusters enriched for TCRs from anti-PAD4+ subjects, implying possible PAD4+ reactivity of those clonotypes. (B) Prevalence of putative anti-PAD4 sequences identified using each similarity measure, calculated as the percentage of sequences from anti-PAD4+ subjects which are specific to anti-PAD4+ clusters. (C) Overlap index among the members of anti-PAD4+ clusters identified using each similarity measure. (D) Consensus TCRB CDR3 motifs derived from the largest anti-PAD4+ cluster identified by each similarity measure. (E) Ensemble factor scores for the members of the largest anti-PAD4+ cluster identified by each similarity measure. Panel (A) created using BioRender (biorender.com/b13t188).
In both TCR and BCR clustering, the total number of significantly enriched putative PAD4-reactive receptors varied depending on the similarity metric used. Hamming distances identified 2.5% of all TCRs from anti-PAD4+ subjects as exclusive to the PAD4+ cohort, while the Atchley matrix only classified 0.8% of TCRs as PAD4+ exclusive (Fig. 7B). The total number of enriched TCRs for other metrics lay between this range, with no clear distinction between physiochemical and traditional metrics. Between all metrics, we calculated the overlap of sequences grouped into enriched clusters. Overlap was variable, ranging from 0.002 to 0.65 with a median value of 0.02 (Fig. 7C). We next characterized the largest enriched cluster identified by each similarity metric using ClustalW. While overlap between similarity metrics varied, all matrices identified a conserved glycine residue in the largest significant cluster (Fig. 7C). Lastly, we attempted to infer physiochemical properties of purported PAD4-reactive sequences by using Ensemble factor scores on the PAD4+-specific clusters identified using each similarity metric. Properties varied depending on the similarity metric used. For example, inferred PAD4-reactive sequences identified by all metrics appear to have positive Beta chain factor scores, while only four of nine metrics identify negative polarity factor scores (Fig. 7D). Similar variability was observed in BCR clustering; six metrics identified zero putative PAD4-reactive BCRs while the PAM250, Gonnet, and BLOSUM matrices demonstrated moderate convergence in identifying inferred PAD4-reactive BCRs (Supplementary Fig. S9).
Discussion
In analyzing residue frequency in the TCR and BCR junctions of healthy controls, we found that the amino acid composition of immune receptors is markedly different from that in the human genome. While our sample size is exceedingly small given the breadth of immune receptor repertoires, our findings were consistent with that of a prior analysis [61] and draw into question the applicability of tools designed for germline sequence homology in the context of AIRRs. To generate a scoring system purely based on physiochemical characteristics, we generated two novel matrix factorizations which describe the majority of variation in properties present in the AAIndex database. Our factorizations considered a much greater amount of laboratory data than prior matrix factorizations of similar design, the KFs and AFs. In contrast to those systems, we processed property data with minimal manual supervision to minimize the degree of human bias in factor generation. Our ICA and CoGAPS reductions were nonredundant with variable degrees of autocorrelation and correlation with other factor systems. Some factors, such as those describing polarity and β-structure, were highly correlated among systems while other mixed-property factors had moderate to low cross-correlations. The nonredundancy of these systems points to the importance of both feature selection and factorization algorithm in generating matrix factors. While the Kidera and ICA factorizations were specifically engineered for minimal correlation between patterns, the Atchley and CoGAPS reductions were not. The latter approaches had a higher proportion of interpretable factors, perhaps reflecting an inherent reality that conceptually distinct physiochemical properties may not be independent (for example, one may anticipate residue bulk/volume being negatively associated with the formation of helical secondary structure and turn propensity). To consider the respective differences of all four physiochemical factorizations, we lastly generated an Ensemble factorization derived from hierarchical clustering which culminated in six physiochemical factors with biological interpretability.
To query whether any of the described physicochemical factorizations might be superior in the context of immune receptor clustering, we transformed each factorization into a similarity matrix to directly compare these metrics with standards used in sequence alignment: the PAM250, Gonnet, BLOSUM62, and simple string distance (Hamming). Our physiochemical metrics had variable correlations to other similarity matrices. Relative to all physiochemical-based metrics, the BLOSUM62 greatly overvalued W-W and other tryptophan pairings, consistent with biases originating from its low frequency in germline-encoded proteins. This finding validates our concern that assumptions in the BLOSUM62, developed for germline-encoded protein homology, may not be appropriate for the unique circumstances of VDJ recombination and immune repertoire generation.
Despite considerable variation in the scoring metrics described, their impact on sequence clustering remains unclear. In TCRB and IGH simulations of artificial motifs inserted into healthy control repertoires, traditional metrics had modestly improved purity and consistency relative to physiochemical metrics, while physiochemical metrics trended towards higher silhouette scores. This finding was recapitulated in TCR repertoire simulations of HIV infection, but not of YFV infection, in which all metrics were unable to effectively cluster antigen-specific TCRs. Using our Ensemble matrix factors, we demonstrated increased physiochemical conservation of HIV TCRs relative to YFV TCRs, offering a possible explanation for the poor efficiency of YFV clustering. When our similarity metrics were used to cluster collections of antigen-specific TCRs, variability was observed in the ability of similarity metrics to distinctly cluster each epitope-specific group of sequences: some groups of TCRs were consistently identified similar purity or consistency by all metrics, while other groups of TCRs had variability in clustering performance depending on similarity matrix used. No single similarity metric outperformed others; rather, specific metrics proved more effective in clustering certain epitope-reactive sequences.
Our final set of analyses proceeded to determine whether clusters of TCRs defined based on AIRR distance metrics could be queried against known antigen-specific TCRs to define hsTCRs in new patient cohorts to infer a known immune response. For this analysis, we applied immune receptor clustering in two novel patient datasets: (i) to assess the ability of similarity metrics to infer mKRAS-specific TCRs in pancreatic cancer, and (ii) to identify features suggestive of anti-PAD4 immune receptors in RA. In both contexts, we observed variable degrees of overlap between the solutions of different similarity metrics. All similarity metrics, however, were able to broadly satisfy the purpose of each TCRB analysis (inferring anti-mKRAS and anti-PAD4 immune receptors, respectively). IGH clustering of RA repertoires appears less successful than in the TCRB case, with only four of eleven metrics identifying any enriched clusters (Supplementary Fig. S9). This parallels the poor performance of all metrics in simulated IGH repertoires (Fig. 3). Longer junction length and somatic hypermutation may explain this discrepancy, which deserves further study.
Overall, our finding also held true when comparing our metrics to the novel tcrBLOSUM, developed to avoid physiochemical biases in TCR clustering. Solutions using the tcrBLOSUM sometimes had low overlap with antigen-specific sequences generated by other metrics, for example in identifying hsTCRs to mKRAS (Fig. 6). The detection of distinct hsTCR sequences may reflect unique properties of the tcrBLOSUM’s epitope-specific training data, which come with advantages and potential limitations highlighted by the authors [12]. The Physiochemical similarity matrix developed in the same manuscript exhibited high purities when clustering HIV repertoire simulations and MANA-specific TCRs (Figs 4 and 5), revealing important differences in performance even among physiochemical-based metrics.
This study has several limitations. Existing tools for AIRR clustering, including GIANA, Trex/Ibex, and others each have distinct pre-processing steps, clustering algorithms, and alignment methods to generate sequence distances independent of the similarity metric used. We generated a new software pipeline, Homolig, with its own pre-processing and alignment methods. The relative impact of similarity metric on results may change depending on modifications to these additional stages. A cohesive analysis of our similarity metrics under every combination of pre- and post-processing steps is impractical and beyond the scope of this study, particularly as some tools do not have modifiable source code through which to substitute similarity metrics, though this is changing [62]. Different statistical approaches to evaluating performance may also alter results; comparisons of statistical analysis pipelines is beyond the scope of this current study. Second, the physiochemical metrics that we generated—ICA, CoGAPS, and Ensemble—all involve manual selection of the number of factors per set. Though we used several heuristics as a guide for pattern number selection, different decisions when creating our physiochemical measures may influence their relative performance in this study. Finally, this study relies on local FDR adjustments for all statistical tests and predominantly relies on Fisher’s tests to compare results between groups. In addition to the clustering metrics compared in this study, future work evaluating the statistical tests used in repertoire analysis, including hierarchical or mixed model FDR strategies, are an important area of research for AIRR analyses.
Additionally, our test cases do not capture the breadth of situations in which AIRR clustering may be performed or sensitivity to the context in which they are measured. For example, we note a consistent underrepresentation of TCR sequences associated with KRAS G12D mutations in the periphery of PDAC patients treated with a KRAS vaccine. We hypothesize that this specific underrepresentation in the periphery may represent successful trafficking of these TCRs into the tumor. However, further studies performing AIRR profiling of the tumor are necessary to fully delineate this response. Similarly, we did not assess differences in clustering naïve versus memory repertoires, or receptors from specific tissues such as germinal centers or tumors, which have distinct resident lymphocytes and proliferative populations. Even in clustering curated antigen-specific sequences such as our MANA and antiviral datasets, the quality of annotated sequences may vary based on the methods and validation used in their source publications. This may explain why some antigen-specific TCRs were effectively clustered but not others. Any immune repertoire simulation has inherent limitations, and an exhaustive exploration of all possibilities is infeasible. As access to high-quality sets of antigen-specific sequences continues to grow, so will the opportunities to improve test cases for all software tools in the field.
Lastly, two-dimensional sequence clustering is far from the only method of repertoire analysis. Increasingly, machine learning tools such as Immune-ML [63] are being designed to classify AIRR data. These algorithms may incorporate sequence distances into their model subject to the same limitations encountered in our study, but also tend to include additional parameters which are independent of similarity metrics [63, 64]. Three-dimensional predictions of protein structure and binding affinity are also growing in popularity and may play a larger role in AIRR analysis as both computational feasibility and access to training data improve [65, 66]. Ultimately, clustering based on sequence distance is only one way to capture features of a repertoire.
In conclusion, we have demonstrated potential vulnerabilities in the applicability of the BLOSUM62 and other traditional scoring metrics to the clustering of immune receptor repertoires, and developed alternative metrics based on physiochemical properties to address the problem. We developed a novel software tool, Homolig, to deploy these metrics for the simultaneous clustering and characterization of immune repertoires. Though we developed several new sets of physiochemical factors, assessments of clustering performance in both simulated and actual immune receptor data fail to demonstrate the superiority of these metrics. Rather, most metrics appeared to perform either well or poorly based on the simulation at hand. We additionally note that different scoring metrics demonstrated a low degree of convergence upon similar solutions as measured by overlap index, depending on the test case. This highlights both the significant impact of scoring metrics on immune receptor clustering, and suggests a multi-pronged analysis using multiple methods may yield the most robust results. These findings are concordant with other recent studies on substitution matrices [12, 62].
Despite our inability to demonstrate superiority of physiochemical metrics in the cases tested, using physiochemical factors does allow for increased biological interpretation of sequence properties. Using matrix factors to characterize sequences in conjunction with factor-derived similarity metrics permits absolute and relative characterization of sequences in the same feature space, to our knowledge a new development in the field not possible with traditional methods. By providing the opportunity to describe sequence motifs and physiochemical motifs using any similarity metric, Homolig also offers complementary methods to describe clustered sequences of interest. Regardless of the method chosen, developers and users of AIRR clustering tools must carefully consider the impact of similarity metrics on their analyses, be mindful of their limitations, and assess multiple complementary approaches to achieve the most robust results.
Supplementary Material
Acknowledgements
The authors gratefully acknowledge use of the facilities at the Joint High Performance Computing Exchange (JHPCE) in the Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health that have contributed to the research results reported within this paper. We also thank Joana Carneiro Da Silva at the University of Maryland, Institute for Genome Sciences for her review and input on this manuscript. Illustrations were created in part using biorender.com. Structural protein illustrations, including BCR and TCR icons, obtained under a CC-BY-4.0 license from Molecule of the Month articles produced by the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB-PDB) and the following authors: David Goodsell, Candice Craig, Samantha Eng, Jenna Manzo, Andrew Tkacenko, Stephen K. Burley, and Janet Iwasa.
Author contributioins: Alexander A. Girgis (Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Writing—original draft, Writing—review & editing), Amanda L. Huff (Conceptualization, Data curation, Investigation, Methodology, Supervision, Writing—review & editing), Emily Davis-Marcisak (Conceptualization, Investigation, Methodology, Software, Writing—review & editing), Theron Palmer (Conceptualization, Software, Writing—review & editing), Hanzhi Wang (Investigation, Formal Analysis, Writing—review & editing), Luciane T. Kagohara (Conceptualization, Supervision, Writing—review & editing), Janelle M. Montagne (Conceptualization, Supervision, Writing—review & editing), Dmitrijs Lvovs (Validation, Software, Writing—review & editing), Ludmila Danilova (Conceptualization, Supervision, Formal Analysis, Writing—review & editing), Alexander V. Favorov (Conceptualization, Supervision, Writing—review & editing), Jonathan Schneck (Data curation, Supervision, Writing—review & editing), Clifton O. Bingham III (Data curation, Funding acquisition, Writing—review & editing), Erika Darrah (Conceptualization, Data curation, Funding acquisition, Supervision, Writing – review & editing), Elizabeth M. Jaffee (Conceptualization, Supervision, Funding acquisition, Writing—review & editing), Neeha Zaidi (Conceptualization, Supervision, Methodology, Writing—review & editing), Bahman Afsari (Conceptualization, Formal Analysis, Investigation, Methodology, Supervision, Writing—original draft, Writing—review & editing), Elana J. Fertig (Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Supervision, Writing—original draft, Writing—review & editing).
Contributor Information
Alexander A Girgis, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, United States; Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States; Division of Rheumatology, Department of Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD, 21202, United States.
Amanda L Huff, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Emily Davis-Marcisak, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, United States; Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Theron Palmer, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, United States; Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Hanzhi Wang, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, United States.
Luciane T Kagohara, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Janelle M Montagne, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Dmitrijs Lvovs, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, United States.
Ludmila Danilova, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Alexander V Favorov, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Jonathan Schneck, Department of Pathology, Johns Hopkins University, Baltimore, MD, 21287, United States; Department of Medicine, Johns Hopkins University, Baltimore, MD, 21231, United States.
Clifton O Bingham III, Division of Rheumatology, Department of Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD, 21202, United States.
Erika Darrah, Division of Rheumatology, Department of Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD, 21202, United States.
Elizabeth M Jaffee, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Neeha Zaidi, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States.
Bahman Afsari, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States; HIV and AIDS Malignancy Branch, National Cancer Institute, Bethesda, MD, 20852, United States.
Elana J Fertig, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, United States; Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, 21231, United States; Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, United States; Greenebaum Comprehensive Cancer Center, University of Maryland School of Medicine, 21201, Baltimore, MD, 21201, United States; Institute for Health Computing, University of Maryland, North Bethesda, MD, 20852, United States.
Supplementary data
Supplementary data is available at NAR online.
Funding
Funding was provided by the National Institutes of Health / National Cancer Institute U01CA253403, P01CA247886, P50CA062924, P30CA006973, F31CA250135 (to E.D.M.), F32CA271470-01 (to A.L.H.), R50CA243627 (to L.D.), The National Institute of Health / National Institute of Arthritis and Musculoskeletal and Skin Diseases P30AR053503 and P30AR070254, the Break Through Cancer Foundation (to EJF and AVF), the Lustgarten Foundation (to E.M.J), Stand Up To Cancer (to E.M.J.), the Jerome. L. Greene Foundation (to E.D.), the Arthritis Discovery Fund (to C.O.B.), and the Camille Julia Morgan Arthritis Research and Education Fund (to C.O.B.), and Maryland Cancer Moonshot Research Grant to the Johns Hopkins Medical Institutions (FY24; to E.J.F.). B.A. was supported by the Intramural Research Program of the NIH.
Data availability
The Homolig software tool is available for download on Zenodo (10.5281/zenodo.14025889). Public-domain repertoires used in analyses, along with selected code to replicate analyses in this manuscript, are separately available on Zenodo (10.5281/zenodo.13984172). Pancreatic cancer patient repertoire data is available through dbGaP, study accession number phs003425.v1.p1. Rheumatoid arthritis patient data is available on the Adaptive Biotechnologies ImmuneAccess database (10.21417/ED2025S).
References
- 1. Katayama Y, Yokota R, Akiyama T et al. Machine learning approaches to TCR repertoire analysis. Front Immunol. 2022;13:858057. 10.3389/fimmu.2022.858057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mhanna V, Bashour H, Lê Quý K et al. Adaptive immune receptor repertoire analysis. Nat Rev Methods Primers. 2024;4:6. 10.1038/s43586-023-00284-1. [DOI] [Google Scholar]
- 3. Huang H, Wang C, Rubelt F et al. Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat Biotechnol. 2020;38:1194–202. 10.1038/s41587-020-0505-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mayer-Blackwell K, Schattgen S, Cohen-Lavi L et al. TCR meta-clonotypes for biomarker discovery with tcrdist3 enabled identification of public, HLA-restricted clusters of SARS-CoV-2 TCRs. eLife. 2021;10:e68605. 10.7554/eLife.68605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Zhang H, Zhan X, Li B. GIANA allows computationally-efficient TCR clustering and multi-disease repertoire classification by isometric transformation. Nat Commun. 2021;12:4699. 10.1038/s41467-021-25006-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89:10915–9. 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bateman A, Martin M-J, Orchard S et al. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins – an overview. Protein Sci. 2020;29:2150–63. 10.1002/pro.3954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Hess M, Keul F, Goesele M et al. Addressing inaccuracies in BLOSUM computation improves homology search performance. BMC Bioinformatics. 2016;17:189. 10.1186/s12859-016-1060-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Styczynski MP, Jensen KL, Rigoutsos I et al. BLOSUM62 miscalculations improve search performance. Nat Biotechnol. 2008;26:274–5. 10.1038/nbt0308-274. [DOI] [PubMed] [Google Scholar]
- 11. Song D, Chen J, Chen G et al. Parameterized BLOSUM matrices for protein alignment. IEEE/ACM Trans Comput Biol Bioinf. 2015;12:686–94. 10.1109/TCBB.2014.2366126. [DOI] [PubMed] [Google Scholar]
- 12. Postovskaya A, Vercauteren K, Meysman P et al. tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs. Briefings Bioinf. 2024;26:bbae602. 10.1093/bib/bbae602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lagattuta KA, Kang JB, Nathan A et al. Repertoire analyses reveal T cell antigen receptor sequence features that influence T cell fate. Nat Immunol. 2022;23:446–57. 10.1038/s41590-022-01129-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kidera A, Konishi Y, Oka M et al. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem. 1985;4:23–55. 10.1007/BF01025492. [DOI] [Google Scholar]
- 15. Atchley WR, Zhao J, Fernandes AD et al. Solving the protein sequence metric problem. Proc Natl Acad Sci USA. 2005;102:6395–400. 10.1073/pnas.0408677102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Kidera A, Konishi Y, Ooi T et al. Relation between sequence similarity and structural similarity in proteins. Role of important properties of amino acids. J Protein Chem. 1985;4:265–97. 10.1007/BF01025494. [DOI] [Google Scholar]
- 17. Bolotin DA, Poslavsky S, Mitrophanov I et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. 2015;12:380–1. 10.1038/nmeth.3364. [DOI] [PubMed] [Google Scholar]
- 18. Mudd P, Borcherding N, Kim W et al. Antigen-specific CD4+ T cells exhibit distinct transcriptional phenotypes in the lymph node and blood following vaccination in humans. Nature Immunology. 2024. 10.1038/s41590-024-01888-9. [DOI] [PMC free article] [PubMed]
- 19. Borcherding N, Sun B, DeNardo D et al. Ibex: variational autoencoder for single-cell BCR sequencing. bioRxiv, 10.1101/2022.11.09.515787, 10 November 2022, preprint: not peer reviewed. [DOI]
- 20. Hoffman RR, Mueller ST, Klein G et al. Metrics for Explainable AI: challenges and Prospects. 2018. 10.48550/ARXIV.1812.04608. [DOI]
- 21. Xu F, Uszkoreit H, Du Y et al. Explainable AI: a brief survey on history, research areas, approaches and challenges. In: Tang J, Kan M-Y, Zhao D, Li S, Zan H (eds.), Natural Language Processing and Chinese Computing, Lecture Notes in Computer Science. Vol. 11839. Cham: Springer International Publishing, 2019, 563–74. [Google Scholar]
- 22. Greissl J, Pesesky M, Dalai SC et al. Immunosequencing of the T-cell receptor repertoire reveals signatures specific for diagnosis and characterization of early Lyme disease. 2021.
- 23. Hamm D. immunoSEQ hsTCRB-V4 control data. 10.21417/ADPT2020V4CD. [DOI]
- 24. Guo Y, Chen K, Kwong PD et al. cAb-Rep: a database of curated antibody repertoires for exploring antibody diversity and predicting antibody prevalence. Front Immunol. 2019;10:2365. 10.3389/fimmu.2019.02365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Shen S, Kai B, Ruan J et al. Probabilistic analysis of the frequencies of amino acid pairs within characterized protein sequences. Physica A. 2006;370:651–62. 10.1016/j.physa.2006.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Brock Guy, Pihur Vasyl, Datta Susmita et al. clValid: AnRPackage for Cluster Validation. Journal of Statistical Software. 2008;25:1–22. 10.18637/jss.v025.i04. [DOI] [Google Scholar]
- 27. Lefranc M-P. IMGT, the International ImMunoGeneTics Information System. Cold Spring Harb Protoc. 2011;2011:pdb.top115. 10.1101/pdb.top115. [DOI] [PubMed] [Google Scholar]
- 28. Manso T, Folch G, Giudicelli V et al. IMGT® databases, related tools and web resources through three main axes of research and development. Nucleic Acids Res. 2022;50:D1262–72. 10.1093/nar/gkab1136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv, https://arxiv.org/abs/1802.03426, 18 September 2020, preprint: not peer reviewed.
- 31. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist. 2001;29:1165–88. 10.1214/aos/1013699998. [DOI] [Google Scholar]
- 32. DeWitt WS, Lindau P, Snyder TM et al. A public database of memory and naive B-cell receptor sequences. PLoS One. 2016;11:e0160853. 10.1371/journal.pone.0160853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Goncharov M, Bagaev D, Shcherbinin D et al. VDJdb in the pandemic era: a compendium of T cell receptors specific for SARS-CoV-2. Nat Methods. 2022;19:1017–9. 10.1038/s41592-022-01578-0. [DOI] [PubMed] [Google Scholar]
- 34. Wang QJ, Yu Z, Griffith K et al. Identification of T-cell receptors targeting KRAS-mutated human tumors. Cancer Immunol Res. 2016;4:204–14. 10.1158/2326-6066.CIR-15-0188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Tran E, Robbins PF, Lu Y-C et al. T-cell transfer therapy targeting mutant KRAS in cancer. N Engl J Med. 2016;375:2255–62. 10.1056/NEJMoa1609279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Veatch JR, Jesernig BL, Kargl J et al. Endogenous CD4+ T cells recognize neoantigens in lung cancer patients, including recurrent oncogenic KRAS and ERBB2 (Her2) driver mutations. Cancer Immunol Res. 2019;7:910–22. 10.1158/2326-6066.CIR-18-0402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Bear AS, Blanchard T, Cesare J et al. Biochemical and functional characterization of mutant KRAS epitopes validates this oncoprotein for immunological targeting. Nat Commun. 2021;12:4365. 10.1038/s41467-021-24562-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Danilova L, Anagnostou V, Caushi JX et al. The mutation-associated neoantigen functional expansion of specific T cells (MANAFEST) assay: a sensitive platform for monitoring antitumor immunity. Cancer Immunol Res. 2018;6:888–99. 10.1158/2326-6066.CIR-18-0129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Robins HS, Campregher PV, Srivastava SK et al. Comprehensive assessment of T-cell receptor β-chain diversity in αβ T cells. Blood. 2009;114:4099–107. 10.1182/blood-2009-04-217604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Carlson CS, Emerson RO, Sherwood AM et al. Using synthetic templates to design an unbiased multiplex PCR assay. Nat Commun. 2013;4:2680. 10.1038/ncomms3680. [DOI] [PubMed] [Google Scholar]
- 41. Cappelli LC, Konig MF, Gelber AC et al. Smoking is not linked to the development of anti-peptidylarginine deiminase 4 autoantibodies in rheumatoid arthritis. Arthritis Res Ther. 2018;20:59. 10.1186/s13075-018-1533-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Robins H, Desmarais C, Matthis J et al. Ultra-sensitive detection of rare T cell clones. J Immunol Methods. 2012;375:14–9. 10.1016/j.jim.2011.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Zhang H, Liu L, Zhang J et al. Investigation of antigen-specific T-cell receptor clusters in human cancers. Clin Cancer Res. 2020;26:1359–71. 10.1158/1078-0432.CCR-19-3249. [DOI] [PubMed] [Google Scholar]
- 44. Valkiers S, Van Houcke M, Laukens K et al. ClusTCR: a python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity. Bioinformatics. 2021;37:4865–7. 10.1093/bioinformatics/btab446. [DOI] [PubMed] [Google Scholar]
- 45. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
- 46. Teraguchi S, Saputri DS, Llamas-Covarrubias MA et al. Methods for sequence and structural analysis of B and T cell receptor repertoires. Comput Struct Biotechnol J. 2020;18:2000–11. 10.1016/j.csbj.2020.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Kawashima S. AAindex: amino acid index database. Nucleic Acids Res. 2000;28:374. 10.1093/nar/28.1.374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Stein-O’Brien G, Arora R, Culhane A et al. Enter the matrix: factorization uncovers knowledge from omics. Trends Genet. 2018;34:790–805. 10.1016/j.tig.2018.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Stein-O’Brien GL, Carey JL, Lee WS et al. PatternMarkers & GWCoGAPS for novel data-driven biomarkers via whole transcriptome NMF. Bioinformatics. 2017;33:1892–4. 10.1093/bioinformatics/btx058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Sherman TD, Gao T, Fertig EJ. CoGAPS 3: bayesian non-negative matrix factorization for single-cell analysis with asynchronous updates and sparse data structures. BMC Bioinformatics. 2020;21:453. 10.1186/s12859-020-03796-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Fertig EJ, Ding J, Favorov AV et al. CoGAPS: an R/C++ package to identify patterns and biological process activity in transcriptomic data. Bioinformatics. 2010;26:2792–3. 10.1093/bioinformatics/btq503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure. Vol. 5. Washington, D.C., USA: National Biomedical Research Foundation, 1978, 345–52. [Google Scholar]
- 53. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science. 1992;256:1443–5. 10.1126/science.1604319. [DOI] [PubMed] [Google Scholar]
- 54. UNAIDS . HIV estimates from 1990 to present. UNAIDS DATA 2023. 2023. https://www.unaids.org/en/resources/documents/2023/HIV_estimates_with_uncertainty_bounds_1990-present Accessed 20 June 2024.. [Google Scholar]
- 55. Cafri G, Gartner JJ, Zaks T et al. mRNA vaccine-induced neoantigen-specific T cell immunity in patients with gastrointestinal cancer. J Clin Invest. 2020;130:5976–88. 10.1172/JCI134915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Ben-David S, Von Luxburg U, Pál D. A sober look at clustering stability. In: Lugosi G, Simon HU (eds.), Learning Theory, Lecture Notes in Computer Science. Vol. 4005. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, 5–19. [Google Scholar]
- 57. von Luxburg U. Clustering stability: an overview. Found Trends Mach Learn. 2009;2:235–74. [Google Scholar]
- 58. DynaMedex . DynaMedex. EBSCO Information Services, Ipswich, MA USA. 2024. [Google Scholar]
- 59. Van Delft MAM, Huizinga TWJ. An overview of autoantibodies in rheumatoid arthritis. J Autoimmun. 2020;110:102392. 10.1016/j.jaut.2019.102392. [DOI] [PubMed] [Google Scholar]
- 60. Curran AM, Naik P, Giles JT et al. PAD enzymes in rheumatoid arthritis: pathogenic effectors and autoimmune targets. Nat Rev Rheumatol. 2020;16:301–15. 10.1038/s41584-020-0409-1. [DOI] [PubMed] [Google Scholar]
- 61. Hou X, Wang M, Lu C et al. Analysis of the repertoire features of TCR beta chain CDR3 in human by high-throughput sequencing. Cell Physiol Biochem. 2016;39:651–67. 10.1159/000445656. [DOI] [PubMed] [Google Scholar]
- 62. Hoffstedt M, Wätzig H, Baumann K. Comparison of different substitution matrices for distance based T-cell receptor epitope predictions using tcrdist3. ImmunoInformatics. 2025;19:100051. 10.1016/j.immuno.2025.100051. [DOI] [Google Scholar]
- 63. Pavlović M, Scheffer L, Motwani K et al. The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires. Nat Mach Intell. 2021;3:936–44. 10.1038/s42256-021-00413-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Zaslavsky ME, Craig E, Michuda JK et al. Disease diagnostics using machine learning of B cell and T cell receptor sequences. Science. 2025;387:eadp2407. 10.1126/science.adp2407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Abramson J, Adler J, Dunger J et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Bertoline LMF, Lima AN, Krieger JE et al. Before and after AlphaFold2: an overview of protein structure prediction. Front Bioinform. 2023;3:1120370. 10.3389/fbinf.2023.1120370. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Homolig software tool is available for download on Zenodo (10.5281/zenodo.14025889). Public-domain repertoires used in analyses, along with selected code to replicate analyses in this manuscript, are separately available on Zenodo (10.5281/zenodo.13984172). Pancreatic cancer patient repertoire data is available through dbGaP, study accession number phs003425.v1.p1. Rheumatoid arthritis patient data is available on the Adaptive Biotechnologies ImmuneAccess database (10.21417/ED2025S).








