Abstract
The T cell receptor (TCR) repertoire is an extraordinarily diverse collection of TCRs essential for maintaining the body’s homeostasis and response to threats. In this study, we compiled an extensive dataset of more than 4200 bulk TCR repertoire samples, encompassing 221,176,713 sequences, alongside 6,159,652 single-cell TCR sequences from over 400 samples. From this dataset, we then selected a representative subset of 5 million bulk sequences and 4.2 million single-cell sequences to train two specialized Transformer-based language models for bulk (CVC) and single-cell (scCVC) TCR repertoires, respectively. We show that these models successfully capture TCR core qualities, such as sharing, gene composition, and single-cell properties. These qualities are emergent in the encoded TCR latent space and enable classification into TCR-based qualities such as public sequences. These models demonstrate the potential of Transformer-based language models in TCR downstream applications.
Language models translate T cell receptors into meaningful representations, facilitating diverse applications.
INTRODUCTION
Many of the tasks of the immune system involve T cells (1, 2). T cells kill infected host cells, detect foreign proteins, activate other immune cells, and regulate immunity. The required specific interaction with a wide variety of antigens leads for a need of a large number of T cells, each with its own pattern recognition means (3, 4). This pattern recognition is mediated through the T cell receptor (TCR). The TCR is made of amino acids, and the collection of TCRs makes up the T cell repertoire (1, 5). Most TCRs consist of α and β chains. Each TCR is antigen relevant, and the interaction is dominated by the third complementarity-determining region (CDR3) of the α and β chains. The CDR3 sequence itself, averaging 16 amino acids in length, is generated by the extensively studied V(D)J recombination, involving a semi-random rearrangement of multiple V, (in β) D, and J gene segments (1, 5–7). The studied sequences are obtained from either RNA or DNA using either bulk sequencing or, more recently, single-cell sequencing technologies. While bulk RNA sequencing currently allows for the processing of a larger population of cells, single-cell technologies offer higher resolutions and the tandem exploration of α and β, enabling the exploration of cell-specific characteristics (8).
CDR3 sequences were long thought to be unique to each individual, referred to as “private” sequences, but over the past two decades, and especially since high-throughput sequencing has become available, it has been shown that many CDR3 sequences are shared between individuals. These sequences are called “public” sequences (9–11). The vast potential diversity of CDR3 sequences, estimated at approximately 1018 unique combinations (12), might intuitively suggest that the occurrence of public sequences would be statistically rare. However, closer examination reveals that such sequences are actually a predictable consequence of the mechanisms driving TCR diversity (13). The identification of public and private sequences within the CDR3 region may offer insights into the molecular underpinnings of TCR usage patterns and their distribution across individuals.
Progress in computing power and computational tools greatly improved the ability to analyze and research sequential data. This is evident in natural language processing (NLP) in general and, within the scope of this work, in the use of language models (specifically Transformers), to study sequential data such as DNA and proteins, leading to promising results (3, 14–18). The language used to study these types of sequences is that of either nucleotides, with their 4-letter representation, or amino acids and their 20-letter representation. In this context, two types of Transformers—encoders and decoders—are of interest. Encoder-based models aim at producing meaningful embeddings out of their inputs, while decoder-based models are used mainly for generation. BERT is an encoder-based Transformer that has been shown to be effective with sequential data, such as DNA (17) and proteins (19). Regardless of the task it is used for, BERT trains unsupervised to learn the grammatical structure of large, unlabeled datasets.
Since CDR3 sequences are assembled from amino acids, with their function highly dependent on the specific order of these acids (1), we posit that a language model—a sequential model—might yield meaningful embeddings to analyze CDR3 sequence features (20). This study reveals that the prevalence of a sequence as public or private can be discerned through embeddings encoded by the Transformer, reflecting intrinsic sequence information. In addition, these embeddings facilitate the investigation of “sister” TCRβ sequences that couple with identical TCRα in single-cell data.
Language models have previously predicted TCR specificity (21), and various methods have classified private and public sequences (13, 22), with tools grouping sequences by editing distance (23, 24). Our encoder-based Transformer, CountVonCount (CVC), is trained on 5 million unique CDR3 TCRβ amino acid sequences—half public and half private. CVC stands out for its robust embeddings that enable unsupervised clustering, phenotypic feature delineation, and diverse classification tasks. Notably, when benchmarked against TCR-BERT and ESM-2, CVC’s embeddings show superior performance in clustering and classification, highlighting its potential for advanced TCR sequence research.
While the CVC model provides unique insights into public versus private status, across thousands of samples, it lacks awareness of TCR α-β pairing at the single-cell level. To address this limitation, we leveraged a large dataset of over 2 million single T cells, which together provided a total of 4.2 million TCRα and TCRβ sequences. Using these rich single-cell data, we developed scCVC—a model that enables inspection of TCR features within individual T cells. The scCVC Transformer models the co-occurrence patterns of TCRα and TCRβ chains, providing a more nuanced understanding of TCR presentation. Moreover, the model provides insight into mucosal-associated invariant T (MAIT) cell status and the role of CDR3 sequences in encoding cell type information.
RESULTS
CVC and scCVC are based on the BERT architecture. CVC was trained by processing CDR3 TCRβ amino acid sequences as input, while scCVC was trained on the combined CDR3 TCRα and TCRβ sequences, according to their linked single-cell association. scCVC’s input is in the form of single cells, represented by their TCR (α and β) joined by a separator token, enabling a more comprehensive analysis of TCR behavior and features. Each amino acid would be a word in the original BERT architecture, while the CDR3 sequences is a would-be sentence. The model processes each input and outputs their embeddings: a 768-dimensional (768D) numerical vector. The data collection used for training CVC includes 1590 TCRβ samples that translate to 91,758,698 unique CDR3 sequences (see Materials and Methods). Of these, 5 million CDR3 TCRβ sequences were randomly selected for CVC’s unsupervised training, with a subdivision of 2.5 million private and 2.5 million public sequences, to avoid bias. As for scCVC, a collection of single-cell data was used, including 2,120,565 single cells that total to 4,200,335 TCR sequences.
An unsupervised language model is trained by masking a certain percentage of the input, and it learns by predicting these masked items. In our case, 15% of each sequence’s amino acids were masked, and the model predicted the missing information, with feedback. Once the Transformer is trained, we produce TCR embeddings for further analysis. The pipeline visualizations in Fig. 1 (A and B) illustrate how these embeddings are used. The trained model receives amino acid CDR3 sequences to create their embeddings. We visualized the embedding space in 2D using Uniform Manifold Approximation and Projection (UMAP) (25). Each point in CVC represents a sequence, while each point in scCVC represents a cell. In the different visualizations, point color is used for the specific feature analyzed.
Fig. 1. Using CVC and scCVC to embed TCR sequences.
A schematic representation of the processes used for constructing and applying the Transformers for bulk and single-cell TCR repertoires. (A) The bulk (CVC) model uses a representative subset of 5 million TCRβ sequences out of over 92 million available for self-supervised learning in the BERT framework, resulting in a 768D embedding for each sequence. (B) The single-cell (scCVC) model uses data from over 2 million single T cells, encompassing a curated subset of 4.2 million TCR sequences, with the higher sequence count reflecting instances of cells expressing multiple variants of TCRα and TCRβ chains. Sequences from the same cell are concatenated using a separator token “|,” facilitating the Transformer to learn a joint representation, and subsequently producing a 768D embedding for each joint sequence (refer to Materials and Methods for details).
CVC identifies public sequences in an unsupervised manner
To evaluate whether CVC encodes meaningful, latent, information about a sequence’s biology in its embeddings, we fed the Transformer with 1,000,000 randomly sampled sequences to obtain their embeddings. Among the 1,000,000 sequences, 15% were public and 85% were private, keeping the original distribution of these labels across the entire dataset. We then visualized (UMAP) the 150,000 public and 850,000 private sequences. The results are shown in Fig. 2A, where each sequence (each point) is colored according to its public/private label. A sequence is labeled as public when it appears in more than one sample in the original database. Otherwise, it is labeled private. From the visualized embedding space, it is apparent that sequences are clustered into roughly dozen groups (unsupervised), with public sequences clustered at the tips of each group. Later, we discuss the dozen groups.
Fig. 2. TCR publicity in CVC space and its association with sequence length and convergent recombination.
For all panels in this figure, CVC was used to create the embeddings of CDR3 TCRβ sequences, followed by dimensionality reduction for visualization (using UMAP). (A) UMAP of the embeddings of 1,000,000 TCRβ sequences colored according to their public/private label. Yellow points represent private sequences, while blue points represent public sequences. (B) public appearance distribution of the sequences in the dataset, colored according to sequence length percentiles, displayed in the upper right corner. The percentiles are 10, 25, 50, 75, and 90% corresponding to lengths 13, 14, 15, 16, and 18. (C) Sequence length distribution of 1,050,000 TCRβ sequences colored by sequence length percentiles: 10, 25, 50, 75, and 90%, which corresponded to amino acid length of 13, 14, 15, 16, and18, respectively. For (D and E), we created embeddings for the sequences used to generate (C). Both UMAP representations display the same latent space for the embeddings, colored initially (D) according to the sequence length percentiles and then (E) according to the private/public label of each sequence, showing the association between sequence length and sequences’ sharing status. For (F and G), we created embeddings for 536,932 TCRβ sequences. Both UMAP representations display the same latent space for the embeddings, colored initially (F) according to public/private status and then (G) according to their convergent recombination ranges. We show five convergent recombination ranges. From each range, we included a set of sequences according to their distribution in the dataset: 0 to 100 with 500,000 sequences, 100 to 200 with 30,799 sequences, 200 to 300 with 4574 sequences, 300 to 400 with 1132 sequences, and 400 and above with 427 sequences. It is easy to see that the Transformer captures publicity and convergent recombination simultaneously in latent space.
We examined the distinct behavior of public sequences by evaluating various thresholds used to tag a sequence as public or private. Our analysis with different criteria for classifying public sequences, based on their frequency across samples, provided consistent results, confirming the robustness of public sequence identification. Independently of the chosen threshold, we sought to determine whether the characteristic of publicity—the extent to which a sequence is common in the population of samples—is inherently captured by the Transformer’s embeddings. We quantified the appearances of each sequence and analyzed the correlation between publicity and sequence length. Figure 2B (top right inset) illustrates that sequence length distribution aligns with being bell-shaped. We segmented these into percentiles (10, 25, 50, 75, and 90%), correlating to sequence lengths of 13, 14, 15, 16, and 18. Figure 2B displays the publicity distribution, colored by sequence length percentiles, with the x axis indicating the count of public appearances and the y axis showing the sequence count on a logarithmic scale. Consistent with previous findings (26), our findings indicate that sequences frequently found in public repertoires are generally shorter and have distinct characteristics that are less commonly observed in private sequences.
Using information from the distribution, we divided publicity values into 24 bins of different sizes. To demonstrate how the different sequences are encoded by the Transformer, we sampled sequences from each bin, maintaining the ratio of the complete dataset, leading to 1,037,748 sequences. CVC was used to create embeddings from the sequences, exclusively from the sequences, without considering samples or other features. In fig. S1, a UMAP of the embeddings is displayed using a color code showing the size-bin affiliation. The figure shows that the spectrum of publicity is associated with directionality in the embedded space. The more public a sequence is, the further it is from the private ones. Furthermore, in our analysis, we observe approximately a dozen prominent clusters, akin to those identified in previous observations. The clusters are not identical each time, which can be attributed to variations in the sampling process. Each iteration of sampling can introduce slight differences, leading to observable but not exact replications of cluster formations. As an interim summary, we showed that the embeddings created by CVC capture, in an unsupervised manner, biological features that are integral to the CDR3 sequence itself.
Sequence length, convergent recombination, and publicity
As previously shown, different CDR3 sequence lengths display varying degrees of publicity. Our analysis aimed to ascertain whether this variation in publicity is captured within the transformed embeddings’ latent space. Figure 2C demonstrates the bell-shaped distribution of sequence lengths, which corresponds to the full dataset distribution. From this, we sampled 1,050,000 sequences, maintaining the proportion across different length percentiles for the embedding process using CVC. Panels D and E of Fig. 2 respectively depict sequence length percentiles and public/private status in the same UMAP space. These two figures show that the embeddings form roughly a dozen clusters, each containing sequences from all percentiles, suggesting a gradient from larger to smaller percentiles as noted in Fig. 2B. Further, when we compare this gradient with the public/private status in Fig. 2E, we find that public sequences predominantly reside within the lower to mid percentiles, whereas private sequences are more common in the higher percentiles. This pattern aligns with the correlation between sequence length and publicity observed in fig. S1, indicating the CVC-created embeddings’ sensitivity to sequence length variations.
Beyond sequence length, public sequences frequently exhibit convergent recombination (CR), where diverse nucleotide sequences encode identical amino acid sequences (11, 27), suggesting functional convergence among public sequences across individuals. We categorized sequences into five CR frequency groups and visualized a subsample of 536,932 sequences, revealing a clear correlation: Sequences with higher CR levels are predominantly public (Fig. 2, F and G). This finding supports our embeddings’ ability to capture not only sequences’ identity but also their immunological relevance. To further explore the relationship between CR and the TCR’s structural components, we analyzed CR patterns across different J genes. Figure S5 (A and B) shows the percentage of sequences with CR levels above certain thresholds for each J gene, highlighting that some J genes may be associated with higher rates of CR. This subset correlates with those J genes documented as more prevalent in the general population (27). The implication of this correlation may suggest a selective advantage for these J genes in the immune repertoire, contributing to their higher representation and potential public nature in T cell responses.
The CVC produced embeddings space stratifies by J gene affiliation
2D dimensionality reduction of the embedded representation shows an intriguing partition into 12 to 13 large clusters. As a reminder, the embeddings were created unsupervised; that is, CDR3s were not tagged with any labels during self-supervision and were therefore not associated with their origin J gene.
As Fig. 3 (A and B) shows, the J gene region of the TCR gene lies within the CDR3 region and is of 13 types: J1:1 to J1:6 and J2:1 to J2:7 (28). To show a substantial amount of J gene tags on a UMAP, we used the ImmuneCODE database (29), which includes millions of TCR sequences from more than 1400 individuals, with high-quality information about the V and J gene sources of each CDR3 sequence. We randomly selected 7 million sequences. The distribution of the J genes is shown in Fig. 3C, with TCRBJ02 to TCRBJ04 and TCRBJ02 to TCRBJ06 showing the lowest frequency in the dataset, while the rest of the J genes differ slightly in their frequency. To level the representation, we downsampled to 9% of the sequences from each of the J genes except for TCRBJ02 to TCRBJ04 and TCRBJ02 to TCRBJ06, for which all available sequences have been used.
Fig. 3. J gene clustering in embedding space.
(A) The structure of the CDR3 region of the RNA transcript of a TCRβ chain. (B) Structure of the DNA used to produce TCRβ chains before recombination, consisting of the variable (V), joining (J), constant (C), and diversity (D) regions. Segment from each region, together with deletion/addition/replacement of nucleotides, generates the TCR through the process of VDJ recombination. The red marked areas are the J genes, J1:1-6 and J2:1-7. (C) Bar plot representation of the number of CDR3 sequences, in our dataset, according to their use of J genes. All the sequences of TCRBJ02-04 and TCRBJ02-06 were taken and 9% of sequences from each of the remaining J gene types were randomly selected to create the represented embedding space and to provide meaningful representations for the visualization of all J genes. We colored the embedding space by the corresponding public/private label of each sequence (D) and by the different J gene types (E). We can see a near-perfect segmentation of the latent space according to J gene association.
Given that J gene–associated clustering has been observed in the embedding space, we aimed to evaluate the reproducibility of this phenomenon using an additional dataset. We used the aforementioned dataset, which encompasses data not previously analyzed in our work. The UMAP visualizations, annotated by public/private labels, support our initial findings as demonstrated in Fig. 3D. This consistency validates the patterns we observed with our baseline dataset. To further explore whether the spatial stratification in the embedding space is related to specific J genes, we applied the CVC model to the sequences and reduced their dimensionality using UMAP, coloring each point to correspond with its J gene. Results are shown in Fig. 3E. The apparent color coding of the different clusters reveals that the embedding space stratifies CDR3 sequences according to their J genes. This influence likely stems from the fact that the J segment constitutes a substantial portion of the sequence, which could explain its notable presence within the clusters.
When contrasting with other related language models like TCR-BERT and ESM-2, distinct clustering patterns emerge. As demonstrated in fig. S7, the J gene–driven stratification is pronounced in CVC’s visualization but is less discernible with TCR-BERT and ESM-2 embeddings. This distinction suggests that task-specific Transformer models like CVC are adept at capturing biologically pertinent features, potentially overshadowed in more generalized models. A comprehensive comparison is elaborated in the Supplementary Materials (fig. S7, A to F), reinforcing the specialized capabilities of CVC in TCR sequence analysis.
The fraction of the J segment within each cluster, indicating the extent to which the J segment is represented in the sequences, may teach us more of this behavior. Our fraction plot (fig. S6A) reveals consistent J segment proportions across different J gene types, suggesting uniformity in sequence length (fig. S6B). Supporting this are sequence logos (fig. S6, C to O) that visualize the prevalence of specific motifs at the CDR3 J segment junctions.
The clear importance of J genes in embeddings space led us to query the role of V genes. To do this, we again used the ImmuneCODE dataset, this time focusing on V gene available information. A total of 65 V genes, from TCRBV1 to TCRBV30, were represented in the data. Roughly 2% of sequences from each type were used; their embeddings were calculated and charted in fig. S2A. We created fig. S2B to see whether the V genes are associated with the public status of sequences. The red line in the figure is at the 50% mark, meaning that any bars over that threshold are for V genes with a greater than 50% chance of being public.
On the basis of the V genes of those bars, we generated fig. S2 (C and D), which displays the embedding space with the corresponding V gene and public/private labels. In fig. S2C, all clusters contain all types of V genes in which the sequences are grouped together by the different types. Regarding the publicity of these genes, we see that the same behavior occurs (fig. S2D), but with a larger presence of the public sequences. This combines to show that embeddings also link to show similarities between sequences with the same V gene, which has been demonstrated in similar related research (30).
Supervised classification using CVC
With clinical applications aiming to control specific TCR sequences in patients, the use of embeddings to expose sequence-based information that associates a TCR with its population-level quantities may greatly benefit clinical TCR uses. To determine whether these embeddings could be used to tag sequences as public or private, we randomly selected 200,000 sequences, 100,000 from each type (public/private), and produced embedding vectors (768D) through CVC. We then used these data (tabular, 200,000 × 768, label 0/1) for supervised binary classification. We tried multiple classification algorithms and eventually focused on three: LDA, xgBoost, and a deep neural network (DNN) (see details in Materials and Methods, in table S1—DOME Report, and in Fig. 4A for the DNN architecture), which showed areas under the curve (AUCs) (over test set) of 0.89, 0.89, and 0.9, respectively. The models provided an accuracy of 81.5, 80.635, and 81.7%.
Fig. 4. CVC embeddings for supervised classification tasks.
(A) We used DNNs, xgBoost, and LDA for the task of binary classification of sequences for their public/private status, and DNN alone for the task of multi-class classification of the J gene of each sequence. In all cases, input is the embeddings of each sequence, produced by CVC. (B) ROC of the LDA, xgBoost, and DNN classifiers trained over the task of binary classification of public and private sequences. Each algorithm was applied twice, using the embeddings created by CVC and using one-hot encoding. As shown in the figure, classifiers over embeddings achieved higher scores compared to the one-hot representation: AUC of 0.89, 0.89, and 0.9 compared to 0.76, 0.81, and 0.8, respectively. (C) Multi-class classification results of J gene type prediction using DNN on both the embeddings and one-hot vector representation of the sequences. The network was applied three times, and average result accuracies were 98.57% on the embeddings and 90.44% using one-hot encoding. All results are for the test set (previously unseen data); see code for details.
To learn about the added information content provided by the transformed model, we used machine learning over a one-hot representation of the CDR3 sequences. In this approach, we represent each amino acid using a 20D binary vector. Each vector with 19 zeros and one is placed at the index of the specific amino acid. To maintain an equal length for all sequences in the dataset, we set all one-hot transformations to be the length of the longest sequence (LS), while shorter sequences were padded with zeros. This led to a 200,000 × LS × 20 table as the algorithm’s input. Using these data, we achieved an AUC of 0.76, 0.81, and 0.8, respectively. The accuracy of the models was 69.98, 73.75, and 72.7%. xgBoost did better here, but only by a small margin. The receiver operating characteristic (ROC) curve can be seen below in Fig. 4B. To place our model within the landscape of existing Transformer-related architectures, we performed the same binary classification task using embeddings from the models TCR-BERT and ESM-2. As fig. S8A indicates, CVC showed superior results compared with these two Transformers. These differences demonstrate the importance of the latent space for classifying the sequences as public or private, with a substantial increase in AUC and accuracy when using CVC.
To see whether the embeddings created by CVC could be used to classify a sequence’s J gene without previous knowledge of the composition of the TCR sequences, and only the CDR3 representation in embedding space, we used the same set of algorithms used before: xgBoost, LDA, and a modified DNN (Fig. 4A), both on the embeddings and on the one-hot representation of the sequences. Figure 4C displays the accuracies for the DNN, while the other methods appear in fig. S3. All methods did well in predicting the J gene of a sequence when it is represented by the embeddings, but also quite well when the sequences are represented by one-hot encoding. In a similar manner to the comparison previously done with other Transformer models, CVC achieved the highest accuracy, as shown in fig. S8B, surpassing the results of TC-Bert and ESM-2.
Co-occurrence of TCRα and TCRβ and publicity in single-cell data
Single-cell immune profiling provides us with the knowledge of which TCRα and TCRβ chains are expressed in the same cell, allowing exploration of their co-occurrence and possible functional implications. To investigate this, we analyzed two distinct examples: (i) the study of MAIT cells and (ii) the analysis of TRB sister sequences. MAIT cells are a unique type of T cell identifiable by their α chain’s specific J and V genes TRAV1-2 joined with TRAJ33/20/12. Using single-cell data, we tagged MAIT cells with this V/J information (available at the data source). Figure 5 (A and B) shows that MAIT cells do not cluster, neither in the single-cell embedding space (scCVC) nor in TCRβ space (CVC). This behavior indicates that unique transcriptional and functional characteristics of MAIT cells are driven primarily by their TCRα. To investigate the publicity of MAIT, we used the TCRβ embeddings at our disposal to classify MAIT cells as public or private according to their TCRβ sequences. We used a DNN classifier like the one described earlier, and as can be seen in Fig. 5C, roughly 60% of the MAIT cells were classified as public. Given the demonstrated success in classifying public sequences with CVC embeddings, and the fact that many MAIT cells were public, we explored whether MAIT cells could be classified as such, using only their TCRβ CDR3 (using CVC embeddings) or only their TCRα CDR3 (scCVC embeddings), without any information about V or J genes. As Fig. 5 (D and E) shows, we were able to achieve an AUC of 0.71 for β-based classification and an AUC of 0.83 for α-based classification. These results demonstrate that information about the cell type is strongly encoded into the CDR3 sequence, and by translating this sequence into the Transformer-based embeddings, without any gene information, we can effectively classify MAIT cells. The differences in accuracy between the α-based and the β-based classifications are expected, as the tagging itself is α-based. It is unexpected to find that β sequences hold relevant information about the MAIT status of the cell.
Fig. 5. MAIT cells and TCRβ sister sequences in CVC and scCVC embedding space.
The 10x Genomics single-cell lung cancer dataset was used to examine the distribution of MAIT cells in the embedding space. MAIT cell barcodes were labeled according to their TRA J and V genes: TRAV1-2 combined with TRAJ33/20/12, enabling the labeling of corresponding TRB sequences by MAIT barcodes. (A) UMAP visualization of MAIT and non-MAIT single-cell embeddings generated using scCVC. (B) UMAP visualization of MAIT and non-MAIT TRB sequences produced by CVC. In both cases, we see that the MAIT cells did not cluster together. Eight 10x Genomics single-cell datasets were combined for a more comprehensive analysis. (C) Publicity distribution for 2508 MAIT and 2508 non-MAIT cells, revealing that over 60% of MAIT cells are public. (D) DNN architecture used for binary classification of MAIT cells, with embeddings as input. (E) Results were evaluated using three types of embeddings: TRA only, TRB only, and TRA combined with TRB. ROC AUC values were 0.83, 0.71, and 0.76, respectively. (F) Using 100,000 single-cell sequences from the single-cell database (see Data and materials availability), TRB sequences coexpressed with the same TRA sequence, i.e., TRB sister sequences (see text), were grouped together. UMAP visualization of the embedding space for these cells highlights TRB sister sequences belonging to TRA CAVMDSNYQLIW, CAVSGSQGNLIF, and CALNPRGNKLTF (fig. S4, A to C, respectively), showing that they do not cluster together. The mean distance between them was calculated and compared to the distance between them and the rest of the (random) sequences, revealing no difference in mean distance.
In addition to MAIT cells, we used single-cell data to analyze single T cells to identify tenets of co-occurrences between different TCRβ chains and the same TCRα chain in different cells. That is, we studied TCRβ sequences appearing in different cells that share the same TCRα sequences. We refer to these β sequences as TCRβ sisters. Using single-cell data (see Materials and Methods and Data and materials availability), TCRβ sisters were analyzed, and their embeddings were generated using CVC. To see whether these TCRβ sisters occupy a contained area in embedding space, we measured distances between sister TCRβs and compared these distances with the measured distances between sister TCRβs and random TCRβ sequences. We also projected the TCRβ sequences onto a 2D UMAP plot. As Fig. 5F indicates, the distances within and without the TRB sisters showed no substantial difference. The same phenomenon can be seen in fig. S4 (A to C), which shows TCRβ sisters scattered throughout the embedding space. These results indicate the diversity within sister TCRβ sequences.
DISCUSSION
Our understanding of the immune system’s versatility and its ability to perform its multitude of responsibilities is intricately linked to the specificity of TCRs, particularly within the CDR3 region. While TCRBuilder (31) and other structural NLP-based modeling tools (21, 32, 33) clarify parts of TCR functionality, translating the sequence of amino acids directly into functional insights remains challenging. Here, we introduced Transformer language models—CVC, trained on bulk TCRβ sequences, and scCVC, trained on single-cell TCRα and TCRβ presentation in isolated T cells. These models unveil underlying patterns in CDR3 sequences, informing of previously hidden associations.
These models detect spatial separations in latent Transformer space between public and private TCR sequences and manifest self-organized clusters indicative of J gene usage. Intriguingly, we observed certain J genes exhibiting higher CR, hinting at selective pressures in the immune system’s evolution that merit further investigation. Moreover, our model is able to classify public and private TCR sequences and to multi-class label J gene types. The utility of such classification tasks is also demonstrated in their ability to identify specialized T cell types, including MAIT cells, showcasing the potential of these tools to parse the complex parts of the T cell phenotype domain with precision.
An additional layer of complexity was addressed by analyzing TCRβ sister sequences using our scCVC model. By examining the co-occurrence of different TCRβ chains within the context of shared TCRα sequences across single T cells, this study follows the diversity and the potential functional interplay between sister chains. Despite the common TCRα linkage, TCRβ sisters demonstrate a high degree of diversity, occupying varied regions within the embedding space. This finding may assist in tracing the recombination mechanisms at the core of these unique cell subsets.
Other language models have been used to explore the TCR. Comparing the results of CVC against one such model, TCR-BERT, which is similarly trained on CDR3 sequences, or ESM-2, which is more generalized, we see that our task-specific Transformer models demonstrated superior capabilities in clustering and in classifying. This shows the contrast between models fine-tuned for specific biological and general models. The clear distinction in performance is a testament to the benefit of CVC’s architecture, which is adept at capturing the subtle complexities of TCR specificity.
Future work and future uses build on our ability to scale these models. We believe that such work could increase both parameter numbers and the number of sequences the model would be trained on (34). Improvement in single-cell technology may provide computational tools with the data needed to not only better our understanding of immune cell biology but also catalyze the development of innovative T cell–based therapies (35–37). Insights into public and private TCR distinctions may pave innovative pathways for cancer immunotherapies by identifying public TCRs that target common tumor antigens across patients. This transition from measured biology to latent mathematical space, and back into clinical implementation, has the potential to improve health.
MATERIALS AND METHODS
Bulk sequencing data
The dataset we collected for model training (CVC) includes information from 34 published papers. All included papers report T cell repertoire sequencing from bulk RNA (in contrast with single-cell data). The library preparation has been done using multiple methods, as well as the sequencing itself. All samples are human samples. While some of these papers reported α-chain sequencing, as well as β-chain sequencing, we only included β-chain sequencing in the training dataset. Further, as we were only interested in the sequences themselves, we did not refer in our analyses to any metadata (such as tissue type). These metadata may be the subject of future work.
Out of each of the papers, tables, FASTQ files, or any form of collection was stripped down to two items: TCR sequence and sample identification. These were aggregated to a larger table. The collection finally included 4217 samples that held 221,176,713 sequences.
Single-cell sequencing data
The single-cell dataset we collected for model training (scCVC) and analysis includes information from 31 published experiments. All included papers report T cell repertoire from single-cell RNA sequencing. The library preparation has been done using multiple methods, as well as the sequencing itself. All samples are human samples. As we were mainly interested in the sequences themselves, of both α- and β-chain types, we did not refer in our analyses to additional metadata.
Out of each experiment, tables, FASTQ files, or any form of collection was stripped down to three items: TCR sequence, sample identification, and unique cell identification. These were aggregated to a larger table. The collection finally included 458 samples that held 6,159,652 sequences.
Language models: CVC and scCVC
Both CVC and scCVC are language models based on the BERT model architecture (38), a language model that has been shown to have state-of-the-art results on different NLP-related tasks. They were implemented in Python using PyTorch (39) and the Transformer libraries (40). The models use a mechanism called attention to learn complex interactions within the input sequence, and in our case the interactions and correlations between the amino acids. This allows, after some training, to understand the grammar of the amino acid language in an unsupervised manner.
The difference between the two models is mainly in the number of training samples each model was trained on, the input sequences themselves, and how they were presented during training:
1) CVC was trained on 5 million CDR3 TCRβ sequences, with an internal split of 2.5 private and 2.5 public sequences. The input was individual CDR3 TCRβ sequences taken from the bulk-sequencing data mentioned above. The training was achieved by using the masking technique: 15% of each sequences’ amino acids were masked, and the model had to predict them.
2) scCVC was trained on 2,120,565 single cells (consisting of 4,200,335 TCRα and TCRβ sequences) from the previously mentioned single-cell sequencing data. The input consisted of single cells represented by a concatenated representation of the CDR3 that belong to them, joined by a separator token. The training process was achieved by first generating a random permutation of the sequences that constitute the single cells and then using the masking technique: 15% of each sequences’ amino acids were masked, and the model had to predict them. The randomization of sequence order was used to ensure that the model did not assign any importance to a particular order.
The hyperparameters that were used were the following, having most kept equal to the default BERT values: hidden representation dimensionality: 768; intermediate representation dimensionality: 1536; number of attention heads: 12; number of Transformer layers: 12; batch size: 1024; training epochs: 50; learning rate: 5 × 10−5; maximum positional embedding: 64; optimizer: Adam; loss: NLL (negative log likelihood); approximately 86 million parameters.
Because of the large computational needs, the models were trained (separately) on the Google Cloud Platform with the NVIDIA Tesla A100 GPU and 120 GB of memory. With this hardware, it took about 6 days to train. Adding parallelization of eight GPUs decreased the training time to about 2 days.
Once the training was complete, each model was ready to be used for embedding creation. The inputs of CVC were CDR3 TCRβ sequences, and the inputs of scCVC were either individual CDR3 sequences of both chain types or single cells, in the format explained above. The lengths (L) differed. Each sequence was then padded with a prefix token, C, and a suffix token, S. The padded input gets passed to an embedding layer that transposed each amino acid token into a 768D vector. Along with position embeddings, all the embedded tokens were passed into a set of 12 layers that created the whole sequence embedding matrix with dimensions of (L + 2) × 768. This matrix was then reduced to be of dimension 1 × 768 by calculating the mean of its embeddings. The method for dimensionality reduction could be changed, but the mean was set as the default method. This final embedding representation could then be used in various downstream tasks like the ones we present below.
Benchmarking—implementation details of protein language models
For the benchmarking process, we used two prominent protein language models: TCR-BERT and ESM-2. The TCR-BERT model is tailored for TCR sequences, while ESM-2 is a generic protein model that has demonstrated broad capabilities in sequence representation.
TCR-BERT: The TCR-BERT model, derived from the original BERT architecture, is designed specifically for TCR sequences. Using the HuggingFace Transformer library, the “wukevin/tcr-bert-mlm-only” version of TCR-BERT includes a base architecture of 12 Transformer layers, each with 12 self-attention heads, and produces embeddings of 768D. The model was trained for 50 epochs with a learning rate of 5 × 10−5 and comprises approximately 58 million parameters, making it particularly suited for analyzing TCRβ sequences and ensuring its relevance for comparison with our CVC model.
ESM-2: We chose the 150 million parameter variants of ESM-2 models for its scale compatibility with other models in parameter count and embedding dimensions. The “facebook/esm2_t30_150M_UR50D” model, also implemented via the HuggingFace Transformer library, features 30 Transformer layers and 20 attention heads, and creates 640D embeddings. This structure provides a balance between efficiency and the ability to capture complex sequence information. It was trained over 500,000 epochs with a learning rate of 4 × 10−4 and a weight decay of 0.01.
A benchmarking set comprising 400,000 TCRβ sequences was extracted from our extensive database to evaluate the performance of these models. The sequences were processed through TCR-BERT and ESM-2 to generate embeddings, which were then benchmarked against those produced by our CVC model.
The benchmarking analysis was conducted under uniform computational settings to ensure equitable conditions, and both models followed identical preprocessing protocols to maintain consistency in the evaluation. Results are presented in figs. S7 and S8.
Downstream clustering (UMAP)
CVC outputs embeddings with dimension of 768. To view these high-dimensional embeddings on a 2D plot, dimensionality reduction had to take place. UMAP (25) was used in this case, after the application of principal components analysis (41). The scanpy package (42) was used to apply this technique with the receiving of an AnnData object consisting of the embeddings and dimensionality reduction coordinates. We also tried t-distributed Stochastic Neighbor Embedding (t-SNE), but results were clearer and faster using UMAP.
Classification models
According to the guidelines of the recent DOME standard (43) for reporting results of supervised machine learning, we included, in addition to the description here in Materials and Methods, a supplementary table that carefully follows the DOME standard. This can be found in table S1—DOME Report.
Input data presentation
For each of the models described below, the input was either the embeddings created by CVC or the one-hot encoding representation of the sequences. The one-hot encoding representation transformed each amino acid to a 1 × 20D one-hot vector.
LDA
The LDA algorithm is a supervised dimensionality reduction technique that was used here to classify both the public/private label and the J gene of a given sequence. The python package that was used to apply this algorithm was sklearn (44). It was used with its default hyperparameters. Hyperparameter tuning did not improve results.
xgBoost
The xgBoost algorithm is a well-known classification algorithm that gives high-accuracy results when applied on tabular data. Here, we used it in a supervised manner to classify both the public/private label and the J gene of a given sequence. The sklearn package was again used for this algorithm with default hyperparameters. Changing the parameters did not give better results.
Deep neural network
The classification tasks the DNN was applied on were to predict the public/private label, MAIT cell label, and the J gene of a given sequence.
Predicting public/private label and MAIT cells
For this task, the best results were achieved by using a simple three-layer network with dimensions of 128, 32, and 1. The nonlinear function was ReLU for the first two layers and sigmoid for the last, with a learning rate of 1 × 10−5, the Adam optimizer, and binary cross entropy loss. For public/private classification, a batch size of 1024 and 150 epochs was used, as opposed to a batch size of 256 and 80 epochs for classifying MAIT cells.
Predicting J gene
For this task, the best results were achieved by using a simple three-layer network with dimensions of 64, 32, and 13 (13 types of J genes). The nonlinear function was ReLU, with a learning rate of 1 × 10−5, the Adam optimizer, batch size of 1024, cross entropy loss, and 80 epochs. Adding dropout and batch normalization did not improve the results.
Acknowledgments
We thank T. Mora and G. Yaari for extremely useful comments.
Funding: This work was supported by an ISF funded project 682/19, an ISF-SFC project 3382/20, an ICRF project 829965, a BSF project 20199090, and by the VATAT Data Science Program. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ISF, ICRF, BSF, and VATAT.
Author contributions: Conceptualization: R.G.K. and S.E. Methodology: R.G.K. and S.E. Investigation: R.G.K. A.A., S.Z., and A.Z. Visualization: R.G.K. and S.E. Funding acquisition: A.Z. and S.E. Project administration: R.G.K. Supervision: S.E. Writing—original draft: R.G.K. Writing—review and editing: R.G.K. and S.E.
Competing interests: A.Z. and S.E. are founders of Clonal Company. S.E. and R.G.K. are inventors on patent application no. IL2023/050758 submitted by Bar-Ilan University that covers the method for TCR sequence identification and classification. The other authors declare that they have no competing interests.
Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Bulk sequencing database: This database was created in our laboratory. It is a collection of public data of 4219 samples that correspond to 221,176,713 rows. The list of the PMIDs of the samples that make up this database can be found in table S2. For this project, we used only the data of TCRβ sequences, which translate to 91,758,697 unique sequences. Single-cell sequencing database: This database was created in our laboratory. It is a collection of public data of 458 samples that correspond to 6,159,652 rows. The list of the PMIDs of the samples that make up this database can be found in table S3. For this project, we filtered out duplicate and low-quality sequences, which left us with 4,200,335 TCRα and TCRβ sequences. These translate to 2,120,565 cells. ImmuneCODE database: This database includes millions of TCR sequences that come from patients who were exposed to or infected with SARS-CoV-2. It includes over 1400 different subjects. In this research, it was specifically used to distinguish the embedding space with the V and J genes. To do so, 17 million sequences were randomly extracted from it and used for both tasks. This database is freely available (29) and is planned to be used in further research. 10x Genomics dataset: 10x Genomics offers many different single-cell datasets that can be used for different research investigations. Overall, we used six datasets in this research: NSCLC tumor dataset, 20,000 bone marrow mononuclear cells, PBMCs of a healthy donor, 10,000 human PBMCs (https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-5-v-2-0-chromium-x-2-standard-6-1-0), CD8+ T cells of healthy donor 1, and CD8+ of healthy donor 2. These were chosen on the basis of the number of cells they contained and not for any specific reason. The NSCLC tumor datasets, which were used for immune profiling, consist of about 3643 cells. This and the rest of the datasets, 20,000 bone marrow mononuclear cells, PBMCs, and CD8+ T cells, were used for MAIT cell classification. These datasets contain 19,737, 6037, 14,632, 123,862, and 191,643 cells, respectively. More information and the datasets themselves can be found in the 10x Genomics website.
Supplementary Materials
This PDF file includes:
Figs. S1 to S8
Tables S1 to S3
References
REFERENCES AND NOTES
- 1.Wang C.-Y., Fang Y.-X., Chen G.-H., Jia H.-J., Zeng S., He X.-B., Feng Y., Li S.-J., Jin Q.-W., Cheng W.-Y., Jing Z.-Z., Analysis of the CDR3 length repertoire and the diversity of T cell receptor α and β chains in swine CD4+ and CD8+ T lymphocytes. Mol. Med. Rep. 16, 75–86 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Papadopoulou I., Nguyen A.-P., Weber A., Martínez M. R., DECODE: A computational pipeline to discover T cell receptor binding rules. Bioinformatics 38, i246–i254 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Reading J., Foster K., Joshi K., Chain B., Tracking down tumor-specific T cells. Cancer Cell 40, 351–353 (2022). [DOI] [PubMed] [Google Scholar]
- 5.Ou M., Zheng F., Zhang X., Liu S., Tang D., Zhu P., Qiu J., Dai Y., Integrated analysis of B-cell and T-cell receptors by high-throughput sequencing reveals conserved repertoires in IgA nephropathy. Mol. Med. Rep. 17, 7027–7036 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hou X., Wang M., Lu C., Xie Q., Cui G., Chen J., Du Y., Dai Y., Diao H., Analysis of the repertoire features of TCR beta chain CDR3 in human by high-throughput sequencing. Cell. Physiol. Biochem. 39, 651–667 (2016). [DOI] [PubMed] [Google Scholar]
- 7.Arnaout R., Lee W., Cahill P., Honan T., Sparrow T., Weiand M., Nusbaum C., Rajewsky K., Koralov S. B., High-resolution description of antibody heavy-chain repertoires in humans. PLOS ONE 6, e22365 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bacher R., Kendziorski C., Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 17, 63 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Serana F., Sottini A., Caimi L., Palermo B., Natali P. G., Nisticò P., Imberti L., Identification of a public CDR3 motif and a biased utilization of T-cell receptor V beta and J beta chains in HLA-A2/Melan-A-specific T-cell clonotypes of melanoma patients. J. Transl. Med. 7, 21 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huisman W., Hageman L., Leboux D. A. T., Khmelevskaya A., Efimov G. A., Roex M. C. J., Amsen D., Falkenburg J. H. F., Jedema I., Public T-cell receptors (TCRs) revisited by analysis of the magnitude of identical and highly-similar TCRs in virus-specific T-cell repertoires of healthy individuals. Front. Immunol. 13, 851868 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Venturi V., Price D. A., Douek D. C., Davenport M. P., The molecular basis for public T-cell responses? Nat. Rev. Immunol. 8, 231–238 (2008). [DOI] [PubMed] [Google Scholar]
- 12.Greiff V., Bhat P., Cook S. C., Menzel U., Kang W., Reddy S. T., A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome Med. 7, 49 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Elhanati Y., Sethna Z., Callan C. G. Jr., Mora T., Walczak A. M., Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination. Immunol. Rev. 284, 167–179 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y., dos Santos Costa A., Fazel-Zarandi M., Sercu T., Candido S., Rives A., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
- 15.Lennox M., Robertson N., Devereux B., Deep learning proteins using a triplet-BERT network. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2021, 4341–4347 (2021). [DOI] [PubMed] [Google Scholar]
- 16.M. H. Vu, R. Akbar, P. A. Robert, B. Swiatczak, V. Greiff, G. K. Sandve, D. T. T. Haug, Advancing protein language models with linguistics: A roadmap for improved interpretability. arXiv:2207.00982 [q-bio.QM] (2022).
- 17.Ji Y., Zhou Z., Liu H., Davuluri R. V., DNABERT: Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.A. Weber, A. Pélissier, M. R. Martínez, T cell receptor binding prediction: A machine learning revolution. arXiv:2312.16594 [q-bio.QM] (2023).
- 19.Unsal S., Atas H., Albayrak M., Turhan K., Acar A. C., Doğan T., Learning functional properties of proteins with language models. Nat. Mach. Intell. 4, 227–245 (2022). [Google Scholar]
- 20.Davidsen K., Olson B. J., DeWitt W. S. III, Feng J., Harkins E., Bradley P., Matsen F. A. IV, Deep generative models for T cell receptor protein sequences. eLife 8, e46935 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.K. Wu, K. E. Yost, B. Daniel, J. A. Belk, Y. Xia, T. Egawa, A. Satpathy, H. Y. Chang, J. Zou, TCR-BERT: Learning the grammar of T-cell receptors for flexible antigen-xbinding analyses. bioRxiv 2021.11.18.469186 [Preprint] (2021). 10.1101/2021.11.18.469186. [DOI]
- 22.Greiff V., Weber C. R., Palme J., Bodenhofer U., Miho E., Menzel U., Reddy S. T., Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires. J. Immunol. 199, 2985–2997 (2017). [DOI] [PubMed] [Google Scholar]
- 23.Valkiers S., Houcke M., Laukens K., Meysman P., ClusTCR: A python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity. Bioinformatics 37, 4865–4867 (2021). [DOI] [PubMed] [Google Scholar]
- 24.DeWitt W. S., Smith A., Schoch G., Hansen J. A., Matsen F. A. IV, Bradley P., Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity. eLife 7, e38358 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [stat.ML] (2018).
- 26.Hou X., Zeng P., Zhang X., Chen J., Liang Y., Yang J., Yang Y., Liu X., Diao H., Shorter TCR β-chains are highly enriched during thymic selection and antigen-driven selection. Front. Immunol. 10, 299 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Freeman J. D., Warren R. L., Webb J. R., Nelson B. H., Holt R. A., Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome Res. 19, 1817–1824 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.L. M.-P. Lefranc, G. Lefranc, The T cell receptor FactsBook, (Academic Press, London, 2001), pp. 398, IMGT/LIGMDB: IMGT000021 (582960 bp), human (Homo sapiens) TRB locus.
- 29.Nolan S., Vignali M., Klinger M., Dines J. N., Kaplan I. M., Svejnoha E., Craft T., Boland K., Pesesky M., Gittelman R. M., Snyder T. M., Gooley C. J., Semprini S., Cerchione C., Mazza M., Delmonte O. M., Dobbs K., Carreño-Tarragona G., Barrio S., Sambri V., Martinelli G., Goldman J. D., Heath J. R., Notarangelo L. D., Carlson J. M., Martinez-Lopez J., Robins H. S., A large-scale database of T-cell receptor beta (TCRβ) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Res. Sq., (2020). [Google Scholar]
- 30.N. Deutchmann, A. Pelissier, A. Weber, S. Gao, J. Bogojeska, M. R. Martínez, Do domain-specific protein language models outperform general models on immunology-related tasks? bioRxiv 2023.10.17.562795 [Preprint] (2023). 10.1101/2023.10.17.562795. [DOI]
- 31.Wong W. K., Marks C., Leem J., Lewis A. P., Shi J., Deane C. M., TCRBuilder: Multi-state T-cell receptor structure prediction. Bioinformatics 36, 3580–3581 (2020). [DOI] [PubMed] [Google Scholar]
- 32.Sidhom J.-W., Larman H. B., Pardoll D. M., Baras A. S., DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat. Commun. 12, 1605 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Ostrovsky-Berman M., Frankel B., Polak P., Yaari G., Immune2vec: Embedding B/T cell receptor sequences in ℝN using natural language processing. Front. Immunol. 12, 680687 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, L. Sifre. Training compute-optimal large language models. arXiv:2203.15556 [cs.CL] (2022).
- 35.Raffin C., Vo L. T., Bluestone J. A., Treg cell-based therapies: Challenges and perspectives. Nat. Rev. Immunol. 20, 158–172 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Romano M., Fanelli G., Albany C. J., Giganti G., Lombardi G., Past, present, and future of regulatory T cell therapy in transplantation and autoimmunity. Front. Immunol. 10, 43 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Loretelli C., Assi E., Seelam A. J., Nasr M. B., Fiorina P., Cell therapy for type 1 diabetes. Expert Opin. Biol. Ther. 20, 887–897 (2020). [DOI] [PubMed] [Google Scholar]
- 38.J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL] (2018).
- 39.A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style, high-performance deep learning library, in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Buc, E. Fox, R. Garnett, Eds. (Curran Associates Inc., 2019), pp. 8024–8035. [Google Scholar]
- 40.T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language processing, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics, Online); https://aclanthology.org/2020.emnlp-demos.6, pp. 38–45.
- 41.Jolliffe I. T., Cadima J., Principal component analysis: A review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Wolf F. A., Angerer P., Theis F. J., SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Walsh I., Fishman D., Garcia-Gasulla D., Titma T., Pollastri G.; ELIXER Machine Learning Focus Group, Harrow J., Psomopoulos F. E., Tosatto S. C. E., DOME: Recommendations for supervised machine learning validation in biology. Nat. Methods 18, 1122–1127 (2021). [DOI] [PubMed] [Google Scholar]
- 44.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011). [Google Scholar]
- 45.Jia Q., Wu W., Wang Y., Alexander P. B., Sun C., Gong Z., Cheng J.-N., Sun H., Guan Y., Xia X., Yang L., Yi X., Wan Y. Y., Wang H., He J., Futreal P. A., Li Q.-J., Zhu B., Local mutational diversity drives intratumoral immune heterogeneity in non-small cell lung cancer. Nat. Commun. 9, 5361 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Napolitani G., Kurupati P., Teng K. W. W., Gibani M. M., Rei M., Aulicino A., Preciado-Llanes L., Wong M. T., Becht E., Howson L., de Haas P., Salio M., Blohmke C. J., Olsen L. R., Pinto D. M. S., Scifo L., Jones C., Dobinson H., Campbell D., Juel H. B., Thomaides-Brears H., Pickard D., Bumann D., Baker S., Dougan G., Simmons A., Gordon M. A., Newell E. W., Pollard A. J., Cerundolo V., Clonal analysis of Salmonella-specific effector T cells reveals serovar-specific and cross-reactive T cell responses. Nat. Immunol. 19, 742–754 (2018). [DOI] [PubMed] [Google Scholar]
- 47.Giudice V., Feng X., Lin Z., Hu W., Zhang F., Qiao W., Del Pilar Fernandez Ibanez M., Rios O., Young N. S., Deep sequencing and flow cytometric characterization of expanded effector memory CD8+CD57+ T cells frequently reveals T-cell receptor Vβ oligoclonality and CDR3 homology in acquired aplastic anemia. Haematologica 103, 759–769 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Seet C. S., He C., Bethune M. T., Li S., Chick B., Gschweng E. H., Zhu Y., Kim K., Kohn D. B., Baltimore D., Crooks G. M., Montel-Hagen A., Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nat. Methods 14, 521–530 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Sims J. S., Grinshpun B., Feng Y., Ung T. H., Neira J. A., Samanamud J. L., Canoll P., Shen Y., Sims P. A., Bruce J. N., Diversity and divergence of the glioma-infiltrating T-cell receptor repertoire. Proc. Natl. Acad. Sci. U.S.A. 113, E3529–E3537 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Genolet R., Stevenson B. J., Farinelli L., Osterås M., Luescher I. F., Highly diverse TCRα chain repertoire of pre-immune CD8+ T cells reveals new insights in gene recombination. EMBO J. 31, 4247–4248 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Neal J. T., Li X., Zhu J., Giangarra V., Grzeskowiak C. L., Ju J., Liu I. H., Chiou S.-H., Salahudeen A. A., Smith A. R., Deutsch B. C., Liao L., Zemek A. J., Zhao F., Karlsson K., Schultz L. M., Metzner T. J., Nadauld L. D., Tseng Y.-Y., Alkhairy S., Oh C., Keskula P., Mendoza-Villanueva D., Vega F. M., Kunz P. L., Liao J. C., Leppert J. T., Sunwoo J. B., Sabatti C., Boehm J. S., Hahn W. C., Zheng G. X. Y., Davis M. M., Kuo C. J., Organoid modeling of the tumor immune microenvironment. Cell 175, 1972–1988.e16 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Azizi E., Carr A. J., Plitas G., Cornish A. E., Konopacki C., Prabhakaran S., Nainys J., Wu K., Kiseliovas V., Setty M., Choi K., Fromme R. M., Dao P., McKenney P. T., Wasti R. C., Kadaveru K., Mazutis L., Rudensky A. Y., Pe’er D., Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308.e36 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.van den Heuvel H., Heutinck K. M., van der Meer-Prins E. M. W., Yong S. L., van Miert P. P. M. C., Anholts J. D. H., Dijk M. E. I. F., Zhang X. Q., Roelen D. L., Berge R. J. M. T., Claas F. H. J., Allo-HLA cross-reactivities of Cytomegalovirus-, influenza-, and varicella zoster virus-specific memory T cells are shared by different healthy individuals. Am. J. Transplant. 17, 2033–2044 (2017). [DOI] [PubMed] [Google Scholar]
- 54.Béziat V., Li J., Lin J.-X., Ma C. S., Li P., Bousfiha A., Pellier I., Zoghi S., Baris S., Keles S., Gray P., Du N., Wang Y., Zerbib Y., Lévy R., Leclercq T., About F., Lim A. I., Rao G., Payne K., Pelham S. J., Avery D. T., Deenick E. K., Pillay B., Chou J., Guery R., Belkadi A., Guérin A., Migaud M., Rattina V., Ailal F., Benhsaien I., Bouaziz M., Habib T., Chaussabel D., Marr N., El-Benna J., Grimbacher B., Wargon O., Bustamante J., Boisson B., Müller-Fleckenstein I., Fleckenstein B., Chandesris M.-O., Titeux M., Fraitag S., Alyanakian M.-A., Leruez-Ville M., Picard C., Meyts I., Santo J. P. D., Hovnanian A., Somer A., Ozen A., Rezaei N., Chatila T. A., Abel L., Leonard W. J., Tangye S. G., Puel A., Casanova J.-L., A recessive form of hyper-IgE syndrome by disruption of ZNF341-dependent STAT3 transcription and activity. Sci. Immunol. 3, eaat4956 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Mimitou E. P., Cheng A., Montalbano A., Hao S., Stoeckius M., Legut M., Roush T., Herrera A., Papalexi E., Ouyang Z., Satija R., Sanjana N. E., Koralov S. B., Smibert P., Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409–412 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Buggert M., Nguyen S., de Oca G. S.-M., Bengsch B., Darko S., Ransier A., Roberts E. R., Alcazar D. D., Brody I. B., Vella L. A., Beura L., Wijeyesinghe S., Herati R. S., Estrada P. M. D. R., Ablanedo-Terrazas Y., Kuri-Cervantes L., Japp A. S., Manne S., Vartanian S., Huffman A., Sandberg J. K., Gostick E., Nadolski G., Silvestri G., Canaday D. H., Price D. A., Petrovas C., Su L. F., Vahedi G., Dori Y., Frank I., Itkin M. G., Wherry E. J., Deeks S. G., Naji A., Reyes-Terán G., Masopust D., Douek D. C., Betts M. R., Identification and characterization of HIV-specific resident memory CD8+ T cells in human lymphoid tissue. Sci. Immunol. 3, eaar4526 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.de Paula Alves Sousa A., Johnson K. R., Ohayon J., Zhu J., Muraro P. A., Jacobson S., Comprehensive analysis of TCR-β repertoire in patients with neurological immune-mediated disorders. Sci. Rep. 9, 344 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Cloughesy T. F., Mochizuki A. Y., Orpilla J. R., Hugo W., Lee A. H., Davidson T. B., Wang A. C., Ellingson B. M., Rytlewski J. A., Sanders C. M., Kawaguchi E. S., Du L., Li G., Yong W. H., Gaffey S. C., Cohen A. L., Mellinghoff I. K., Lee E. Q., Reardon D. A., O’Brien B. J., Butowski N. A., Nghiemphu P. L., Clarke J. L., Arrillaga-Romany I. C., Colman H., Kaley T. J., de Groot J. F., Liau L. M., Wen P. Y., Prins R. M., Neoadjuvant anti-PD-1 immunotherapy promotes a survival benefit with intratumoral and systemic immune responses in recurrent glioblastoma. Nat. Med. 25, 477–486 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Martino D., Neeland M., Dang T., Cobb J., Ellis J., Barnett A., Tang M., Vuillermin P., Allen K., Saffery R., Epigenetic dysregulation of naive CD4+ T-cell activation genes in childhood food allergy. Nat. Commun. 9, 3308 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kagoya Y., Nakatsugawa M., Saso K., Guo T., Anczurowski M., Wang C.-H., Butler M. O., Arrowsmith C. H., Hirano N., DOT1L inhibition attenuates graft-versus-host disease by allogeneic T cells in adoptive immunotherapy models. Nat. Commun. 9, 1915 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wu J., Jia S., Wang C., Zhang W., Liu S., Zeng X., Mai H., Yuan X., Du Y., Wang X., Hong X., Li X., Wen F., Xu X., Pan J., Li C., Liu X., Minimal residual disease detection and evolved IGH clones analysis in acute B lymphoblastic leukemia using IGH deep sequencing. Front. Immunol. 7, 403 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zhang W., Du Y., Su Z., Wang C., Zeng X., Zhang R., Hong X., Nie C., Wu J., Cao H., Xu X., Liu X., IMonitor: A robust pipeline for TCR and BCR repertoire analysis. Genetics 201, 459–472 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Yost K. E., Satpathy A. T., Wells D. K., Qi Y., Wang C., Kageyama R., McNamara K. L., Granja J. M., Sarin K. Y., Brown R. A., Gupta R. K., Curtis C., Bucktrout S. L., Davis M. M., Chang A. L. S., Chang H. Y., Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med. 25, 1251–1259 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Song I., Gil A., Mishra R., Ghersi D., Selin L. K., Stern L. J., Broad TCR repertoire and diverse structural solutions for recognition of an immunodominant CD8+ T cell epitope. Nat. Struct. Mol. Biol. 24, 395–406 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Stromnes I. M., Hulbert A., Pierce R. H., Greenberg P. D., Hingorani S. R., T-cell localization, activation, and clonal expansion in human pancreatic ductal adenocarcinoma. Cancer Immunol. Res. 5, 978–991 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Spreafico R., Rossetti M., van Loosdregt J., Wallace C. A., Massa M., Magni-Manzoni S., Gattorno M., Martini A., Lovell D. J., Albani S., A circulating reservoir of pathogenic-like CD4+ T cells shares a genetic and phenotypic signature with the inflamed synovial micro-environment. Ann. Rheum. Dis. 75, 459–465 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Carey A. J., Hope J. L., Mueller Y. M., Fike A. J., Kumova O. K., van Zessen D. B. H., Steegers E. A. P., van der Burg M., Katsikis P. D., Public clonotypes and convergent recombination characterize the naïve CD8+ T-cell receptor repertoire of extremely preterm neonates. Front. Immunol. 8, 1859 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Abdel-Hakeem M. S., Boisvert M., Bruneau J., Soudeyns H., Shoukry N. H., Selective expansion of high functional avidity memory CD8 T cell clonotypes during hepatitis C virus reinfection and clearance. PLOS Pathog. 13, e1006191 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Rossetti M., Spreafico R., Consolaro A., Leong J. Y., Chua C., Massa M., Saidin S., Magni-Manzoni S., Arkachaisri T., Wallace C. A., Gattorno M., Martini A., Lovell D. J., Albani S., TCR repertoire sequencing identifies synovial Treg cell clonotypes in the bloodstream during active inflammation in human arthritis. Ann. Rheum. Dis. 76, 435–441 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Suessmuth Y., Mukherjee R., Watkins B., Koura D. T., Finstermeier K., Desmarais C., Stempora L., Horan J. T., Langston A., Qayed M., Khoury H. J., Grizzle A., Cheeseman J. A., Conger J. A., Robertson J., Garrett A., Kirk A. D., Waller E. K., Blazar B. R., Mehta A. K., Robins H. S., Kean L. S., CMV reactivation drives posttransplant T-cell reconstitution and results in defects in the underlying TCRβ repertoire. Blood 125, 3835–3850 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Hsu M., Sedighim S., Wang T., Antonios J. P., Everson R. G., Tucker A. M., Du L., Emerson R., Yusko E., Sanders C., Robins H. S., Yong W. H., Davidson T. B., Li G., Liau L. M., Prins R. M., TCR sequencing can identify and track glioma-infiltrating T cells after DC vaccination. Cancer Immunol. Res. 4, 412–418 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Beausang J. F., Wheeler A. J., Chan N. H., Hanft V. R., Dirbas F. M., Jeffrey S. S., Quake S. R., T cell receptor sequencing of early-stage breast cancer tumors identifies altered clonal structure of the T cell repertoire. Proc. Natl. Acad. Sci. U.S.A. 114, E10409–E10417 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Gomez-Tourino I., Kamra Y., Baptista R., Lorenc A., Peakman M., T cell receptor β-chains display abnormal shortening and repertoire sharing in type 1 diabetes. Nat. Commun. 8, 1792 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Keane C., Gould C., Jones K., Hamm D., Talaulikar D., Ellis J., Vari F., Birch S., Han E., Wood P., Le-Cao K.-A., Green M. R., Crooks P., Jain S., Tobin J., Steptoe R. J., Gandhi M. K., The T-cell receptor repertoire influences the tumor microenvironment and is associated with survival in aggressive B-cell lymphoma. Clin. Cancer Res. 23, 1820–1828 (2017). [DOI] [PubMed] [Google Scholar]
- 75.Page D. B., Yuan J., Redmond D., Wen Y. H., Durack J. C., Emerson R., Solomon S., Dong Z., Wong P., Comstock C., Diab A., Sung J., Maybody M., Morris E., Brogi E., Morrow M., Sacchini V., Elemento O., Robins H., Patil S., Allison J. P., Wolchok J. D., Hudis C., Norton L., McArthur H. L., Deep sequencing of T-cell receptor DNA as a biomarker of clonally expanded TILs in breast cancer after immunotherapy. Cancer Immunol. Res. 4, 835–844 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Wu D., Sherwood A., Fromm J. R., Winter S. S., Dunsmore K. P., Loh M. L., Greisman H. A., Sabath D. E., Wood B. L., Robins H., High-throughput sequencing detects minimal residual disease in acute T lymphoblastic leukemia. Sci. Transl. Med. 4, 134ra63 (2012). [DOI] [PubMed] [Google Scholar]
- 77.Seay H. R., Yusko E., Rothweiler S. J., Zhang L., Posgai A. L., Campbell-Thompson M., Vignali M., Emerson R. O., Kaddis J. S., Ko D., Nakayama M., Smith M. J., Cambier J. C., Pugliese A., Atkinson M. A., Robins H. S., Brusko T. M., Tissue distribution and clonal diversity of the T and B cell repertoire in type 1 diabetes. JCI Insight 1, e88242 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Emerson R. O., DeWitt W. S., Vignali M., Gravley J., Hu J. K., Osborne E. J., Desmarais C., Klinger M., Carlson C. S., Hansen J. A., Rieder M., Robins H. S., Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet. 49, 659–665 (2017). [DOI] [PubMed] [Google Scholar]
- 79.Leader A. M., Grout J. A., Maier B. B., Nabet B. Y., Park M. D., Tabachnikova A., Chang C., Walker L., Lansky A., Berichel J. L., Troncoso L., Malissen N., Davila M., Martin J. C., Magri G., Tuballes K., Zhao Z., Petralia F., Samstein R., D’Amore N. R., Thurston G., Kamphorst A. O., Wolf A., Flores R., Wang P., Müller S., Mellman I., Beasley M. B., Salmon H., Rahman A. H., Marron T. U., Kenigsberg E., Merad M., Single-cell analysis of human non-small cell lung cancer lesions refines tumor classification and patient stratification. Cancer Cell 39, 1594–1609.e12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Zhao Y., Kilian C., Turner J.-E., Bosurgi L., Roedl K., Bartsch P., Gnirck A.-C., Cortesi F., Schultheiß C., Hellmig M., Enk L. U. B., Hausmann F., Borchers A., Wong M. N., Paust H.-J., Siracusa F., Scheibel N., Herrmann M., Rosati E., Bacher P., Kylies D., Jarczak D., Lütgehetmann M., Pfefferle S., Steurer S., Zur-Wiesch J. S., Puelles V. G., Sperhake J.-P., Addo M. M., Lohse A. W., Binder M., Huber S., Huber T. B., Kluge S., Bonn S., Panzer U., Gagliani N., Krebs C. F., Clonal expansion and activation of tissue-resident memory-like TH17 cells expressing GM-CSF in the lungs of patients with severe COVID-19 patients. Sci. Immunol. 6, eabf6692 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Tang Y., Kwiatkowski D. J., Henske E. P., Midkine expression by stem-like tumor cells drives persistence to mTOR inhibition and an immune-suppressive microenvironment. Nat. Commun. 13, 5018 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Banta K. L., Xu X., Chitre A. S., Au-Yeung A., Takahashi C., O’Gorman W. E., Wu T. D., Mittman S., Cubas R., Comps-Agrar L., Fulzele A., Bennett E. J., Grogan J. L., Hui E., Chiang E. Y., Mellman I., Mechanistic convergence of the TIGIT and PD-1 inhibitory pathways necessitates co-blockade to optimize anti-tumor CD8+ T cell responses. Immunity 55, 512–526.e9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Mahuron K. M., Moreau J. M., Glasgow J. E., Boda D. P., Pauli M. L., Gouirand V., Panjabi L., Grewal R., Luber J. M., Mathur A. N., Feldman R. M., Shifrut E., Mehta P., Lowe M. M., Alvarado M. D., Marson A., Singer M., Wells J., Jupp R., Daud A. I., Rosenblum M. D., Layilin augments integrin activation to promote antitumor immunity. J. Exp. Med. 217, e20192080 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Liao M., Liu Y., Yuan J., Wen Y., Xu G., Zhao J., Cheng L., Li J., Wang X., Wang F., Liu L., Amit I., Zhang S., Zhang Z., Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26, 842–844 (2020). [DOI] [PubMed] [Google Scholar]
- 85.Biermann J., Melms J. C., Amin A. D., Wang Y., Caprio L. A., Karz A., Tagore S., Barrera I., Ibarra-Arellano M. A., Andreatta M., Fullerton B. T., Gretarsson K. H., Sahu V., Mangipudy V. S., Nguyen T. T. T., Nair A., Rogava M., Ho P., Koch P. D., Banu M., Humala N., Mahajan A., Walsh Z. H., Shah S. B., Vaccaro D. H., Caldwell B., Mu M., Wünnemann F., Chazotte M., Berhe S., Luoma A. M., Driver J., Ingham M., Khan S. A., Rapisuwon S., Slingluff C. L., Eigentler T., Röcken M., Carvajal R., Atkins M. B., Davies M. A., Agustinus A., Bakhoum S. F., Azizi E., Siegelin M., Lu C., Carmona S. J., Hibshoosh H., Ribas A., Canoll P., Bruce J. N., Bi W. L., Agrawal P., Schapiro D., Hernando E., Macosko E. Z., Chen F., Schwartz G. K., Izar B., Dissecting the treatment-naive ecosystem of human melanoma brain metastasis. Cell 185, 2591–2608.e30 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Caushi J. X., Zhang J., Ji Z., Vaghasia A., Zhang B., Hsiue E. H.-C., Mog B. J., Hou W., Justesen S., Blosser R., Tam A., Anagnostou V., Cottrell T. R., Guo H., Chan H. Y., Singh D., Thapa S., Dykema A. G., Burman P., Choudhury B., Aparicio L., Cheung L. S., Lanis M., Belcaid Z., Asmar M. E., Illei P. B., Wang R., Meyers J., Schuebel K., Gupta A., Skaist A., Wheelan S., Naidoo J., Marrone K. A., Brock M., Ha J., Bush E. L., Park B. J., Bott M., Jones D. R., Reuss J. E., Velculescu V. E., Chaft J. E., Kinzler K. W., Zhou S., Vogelstein B., Taube J. M., Hellmann M. D., Brahmer J. R., Merghoub T., Forde P. M., Yegnasubramanian S., Ji H., Pardoll D. M., Smith K. N., Transcriptional programs of neoantigen-specific TIL in anti-PD-1-treated lung cancers. Nature 596, 126–132 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Chandran S. S., Ma J., Klatt M. G., Dündar F., Bandlamudi C., Razavi P., Wen H. Y., Weigelt B., Zumbo P., Fu S. N., Banks L. B., Yi F., Vercher E., Etxeberria I., Bestman W. D., Paula A. D. C., Aricescu I. S., Drilon A., Betel D., Scheinberg D. A., Baker B. M., Klebanoff C. A., Immunogenicity and therapeutic targeting of a public neoantigen derived from mutated PIK3CA. Nat. Med. 28, 946–957 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Luoma A. M., Suo S., Wang Y., Gunasti L., Porter C. B. M., Nabilsi N., Tadros J., Ferretti A. P., Liao S., Gurer C., Chen Y.-H., Criscitiello S., Ricker C. A., Dionne D., Rozenblatt-Rosen O., Uppaluri R., Haddad R. I., Ashenberg O., Regev A., Allen E. M., MacBeath G., Schoenfeld J. D., Wucherpfennig K. W., Tissue-resident memory and circulating T cells are early responders to pre-surgical cancer immunotherapy. Cell 185, 2918–2935.e29 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Zheng Y., Chen Z., Han Y., Han L., Zou X., Zhou B., Hu R., Hao J., Bai S., Xiao H., Li W. V., Bueker A., Ma Y., Xie G., Yang J., Chen S., Li H., Cao J., Shen L., Immune suppressive landscape in the human esophageal squamous cell carcinoma microenvironment. Nat. Commun. 11, 6268 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Han L., Chen S., Chen Z., Zhou B., Zheng Y., Shen L., Interleukin 32 promotes Foxp3+ Treg cell development and CD8+ T cell function in human esophageal squamous cell carcinoma microenvironment. Front. Cell Dev. Biol. 9, 704853 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Anadon C. M., Yu X., Hänggi K., Biswas S., Chaurio R. A., Martin A., Payne K. K., Mandal G., Innamarato P., Harro C. M., Mine J. A., Sprenger K. B., Cortina C., Powers J. J., Costich T. L., Perez B. A., Gatenbee C. D., Prabhakaran S., Marchion D., Heemskerk M. H. M., Curiel T. J., Anderson A. R., Wenham R. M., Rodriguez P. C., Conejo-Garcia J. R., Ovarian cancer immunogenicity is governed by a narrow subset of progenitor tissue-resident memory T cells. Cancer Cell 40, 545–557.e13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Anadon C. M., Zhang C., Wang X., Cen L., Conejo-Garcia J. R., Yu X., Protocol for the isolation of CD8+ tumor-infiltrating lymphocytes from human tumors and their characterization by single-cell immune profiling and multiome. Star Protoc. 3, 101649 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Heming M., Li X., Räuber S., Mausberg A. K., Börsch A.-L., Hartlehnert M., Singhal A., Lu I.-N., Fleischer M., Szepanowski F., Witzke O., Brenner T., Dittmer U., Yosef N., Kleinschnitz C., Wiendl H., Stettner M., Zu Hörste G. M., Neurological manifestations of COVID-19 feature T cell exhaustion and dedifferentiated monocytes in cerebrospinal fluid. Immunity 54, 164–175.e6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Gueguen P., Metoikidou C., Dupic T., Lawand M., Goudot C., Baulande S., Lameiras S., Lantz O., Girard N., Seguin-Givelet A., Lefevre M., Mora T., Walczak A. M., Waterfall J. J., Amigorena S., Contribution of resident and circulating precursors to tumor-infiltrating CD8+ T cell populations in lung cancer. Sci. Immunol. 6, eabd5778 (2021). [DOI] [PubMed] [Google Scholar]
- 95.Kourtis N., Wang Q., Wang B., Oswald E., Adler C., Cherravuru S., Malahias E., Zhang L., Golubov J., Wei Q., Lemus S., Ni M., Ding Y., Wei Y., Atwal G. S., Thurston G., Macdonald L. E., Murphy A. J., Dhanik A., Sleeman M. A., Tykodi S. S., Skokos D., A single-cell map of dynamic chromatin landscapes of immune cells in renal cell carcinoma. Nat Cancer 3, 885–898 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Wang Z., Xie L., Ding G., Song S., Chen L., Li G., Xia M., Han D., Zheng Y., Liu J., Xiao T., Zhang H., Huang Y., Li Y., Huang M., Single-cell RNA sequencing of peripheral blood mononuclear cells from acute Kawasaki disease patients. Nat. Commun. 12, 5444 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Shi X., Li Z., Yao R., Cheng Q., Li W., Wu R., Xie Z., Zhu Y., Qiu X., Yang S., Zhou T., Hu J., Zhang Y., Wu T., Zhao Y., Zhang Y., Wu J., Wang H., Jiang X., Chen L., Single-cell atlas of diverse immune populations in the advanced biliary tract cancer microenvironment. Npj Precis. Oncol. 6, 58 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Ferreira-Gomes M., Kruglov A., Durek P., Heinrich F., Tizian C., Heinz G. A., Pascual-Reguant A., Du W., Mothes R., Fan C., Frischbutter S., Habenicht K., Budzinski L., Ninnemann J., Jani P. K., Guerra G. M., Lehmann K., Matz M., Ostendorf L., Heiberger L., Chang H.-D., Bauherr S., Maurer M., Schönrich G., Raftery M., Kallinich T., Mall M. A., Angermair S., Treskatsch S., Dörner T., Corman V. M., Diefenbach A., Volk H.-D., Elezkurtaj S., Winkler T. H., Dong J., Hauser A. E., Radbruch H., Witkowski M., Melchers F., Radbruch A., Mashreghi M.-F., SARS-CoV-2 in severe COVID-19 induces a TGF-β-dominated chronic immune response that does not target itself. Nat. Commun. 12, 1961 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Ramaswamy A., Brodsky N. N., Sumida T. S., Comi M., Asashima H., Hoehn K. B., Li N., Liu Y., Shah A., Ravindra N. G., Bishai J., Khan A., Lau W., Sellers B., Bansal N., Guerrerio P., Unterman A., Habet V., Rice A. J., Catanzaro J., Chandnani H., Lopez M., Kaminski N., Cruz C. S. D., Tsang J. S., Wang Z., Yan X., Kleinstein S. H., van Dijk D., Pierce R. W., Hafler D. A., Lucas C. L., Immune dysregulation and autoreactivity correlate with disease severity in SARS-CoV-2-associated multisystem inflammatory syndrome in children. Immunity 54, 1083–1095.e7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Gaydosik A. M., Stonesifer C. J., Khaleel A. E., Geskin L. J., Fuschiotti P., Single-cell RNA sequencing unveils the clonal and transcriptional landscape of cutaneous T-cell lymphomas. Clin. Cancer Res. 28, 2610–2622 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Eberhardt C. S., Kissick H. T., Patel M. R., Cardenas M. A., Prokhnevska N., Obeng R. C., Nasti T. H., Griffith C. C., Im S. J., Wang X., Shin D. M., Carrington M., Chen Z. G., Sidney J., Sette A., Saba N. F., Wieland A., Ahmed R., Functional HPV-specific PD-1+ stem-like CD8 T cells in head and neck cancer. Nature 597, 279–284 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Corridoni D., Antanaviciute A., Gupta T., Fawkner-Corbett D., Aulicino A., Jagielowicz M., Parikh K., Repapi E., Taylor S., Ishikawa D., Hatano R., Yamada T., Xin W., Slawinski H., Bowden R., Napolitani G., Brain O., Morimoto C., Koohy H., Simmons A., Single-cell atlas of colonic CD8+ T cells in ulcerative colitis. Nat. Med. 26, 1480–1490 (2020). [DOI] [PubMed] [Google Scholar]
- 103.Gao S., Wu Z., Arnold B., Diamond C., Batchu S., Giudice V., Alemu L., Raffo D. Q., Feng X., Kajigaya S., Barrett J., Ito S., Young N. S., Single-cell RNA sequencing coupled to TCR profiling of large granular lymphocyte leukemia T cells. Nat. Commun. 13, 1982 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Saluzzo S., Pandey R. V., Gail L. M., Dingelmaier-Hovorka R., Kleissl L., Shaw L., Reininger B., Atzmüller D., Strobl J., Touzeau-Römer V., Beer A., Staud C., Rieger A., Farlik M., Weninger W., Stingl G., Stary G., Delayed antiretroviral therapy in HIV-infected individuals leads to irreversible depletion of skin- and mucosa-resident memory T cells. Immunity 54, 2842–2858.e5 (2021). [DOI] [PubMed] [Google Scholar]
- 105.Hu Y., Cao G., Chen X., Huang X., Asby N., Ankenbruck N., Rahman A., Thusu A., He Y., Riedell P. A., Bishop M. R., Schreiber H., Kline J. P., Huang J., Antigen multimers: Specific, sensitive, precise, and multifunctional high-avidity CAR-staining reagents. Matter 4, 3917–3940 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Borcherding N., Vishwakarma A., Voigt A. P., Bellizzi A., Kaplan J., Nepple K., Salem A. K., Jenkins R. W., Zakharia Y., Zhang W., Mapping the immune environment in clear cell renal carcinoma by single-cell genomics. Commun. Biol. 4, 122 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Cheon I. S., Li C., Son Y. M., Goplen N. P., Wu Y., Cassmann T., Wang Z., Wei X., Tang J., Li Y., Marlow H., Hughes S., Hammel L., Cox T. M., Goddery E., Ayasoufi K., Weiskopf D., Boonyaratanakornkit J., Dong H., Li H., Chakraborty R., Johnson A. J., Edell E., Taylor J. J., Kaplan M. H., Sette A., Bartholmai B. J., Kern R., Vassallo R., Sun J., Immune signatures underlying post-acute COVID-19 lung sequelae. Sci. Immunol. 6, eabk1741 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figs. S1 to S8
Tables S1 to S3
References





