Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires

Sofie Gielis; Pieter Moris; Wout Bittremieux; Nicolas De Neuter; Benson Ogunjimi; Kris Laukens; Pieter Meysman

doi:10.3389/fimmu.2019.02820

. 2019 Nov 29;10:2820. doi: 10.3389/fimmu.2019.02820

Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires

Sofie Gielis ^1,^2,³, Pieter Moris ^1,^3,^†, Wout Bittremieux ^1,^3,^4,^†, Nicolas De Neuter ^1,^2,³, Benson Ogunjimi ^2,^5,^6,⁷, Kris Laukens ^1,^2,^3,^‡, Pieter Meysman ^1,^2,^3,^*,^‡

PMCID: PMC6896208 PMID: 31849987

Abstract

High-throughput T cell receptor (TCR) sequencing allows the characterization of an individual's TCR repertoire and directly queries their immune state. However, it remains a non-trivial task to couple these sequenced TCRs to their antigenic targets. In this paper, we present a novel strategy to annotate full TCR sequence repertoires with their epitope specificities. The strategy is based on a machine learning algorithm to learn the TCR patterns common to the recognition of a specific epitope. These results are then combined with a statistical analysis to evaluate the occurrence of specific epitope-reactive TCR sequences per epitope in repertoire data. In this manner, we can directly study the capacity of full TCR repertoires to target specific epitopes of the relevant vaccines or pathogens. We demonstrate the usability of this approach on three independent datasets related to vaccine monitoring and infectious disease diagnostics by independently identifying the epitopes that are targeted by the TCR repertoire. The developed method is freely available as a web tool for academic use at tcrex.biodatamining.be.

Keywords: TCR repertoire analysis, epitope specificity, enrichment analysis, immunoinformatics, vaccines, infectious disease, cytomegalovirus (CMV), yellow fever virus (YFV)

Introduction

T cells form an important part of the adaptive immune system as they can recognize potentially pathogen-derived or aberrant peptides (epitopes), which are presented by major histocompatibility complex (MHC) molecules on the cell surface of nucleated host cells or professional antigen-presenting cells, and induce an immune response. The T cell receptor (TCR) molecule is responsible for the recognition of the MHC presented epitope. Each TCR protein is encoded by a genomic region that undergoes non-homologous recombination during T cell maturation, a process termed VDJ recombination. The randomness of the recombination process ensures the production of many different TCR proteins, with each T cell clone expressing a particular TCR, and allows the recognition of different epitopes. Epitope binding by the TCR is a critical step for the activation of targeted immune responses.

As high-throughput sequencing of TCR repertoires becomes more common, the need for more advanced analysis methods is increasing. A major challenge involves the functional identification of activated T cells from TCR repertoire data. To date, considerable efforts have been made to address this problem either by the detection of proliferated T cell clones, through the comparison of the abundance of TCR clonotypes between different time points (1, 2), or by the identification of enriched groups of similar T cell clones using VDJ recombination probabilities (3). Along with these identification methods, several strategies have been used to compare the TCR repertoires of pathology-associated tissues with healthy tissues or peripheral blood, for example in cancer (4, 5). However, the current identification methods lack the ability to discriminate disease-reactive T cells from other T cells that play a role in concurrent, non-related immune responses, thus requiring additional experiments. Different experimental strategies are currently available to determine epitope-specific TCRs, i.e., TCRs that recognize the epitope of interest, such as epitope-MHC multimer assays and peptide stimulation experiments. However, these techniques are not error-free as they can miss important epitope-specific TCRs (6, 7) or falsely report non-binding TCRs to be epitope specific (8). Nevertheless, these methods have been extensively used to generate a large amount of epitope-specific TCR data which has been collected in public databases such as the VDJdb (9) and McPAS-TCR (10).

Here, we present a new strategy to study T cell responses based on the direct identification of epitope-specific TCRs from TCR repertoire data. To determine the epitope specificity of TCRs, machine learning models have been developed to analyze TCR sequences and predict the probability that they recognize and bind specific epitopes. These methods are based on the principle that similar TCR sequences often target the same epitope (11) and that machine learning techniques can be used to learn the molecular underpinnings that are shared by these epitope-specific TCR sequences (12–14). While these methods have been shown to be performant on small targeted datasets, their application on full repertoire datasets remains challenging.

In this paper, we specifically address two issues that currently limit the use of prediction methods on full TCR repertoires. The first is the uncertainty on where to place the cut-off for a correct prediction. In other words, how confident must the prediction be before we accept it as the truth. If the cut-off is chosen too strictly, we may miss many epitope-specific T cells. Conversely, if it is chosen too low, we increase the number of TCRs that are mistakenly identified as epitope-binding. To address this first issue, the strategy proposed here includes the estimation of a model-specific baseline prediction rate, which can be used to set the prediction cut-off in an informed manner. The second issue is that the presence of epitope-specific TCRs is not restricted to TCR repertoires that have been stimulated by an epitope, since they can also occur in small numbers in naïve, healthy TCR repertoires. Therefore, the identification of single epitope-specific TCR sequences by itself does not provide sufficient information on the capacity of the full TCR repertoire to target specific epitopes. Current prediction strategies do not evaluate whether a TCR repertoire as a whole contains more TCRs against a specific epitope than expected in a healthy TCR repertoire that is not enriched for these TCRs. To address this second issue, we introduce an enrichment analysis strategy to identify those specific epitopes that are being targeted within a given TCR repertoire.

We built upon the epitope-specific TCR classifier described in De Neuter et al. (12) and developed an approach designed for the identification of various epitope-specific TCRs in full repertoire datasets. We have applied this new approach to three independent datasets (1, 2, 15), which have previously been analyzed with the traditional TCR repertoire data analyses described above. In this manner, we show that by adding this layer of epitope information, we are able to link T cell data to broader immune repertoire targets. The described method is available as a web tool at tcrex.biodatamining.be.

Materials and Methods

Collection of Training Data and Assembly of the Background TCR Dataset

A positive training dataset was constructed containing human epitope-specific TCR beta chain sequences collected from the manually curated catalog of pathology-associated T cell receptor sequences [McPAS-TCR (10)] (11 339 TCR-pathology combinations) and the VDJ database [VDJdb (9)] (17 792 TCR-epitope combinations) on November 18, 2018. To train the prediction models, both the CDR3 beta amino acid sequence and the V/J genes were gathered. The following quality filtering steps were applied for the McPAS-TCR dataset: (1) retaining epitope-specific TCRs determined with peptide-MHC tetramers or peptide stimulation and (2) removal of TCRs with missing information (i.e., CDR3 beta sequence, V/J genes or the specific epitope), reconstructed J genes, V/J genes with special characters that could not be matched to known V/J genes, CDR3 beta sequences with lower case amino acids or non-amino acid characters, TCRs with additional quality remarks and TCRs from studies using mouse strains. The dataset was further extended with 550 TCR-epitope combinations found in the published literature (13, 16–18). Standard TCRex filtering steps were carried out (Supplemental Material S1). Furthermore, entries with multiple V genes were split into separate entries, each having one of the V genes. For the retained TCR beta sequences, we checked whether their V/J genes were present in the IMGT format as formulated by the IMGT database (19). The latter was necessary due to possible inconsistencies in the way V/J genes were reported across studies. All V/J genes not reported in the IMGT format were corrected using our in-house IMGT parser. Finally, duplicate TCR-epitope combinations across different sources were removed. Only beta chain TCR sequence data for epitopes having at least 30 unique epitope-specific TCRs was retained in order to ensure the presence of an adequate number of TCRs to train and validate the classifiers. The final dataset contained 18 679 unique TCR-epitope combinations.

The control training dataset was designed to contain non-binding TCRs for each epitope. As one epitope can be presented by more than one MHC-molecule, and one TCR sequence might interact with different MHC molecules (20), we did not take the epitopes' MHC background into account. Instead, we assembled one control dataset for all epitope-specific models by collecting TCR beta chain sequences from healthy TCR repertoires from the study by Dean et al. (21). When considering a certain epitope, the vast majority of TCRs in healthy TCR repertoires will not target it. However, it is not unlikely that these repertoires contain a few TCRs that do target this specific epitope. The presence of these TCRs in our control dataset and their effect on the performance of the prediction models is expected to be negligible due to the low probability of their inclusion. In total, bulk TCR beta chain repertoire data from 587 healthy individuals was available. We randomly selected 1500 TCR beta sequences from the repertoire of each of these individuals. Subsequently standard TCRex filtering steps (Supplemental Material S1) were applied and the V/J genes of the remaining TCR beta chain sequences were transformed to the IMGT format. From this final dataset, 250 000 unique TCR beta chain sequences were randomly selected to make up the negative training dataset and 100 000 unique TCR beta chain sequences were randomly selected to represent the background dataset needed for the calculation of the baseline prediction rate. We ensured that the negative training and background dataset contained TCRs with different CDR3 beta sequences.

Collection of TCR Datasets From Infectious-Disease Studies

To evaluate the use of our epitope-specific models for the functional annotation of TCR repertoires, we collected TCR data from two independent yellow fever virus (YFV) vaccination studies (1, 2) and one cytomegalovirus (CMV) study (15). From the first YFV study (1), pre- (day 0) and post-vaccination (day 14) bulk TCR beta repertoire data (i.e., derived from peripheral blood mononuclear cells or PBMC) was collected from nine volunteers in the immunoSEQ Analyzer format. This study also reported TCRs from activated T cells in the post-vaccination repertoires for each of the volunteers [“CD3⁺CD8⁺CD14⁻ CD19⁻CD38⁺HLA-DR⁺ Ag-experienced, activated effector T cells” (1)]. From the second YFV study (2), pre- (day 0) and post-vaccination (day 15) bulk TCR repertoire data (i.e., derived from PBMC) was collected from three pairs of twins¹ In case biological replicates were present for a volunteer at a given time point, these were merged. As the TCR repertoires from this study were reported in a MiXCR file format that did not follow the standard TCRex input requirements (i.e., two additional columns reporting the best V gene and the best J gene, respectively, for each TCR sequence), we parsed these files and selected the most likely V/J gene for every TCR sequence based on the provided alignment score. For those TCRs with several V/J genes tied for the highest score, only the first one was retained. The second YFV study (2) also reported a list of vaccine-associated TCR beta sequences that were identified in the volunteers using a statistical analysis based on clonal expansion. This list was used to compare the number of identified TCRs between TCRex and the YFV study. From the CMV study (15), we gathered a dataset of 164 TCRs that were reported to be CMV-associated following a statistical analysis of a large group (>600) of healthy CMV+ and CMV- volunteers.

Model Training and Performance Evaluation

For each epitope present in the collected training dataset, we trained a random forest model to identify epitope-binding TCRs in a TCR repertoire dataset. These machine learning models have the ability to identify TCRs from multiple epitope-specific clusters (Supplemental Material S2). Training of epitope-specific prediction models was based on the method established in De Neuter et al. (12). This method was shown to perform comparably to other state-of-the-art classifiers in a recent independent study (22). In brief, the amino acid sequences of the CDR3 regions of the TCR beta chains were converted into physicochemical features and the beta chain's V/J genes and families were one-hot encoded. For each epitope in the collected training data, a random forest classifier was trained with 100 individual decision trees using the scikit-learn library (23). We sampled 10 times more negative training data than positive training data from the negative control set of 250 000 sequences, in order to mimic the natural imbalance of epitope-specific T cells in full TCR repertoires. TCR beta chain sequences that occurred in both the positive training and negative training dataset were removed from the negative training dataset to avoid ambiguously labeled TCR beta chain sequences. Given a TCR beta chain sequence, each model returns the confidence value of binding a specific epitope. All models were evaluated using a stratified 5-fold cross-validation strategy. With this validation strategy we obtained the receiver operating characteristic (ROC) curve, precision-recall (PR) curve, the balanced accuracy, the area under the ROC curve (AUC) value and the average precision value for each classifier as implemented in the scikit-learn library (23). Models that had poor AUC (<0.7) or poor average precision (<0.35) values were excluded from the model collection.

Controlling the Baseline Prediction Rate

The class probability score for a TCR beta chain sequence is specific for each epitope model and does not take false positive predictions into account. To control the number of false positives, we provide each TCR sequence with a baseline prediction rate (BPR) value along with its class probability score. This BPR value represents the fraction of epitope-specific TCRs that can be detected in a background dataset (i.e., a control dataset representing a normal, healthy TCR repertoire) at a certain confidence cut-off. Since background datasets are expected to contain only a limited number of epitope-specific TCRs, this BPR can also be used as an estimate for the false positive rate (FPR). For every TCR sequence, its class probability score is compared to the class probability scores of the 100 000 unique background TCR beta chain sequences. A BPR value is then calculated for each TCR sequence as the fraction of randomly selected background sequences with a class probability score greater than or equal to the score of the TCR sequence. Thus, for a desired BPR threshold, a confidence cut-off c can be calculated so that:

arg \underset{0 \leq c \leq 1}{m i n} \frac{| p (B) \geq c |}{| B |} \leq B P R

Where p is the prediction function and B is the background dataset. This allows us to filter the prediction results with a custom threshold across different epitope models, by using a calculated cut-off c that is specific for each model. Furthermore, this allows a direct interpretation of the cut-off with regards to the expected number of false positive hits for a typical TCR sequence dataset.

Epitope Specificity Enrichment Analysis

In addition to the detection of epitope-specific TCR beta chain sequences, we implemented a method to identify those epitopes for which we find a significantly higher number of unique TCR sequences in a target repertoire than expected in a representative control TCR repertoire. This analysis is performed by comparing the abundance of the epitope-specific TCR sequences in the TCR repertoire to a hypothetical dataset of randomly selected TCRs, i.e., the background dataset. TCRex does not use any information on the frequency of the TCR clonotypes in this comparison: only the number of unique predicted epitope-specific TCRs are considered. This enrichment analysis strategy uses a BPR-based enrichment threshold that represents the percentage of unique epitope-specific TCRs identified in the background repertoire at the chosen BPR threshold. For each epitope of interest, a one-sided exact binomial test is used to compare the number of TCR sequences with a BPR score smaller or equal to the specified enrichment threshold. More specifically, we test whether we can reject the null hypothesis that the observed number of identified TCRs follows the binomial distribution X ~ B(n, p) and thus accept the alternative hypothesis that the abundance of TCRs specific for the epitope in the studied TCR repertoire exceeds the chosen enrichment threshold. Here, X is the number of unique identified TCRs specific for the studied epitope at a chosen enrichment threshold, n the size of the TCR repertoire and p the chosen enrichment threshold. The latter represents the percentage of epitope-specific TCRs that can be identified in a hypothetical background dataset at a specific BPR threshold. Since one false positive hit (i.e., a TCR sequence with a BPR score below the BPR threshold that does not recognize the epitope) is possible at every BPR threshold, we restrict the statistical analysis to those epitopes for which at least two epitope-specific TCR sequences are found.

Web Tool

The trained prediction models were collected and made available as a web application, called TCRex, which is accessible at tcrex.biodatamining.be. This web tool was built using the Django web framework (24) and makes use of several Python libraries including scikit-learn (23), SciPy (25), Pyteomics (26), Altair (27), NumPy (28), Pandas (29), and Dask (30). It provides a user-friendly web interface to predict epitope binding for human TCR beta chain sequences and perform epitope specificity enrichment analyses for various viral and cancer epitopes. It is compatible with several common TCR sequence data formats, namely the immunoSEQ Analyzer formats² and MiXCR (31) format, and the TCRex tab-delimited format which contains the V/J genes and CDR3 beta amino acid sequences for each TCR. More information concerning the use of the web tool is provided in Supplemental Material S3. In this paper, the TCRex web tool was used with the default BPR threshold of 0.01% for all analyses. This value is based on the assumption that roughly 1–100 epitope-specific TCRs can be found for every million TCRs in a healthy CD8⁺ TCR repertoire (32, 33).

Results

Overview of the Epitope-Specific Models

The following performance metrics were calculated for each epitope-specific model using a 5-fold stratified cross-validation approach: the balanced accuracy, the area under the ROC curve (AUC) value and the average precision. These are shown in Supplemental Material S4. Only models with an AUC of at least 0.7 and an average precision of at least 0.35 were included in the final model collection, and thus in the current version (version 0.3.0) of the TCRex web tool. Forty nine out of 54 prediction models passed these inclusion criteria. Out of these 49 models, 44 were trained for viral epitopes and 5 for cancer epitopes (Figure 1). The MHC background for the epitopes is shown in Supplemental Material S5. It is worth to note that the detection of epitope-specific TCRs in this paper solely relies on these prediction models without any additional experimental validation. Therefore, all identified epitope-specific TCRs should be interpreted as TCRs predicted to bind a specific epitope. However, some of these TCRs might also be present in the training dataset, which is a collection of experimentally identified TCRs.

Overview of the number of trained prediction models for each virus and cancer type (TCRex version 0.3.0). The bars show the number of epitopes for which a prediction model with sufficient performance (i.e., AUC ≥ 0.7 and average precision ≥0.35) was trained.

Evaluation of the Prediction Models Using a Leave-One-Study-Out Validation Approach

To evaluate how our epitope-specific models handle data from new studies, we designed a validation approach wherein all data of one study was left out during the training of the models and served as a holdout validation set. We selected viral epitopes with a large set of epitope-specific training data (at least 1,000 unique TCRs) from various studies. For each of these epitopes, one study was removed from the training dataset (Supplemental Material S6). We ensured that this holdout dataset did not contain any of the training TCRs (i.e., TCRs with the same V/J genes and CDR3 beta sequence as training TCRs). After training the epitope-specific models, predictions on the holdout dataset were made using a strict BPR threshold of 0.01%. For each epitope-specific model, we expected a maximum of one false positive when analyzing these small datasets with a 0.01% BPR threshold, as these datasets varied in size from 66 to 124 TCRs. To verify this assumption, the same models were used on a negative control dataset, i.e., a dataset lacking TCRs specific for the epitope of interest. For this purpose, we randomly selected TCRs from the training dataset of the heteroclitic cancer epitope ELAGIGILTV up to the size of the holdout validation datasets. We also ensured that this negative test set did not contain any training TCRs (i.e., TCRs sharing both their V/J genes and CDR3 beta sequence with a TCR from the training dataset), as this might be the case for putative cross-reactive TCRs. Standard performance metrics including accuracy, sensitivity, specificity and precision were calculated for each of the models (Supplemental Material S7). A summary of the results is shown in Table 1. The majority of the identified TCRs did not match any of the training CDR3 beta sequences, which demonstrates the ability to find TCRs with unseen CDR3 beta sequences. Because of the strict BPR threshold, we achieved a very high specificity and precision, but a limited sensitivity. For both the GILGFVFTL and GLCTLVAML models, the specificity and precision reached the maximum score of 1 which corresponds to zero false positives. For the NLVPMVATV model, two false positives were detected, which does not align with the chosen BPR threshold. One of these TCRs, however, shares its CDR3 beta sequence and J gene with a NLVPMVATV-specific training TCR. This may indicate the presence of cross-reactive TCR recognition patterns for both the cancer and the CMV epitope. If this is the case, the TCR could be considered a true positive. The presence of epitope-specific TCRs in background datasets is a potential issue, as epitope-specificity studies mainly report TCRs binding to the epitope of interest and not the non-binding TCRs. Since there is no public information available on non-binding TCRs, our control datasets might contain a small number of epitope-specific TCRs. This could have led to an overestimation of the number of false positives in this analysis. Overall, the models performed well and did not seem to be subject to any study specific biases.

Table 1.

Validation of three epitope-specific models with the leave-one-study-out validation approach.

Epitope	Size training dataset	Size test datasets	Identified TCRs in epitope-specific test dataset	Identified TCRs containing training CDR3 beta sequences	Identified TCRs in cancer dataset	Performance metrics
						Accuracy	Sensitivity	Specificity	Precision
NLVPMVATV	4,723	89	10	3	2	0.54	0.11	0.98	0.83
GILGFVFTL	2,983	124	3	0	0	0.51	0.02	1	1
GLCTLVAML	1,142	66	12	3	0	0.59	0.18	1	1

Open in a new tab

For each epitope, the size of the training dataset, the size of the test datasets (i.e. the holdout validation dataset and the additional cancer dataset) and the number of identified epitope-specific TCRs in the test datasets is given together with the associated performance metrics for a BPR threshold of 0.01%. In addition, the number of identified TCRs in the holdout validation set having a training CDR3 beta sequence is shown. They, however, do not share their V and/or J genes with the training TCRs that match the CDR3 beta sequences, as these were filtered out the holdout dataset.

The Number of Unique Epitope-Specific TCRs Increases Post-vaccination

Previous studies have already demonstrated the feasibility of identifying T cell clones that have proliferated between time points in TCR repertoire data (1, 2). However, such methods do not provide insight into the targeted epitopes. Considering the fact that T cells proliferate upon binding with vaccine-associated epitopes, we hypothesized that post-vaccination repertoires contain an elevated level of unique epitope-specific TCRs compared to pre-vaccination repertoires. To evaluate the difference in the number of unique epitope-specific TCRs (i.e., each TCR clonotype is considered once) captured within the sequenced repertoire before and after vaccination, we reanalyzed published TCR repertoires from two independent YFV vaccination studies. In the first study, nine healthy volunteers with unknown HLA types received the YF-17D vaccine (1). In the second study, the YF-17D vaccine was used to vaccinate three pairs of HLA-A*02:01 positive twins (2). For each of the participants, TCR beta repertoire data from PBMC samples taken before and 2 weeks after vaccination were available. At the time of analysis, TCRex featured one YFV epitope prediction model: the immunodominant LLWNGPMAV peptide. We used this model to identify unique LLWNGPMAV-specific TCRs in both pre- and post-vaccination repertoires. To control the number of false positives, a strict BPR threshold of 0.01% was applied. As can be seen in Figure 2, the percentage of unique predicted LLWNGPMAV-specific TCR sequences increased post vaccination for the majority of the volunteers [one-sided Wilcoxon signed-rank test, p = 0.03711 for (1) and p = 0.01563 for (2)].

Percentage of unique identified LLWNGPMAV-specific TCRs pre- and post-vaccination. The boxplots show the proportion of unique LLWNGPMAV-specific T cells in pre- and post-vaccination PBMC samples for the nine volunteers from Dewitt et al. (1) **(A)** and the six volunteers from Pogorelyy et al. (2) **(B)**. Epitope-specific T cells were identified with TCRex using a 0.01% BPR threshold. An increase in the number of unique epitope-specific cells was found for both studies.

Comparison of the Identification Level With Techniques Relying on Clonal Expansion

The detection of vaccine-associated T cells with our new method is restricted to the recognition of epitope-specific T cells for which trained prediction models are available. Because only one YFV epitope model passed the quality filters at the time of analysis, we expected to detect a considerably smaller number of vaccine-associated TCRs in comparison with a differential analysis between time points. To evaluate the identification level of our new strategy, we compared our results from the post-vaccination repertoires to the outcome of the original YFV vaccine study from Pogorelyy et al. (2) which reported a list of the significantly expanded TCR clones 2 weeks after vaccination. Two important findings resulted from this analysis. First, considerably more YFV-associated TCRs were picked up by the method from study (2) than by TCRex. Second, the overlap between the epitope-specific TCRs identified by TCRex and the proliferated TCRs identified by Pogorelyy et al. (2) is rather limited, with an overlap ranging from 2 to 19 TCRs (Table 2). These results could indicate that our approach might be able to detect epitope-specific TCRs that might be missed by other techniques. However, it is important to point out that the list of epitope-specific TCRs reported by TCRex may contain a small number of non-specific TCRs due to the presence of false positive predictions. The lower identification rate of TCRex can be explained by its limited sensitivity at strict BPR thresholds, as demonstrated by the leave-one-study-out validation, and its restriction to one YFV epitope. Thus, any T cells reacting with other vaccine-related epitopes that were detected as expanded TCR clones in Pogorelyy et al. (2), would be missed by TCRex. On the other hand, as clonal expansion techniques do not take into account the epitope specificity of the TCRs, they might also falsely report bystander TCRs or TCRs associated with other non-vaccine related immune responses. Indeed, in the original YFV vaccine study three different experimental assays (i.e., a multimer assay using the LLWNGPMAV epitope, evaluation of the interferon gamma production following vaccine-stimulation and identification of activated CD8⁺CD38⁺HLA–DR⁺ T cells) were carried out to assess whether the identified expanded T cell clones for one of the volunteers (S1) were vaccine-associated (2). About half of the expanded 773 TCR clones were experimentally validated, of which 68 clonotypes were found to be LLWNGPMAV-specific. This is less than the 170 epitope-specific TCRs identified with TCRex. Although tetramer assays are known to miss epitope-specific TCRs (6, 7), this low identification rate might also be a consequence of the fact that the multimer assay was performed 2 years after vaccination.

Table 2.

Comparison of the epitope-specific TCR sequences identified with TCRex to the results from Pogorelyy et al. (2).

Volunteer	Vaccine-associated TCR clones reported by Pogorelyy et al. (2)	Unique canonical vaccine-associated TCRs identified by Pogorelyy et al. (2)	Epitope-specific TCRs identified by TCRex	Overlapping TCRs
P1	1,151	1090	92	2
P2	800	763	110	3
Q1	576	547	79	7
Q2	1,685	1589	92	8
S1	773	747	170	19
S2	983	948	273	11

Open in a new tab

For each volunteer from Pogorelyy et al. (2) are given: the number of YFV-specific clones (DNA level) reported by the YFV vaccine study, the number of unique canonical YFV-specific TCRs (i.e., TCRs starting with cysteine and ending with phenylalanine) (protein level with CDR3 beta sequences not containing any non-amino acid characters) identified in the YFV vaccine study, the number of epitope-specific sequences in the post-vaccination repertoire identified by TCRex and the number of TCRs identified by both the YFV vaccine study and TCRex.

The Majority of the Identified LLWNGPMAV-Specific T Cells Are Not Present in Public Databases

Since the detection of epitope-specific T cells relies on a prediction model trained on previously identified TCRs recognizing the epitope, we assessed to what extent our models are able to identify new epitope-specific TCRs. For this, we compared the number of TCRex-identified LLWNGPMAV-specific T cells in the post-vaccination repertoires from the two YFV vaccine studies (1, 2) with the number of public TCRs (i.e., known LLWNGPMAV-specific TCR sequences present in the training data) in these repertoires. For both YFV studies, TCRex achieved a higher identification rate in comparison with the database search technique (Figure 3). The number of TCRs identified by both techniques is rather limited (Supplemental Material S8). TCRex does find some training TCRs in the repertoires, however, we can deduce that the bulk of the TCRs identified by TCRex were not present in our training data and thus not publicly known.

Public vs. TCRex-identified unique LLWNGPMAV-specific TCRs in the post-vaccination samples of Dewitt et al. (1) and Pogorelyy et al. (2). **(A)** Overview of the number of unique LLWNGPMAV-specific TCRs that were identified with TCRex in the post-vaccination PBMC samples for all nine volunteers (V1–V9) of Dewitt et al. (1) (dark blue) and the total number of public LLWNGPMAV-specific TCRs that were present in these repertoires (light blue). **(B)** Overview of the number of unique LLWNGPMAV-specific TCRs that were identified with TCRex in the post-vaccination PBMC samples for all six volunteers (P1–S2) of Pogorelyy et al. (2) (dark orange) and the total number of public LLWNGPMAV-specific TCRs that were present in these repertoires (light orange).

Identification of Enriched Vaccine-Associated TCR Repertoire Targets

In addition to the identification of epitope-specific TCRs in whole TCR repertoires, our newly developed strategy can also be used to detect enriched TCR repertoire targets. In essence, we detect epitopes for which more specific TCRs are present than expected based on the epitope-specific TCR occurrence in a healthy, non-vaccinated TCR repertoire. Here, we demonstrate the feasibility of this enrichment analysis by comparing the levels of LLWNGPMAV-specific TCRs before and after vaccination for the volunteers from the two YFV studies. For each of the volunteers, we evaluated whether their pre- and post-vaccination TCR repertoires contained more LLWNGPMAV-specific TCRs than expected in a background TCR repertoire, i.e., a TCR dataset representing a healthy, non-vaccinated TCR repertoire. As all the LLWNGPMAV-specific prediction analyses were performed with a 0.01% BPR threshold, we expect that the abundance of LLWNGPMAV-specific TCRs in a background repertoire will not exceed 0.01%. Therefore, we assessed for each of the volunteers' TCR repertoires whether the percentage of epitope-specific TCRs in the TCR repertoires was significantly higher than 0.01% and thus were enriched in LLWNGPMAV-specific TCRs. The number of identified TCRs and the results of the epitope-specificity enrichment analyses for all volunteers from Pogorelyy et al. (2) are shown in Table 3. The majority of the post-vaccination TCR repertoires (4 out of 6) are enriched in LLWNGPMAV-specific TCRs, as opposed to the pre-vaccination TCR repertoires (2 out of 6). The enrichment in pre-vaccination repertoires might be explained by the use of an unrepresentative background dataset, which is a common problem addressed in the next section. A similar trend can be seen for the volunteers from Dewitt et al. (1). Here, the post-vaccination TCR repertoires from 3 of the 9 volunteers demonstrated an elevated level of LLWNGPMAV-specific TCRs while no volunteer showed an enrichment in their pre-vaccination TCR repertoire (Table 4).

Table 3.

LLWNGPMAV-specific enrichment analysis results for all volunteers from Pogorelyy et al. (2).

Volunteer	Identification level pre-vaccination PBMC			Identification level post-vaccination PBMC
	Identified TCRs	% epitope-specific TCRs	Adjusted p-value enrichment analysis vs. background repertoire	Identified TCRs	% epitope-specific TCRs	Adjusted p-value enrichment analysis vs. background repertoire	Adjusted p-value enrichment analysis vs. pre-vaccination repertoire
P1	77	0.007272	0.999	92	0.010400	0.441	8.17e-04
P2	95	0.008119	0.999	110	0.009614	0.673	0.045
Q1	61	0.008615	0.999	79	0.017915	2.80e-06	1.24e-08
Q2	109	0.008440	0.999	92	0.015455	8.59e-05	1.64e-07
S1	138	0.014311	2.40e-04	170	0.025194	6.06e-25	5.69e-11
S2	180	0.013314	4.09e-04	273	0.017962	2.10e-18	2.03e-06

Open in a new tab

The table represents the adjusted p-values (Benjamini-Hochberg adjusted p-values for 6 tests) associated with the LLWNGPMAV-specific enrichment analyses for both full (i.e., derived from PBMC) pre- and post-vaccination TCR repertoires. In addition, the number of identified specific TCRs and the percentage of identified specific TCRs are given.

Table 4.

LLWNGPMAV-specific enrichment analysis results for all volunteers from Dewitt et al. (1).

Volunteer	Pre-vaccination full TCR repertoire			Post-vaccination full TCR repertoire				Post-vaccination activated TCR repertoire
	Identified TCRs	% epitope- specific TCRs	Adjusted p-value enrichment analysis vs. background repertoire	Identified TCRs	% epitope- specific TCRs	Adjusted p-value enrichment analysis vs. background repertoire	Adjusted p-value enrichment analysis vs. pre-vaccination repertoire	Identified TCRs	% epitope- specific TCRs	Adjusted p-value enrichment analysis vs. background repertoire	Adjusted p-value enrichment analysis vs. pre-vaccination repertoire
1	35	0.010813	0.621	32	0.009094	0.935	0.857	4	0.017189	0.265	0.316
2	48	0.012999	0.384	65	0.015565	1.56e-03	0.212	5	0.028991	0.056	0.116
3	41	0.012539	0.384	87	0.025967	9.29e-14	7.86e-09	181	0.550922	1.86e-239	5.00e-222
4	23	0.009778	0.856	34	0.012084	0.299	0.231	21	0.103128	2.56e-14	1.67e-14
5	8	0.012759	0.621	63	0.020710	7.22e-07	9.93e-04	71	0.423527	7.50e-87	1.56e-79
6	41	0.012111	0.384	32	0.012065	0.299	0.684	8	0.040422	2.31e-03	7.45e-03
7	11	0.006771	0.964	11	0.005344	0.992	0.857	2	0.011544	0.517	0.369
8	22	0.008229	0.964	27	0.010866	0.539	0.212	2	0.011869	0.517	0.404
9	13	0.006451	0.964	18	0.008163	0.938	0.284	4	0.021606	0.176	0.060

Open in a new tab

The table represents the adjusted p-values (Benjamini-Hochberg adjusted p-values for 9 tests) associated with the LLWNGPMAV-specific enrichment analyses for both full (i.e., derived from PBMC) pre- and post-vaccination TCR repertoires and the post-vaccination activated TCR repertoire. In addition, the number of identified specific TCRs and the percentage of identified specific TCRs are given.

For the YFV study from Dewitt et al. (1) we additionally studied the abundance of LLWNGPMAV-specific TCRs in experimentally enriched TCR repertoires as Dewitt et al. reported the activated subset [i.e., “CD3⁺CD8⁺CD14⁻ CD19⁻CD38⁺HLA-DR⁺ Ag-experienced, activated effector T cells” (1)] of the TCR repertoire for each of the nine volunteers 14 days after vaccination with the YFV-17D vaccine. For each of these activated TCR repertoires, LLWNGPMAV-specific TCRs were identified using a BPR threshold of 0.01% and compared with the abundance of specific TCRs in the unsorted PBMC samples at the same time point. The number of LLWNGPMAV-specific TCRs in the activated TCR dataset increased drastically in comparison to their unsorted PBMC samples for the majority of the volunteers (one-sided Wilcoxon signed-rank test, p = 0.001953) (Figure 4). The same trend can be seen in the results of the enrichment analysis, presented in Table 4, with 4 out of 9 of the activated TCR repertoires demonstrating an enrichment in LLWNGPMAV-specific TCRs whereas this is limited to 3 out of 9 for the unsorted post-vaccination repertoires.

Percentage of unique identified LLWNGPMAV-specific TCRs in the post-vaccination repertoires from Dewitt et al. (1). The box plots show the log-scaled proportion of unique LLWNGPMAV-specific T cells in post-vaccination PBMC samples and activated TCR repertoires [i.e., “CD3⁺ CD8⁺CD14⁻CD19⁻CD38⁺HLA-DR⁺ Ag-experienced, activated effector T cells” (1)] for the nine volunteers from Dewitt et al. (1). Epitope-specific TCRs were identified with TCRex using a 0.01% BPR threshold. An increase in the number of unique LLWNGPMAV-specific cells was found in the activated dataset.

The Importance of the Background Dataset in Enrichment Analyses

In the previous section, we showed how an enrichment in epitope-specificity in TCR repertoires can be detected by comparing the abundances of epitope-specific TCRs to a general background dataset with a predefined enrichment threshold. Here, we demonstrate how the selection of a representative background dataset can improve these enrichment analyses. For each of the volunteers from the two YFV studies (1, 2), we compared the abundance of LLWNGPMAV-specific TCRs in their post-vaccination repertoire to the abundance in their pre-vaccination repertoire. For this, we slightly adapted our general epitope specificity enrichment analysis: (1) the number of successes in the one-sided exact binomial test was set equal to the number of epitope-specific TCRs in the post-vaccination repertoire that was identified with a 0.01% BPR threshold and (2) the enrichment threshold was set equal to the fraction of the number of identified TCRs in their pre-vaccination repertoire. A comparison between the two different enrichment analysis strategies demonstrates the importance of the use of a representative background dataset (Tables 3, 4). When the percentage of epitope-specific TCRs in the pre-vaccination repertoires is much lower than we generally expect in healthy TCR repertoires, the use of a standard background dataset might not discover the increase of epitope-specific TCRs in post-vaccination repertoires. This can be seen for volunteer P1 and P2 from study (2) (Table 3). Conversely, when the percentage of epitope-specific TCRs in the pre-vaccination repertoires exceeds our expectations, the use of a background dataset could falsely indicate a significant increase of epitope-specific TCRs in post-vaccination repertoires. This can be seen for volunteer 2 from study (1) (Table 4). In conclusion, the selection of the background dataset is an important step when performing enrichment analyses and depends on the research question and experimental set-up of the study.

Identification of Enriched Disease-Associated TCR Repertoire Targets

Previously, Emerson et al. introduced a new statistical method to discover disease-associated TCRs from TCR repertoire data (15). Briefly, they sequenced the TCR beta repertoires of a large group (more than 600) healthy CMV⁺ and CMV⁻ volunteers and set up a statistical test to identify public TCRs that were enriched in the healthy CMV⁺ volunteers. In total, 164 TCRs were found to be associated with a positive cytomegalovirus serostatus of which three were matched with a CMV epitope through an additional validation experiment. Here, we use our enrichment analysis strategy to identify the CMV epitopes that are targeted by the TCRs in this dataset. In short, we searched for epitopes for which an elevated level of specific TCRs are present in this dataset, i.e., more than expected in a background TCR repertoire. To see whether we only find an enrichment for CMV-specific TCR sequences, and not for non-CMV epitopes, we performed binding predictions for all epitopes present in TCRex with a strict BPR threshold of 0.01%. In total, 8 out of 164 different TCRs were predicted to bind to one of the CMV epitopes (Table 5). Two epitopes, both derived from CMV, were found to be significantly enriched in the dataset: NLVPMVATV (adjusted p = 7.1e-7, Benjamini-Hochberg adjusted p-value for two tests) and TPRVTGGGAM (adjusted p = 5.7e-09, Benjamini-Hochberg adjusted p-value for two tests), compared to a TCR background dataset where we would expect to identify 0.01% specific TCRs for each epitope with our selected BPR threshold. Additionally, we also found one YFV-specific TCR in the list of 164 CMV associated TCR sequences. This might be a false positive identification, where the LLWNGPMAV-specific model wrongly classified the TCR as epitope-binding. However, this TCR sequence shared an identical combination of a CDR3 beta sequence and J gene with TCR sequences linked with the YFV epitope in public databases and thus the model training set. It is therefore possible that this TCR is cross-reactive and recognizes both the immunodominant YFV epitope and a CMV epitope. It could also be that the identified TCR is not CMV-reactive as the dataset reported by the original study represents TCR sequences that were statistically associated with the CMV seropositive volunteers which does not directly entail CMV-reactivity. In summary, this analysis indicates the relevance of using prediction models to identify epitope-specific T cells and search for enriched T cell epitopes in a TCR beta dataset.

Table 5.

Epitope-specific TCRs identified in the enriched CMV-related TCR dataset of Emerson et al. (15) (0.01% BPR threshold).

TRBV gene	CDR3 beta sequence	TRBJ gene	Epitope	Virus	Class probability score	BPR score
TRBV07-06	CASSLAPGATNEKLFF	TRBJ01-04	NLVPMVATV	CMV	0.99	0.0
TRBV30	CAWSVSDLAKNIQYF	TRBJ02-04	NLVPMVATV	CMV	0.91	0.0
TRBV09	CASSALGGAGTGELFF	TRBJ02-02	NLVPMVATV	CMV	0.84	0.0
TRBV07-03	CASSRLAGGTDTQYF	TRBJ02-03	QIKVRVKMV	CMV	0.54	2e-05
TRBV07-09	CASSLIGVSSYNEQFF	TRBJ02-01	TPRVTGGGAM	CMV	0.97	0.0
TRBV04-03	CASSPSRNTEAFF	TRBJ01-01	TPRVTGGGAM	CMV	0.92	0.0
TRBV04-03	CASSPQRNTEAFF	TRBJ01-01	TPRVTGGGAM	CMV	0.85	0.0
TRBV04-03	CASSPHRNTEAFF	TRBJ01-01	TPRVTGGGAM	CMV	0.85	0.0
TRBV09	CASSGQGAYEQYF	TRBJ02-07	LLWNGPMAV	YFV	0.98	0.0

Open in a new tab

Evaluation of the Impact of MHC Bias in Epitope-Specific Prediction Models

Considering each epitope-specific model is based on a positive and negative training dataset, with, respectively, a very constrained and broad MHC background, the performance of the prediction models could be hampered by an inherent MHC bias. More specifically, the models are at risk of being MHC-specific, and thus predict the MHC binding property of a TCR sequence in addition to its epitope specificity. This could obstruct the identification of epitope-specific TCRs that bind MHC proteins other than those included in the training dataset. To assess the MHC-specificity of our prediction models, we studied the influence of the MHC bias on the prediction results. To this end, we collected epitope-specific TCR sequences from epitopes that are presented by more than one MHCI molecule. Out of the 49 epitopes available in TCRex, 8 were shown to be presented by different MHCI proteins (Supplemental Material S5). We selected the epitopes that were associated with two different HLA allele groups, i.e., their HLA allele names differ in the first two digits (Table 6): TPGPGVRYPL and TPQDLNTML. For each of these epitopes, a model was trained on a dataset of epitope-specific TCRs that recognizes only one of the MHCI proteins. Hereafter, predictions (0.01% BPR) were made on the remaining epitope-specific TCRs, which were associated with the other MHCI protein. TCRs with a different MHC background than the training dataset were identified for both epitopes (Table 6). This indicates that these prediction models are MHC-agnostic.

Table 6.

Evaluation of the MHC-specificity of epitope-specific models.

Epitope	Training data		Test data
	MHC background	Size	MHC background	Size	Number of identified TCRs
TPGPGVRYPL	HLA-B*42:01	63	HLA-B*07:02	23	1
TPQDLNTML	HLA-B*42:01	114	HLA-B*81:01	40	7

Open in a new tab

The table lists the MHC background and size of the training and test datasets for the study on MHC bias and the number of identified TCRs in the test dataset (0.01% BPR threshold).

To support this result, we performed an additional analysis where the HLA-B*42:01 restricted models for the TPGPGVRYPL and TPQDLNTML epitopes were used to predict the specificity of TCRs with the same HLA-B*42:01 MHC background but different epitope specificity. In total, TCRex contained 4 epitopes associated with HLA-B*42:01: LPPIVAKEI (62 TCRs), IIKDYGKQM (54 TCRs), HPKVSSEVHI (75 TCRs), and FPRPWLHGL (120 TCRs). Using the HLA-B*42:01 restricted models (0.01% BPR), none of the TCRs associated with these 4 epitopes were predicted to bind TPGPGVRYPL or TPQDLNTML, which shows that the MHC binding property of a TCR was not learned by these prediction models. Taken together, the results of this study suggest that the MHC-bias in the prediction models does not necessarily have a major impact on the prediction results.

Discussion

Traditional methods to identify epitope-specific T cells require targeted experiments, which are known to fail to identify a considerable amount of reactive T cells (6, 7). Approaches to solve this problem computationally have been proposed, but they are hard to translate to full repertoires due to their inherent prediction error. To overcome these limitations, we propose a novel approach built on the identification of epitope-specific T cells. Our method uses epitope-specific prediction models trained on publicly available TCR beta sequence data to assess the probability of binding the epitope of interest and corrects for prediction errors by using a stringent BPR threshold. As such, it allows for the easy identification of epitope-specific T cells in full TCR repertoires and consequently the detection of enriched TCR repertoire targets. In the current implementation, it is based on a random forest classifier trained on common physicochemical properties. The implemented method currently only uses TCR beta chain information, as this is the bulk of the data that is currently available. As linked beta-alpha TCR data becomes more common (34, 35), the approach can be easily extended to include alpha chain sequence data.

This novel approach was validated on several independent datasets to address different research questions. Firstly, we examined whether we could detect a difference in the presence of epitope-specific TCRs in pre- and post-vaccination repertoires. Considering that vaccines are expected to activate T cells upon binding with vaccine-associated epitopes, leading to a targeted proliferation of these T cells, we expected to find more unique epitope-specific T cells after vaccination. The analysis of the two YFV studies (1, 2) seems to support this hypothesis as an increase in predicted LLWNGPMAV-specific T cells is found in the post-vaccination repertoires from both studies.

One limitation of our method is the restriction on the number of epitopes that can be studied as the models rely on the presence of sufficient epitope-specific TCR data from targeted experiments. Consequently, it is only possible to inspect the specificity of TCR repertoires for epitopes which have already been studied. However, considering our analysis of the two YFV studies was restricted to one epitope, we achieved a relatively high identification rate in comparison with the method of Pogorelyy et al. (2). This might be a result of the immunodominance of the studied YFV epitope. The majority of the TCRs identified with TCRex, however, did not overlap with the results in Pogorelyy et al. (2). This could indicate that we might be able to find epitope-specific TCRs that are missed by other methods. For example, it is possible that some epitope-specific TCRs expanded less in comparison to others and were therefore not detected using the technique relying on clonal expansion. Though more extensive experimental validation would be required, TCRex has the potential to be used as a complementary technique to existing TCR data analysis strategies by annotating single TCRs with their epitope target and/or by discovering epitope response trends in (activated) TCR repertoires.

Finally, we demonstrated how our approach enables the search for enriched epitope specificity in TCR repertoire datasets using our built-in enrichment analysis test. In short, we used an enrichment test to check whether a TCR repertoire contains significantly more TCRs recognizing a specific epitope than a random background TCR dataset. As a proof of concept, we compared the elevated level of LLWNGPMAV-specific TCRs in pre- and post-vaccination TCR repertoires and analyzed a dataset enriched in CMV-associated TCRs. The former analysis shows the importance of the selection of the right background dataset when performing enrichment analyses, which we discuss in more detail in the next section. For the latter, our enrichment analysis method easily detected a significant increase in epitope-specific TCRs for two CMV epitopes. No enrichment was found for other viral epitopes, which illustrates the robustness of our approach.

While this novel approach allows a full repertoire exploration, several potential sources of bias may still impact the results. A first problem lies in the diversity of epitope-specific repertoires: various groups of T cells can be found in these repertoires, each carrying a specific pattern (11, 13, 14). Patterns that are present in more T cell clonotypes, will be picked up more efficiently in TCR repertoire studies. This pattern bias will therefore be present in the training dataset. Consequently, T cells carrying common patterns might be detected more easily by the prediction models while T cells with rare patterns are more likely to be missed. Although it is unlikely that all specificity patterns can be discovered with a prediction model, it could still be possible to detect the general trends following a disease or treatment, as these might influence the frequency of both TCRs with common and rare patterns.

Another major concern was the effect of the MHC bias in the training data on the prediction models. This bias could lead to an underestimation of the epitope specificity of TCR repertoires from individuals whose HLA types do not match the MHC background of the prediction models. In addition, it could cause an increase of false positive predictions for individuals with matching HLAs. Our analysis on the impact of the MHC bias demonstrated that the MHC background of epitope-specific TCRs is not likely to hamper the prediction results. This indicates that the epitope-specific TCR patterns might transcend the MHC background of epitopes, which is in line with (11). However, caution needs to be taken when extending this conclusion to all other prediction models as the relative importance between the epitope-specific and MHC-specific TCR patterns could vary between the different models.

Still, even if the epitope-specific models are not MHC-agnostic, potential biases in the enrichment results can be avoided by comparing the abundance of epitope-specific TCRs in a TCR repertoire to a background dataset with a matching MHC background. In general, we recommend the use of a representative background dataset (e.g., background data from the same subject using the same TCR sequencing and analysis set up) when available as any other biases between the TCR repertoire and the background dataset might affect the enrichment results. This is demonstrated in the enrichment analyses of the YFV studies, where the replacement of the standard background dataset by the pre-vaccination repertoires gave better results. Nevertheless, if no suitable background dataset is available, the enrichment analysis, as implemented in the TCRex web tool, still provides an easy and fast exploration of a TCR dataset, since it gives an initial overview of the possible TCR targets that may be enriched in this dataset. As bulk sequenced TCR repertoires are commonly not annotated with their epitope specificities and require laborious additional experiments to discover these targets, this enrichment analysis can inform subsequent experimental efforts.

Taken together, we have developed a novel strategy to study the epitope-specificity of full TCR repertoires. Our approach relies on the identification of epitope-specific T cell receptor sequences using epitope-specific prediction models. This enables the prediction of human epitope-specific TCR sequences and the identification of enriched TCR repertoire targets. We have embedded our method into a user-friendly web tool, called TCRex, which is available at tcrex.biodatamining.be. This tool will facilitate the study of the epitope specificity of TCR sequences that underly immune responses in different disease-associated studies.

Data Availability Statement

The training dataset of the TCRex web tool will be made available by the authors upon request. The TCR repertoire datasets analyzed in this study were obtained from three different published studies: Pogorelyy et al. (2) (https://github.com/mptouzel/pogorelyy_et_al_2018, https://www.pnas.org/content/115/50/12704/tab-figures-data), DeWitt et al. (1) (https://clients.adaptivebiotech.com/pub/dewitt-2015-jvi), and Emerson et al. (15) (https://www.nature.com/articles/ng.3822).

Author Contributions

BO, KL, and PMe conceived the study. SG, PMo, WB, and ND implemented the TCRex web tool. SG performed the analyses. All authors interpreted the results and contributed to the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

¹https://github.com/mptouzel/pogorelyy_et_al_2018

²https://www.adaptivebiotech.com/immunoseq

Funding. This work was supported by the University of Antwerp [BOF GOA: PIMS, IOF SBO to PMe]; the Research Foundation Flanders [1S29816N to ND, 1S48819N to SG, 1141217N to PMo, 12W0418N to WB, 1861219N to BO, G067118N]; and by the Belgian American Educational Foundation [postdoctoral grant to WB].

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu.2019.02820/full#supplementary-material

Click here for additional data file.^{(412.9KB, pdf)}

References

1.Dewitt WS, Emerson RO, Lindau P, Vignali M, Snyder TM, Desmarais C, et al. Dynamics of the cytotoxic T cell response to a model of acute viral infection. J Virol. (2015) 89:4517–26. 10.1128/JVI.03474-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Pogorelyy MV, Minervina AA, Touzel MP, Sycheva AL, Komech EA, Kovalenko EI, et al. Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins. PNAS. (2018) 115:12704–9. 10.1073/pnas.1809642115 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Pogorelyy MV, Minervina AA, Shugay M, Chudakov DM, Lebedev YB, Mora T, et al. Detecting T-cell receptors involved in immune responses from single repertoire snapshots. PLoS Biol. (2019) 17:e3000314. 10.1371/journal.pbio.3000314 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sherwood A, Emerson R, Scherer D, Habermann N, Buck K, Staffa J, et al. Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer Immunol Immunother. (2013) 62:1453–61. 10.1007/s00262-013-1446-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ahmadzadeh M, Pasetto A, Jia L, Deniger DC, Stevanović S, Robbins PF, et al. Tumor-infiltrating human CD4 + regulatory T cells display a distinct TCR repertoire and exhibit tumor and neoantigen reactivity. Sci Immunol. (2019) 4:eaao4310. 10.1126/sciimmunol.aao4310 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Rius C, Attaf M, Tungatt K, Bianchi V, Legut M, Bovay A, et al. Peptide–MHC class I tetramers can fail to detect relevant functional T cell clonotypes and underestimate antigen-reactive T cell populations. J Immunol. (2018) 200:2263–79. 10.4049/jimmunol.1700242 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Dolton G, Zervoudi E, Rius C, Wall A, Thomas HL, Fuller A, et al. Optimized peptide – MHC multimer protocols for detection and isolation of autoimmune T-cells. Front Immunol. (2018) 9:1378 10.3389/fimmu.2018.01378 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Martin MD, Jensen IJ, Ishizuka AS, Lefebvre M, Shan Q, Xue H-H, et al. Bystander responses impact accurate detection of murine and human antigen-specific CD8 T cells. J Clin Invest. (2019) 130:3894–908. 10.1172/JCI124443 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shugay M, Bagaev DV, Zvyagin IV, Vroomans RM, Crawford JC, Dolton G, et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res. (2018) 46:D419–27. 10.1093/nar/gkx760 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N. McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics. (2017) 33:2924–9. 10.1093/bioinformatics/btx286 [DOI] [PubMed] [Google Scholar]
11.Meysman P, De Neuter N, Gielis S, Bui Thi D, Ogunjimi B, Laukens K. On the viability of unsupervised T-cell receptor sequence clustering for epitope preference. Bioinformatics. (2019) 35:1461–8. 10.1093/bioinformatics/bty821 [DOI] [PubMed] [Google Scholar]
12.De Neuter N, Bittremieux W, Beirnaert C, Cuypers B, Mrzic A, Moris P, et al. On the feasibility of mining CD8+ T cell receptor patterns underlying immunogenic peptide recognition. Immunogenetics. (2018) 70:159–68. 10.1007/s00251-017-1023-5 [DOI] [PubMed] [Google Scholar]
13.Dash P, Fiore-Gartland AJ, Hertz T, Wang GC, Sharma S, Souquette A, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. (2017) 547:89–93. 10.1038/nature22383 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, et al. Identifying specificity groups in the T cell receptor repertoire. Nature. (2017) 547:94–8. 10.1038/nature22976 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat Genet. (2017) 49:659–65. 10.1038/ng.3822 [DOI] [PubMed] [Google Scholar]
16.Costa AI, Koning D, Ladell K, Mclaren JE, Grady BPX, Schellens IMM, et al. Complex T-cell receptor repertoire dynamics underlie the CD8 + T-cell response to HIV-1. J Virol. (2015) 89:110–19. 10.1128/JVI.01765-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Miconnet I, Marrau A, Farina A, Taffé P, Vigano S, Harari A, et al. Large TCR Diversity of Virus-Specific CD8 T Cells Provides the Mechanistic Basis for Massive TCR Renewal after Antigen Exposure. J Immunol. (2011) 186:7039–7049. 10.4049/jimmunol.1003309 [DOI] [PubMed] [Google Scholar]
18.Venturi V, Chin HY, Asher TE, Ladell K, Scheinberg P, Bornstein E, et al. TCR β-chain sharing in human CD8+ T cell responses to cytomegalovirus and EBV. J Immunol. (2008) 181:7853–62. 10.4049/jimmunol.181.11.7853 [DOI] [PubMed] [Google Scholar]
19.Lefranc MP, Giudicelli V, Duroux P, Jabado-Michaloud J, Folch G, Aouinti S, et al. IMGT, the international ImMunoGeneTics information system 25 years on. Nucleic Acids Res. (2015) 43:D413–22. 10.1093/nar/gku1056 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Robins HS, Srivastava SK, Campregher PV, Turtle CJ, Andriesen J, Riddell SR, et al. Overlap and effective size of the human CD8+ T-cell receptor repertoire. Sci Transl Med. (2010) 2:47ra64. 10.1126/scitranslmed.3001442 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Dean J, Emerson RO, Vignali M, Sherwood AM, Rieder MJ, Carlson CS, et al. Annotation of pseudogenic gene segments by massively parallel sequencing of rearranged lymphocyte receptor loci. Genome Med. (2015) 7:123. 10.1186/s13073-015-0238-z [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Jokinen E, Heinonen M, Huuhtanen J, Mustjoki S, Lähdesmäki H. TCRGP: determining epitope specificity of T cell receptors. bioRxiv. (2019) 10.1101/542332 [DOI] [Google Scholar]
23.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. (2011) 12:2825–30. [Google Scholar]
24.Django Software Foundation Django (Version 2.0.7) (2018). Available online at: https://djangoproject.com (accessed August 2, 2019).
25.Jones E, Oliphant T, Peterson P. SciPy: Open Source Scientific Tools for Python. Available online at: http://www.scipy.org/ (accessed August 2, 2019).
26.Goloborodko AA, Levitsky LI, Ivanov MV, Gorshkov MV. Pyteomics — a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J Am Soc Mass Spectrom. (2013) 24:301–4. 10.1007/s13361-012-0516-6 [DOI] [PubMed] [Google Scholar]
27.VanderPlas J, Granger BE, Heer J, Moritz D, Wongsuphasawat K, Satyanarayan A, et al. Altair: interactive statistical visualizations for Python. J Open Source Softw. (2018) 3:1057 10.21105/joss.01057 [DOI] [Google Scholar]
28.Stéfan van der Walt, Chris Colbert S, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. (2011) 13:22–30. 10.1109/MCSE.2011.37 [DOI] [Google Scholar]
29.McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. (Austin, TX: ), 51–56. [Google Scholar]
30.Dask Development Team Dask: Library for Dynamic Task Scheduling. (2016). Available online at: https://dask.org (accessed August 2, 2019).
31.Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. (2015) 12:380–1. 10.1038/nmeth.3364 [DOI] [PubMed] [Google Scholar]
32.Legoux F, Debeaupuis E, Echasserieau K, De La Salle H, Saulquin X, Bonneville M. Impact of TCR reactivity and HLA phenotype on naive CD8 T cell frequency in humans. J Immunol. (2010) 184:6731–8. 10.4049/jimmunol.1000295 [DOI] [PubMed] [Google Scholar]
33.Alanio C, Lemaitre F, Law HKW, Hasan M, Albert ML. Enumeration of human antigen – specific naive CD8+ T cells reveals conserved precursor frequencies. Blood. (2010) 115:3718–25. 10.1182/blood-2009-10-251124 [DOI] [PubMed] [Google Scholar]
34.Redmond D, Poran A, Elemento O. Single-cell TCRseq : paired recovery of entire T-cell alpha and beta chain transcripts in T-cell receptors from single-cell RNAseq. Genome Med. (2016) 8:80. 10.1186/s13073-016-0335-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lee ES, Thomas PG, Mold JE, Yates AJ. Identifying T cell receptors from high- throughput sequencing : dealing with promiscuity in TCR α and TCR β pairing. PLoS Comput Biol. (2017) 13:e1005313 10.1371/journal.pcbi.1005313 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(412.9KB, pdf)}

Data Availability Statement

[B1] 1.Dewitt WS, Emerson RO, Lindau P, Vignali M, Snyder TM, Desmarais C, et al. Dynamics of the cytotoxic T cell response to a model of acute viral infection. J Virol. (2015) 89:4517–26. 10.1128/JVI.03474-14 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Pogorelyy MV, Minervina AA, Touzel MP, Sycheva AL, Komech EA, Kovalenko EI, et al. Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins. PNAS. (2018) 115:12704–9. 10.1073/pnas.1809642115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Pogorelyy MV, Minervina AA, Shugay M, Chudakov DM, Lebedev YB, Mora T, et al. Detecting T-cell receptors involved in immune responses from single repertoire snapshots. PLoS Biol. (2019) 17:e3000314. 10.1371/journal.pbio.3000314 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Sherwood A, Emerson R, Scherer D, Habermann N, Buck K, Staffa J, et al. Tumor-infiltrating lymphocytes in colorectal tumors display a diversity of T cell receptor sequences that differ from the T cells in adjacent mucosal tissue. Cancer Immunol Immunother. (2013) 62:1453–61. 10.1007/s00262-013-1446-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Ahmadzadeh M, Pasetto A, Jia L, Deniger DC, Stevanović S, Robbins PF, et al. Tumor-infiltrating human CD4 + regulatory T cells display a distinct TCR repertoire and exhibit tumor and neoantigen reactivity. Sci Immunol. (2019) 4:eaao4310. 10.1126/sciimmunol.aao4310 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Rius C, Attaf M, Tungatt K, Bianchi V, Legut M, Bovay A, et al. Peptide–MHC class I tetramers can fail to detect relevant functional T cell clonotypes and underestimate antigen-reactive T cell populations. J Immunol. (2018) 200:2263–79. 10.4049/jimmunol.1700242 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Dolton G, Zervoudi E, Rius C, Wall A, Thomas HL, Fuller A, et al. Optimized peptide – MHC multimer protocols for detection and isolation of autoimmune T-cells. Front Immunol. (2018) 9:1378 10.3389/fimmu.2018.01378 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Martin MD, Jensen IJ, Ishizuka AS, Lefebvre M, Shan Q, Xue H-H, et al. Bystander responses impact accurate detection of murine and human antigen-specific CD8 T cells. J Clin Invest. (2019) 130:3894–908. 10.1172/JCI124443 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Shugay M, Bagaev DV, Zvyagin IV, Vroomans RM, Crawford JC, Dolton G, et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res. (2018) 46:D419–27. 10.1093/nar/gkx760 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N. McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics. (2017) 33:2924–9. 10.1093/bioinformatics/btx286 [DOI] [PubMed] [Google Scholar]

[B11] 11.Meysman P, De Neuter N, Gielis S, Bui Thi D, Ogunjimi B, Laukens K. On the viability of unsupervised T-cell receptor sequence clustering for epitope preference. Bioinformatics. (2019) 35:1461–8. 10.1093/bioinformatics/bty821 [DOI] [PubMed] [Google Scholar]

[B12] 12.De Neuter N, Bittremieux W, Beirnaert C, Cuypers B, Mrzic A, Moris P, et al. On the feasibility of mining CD8+ T cell receptor patterns underlying immunogenic peptide recognition. Immunogenetics. (2018) 70:159–68. 10.1007/s00251-017-1023-5 [DOI] [PubMed] [Google Scholar]

[B13] 13.Dash P, Fiore-Gartland AJ, Hertz T, Wang GC, Sharma S, Souquette A, et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature. (2017) 547:89–93. 10.1038/nature22383 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, et al. Identifying specificity groups in the T cell receptor repertoire. Nature. (2017) 547:94–8. 10.1038/nature22976 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat Genet. (2017) 49:659–65. 10.1038/ng.3822 [DOI] [PubMed] [Google Scholar]

[B16] 16.Costa AI, Koning D, Ladell K, Mclaren JE, Grady BPX, Schellens IMM, et al. Complex T-cell receptor repertoire dynamics underlie the CD8 + T-cell response to HIV-1. J Virol. (2015) 89:110–19. 10.1128/JVI.01765-14 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Miconnet I, Marrau A, Farina A, Taffé P, Vigano S, Harari A, et al. Large TCR Diversity of Virus-Specific CD8 T Cells Provides the Mechanistic Basis for Massive TCR Renewal after Antigen Exposure. J Immunol. (2011) 186:7039–7049. 10.4049/jimmunol.1003309 [DOI] [PubMed] [Google Scholar]

[B18] 18.Venturi V, Chin HY, Asher TE, Ladell K, Scheinberg P, Bornstein E, et al. TCR β-chain sharing in human CD8+ T cell responses to cytomegalovirus and EBV. J Immunol. (2008) 181:7853–62. 10.4049/jimmunol.181.11.7853 [DOI] [PubMed] [Google Scholar]

[B19] 19.Lefranc MP, Giudicelli V, Duroux P, Jabado-Michaloud J, Folch G, Aouinti S, et al. IMGT, the international ImMunoGeneTics information system 25 years on. Nucleic Acids Res. (2015) 43:D413–22. 10.1093/nar/gku1056 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Robins HS, Srivastava SK, Campregher PV, Turtle CJ, Andriesen J, Riddell SR, et al. Overlap and effective size of the human CD8+ T-cell receptor repertoire. Sci Transl Med. (2010) 2:47ra64. 10.1126/scitranslmed.3001442 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Dean J, Emerson RO, Vignali M, Sherwood AM, Rieder MJ, Carlson CS, et al. Annotation of pseudogenic gene segments by massively parallel sequencing of rearranged lymphocyte receptor loci. Genome Med. (2015) 7:123. 10.1186/s13073-015-0238-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Jokinen E, Heinonen M, Huuhtanen J, Mustjoki S, Lähdesmäki H. TCRGP: determining epitope specificity of T cell receptors. bioRxiv. (2019) 10.1101/542332 [DOI] [Google Scholar]

[B23] 23.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. (2011) 12:2825–30. [Google Scholar]

[B24] 24.Django Software Foundation Django (Version 2.0.7) (2018). Available online at: https://djangoproject.com (accessed August 2, 2019).

[B25] 25.Jones E, Oliphant T, Peterson P. SciPy: Open Source Scientific Tools for Python. Available online at: http://www.scipy.org/ (accessed August 2, 2019).

[B26] 26.Goloborodko AA, Levitsky LI, Ivanov MV, Gorshkov MV. Pyteomics — a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J Am Soc Mass Spectrom. (2013) 24:301–4. 10.1007/s13361-012-0516-6 [DOI] [PubMed] [Google Scholar]

[B27] 27.VanderPlas J, Granger BE, Heer J, Moritz D, Wongsuphasawat K, Satyanarayan A, et al. Altair: interactive statistical visualizations for Python. J Open Source Softw. (2018) 3:1057 10.21105/joss.01057 [DOI] [Google Scholar]

[B28] 28.Stéfan van der Walt, Chris Colbert S, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. (2011) 13:22–30. 10.1109/MCSE.2011.37 [DOI] [Google Scholar]

[B29] 29.McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference. (Austin, TX: ), 51–56. [Google Scholar]

[B30] 30.Dask Development Team Dask: Library for Dynamic Task Scheduling. (2016). Available online at: https://dask.org (accessed August 2, 2019).

[B31] 31.Bolotin DA, Poslavsky S, Mitrophanov I, Shugay M, Mamedov IZ, Putintseva EV, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods. (2015) 12:380–1. 10.1038/nmeth.3364 [DOI] [PubMed] [Google Scholar]

[B32] 32.Legoux F, Debeaupuis E, Echasserieau K, De La Salle H, Saulquin X, Bonneville M. Impact of TCR reactivity and HLA phenotype on naive CD8 T cell frequency in humans. J Immunol. (2010) 184:6731–8. 10.4049/jimmunol.1000295 [DOI] [PubMed] [Google Scholar]

[B33] 33.Alanio C, Lemaitre F, Law HKW, Hasan M, Albert ML. Enumeration of human antigen – specific naive CD8+ T cells reveals conserved precursor frequencies. Blood. (2010) 115:3718–25. 10.1182/blood-2009-10-251124 [DOI] [PubMed] [Google Scholar]

[B34] 34.Redmond D, Poran A, Elemento O. Single-cell TCRseq : paired recovery of entire T-cell alpha and beta chain transcripts in T-cell receptors from single-cell RNAseq. Genome Med. (2016) 8:80. 10.1186/s13073-016-0335-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35.Lee ES, Thomas PG, Mold JE, Yates AJ. Identifying T cell receptors from high- throughput sequencing : dealing with promiscuity in TCR α and TCR β pairing. PLoS Comput Biol. (2017) 13:e1005313 10.1371/journal.pcbi.1005313 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Detection of Enriched T Cell Epitope Specificity in Full T Cell Receptor Sequence Repertoires

Sofie Gielis

Pieter Moris

Wout Bittremieux

Nicolas De Neuter

Benson Ogunjimi

Kris Laukens

Pieter Meysman

Abstract

Introduction

Materials and Methods

Collection of Training Data and Assembly of the Background TCR Dataset

Collection of TCR Datasets From Infectious-Disease Studies

Model Training and Performance Evaluation

Controlling the Baseline Prediction Rate

Epitope Specificity Enrichment Analysis

Web Tool

Results

Overview of the Epitope-Specific Models

Figure 1.

Evaluation of the Prediction Models Using a Leave-One-Study-Out Validation Approach

Table 1.

The Number of Unique Epitope-Specific TCRs Increases Post-vaccination

Figure 2.

Comparison of the Identification Level With Techniques Relying on Clonal Expansion

Table 2.

The Majority of the Identified LLWNGPMAV-Specific T Cells Are Not Present in Public Databases

Figure 3.

Identification of Enriched Vaccine-Associated TCR Repertoire Targets

Table 3.

Table 4.

Figure 4.

The Importance of the Background Dataset in Enrichment Analyses

Identification of Enriched Disease-Associated TCR Repertoire Targets

Table 5.

Evaluation of the Impact of MHC Bias in Epitope-Specific Prediction Models

Table 6.

Discussion

Data Availability Statement

Author Contributions

Conflict of Interest

Supplementary Material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases