Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)

Stefan A Schattgen; Kate Guion; Jeremy Chase Crawford; Aisha Souquette; Alvaro Martinez Barrio; Michael JT Stubbington; Paul G Thomas; Philip Bradley

doi:10.1038/s41587-021-00989-2

. Author manuscript; available in PMC: 2022 Jul 1.

Published in final edited form as: Nat Biotechnol. 2021 Aug 23;40(1):54–63. doi: 10.1038/s41587-021-00989-2

Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)

Stefan A Schattgen ¹, Kate Guion ^2,³, Jeremy Chase Crawford ¹, Aisha Souquette ^1,⁴, Alvaro Martinez Barrio ⁵, Michael JT Stubbington ⁵, Paul G Thomas ^1,^4,^*, Philip Bradley ^2,^6,^*

PMCID: PMC8832949 NIHMSID: NIHMS1771204 PMID: 34426704

Summary

Links between T cell clonotypes, as defined by T cell receptor (TCR) sequences, and phenotype, as reflected in gene expression profiles, surface protein expression, and peptide:MHC binding, can reveal functional relationships beyond the features shared by clonally related cells. We present clonotype neighbor-graph analysis (CoNGA), a graph-theoretic approach that identifies unbiased correlations between gene expression profile and TCR sequence through statistical analysis of gene expression and TCR similarity graphs. Using CoNGA, we uncovered associations between TCR sequence and gene expression profiles that include a previously undescribed ‘natural lymphocyte’ population of human circulating CD8+ T cells and a set of TCR sequence determinants of differentiation in thymocytes. These examples show that CoNGA may help elucidate complex relationships between TCR sequence and T cell phenotype in large, heterogeneous single-cell datasets.

Main

Previous studies pairing gene expression and TCR sequence have focused on the TCR sequence as a unique ‘barcode’ by which to identify clonally related cells. This approach produced insights into the development and interrelatedness of different T cell subsets within the context of cancer^1–8, infectious disease⁹, and homeostasis¹⁰. This body of work demonstrates that T cell clones derived from a common clonal ancestor tend to express similar transcriptional profiles. However, the availability of large single-cell sequencing datasets provides a rich pool of data to uncover relationships between TCR sequence similarity and cellular phenotype. Researchers have mapped the TCR sequence properties of previously identified T cell subsets^11–13, but systematic approaches that can identify previously unknown populations or subpopulations by correlating gene expression and TCR sequence have not been reported. Also lacking are methods for identifying correlations between TCR sequence and gene expression that do not extend to global similarity or associate with a defined cell population (e.g., correlations between specific TCR sequence properties and expressed genes that might span multiple cell subsets).

In parallel to the developments in single-cell profiling, methods for quantifying TCR repertoire features and identifying patterns within them have matured, helping extend our understanding of T cell biology. Previously, we introduced TCRdist, a measure for assessing inter-TCR similarity capable of identifying closely related clonotypes based on shared sequence features¹⁴. Based on this work and others^15,16, it is clear that T cells targeting the same pathogen-derived epitope utilize T cell receptors that share consistent, definable amino acid motifs. In addition to these conventional T cell responses, certain unconventional T cell populations, such as mucosalassociated invariant T (MAIT) cells and invariant natural killer T (iNKT) cells, are characterized by conserved TCR sequence features and gene expression profiles^11,12. Although the repertoires for a number of distinct T cell subsets with suitable markers for their enrichment have been described, it is likely that other subsets linked by TCR and GEX remain undiscovered. We hypothesized that by identifying correlations between “TCR neighborhoods’’, defined by shared sequence features, and gene expression, we could move beyond simply measuring gene expression variation within individual clonal families and potentially identify associations between T cell antigen-specificities and phenotypes.

To this end, we developed a graph theoretic approach for clonotype neighbor-graph analysis, CoNGA, that identifies correlations between gene expression profile and TCR sequence features through analysis of similarity graphs defined on the set of T cell clonotypes. Application of CoNGA to publicly available T cell datasets identified multiple examples of GEX/TCR correlation including MAIT, iNKT, and epitope-specific T cell populations; TCR sequence determinants of T cell fate during thymic development; a previously undescribed ZNF683+/IKZF2+ (aka HOBIT+/HELIOS+) population of CD8+ T cells with long and biased CDR3 regions; and a striking correlation between expression of the gene EPHB6 and usage of a specific human TCR V gene segment, TRBV30. Applying CoNGA to four datasets that included pMHC binding profiles derived from sequencing of cell-surface bound, DNAbarcoded pMHC multimers revealed strong correlations between pMHC binding and both TCR sequence and gene expression. Systematic approaches such as CoNGA will play a key role in deconvoluting multi-modal single-cell datasets as they continue to grow in size and complexity.

Results

CoNGA graph-vs-graph analysis

In graph-vs-graph correlation analysis (Fig. 1a), CoNGA identifies statistically significant overlap between a gene expression similarity graph and a TCR sequence similarity graph. CoNGA similarity graphs are defined at the level of clonotypes rather than individual cells, since cells within the same clonotype (cells inferred to be descended from a common clonal ancestor) will all share the same TCR sequence and tend to have similar gene expression profiles (Fig. E1)^10,17,18. The goal is to identify T cell clonotypes whose neighbors in gene expression space overlap significantly with their neighbors in TCR sequence space. Here, we model the concept of a clonotype’s neighbors in gene expression or TCR space using the mathematical concept of a graph neighborhood, defined as the set of vertices directly connected to that clonotype’s vertex in the corresponding similarity graph. Briefly, CoNGA considers each clonotype in turn, counts how many other clonotypes are connected to it by both a TCR-similarity edge and a gene expression-similarity edge, and assigns a significance score (the CoNGA score). The CoNGA score is the probability of observing an equal or larger overlap by chance, multiplied by the total number of clonotypes to limit the false discovery rate from multiple comparisons. CoNGA scores range from 0 to the number of clonotypes; scores close to 0 are significant, scores around 1 are borderline, and scores above 1 are expected to occur by chance (see Methods for additional detail). T cell clonotypes with CoNGA scores below a significance threshold (henceforth referred to as “CoNGA hits”) are grouped into “CoNGA clusters” defined by shared gene expression and TCR cluster assignments (Fig. 1a, right panel). CoNGA clusters of sufficient size are analyzed to identify shared features including differentially expressed genes (DEGs) and TCR sequence motifs.

Figure 1. — **(a)** CoNGA identifies correlation between T cell gene expression (GEX) and TCR sequence by constructing a gene expression similarity graph and a TCR sequence similarity graph and looking for statistically significant overlap between them. Overlap is assessed on a per-clonotype basis by counting the number of edges that originate at each clonotype and are shared between the two graphs, or equivalently by measuring the overlap between each clonotype’s GEX graph neighbors and its TCR graph neighbors and assigning a score that reflects the likelihood of seeing equal or greater overlap by chance (the CoNGA score). Clonotypes with CoNGA scores below a threshold are grouped based on shared GEX and TCR cluster assignments into CoNGA clusters. Clonotypes within each CoNGA cluster carry their initial GEX and TCR cluster identities which are combined together and used as a group ID for the CoNGA cluster. **(b-c)** Application of CoNGA on a dataset of human CD8+ T cells (*10x_200k_donor2a*). **(b)** 2D UMAP projections of clonotypes in the dataset based on GEX similarity (left three panels) and TCR similarity (right three panels), colored from left to right by (I) GEX cluster assignment; (II) CoNGA score; (III) joint GEX:TCR cluster assignment for clonotypes with significant CoNGA scores, using a bicolored disk whose left half indicates GEX cluster and whose right half indicates TCR cluster; (IV) TCR cluster; (V) CoNGA score; (VI) GEX:TCR cluster assignments for CoNGA hits, as in (III). **(c)** GEX and TCR sequence features of CoNGA hits in clusters with 5 or more hits are summarized by a series of logo-style visualizations, from left to right: cluster dendrogram based on graph connections; differentially expressed genes (DEGs), TCR sequence logos showing V and J gene usage and CDR3 sequences¹⁴; biased TCR sequence scores (Table S4), with red indicating elevated scores and blue indicating decreased scores. DEG and TCRseq sequence logos are scaled by the adjusted P value of the associations, with full logo height requiring a top adjusted P value below 10⁻⁶. DEGs with fold-change less than 2 are shown in gray.

We applied CoNGA to a collection of publicly available T cell datasets (Table S1) with singlecell gene expression profile and paired TCRαβ sequencing in an unbiased search for T cell populations defined by covariation between TCR sequence and gene expression profile. Figure 1b–c illustrates the CoNGA graph-vs-graph analysis workflow applied to a dataset of human CD8+ T cells sorted from peripheral blood (10x_200k_donor2a; Table S1). First, the UMAP algorithm¹⁹ is applied to the gene expression and TCR distance matrices of each dataset to generate two dimensional projections of the gene expression and TCR landscapes (Fig. 1b, panels I-III and IV-VI, respectively). Next, a graph-based clustering algorithm^20,21 is applied to the gene expression matrix to partition the dataset into clusters of clonotypes with similar transcriptional profiles and to the TCR distance matrix to produce clusters of clonotypes with similar TCR sequences (Fig. 1b, panels I and IV). To visualize the relative location of the topscoring clonotypes in both the gene expression and TCR UMAP space, these projections are also colored by CoNGA score (Fig. 1b, panels II and V). Lastly, the gene expression and TCR cluster assignments of CoNGA hits are shown in the 2D projections using bicolored disks whose left and right halves correspond to the gene expression and TCR cluster assignments, respectively (Fig. 1a; Fig. 1b, panel III for gene expression and VI TCR landscapes; Fig. S1). These cluster assignments provide useful handles for identifying CoNGA hits because they contain information on both gene expression and TCR, allowing us to map between the distinct two-dimensional landscapes. For example, in Figure 1b at the top of the gene expression landscape is a group of CoNGA hits that all belong to gene expression (GEX) cluster 4 (light brown on the left half of the disk) and TCR cluster 5 (purple on the right half of the disk), or equivalently, (GEX:TCR) cluster pair (4:5); based on consistent GEX:TCR disk coloring, we can see that these correspond to the group of clonotypes in the TCR landscape also located near the top of the plot and that they are likely TRAV1 (from the TCR cluster identifier in Fig. 1b panel IV). Each (GEX:TCR) cluster pair containing a minimum number of CoNGA hits (here 5) is characterized by a row of CDR3 sequence-logo²² style visualizations that identify the distinguishing features of those CoNGA hits (Fig. 1c).

Four CoNGA clusters of ≥5 clonotypes were identified in this dataset of human CD8 T cells (Fig. 1c; see Supplementary Data 1 for TCR sequence information on all CoNGA clusters). The two largest (GEX:TCR) clusters—(4:11) and (4:5)—show the invariant TCR chains and distinctive gene expression profiles of MAIT cells²³. Cluster (2:12) is characterized by a strong TCRβ sequence motif and high expression of cytotoxicity/activation markers including GNLY and CCL5. The TCR sequence motif for this cluster matches the consensus for the response to the immunodominant A*02:01-restricted Influenza M1₅₈ epitope¹⁴ (GILGFVFTL). Further confirming this, the top DEG for this cluster (‘A02_GILG9’) is actually the read count for a DNA-barcoded A*02:01-M1₅₈ multimer that was included in the experiment. Application of CoNGA to three additional human and mouse PBMC datasets identified MAIT and iNKT cell clusters as well as CD8+ T cell clusters with a naive phenotype and TCR sequence features that appear to bias thymic development toward the CD8 compartment (Figs. E2 and E3; Supplementary Note 1).

CoNGA defines a HOBIT+/HELIOS+ T cell population

We next applied CoNGA to four large datasets of peripheral blood CD8+ T cells that were sorted for positive binding to at least one of 50 DNA-barcoded pMHC multimers (10x_200k_donor1–4²⁴). Our analysis of TCR:pMHC binding described below identifies a number of strong epitope-specific responses for many of the pMHC multimers in the panel. However, for several of the multimers we observed significant levels of non-specific binding (Fig. E4), for example to MAIT cells; as a result, these datasets also include diverse T cells whose specificity extends beyond the pMHC multimer panel (Supplementary Note 2). CoNGA detected a large number of significant GEX/TCR correlations across these datasets, identifying 62 CoNGA clusters containing ≥5 clonotypes (Figs. S2–5) and 42 clusters using the more stringent size threshold of 0.1% of the dataset. Figure 2a provides an overview of the largest CoNGA clusters in the 10x_200k_donor1 dataset. Further examination allowed categorization of the CoNGA clusters depicted in Figure 2a into three groups: (1) Flu M1₅₈-responding clonotypes; (2) MAIT cells; and (3) a population of clonotypes with a shared gene expression profile (GEX cluster 2), diverse TCR gene usage, and rather long CDR3 regions. These CoNGA clusters in GEX cluster 2 showed high expression of the transcription factors ZNF683 (aka HOBIT) and IKZF2 (aka HELIOS), along with a number of other NK cell-associated receptors including KLRC2, KLRC3, several KIR genes (e.g., KIR2DL3), and Natural Cytotoxicity Triggering Receptor 3 (NCR3; Fig. 2b). Notably, several of their DEGs matched those in the HLA interactor gene set (described in Fig. S6 and Supplementary Note 2), suggesting that the clonotypes contained in these CoNGA clusters were enriched through non-specific pMHC binding. Analysis of the features distinguishing the HOBIT+ population in 10x_200k_donor1 suggested that they were likely CD8⁺CD45RA⁺CD45RO^dim/− based on surface protein labeling, negative for CCR7 expression, and positive for KLRC2 and a number of KIR2 genes (Fig. 2b). Using flow cytometry, we were able to confirm the presence of CD8⁺CD45RA⁺CD45RO⁻CCR7⁻ T cells expressing different combinations of KIR2 and KLRC2 in human PBMC samples (0.18.5% of CD8 T cells, n =12 donors), and found the KLRC2⁺KIR2D^mix and KLRC2⁻KIR2D⁺ subsets had a higher frequency of HELIOS⁺ cells compared to KLRC2⁻KIR2D⁻CD8 T cells (Fig. 2c–f, gating strategy in Fig. E5, Supplementary Note 3).

Figure 2. — **(a)** CoNGA analysis of *10x_200k_donor1*. Only CoNGA clusters containing at least 40 hits are shown. **(b)** 2D GEX projection of the *10x_200k_donor1* dataset colored by ‘*is_hobit*’ (an indicator variable for the *HOBIT*+ CoNGA population), iMHC score, CD45RA, CD45RO, CD8α surface protein, *CCR7*, *ZNF683*, *IKZF2, KLRC2*, *KLRC3*, *KIR2DL3,* and *NCR3* expression, all averaged over GEX graph neighborhoods (with neighborhood size equal to 0.1% of the dataset). The *is_hobit* variable is 1 for all CoNGA hits in GEX cluster 2 and 0 otherwise. **(c)** Detection of KLRC2⁺KIR2D^mix and KLRC2⁻KIR2D⁺ CD8 T cells in human PBMCs by flow cytometry. Gated on CD3⁺CD8⁺CD45RA⁺CCR7⁻ cells (full gating strategy in Fig. E5). **(d)** Frequency of KLRC2⁺KIR2D^mix and KLRC2⁻KIR2D⁺ CD8 T cells in human PBMCs (n=12). Line indicates paired observations for each sample. **(e)** Representative histogram of intracellular HELIOS staining in KLRC2⁺KIR2D^mix, KLRC2⁻ KIR2D⁺, and KLRC2⁻ KIR2D⁻ CD8 T cells (full gating strategy in Fig. E5). **(f)** Frequency of HELIOS⁺ KLRC2⁺KIR2D^mix, KLRC2⁻KIR2D⁺, and KLRC2KIR2D⁻ CD8 T cells in human PBMCs (n=12). P values calculated by 1-sided t-test. For the boxand-whisker plots in (d and f) the lower limit of the box corresponds to the 1st quartile, center line the median, and upper limit the 3rd quartile; whiskers extend up to 1.5 * IQR.

We identified significant sequence biases in the CDR3 loops of these HOBIT-expressing clonotypes (Table S2). Compared to the remainder of the dataset, they are significantly longer (P<10⁻³⁰⁰), more positively charged (P<10⁻⁴⁰), higher in aromatic, hydrophobic, and bulky residues, particularly tryptophan (P<10⁻⁶⁰), and higher in cysteine (>100-fold enriched in the CDR3β, P<10⁻⁵⁰). These sequence characteristics are strikingly similar to features identified in a comparison of MHC-independent versus MHC-restricted TCR sequences from an experimental study of TCR repertoires in MHC-knockout mice²⁵. Similar trends were also seen in comparisons of simulated and observed TCR sequences from pre- versus post-selection repertoires^26–28, and in CD8αα+ intraepithelial lymphocytes and their thymic precursors^29,30. Based on these trends, we hypothesize that this CoNGA-identified population represents a noncanonical, self-specific or MHC-independent T cell population. We developed a numerical score, the iMHC score (for ‘independent of pMHC’), that captures the defining CDR3 sequence features of this putative MHC-independent T cell repertoire (see Methods and Table S3).

CoNGA identifies GEX/TCR correlation in thymic T cells

We next applied CoNGA to a recently published single-cell atlas of human thymic T cells³¹. This dataset combines thymic tissue from embryonic and fetal stages and postnatal thymi from children and adults, totaling over 9,400 clonotypes with paired TCRα and TCRβ sequences. CoNGA identified a large number of significant hits in this dataset, primarily within the DP (double-positive), CD8 single positive (SP), CD4 SP, Treg, and CD8αα⁺ thymic populations (Fig. 3). In TCR sequence space, we see a concentration of hits in the TRAV41 cluster (this TRAV gene is enriched in DP cells), the TRAV1 and TRAV12 clusters (enriched in CD8 cells), and in the TRAV14 cluster (enriched in CD8αα cells) (Fig. 3). The CD8+ clusters identified by CoNGA also showed high CD8 sequence scores and high scores for a measure (‘alphadist’) of the genomic distance between the TRAV and TRAJ gene segments incorporated in a clonotype’s TCRα chain. The DP CoNGA clusters show low alphadist scores, preference for TRAV41 and other TRAV genes at the 3’ end of the locus, longer CDR3 loops (CDR3 length has been shown to decrease during thymic selection²⁶), and higher scores for the ‘rim’ and ‘disorder’ amino acid properties (and lower scores for ‘strength’), which may suggest more polar, less bulky, and less strongly interacting CDR3 regions with lower overall affinity for pMHC. Consistent with the findings of Park et al.³¹, both CD8αα clusters show low alphadist scores; however, CoNGA further identified high iMHC scores and longer CDR3 loops as TCR features of these clusters. Interestingly, the CD8αα(II) cluster (see green/orange disc in Fig. 3 lower left) expressed both ZNF683 and IKZF2, which together with TCR features similar to those of the HOBIT+ T cells in the blood identified above suggests a possible precursor relationship between these two populations that warrants further investigation.

Figure 3. — Same arrangement of plots as in Figure 1, with the following additions: two rows of GEX landscape plots colored by (I) expression of selected marker genes, (II) Z-score normalized and GEX-neighborhood averaged expression of the same marker genes, and (III) TCR sequence features (see main text and Table S4 for TCR feature descriptions); (IV) GEX ‘logos’ for each CoNGA cluster consisting of a panel of marker genes shown with red disks colored by mean expression and sized according to the fraction of cells expressing the gene (gene names are given above). Only CoNGA clusters containing at least 9 CoNGA hits (0.1% of the dataset) are shown. The five colored lines/labels group related CoNGA clusters as annotated by the text labels.

CoNGA graph-vs-feature analysis

In CoNGA graph-vs-feature analysis (Fig. 4a), numerical features calculated on the basis of one cellular property, GEX or TCR sequence, are mapped onto a similarity graph defined by the other property, and the feature score distributions for each of the neighborhoods in the graph are compared to their background distributions to identify neighborhoods with skewed scores (as above, a graph neighborhood consists of a single central vertex together with all of its directly connected neighbors). As GEX features, we consider the expression levels of individual genes, and for TCR sequence features, we use a set of CDR3 amino acid property values as well as a handful of additional, sequence-based scores (Table S4 and Fig. S7). We first used graph-vs-feature analysis to identify additional members of the HOBIT+/HELIOS+ unconventional T cell subset by looking for GEX graph neighborhoods with elevated iMHC scores. Although per-clonotype iMHC scores are highly variable (Fig. 4b), by computing averages over GEX graph neighborhoods we can identify a subregion of GEX space with enhanced scores (Fig. 4c), whose significance can be assessed using standard statistical tests (Fig. 4d). Three of the four 10x_200k donors show populations of clonotypes with significantly enhanced iMHC scores (Fig. E6a–d) whose DEGs correlate well with one another and with the key marker genes (ZNF683, CD7, CD99, DUSP1/2) for the original HOBIT+ CoNGA clusters (Fig. E6e–g).

Figure 4. — **(a)** In graph-vsfeature analysis, a numerical feature defined by one property (here gene expression) is mapped onto a similarity graph defined by the other property (TCR sequence), and graph neighborhoods with skewed score distributions are identified using statistical tests that compare the scores for each neighborhood (including the center clonotype) with the scores of the remaining clonotypes (left). For example, the gene *KLRB1* (*CD161*) shows a non-uniform distribution over the TCR sequence landscape—discrete regions of higher expression (red) against a background of lower expression (blue)—suggesting that a group of homologous clonotypes belongs to a T cell subtype characterized by KLRB1 expression. This is quantified for a single clonotype (green outline) and its TCR sequence neighbors (black outlines) in the violin plot (right), which shows the *KLRB1* expression level for the clonotype and its neighbors on the right and for the remainder of the dataset on the left (boxes show quartiles with whiskers extending to 1.5*IQR). The 1-sided Mann-Whitney-Wilcoxon P value for this expression difference is 1.5e-46 (n=2,427 clonotypes). **(b)** 2D GEX projection of the *10x_200k_donor1* dataset colored by iMHC score (standardized to have mean 0 and standard deviation 1). **(c)** same projection as (b) but each clonotype is colored by the average iMHC score in its GEX graph neighborhood. **(d)** The same projection as in (b and c) but colored by P values for iMHC enrichment in each clonotype’s graph neighborhood (the set of iMHC scores in each clonotype’s neighborhood are compared to the remainder of the iMHC scores using an unpaired, 1-sided Mann-Whitney-Wilcoxon test). **(e)** Graph-vs-feature correlation analysis highlights TCR:GEX covariation in Flu-specific T cells. Correlation between a score derived from the TCR sequence (left panel), here defined by the surface counts for the multimerized A*02:M1₅₈ pMHC, and 2 scores derived from the GEX profile (right panels, *ITGB1* and *KLRC1* ) is illustrated by mapping the scores onto the 2D TCR landscape for the *10x_200k_donor2 dataset* (after Z-score normalization and averaging over graph neighborhoods).

We next applied graph-vs-feature analysis in the reverse direction to identify genes that are differentially expressed in specific TCR graph neighborhoods. Table S5 provides the top hits from this analysis for the datasets analyzed in this study (the top significant gene for each GEX:TCR cluster pair and a maximum of 10 genes per dataset are shown). Notable features include MAIT-associated genes such as KLRB1 (Fig. 4a, middle panel) and SLC4A10; genes associated with the HOBIT+ population such as ZNF683 and KLRC3 (Fig. 2b; Fig. E6h); and genes upregulated in the Flu M1₅₈ response including ITGB1 and KLRC1 (in donor 2; Fig. 4e). We also observed TCR neighborhoods with elevated levels of CD8A and CD8B, which appear to overlap with the populations identified in the earlier graph-vs-graph correlation analysis and suggest the presence of TCR sequence features that bias toward the CD8+ compartment. Such TCR sequence biases have been previously reported in analyses of bulk repertoire^32–35.

A recurring feature identified by CoNGA graph-vs-feature analysis was a positive correlation between expression of the gene EPHB6 and usage of the TRBV30 gene segment in humans (Fig. 5a), and analogously Ephb6 and Trbv31 in mouse (Fig. 5b, Tables S5–6). The TRBV30 segment is unique among TRBV genes in being located alone downstream of the TRBJ and TRBC genes at the end of the TRB locus³⁶. Providing a potential clue into the mechanism underlying this covariation, EPHB6 is located adjacent to TRBV30 on Chromosome 7, ~40kb downstream from the TRB locus (Fig. 5c). A focused search for covariation between TCR gene segment usage and DEGs on ten separate datasets confirmed higher EPHB6 expression in clonotypes that incorporate the TRBV30 gene segment (Fig. 5d), or TRBV31 in mouse (Fig. 5e, n=4). Flow cytometric analysis confirmed that these trends extend to cell-surface levels of EPHB6 protein (Fig. 5f; gating strategy in Fig. E7). Given that EPHB6 has been shown to play a role in T cell activation^37,38, TRBV30+ clonotypes may have distinctive functional properties due to their elevated EPHB6 surface expression.

Figure 5. — **(a)** 2D projections based on TCR sequence of a human dataset colored by TCR neighborhood-averaged *TRBV30* (left) and *EPHB6* (right) expression. **(b)** 2D projections based on TCR sequence of a mouse dataset colored by neighborhood-averaged *Trbv31* (left) and *Ephb6* (right) expressions. **(c)** Locus view of human *TRBV30* and *EPHB6.* **(d)** Average *EPHB6* expression for *TRBV30*- and *TRBV30*+ clonotypes in 10 human datasets. **(e)** Average *Ephb6* expression for *Trbv31-* and *Trbv31*+ clonotypes in 4 mouse datasets. **(f)** Comparison of cell surface EPHB6 protein levels between TRBV30+ and TRBV30- CD4+ and CD8+ human peripheral blood T cells (n=12) (full gating strategy in Fig. E7). Geometric mean fluorescence intensity (MFI) shown on the y-axis. P values calculated by 1-sided t-test. The lower limit of the box corresponds to the 1st quartile, center line the median, and upper limit the 3rd quartile; whiskers extend up to 1.5 * IQR.

TCR and GEX similarity among epitope-specific clonotypes

The use of pMHC-multimers conjugated to DNA barcodes as cell labeling reagents enables high-throughput interrogation of pMHC binding in parallel with other single-cell analyses. We applied CoNGA to investigate correlation between gene expression profiles, TCR sequences, and pMHC:TCR interactions in a large dataset of human T cells sorted for pMHC-multimer binding (10x_200k_donor1–4). To do this, we used the pMHC-binding information, stringently filtered and condensed to the level of clonotypes (see Methods), to define a neighbor graph structure in which edges connect clonotypes that bind to the same pMHC (Fig. E8, Supplementary Data 2). We then applied CoNGA graph-vs-graph analysis to look for statistically significant overlap between this pMHC-binding graph and the GEX and TCR similarity graphs defined above. We measured graph overlap, on a per-pMHC basis, as the enrichment of GEX (or TCR) similarity graph edges within the pMHC positive clonotypes (Fig. 6a–b, Table S7). From this analysis we can see, as expected, that nearly all the pMHC-positive clonotype subsets show greater than expected TCR sequence similarity. Interestingly, we also see that all pMHC-positive populations show greater than expected GEX similarity, with highly significant P values and large fold-enrichments for most pMHCs with a sufficient number of analyzed clonotypes. These results suggest that clonotypes positive for the same pMHC have more similar gene expression profiles than would be expected by chance.

Figure 6. — **(a-b)** Each marker represents a population of pMHC-positive clonotypes in one of the four *10x_200k* donors. Markers are labeled with the two-digit HLA allele and the first three amino acids of the peptide for the given pMHC (see Table S7 for details); colors indicate the source donor and symbols are sized based on the number of pMHC+ clonotypes found as indicated in the legend. Markers are positioned based on the rate of intra-subset GEX **(a)** or TCR **(b)** graph edges relative to random expectation (xaxis; >1 indicates enrichment while <1 indicates depletion) and corresponding 2-sided P value (y-axis). **(c)** Heatmap of scaled DEGs and surface-protein features across different pMHCpositive populations.

We performed all-against-all differential expression analyses to identify upregulated genes within each pMHC-positive subset (Fig. 6c). Examination of the expression patterns in Figure 6c reveals a number of trends: the naive MART1 responses cluster together at the right and show higher levels of CD45RA and lower levels of PD-1 and CD45RO; Flu M1₅₈ (A02_GIL_MP) responses cluster together based on shared expression of specific markers including GNLY, ITGB1, and IFITM2; EBV-specific responses show what may be a partitioning based on whether the antigens are ‘early’ or ‘latent’ genes, with the latent-gene responses showing higher GZMK, JUNB, CD45RO, and lower CD45RA compared to the ‘early’-gene responses (Fig. E9). Application of gene set variation analysis (GSVA) to better characterize the pMHC phenotypes showed an enrichment of genes associated with naive T cells for some epitopes (e.g., MART1 and B08_RAK in the B*08-negative donor 1) while others (e.g., BMLF1 and BZLF1 in Donor 2) had clear signatures of activation/memory (Fig. E9).

Discussion

In this study, we have introduced and applied an analytical tool, clonotype neighbor graph analysis or “CoNGA”, which we demonstrate to be capable of uncovering T cell populations defined by shared TCR sequence and gene expression features within large single-cell datasets. Application of CoNGA’s graph-vs-graph analysis on a diverse collection of datasets identified distinct GEX profiles of epitope-specific T cells; bias in the repertoire selection of naive CD8+ and CD4+ T cell populations; multiple populations of thymic T cells with biased TCR repertoires; and a putative MHC-independent, HOBIT/HELIOS-expressing CD8+ T cell subset detected both in the thymus and peripheral blood with distinctive CDR3 sequence features. CoNGA analysis applied to a graph defined by single-cell pMHC-binding data determined that T cell populations specific for different pMHCs show distinctive GEX profiles.

Further, while the identification of marker genes associated with cells clustered in GEX space is a routine part of single-cell analysis, there are currently no available methods for systematically identifying genes associated with TCR clusters or TCR sequence biases that define GEX clusters. CoNGA addresses this gap with its graph-vs-feature analysis, in which TCR-derived properties such as CDR3 amino acid composition or V gene usage are mapped onto the GEX landscape to detect neighborhoods with biased feature distributions; GEX-derived properties such as the expression levels of individual genes are similarly analyzed to detect biased regions of the TCR landscape. Applying this analysis revealed the long CDR3s of the HOBIT+ population enriched for hydrophobic residues, and a previously uncharacterized and highly significant correlation between expression of the EPHB6 gene and usage of the TRBV30 gene segment. This analysis mode is not limited solely to TCR features but can also leverage any other labelled feature (e.g., pMHC, cell surface markers) that has been linked, quantified, and integrated into the dataset.

An important next step will be to validate these findings by applying CoNGA to other datasets with GEX and TCR (and perhaps pMHC binding) information, as they become available. It will also be relevant to experimentally characterize the T cell populations identified by CoNGA, which should be possible using flow cytometry and the marker genes highlighted by CoNGA clustering. Furthermore, matching CoNGA-identified TCR sequences against bulk TCR sequence datasets may provide additional clues to their functionality while also shedding light on the matched repertoire sequences (Fig. E3).

Our analyses have a number of limitations that could be addressed in future work. First, a consequence of operating at the level of clonotypes rather than individual cells is that variation among the cells belonging to expanded clonotypes becomes obscured. It is also important to keep in mind that the results from CoNGA will depend critically on the distance measures used to define clonotype similarity and the frameworks chosen for detecting GEX/TCR correlation (Fig. S9). In our experience, successful application of CoNGA requires a relatively large number of unique clonotypes (at least several hundred), which, depending on the degree of clonal expansion, may require a substantially larger number of individual cells. Lastly, the generality of the biological observations we report here should be weighed against the small number of donors examined. Future studies on larger cohorts will be necessary to definitively assess some of our observations.

To our knowledge, no previous algorithm enables systematic detection of GEX:TCR correlation. There are many possible extensions of CoNGA to explore in future work. CoNGA is agnostic to the source of the clonotype graphs, and hence could be applied to graphs defined by new similarity measures (based on surface protein expression, for example), new T cell clustering approaches³⁹, epigenetic rather than gene expression profiles, or new immunological and clinical phenotypes. CoNGA could also be applied to B cell clonotypes by incorporating a BCR sequence similarity score analogous to TCRdist. It may also be worthwhile to explore the use of more sophisticated graph-correlation algorithms developed in the computer science and machine learning communities as alternatives to the neighborhood-overlap and neighborhoodscore enrichment that we have applied here.

Our analyses have a number of broader biological implications that warrant further consideration. First, the observation of a diversity of gene expression profiles across the different epitope-specific T cell populations argues for a broad continuum of memory T cell phenotypes⁴⁰ rather than a small number of discrete subsets. Indeed, the definition of memory phenotypes would seem to be significantly determined by the eliciting pathogen. This diversity also suggests that improved prediction of target pMHC epitopes for T cells might be possible by combining TCR sequence with information on GEX profile⁴¹. The putative MHC-independent and naive T cell populations identified by CoNGA hint at developmental influences of TCR sequence on T cell fate that go beyond the well-characterized role of invariant and semiinvariant TCRs⁴². We are optimistic that analytical approaches combined with high-throughput single-cell experiments will continue to illuminate aspects of adaptive immunology for years to come.

Methods

CoNGA algorithm

CoNGA was developed to identify correlations between gene expression profile and TCR sequence in diverse T cell populations without prior knowledge of the precise nature of these correlations. We envisioned two broad categories of correlation: one based on similarity, in which cells similar with respect to gene expression profile are also similar with respect to TCR sequence, and one based on features, in which specific aspects of gene expression and of TCR sequence are correlated without global similarity of both properties. CoNGA graph-vs-graph correlation was developed to detect the first category of correlation, using the mathematical concept of graph neighborhoods to formalize our intuitive notion of global similarity. In contrast, de novo discovery of feature-based correlations, without prior knowledge of the correlated features, is more challenging, as it requires enumeration and testing of all possible feature pairs. CoNGA graph-vs-feature analysis represents a compromise approach in which we assume that, at least on one side of the correlation, some degree of global similarity is present (this is the “graph-” side); we then enumerate possible features defined by the other property and test for graph neighborhoods with biased feature distributions. CoNGA similarity graphs are defined at the level of clonotypes rather than individual cells. In the TCR similarity graph, each clonotype is connected by edges to its K nearest-neighbor (KNN) clonotypes based on TCR similarity as assessed by the TCRdist measure¹⁴, which scores sequence similarity in the pMHC-contacting CDR loops of the TCRα and TCRβ chains (here, K is an adjustable parameter specified as a fraction of the total number of clonotypes). In the gene expression similarity graph, each clonotype is connected by edges to its KNN clonotypes based on similarity in gene expression profile. Expanded clonotypes are represented by the gene expression profile of a single representative cell with the smallest average gene expression distance to the rest of the clonal family.

CoNGA software package

An open-source python3 package implementing CoNGA graph-vs-graph and graph-vs-feature analysis is available from the software repository GitHub (https://github.com/phbradley/conga). The conga package is built on the scanpy ⁴³ python package (https://github.com/theislab/scanpy) for single-cell analysis and makes heavy use of scanpy’s AnnData object to store integrated gene expression and TCR sequence data. We are grateful to the authors of scanpy for creating such a robust and useful package. CoNGA includes an implementation of the TCRdist¹⁴ distance calculation and TCR logo construction routines. Finally, CoNGA depends on the standard python data science tools numpy, scipy, matplotlib, pandas, and scikit-learn for visualization, data manipulation, and statistical calculations.

TCR analysis

VDJ sequence data in the filtered_contig_annotations.csv output file generated by 10X Genomics cellranger vdj are first parsed into paired clonotypes using the conga.tcrdist.make_10x_clones_file function. Here, by default, the 10X cellranger clonotype definitions are filtered to remove spurious chain sharing and merge split clonotypes (e.g., due to partial recovery of a second TCRα transcript). Next, to quantify and assess the similarity between TCR sequences in the dataset, a matrix of pairwise TCRdist distances between each unique paired TCR from this cleaned clonotype table is computed. Kernel principal components analysis (kPCA) as implemented in scikit-learn’s KernelPCA class is then used to extract the top 50 components of variation from this distance matrix. While the raw TCRdist values can be used directly in dimensionality reduction and clustering (available as an option in the pipeline), the kernel PCs are used by default as a more memory-efficient alternative as they can be directly incorporated into the standard single-cell workflows in place of the principal components extracted from the gene expression counts matrix. For generation of 2D landscape projections, CoNGA uses the UMAP algorithm for dimensionality reduction¹⁹ as implemented in scanpy.tl.umap. Clusters of clonotypes with similar T cell receptor sequences are identified with the Louvain^20,21 graph-based clustering algorithm (scanpy.tl.louvain). Both UMAP projection and clustering rely on a nearest neighbors calculation conducted with the scanpy.pp.neighbors routine with 10 neighbors and 50 principal components (the 50 kernel PCs computed from the distance matrix). To annotate the Louvain clusters in CoNGA visualizations, the most frequent V segment in each cluster is identified and appended to the cluster name if it is present in at least 50% of the clustered TCRs, uppercased if present in at least 75% of the TCRs (clusters are initially named with consecutive integers, starting at 0 with the largest cluster).

TCR sequence features

For each clonotype, CoNGA calculates a set of TCR sequence-based scores for use in graphvs-feature analysis and for annotating graph-vs-graph clusters (Table S4). First, a set of 28 different amino acid properties (Fig. S7, Table S4) are averaged over the central amino acids in the alpha and beta chain CDR3 loops (excluding the first 4 and last 4 residues of each CDR3, where the full CDR3 sequence is defined as beginning with the conserved cysteine and ending with, and inclusive of, the phenylalanine immediately before the GXG motif in the J region). These scores include a set compiled from original sources^44–49 by the authors of the VDJtools package⁵⁰ as well as the five Atchley factors⁵¹. Seven additional sequence-based scores are calculated: ‘alphadist’, which measures the ordinal distance between the TRAV and TRAJ genes when the full set of gene segments is ordered by genomic position; ‘imhc’, the iMHC score (detailed below); ‘cd8’, a simple CD8-versus-CD4 preference score calculated from the TCR V and J gene usage, CDR3 length, and CDR3 amino acid composition, based on frequency differences between flow-sorted CD8+ and CD4+ TCR sequence repertoires; ‘cdr3len’, total CDR3 length; ‘mait’, which assigns a score of 1 to TCRs with an alpha chain using the TRAV1–2 and TRAJ33/TRAJ20/TRAJ12 segments (TRAV1 and TRAJ33 in mouse) and a CDR3α length of 12, and 0 to all other TCRs; ‘inkt’, which assigns a score of 1 to TCRs with the TRAV10/TRAJ18/TRBV25 gene combination and a CDR3α length of 14, 15, or 16 (TRAV11/TRAJ18 and length 15 for mouse); and ‘nndists_tcr’, which measures the density of TCR sequences nearby the scored clonotype by calculating the average TCR distance to the nearest 1% of clonotypes. The iMHC (for ‘independent of pMHC’) score is a weighted linear combination of TCR sequence features (Table S4; Fig. S8). The parameters were fit by using L1-regularized logistic regression to discriminate the TCR sequences of HOBIT+ CoNGA hits (CoNGA score<0.2) in GEX cluster 2 of dataset 10x_200k_donor1 (Fig. 2) from the TCRs of the clonotypes in the other GEX clusters. We chose to draw the background clonotypes exclusively from the other GEX clusters to avoid inclusion of genuine HOBIT+ TCR sequences in our negative set.

Gene expression analysis

Gene expression data in the form of read count matrices are processed according to standard workflows implemented in scanpy to eliminate cells and genes with low counts, high mitochondrial content, etc. Variable genes are identified, and principal components analysis (PCA) is used to project the high-dimensional gene expression data down to a smaller set of components per cell (the default is 40 components). These gene expression PCs are used to select a single representative cell for each clonotype by taking the cell with the smallest average Euclidean distance in PC space to the other cells in the clonotype. Alternatively, the PC vectors of all the cells in each clonotype can be averaged to generate a single pseudo-cell GEX profile (accessible with the --average_clone_gex command line option). Once the dataset has been reduced to a single cell per clonotype, the UMAP and Louvain clustering tools are applied to the PCA matrix to produce a gene expression landscape and a set of gene expression clonotype clusters. DEGs in clonotype groupings (for example the set of CoNGA hits in a cluster) are identified using the sc.tl.rank_genes_groups routine with the ‘wilcoxon’ method.

The large thymus atlas T cell dataset³¹ combined a heterogeneous set of donors and samples; merging these data to generate integrated projections and clusters required the original authors to perform an iterative batch correction scheme. As it was not immediately obvious how to recover the processed gene expression components from the publicly available data, and as a test of CoNGA’s robustness to alternative neighbor graphs, we elected to use the provided 3D UMAP coordinates in lieu of gene expression PCs for the CoNGA GEX neighbor calculations described below. We also directly borrowed the GEX clusters from the original publication rather than reclustering the dataset.

Graph-vs-graph correlation analysis

In CoNGA graph-vs-graph correlation analysis, similarity graphs defined by gene expression and by TCR sequence are compared to identify vertices (clonotypes) whose neighbor sets in the two graphs overlap significantly. The CoNGA score assigned to a clonotype equals the probability of seeing an equal or larger overlap between its GEX and TCR neighborhoods by chance, multiplied by the total number of clonotypes to correct for multiple testing. The hypergeometric distribution, as implemented in the scipy.stats module, is used to estimate this 1sided probability; this probability distribution models the overlap observed when selecting two subsets of specified size independently and at random from a set of interchangeable items. Two types of similarity graphs can be used in CoNGA: K nearest neighbor (KNN) graphs, in which each clonotype is connected to its K nearest neighbors in gene expression or TCR space (Fig. 1a); and cluster graphs, in which each clonotype is connected to all the clonotypes in the same (GEX or TCR) cluster. The neighbor number K for constructing KNN graphs is specified as a fraction of the total number of clonotypes; for the calculations reported here, neighbor fractions of 0.01 and 0.1 were used. The CoNGA score assigned to a clonotype is the minimum score over all graph comparisons, of which there were 6 combinations in the calculations reported here (GEX_KNN vs TCR_KNN, GEX_KNN vs TCR_cluster, and GEX_cluster vs TCR_KNN, for both the 0.01 and 0.1 KNN neighbor fractions). Since these neighbor graphs are correlated (for example, the neighborhoods in the 0.01 KNN graph are contained in the neighborhoods in the 0.1 KNN graph), estimating the multiple testing burden associated with using multiple graphs is not completely straightforward. Instead, we turned to shuffling experiments to estimate false discovery rates associated with our procedure of selecting CoNGA clusters using CoNGA score and cluster size thresholds. We randomly permuted the TCR sequence assignments relative to the GEX information for each of our 9 datasets and ran the CoNGA graph-vs-graph analysis, tallying the number of CoNGA hits at a score threshold of 1.0 and the number of CoNGA clusters of size exceeding our default threshold (5 or 0.001*num_clonotypes, whichever is larger). This procedure was repeated 5 times for each dataset, yielding 45 shuffled outcomes (Table S8), across which we observed a total of 3 CoNGA clusters for a background rate of 3/45 = 0.067 per shuffled run.

To assess the sensitivity of CoNGA graph-vs-graph analysis, we performed sub-sampling experiments in which we varied the frequency of clonotypes belonging to a known “truepositive” population (MAIT cells in the human datasets and iNKT cells in the mouse dataset) and recorded the fraction reported as CoNGA hits as a function of the subsampled frequency (Fig. E10). This analysis revealed that the recovery rate depended more strongly on the absolute number of subsampled true-positive clonotypes than on the fraction within the dataset: there is better alignment between the recovery curves plotted as a function of subsampled count (Fig. E10b) than as a function of subsampled fraction (Fig. E10a). We see relatively high rates of recovery down to a population size of ~20 true-positive clonotypes.

For annotation purposes, TCRβ sequences in all CoNGA clusters and in the pMHC-positive repertoires in the 10x_200k dataset were matched to a set of bulk TCRβ repertoires (Figs. E3 and E8). Exact matching at the amino acid level was first used to assign a ‘publicity’ score to each TCRβ chain equal to the fraction of repertoires containing that chain in a large (N=666) dataset of relatively deep (~200,000 median clonotypes) repertoires⁵². The probability of generation (P_gen) was calculated for each chain using the model proposed by Murugan et al⁵³. To quantify overlap between the set of TCR sequences in a CoNGA cluster or pMHC-positive subset and the set of sequences in a repertoire, we developed a modified version of the Morisita-Horn overlap measure⁵⁴ that accounts for sequence similarity (rather than exact identity) using a Gaussian kernel:

M H_{T C R d i s t} (R_{1}, R_{2}) = \frac{2 * M a t c h (R_{1}, R_{2})}{M a t c h (R_{1}, R_{1}) + M a t c h (R_{2}, R_{2})}, where

M a t c h (R_{1}, R_{2}) = \frac{\sum_{t_{1} \in R_{1}, t_{2} \in R_{2}} e^{- (\frac{T C R d i s t {(t_{1}, t_{2})}_{2}}{σ})}}{N_{1} * N_{2}}, N_{i} = # (R_{i}), and σ = 24.

In our calculations we ignored clonotype sizes (i.e., the number of cells in each clonotype), but these could be included as multiplicative prefactors of the exponential term in the Match score above, replacing N_i with the sum of the clonotype sizes in repertoire R_i. For matching paired repertoires (Fig. E8b), we used a larger value of 96 for the Gaussian standard deviation term σ. The Morisita-Horn overlaps for the N=666 repertoire dataset were used to calculate an age correlation for each CoNGA cluster equal to the linear correlation coefficient between its MH overlap scores and the ages of the sample donors (Fig. E3a,d; Fig. E8a). A second dataset³⁴ of TCRβ repertoires from flow-sorted CD4+ and CD8+ samples (N=84) was used to compute a CD4/CD8 repertoire bias score equal to the t-statistic for the comparison of the MH scores for the CD4 repertoires to the MH scores for the CD8 repertoires (Fig. E3b,d; Fig. E8a). A subset of these samples (N=34) were additionally sorted into memory (CD45RA-/CD45RO+) and naive (CD45RA+/CD62L+) subsets; these were used to compute an analogous memory/naive repertoire bias score (Fig. E3c, d; Fig. E8a).

Graph-vs-feature correlation analysis

In CoNGA graph-vs-feature correlation analysis, numerical features defined on the basis of one property (GEX or TCR) are mapped onto similarity graphs defined by the other property, and graph neighborhoods with biased score distributions are identified. As GEX properties we consider the expression levels of all the individual genes as well as a feature (‘nndists_gex’) that captures the density of nearby clonotypes by calculating the average distance in GEX space to the nearest 1% of the clonotypes. The TCR features were described in an earlier section. As this analysis involves a large number of differential expression calculations (roughly the number of clonotypes times the number of different similarity graphs times the number of features), we use a two-step procedure that combines a pre-filter with the t-test followed by the more timeintensive Mann-Whitney-Wilcoxon (MWW) calculation for the top 100 hits per clonotype and graph that pass a t-test significance threshold ten times higher than the target threshold. The final significance score assigned to a detected association equals the raw MWW P value multiplied by the product of the number of clonotypes and the number of features, to correct for multiple testing.

Analysis of pMHC binding

In the 10x_200k experiment, T cells were stained with a panel of 50 DNA-barcoded pMHC multimer reagents. Sequence reads for each of the pMHC barcodes were counted along with the reads for intracellular transcripts and included in the raw count matrix provided by 10x Genomics. The first step in our analysis was to assign individual T cells and T cell clonotypes as positive for binding to specific pMHC multimers based on the observed read counts for the pMHC DNA barcodes. A cell was called positive for the pMHC multimer with the highest barcode count if the natural logarithm of that pMHC’s barcode count exceeded the next highest log-count by at least 2.0 (corresponding to a fold-difference in barcode counts of roughly 7.5; all counts were augmented by 1 prior to taking logarithms). To assign clonotypes to pMHCs, we averaged the log-counts for each pMHC over all the cells in the clonotype and again applied a threshold of 2.0 between the top and second-highest averaged log-counts. The results of this pMHC-binding analysis are summarized in Table S7 for all pMHCs with at least 5 positive clonotypes in one of the four samples. We can see that, with the exception of the ‘sticky’ pMHC A03_KLG, the majority of pMHC+ positive cells belong to clonotypes that are also called positive (see the ‘Clone fraction’ column in Table S7). Each donor dataset in the 10x_200k experiment consisted of multiple batches that were combined using the scanpy AnnData.concatenate function. For donors 1 and 2, the individual batches were divided into two experimental meta-batches. We observed a modest enhancement (3–5%) of neighbor connections between cells in the same meta-batch. Additionally, for two of the analyzed pMHCs (A02_GILG9 and A02_ELAG10) we detected a relative bias of pMHC-positive cells toward the first meta-batch. Since the second meta-batch was larger, this actually had the effect of balancing the batch composition for these pMHC-positive subsets and hence, given the slight depletion of inter-batch neighbor connections, making our conclusions regarding GEX similarity of pMHC-positive subsets more conservative.

Flow Cytometry Analysis

Flow cytometric analysis was performed using PBMCs collected from apheresis rings of 12 healthy blood donors. Samples were first stained with Ghost510 viability dye (Tonbo Bioscience) then TruStain FcX (Biolegend) to block non-specific staining prior to antibody labeling. For the HOBIT/HELIOs experiments, samples were first stained for CD3 (APC-Fire750, SK7), CD4 (PE-Cy7, OKT4), CCR7 (BrilliantViolet 785, G043H7), CD45RA (BrilliantViolet 421, HI100), CD45RO (BrilliantViolet 605 or AlexaFluor 700, UCHL1) (BioLegend), CD14 (biotin, 61D3), CD19 (biotin, SJ25C1) (Tonbo Bioscience), KIR2D ( FITC, NKVFS1), KLRC2 (PEVio615, REA205) (Miltenyi), CD8B (PerCP-eFluor710, SIDI8BEE, Invitrogen), and CD248 ( AlexaFluor 647, B1/35, BD Biosciences) for 30’ at RT in PBS containing 2% FCS and 1 mM EDTA prior to secondary staining with streptavidin-BrilliantViolet 510 (Biolegend) for 15’ on ice. In the experiment with MR1 and CD1d tetramer labelling, 5-OP-RU loaded MR1 (BrilliantViolet 421), and PBS-57 loaded CD1d (AlexaFluor 647) tetramers (NIH Tetramer Core Facility) were included during the time of surface staining. For the HELIOS experiment, surface-stained cells were fixed and permeabilized using the Transcription Factor Staining Buffer Set (ThermoFisher) prior to staining for HELIOS (PE, D8W4X, Cell Signaling). For the EPHB6/TRBV30 experiments, samples were stained for CD3 (APC-Fire750, SK7), CD4 (PE-Cy7, OKT4), CD8B (PerCPeFluor710, SIDI8BEE), EPHB6 (AlexaFluor 488, 465327, R&D Systems), and TRBV30(Vβ20) (PE, ELL1.4, Beckman Coulter). Samples were analyzed using an Aurora spectral analyzer (Cytek) and data analyzed using FlowJo software (BD Biosciences).

Further information on research design is available in the Nature Research Reporting Summary linked to this article

Extended Data

Extended Data Fig. 3 — TCRβ sequences from human CoNGA clusters were matched to bulk TCRβ repertoires using TCRdist. To score the overlap between the set of TCR sequences in a CoNGA cluster and the set of sequences in a bulk repertoire, we developed a variant of the Morisita-Horn (MH) overlap index that accounts for sequence similarity in addition to exact identity (see Methods for further details). **(a)** The MH overlaps (y-axis) are plotted against subject age (x-axis) for the two CoNGA clusters indicated in the panel titles. The first cluster (a MAIT cluster) appears to decline with subject age, while the second one (a HOBIT cluster) appears to increase (R value and 2-sided P value in legend). **(b)** The distribution of MH overlaps for a set of CD4+ repertoires is compared with the distribution of MH overlaps for a set of CD8+ repertoires for two different clusters from the *thymus_atlas* dataset. **(c)** The distribution of MH overlaps for a set of memory repertoires is compared with the distribution of MH overlaps for a set of naive repertoires for the two clusters indicated in the panel titles. Boxes in panels b and c show quartiles with whiskers extending to 1.5*IQR. **(d)** All-vs-all scatter plots (with kernel density estimates along the diagonal) for the following CoNGA cluster features (see Methods for feature calculation details): log10_Pgen, the average log10 generation probability of the cluster TCRβ chains; log10_publicity, the average log10 rate of occurrence in a large (N=666) dataset of PBMC repertoires; age_correlation, the linear correlation coefficient between MH overlap and subject age (see panel (a)); CD8_vs_CD4, t-statistic comparing MH overlaps for CD8 and CD4 repertoires (higher indicates greater preference for CD8 repertoires; see panel (b)); memory_vs_naive, t-statistic comparing MH overlaps for memory and naive repertoires (higher indicates greater preference for memory repertoires; see panel (c)). The CoNGA clusters are grouped according to the discussion in the main text; ‘pre_hobit’ refers to the two clusters in the *thymus_atlas* dataset that may be precursors of the HOBIT+ population, *(CD8αα(I):2)* and *(CD8αα(II):2)*.

Extended Data Fig. 4 — Comparison of binding data for four ‘specific’ pMHC multimers (A02_GIL, A02_ELA, B08_RAK, A02_GLC) and four ‘sticky’ pMHC multimers (A03_KLG, A03_RLR, A03_RIA, A11_AVF) in the *10x_200k_donor2* dataset. **(a)** GEX landscapes colored by pMHC binding signal (log(1 + UMI read count)). **(b)** TCR landscapes colored by pMHC binding signal. The ‘specific’ pMHCs show binding that is focused in certain areas of the landscapes, whereas the binding of the putative ‘sticky’ pMHCs is dispersed across the landscapes. **(c)** The Pearson correlation between binding profiles for different pMHCs is shown in matrix form according to the indicated color mapping. The specific pMHCs show little correlation whereas the sticky pMHCs are significantly correlated in their binding, suggesting that a shared cellular property (TCR or CD8 surface expression, expression of other HLA-interacting molecules, general level of activation) is jointly influencing their binding. Note that A11_AVF (and A11_IVT) show additional specific binding in donor 1, who is A*11:01 positive; the A*03:01 pMHC multimers appear non-specific regardless of donor HLA type.

Extended Data Fig. 5 — **(a)** Gating strategy for KLRC2+KIR2Dmix and KLRC2-KIR2D+ CD8 T cells in panels (b + c). After gating on single lymphocytes the gating is Ghost510-CD14-CD19-CD3+CD8B+CCR7-CD45RA+. **(b)** Representative example of CD1d:PBS-57 and MR1:5-OP-RU tetramer labeling of KLRC2+KIR2Dmix, KLRC2-KIR2D+, and CCR7-CD45RO+ CD8 T cells. **(c)** Frequency of CD1d and MR1-labelled KLRC2+KIR2Dmix, KLRC2-KIR2D+, and CCR7-CD45RO+ CD8 T cells (n = 12; Supplementary Note 3). P values calculated by 1-sided t-test. The lower limit of the box corresponds to the 1st quartile, center line the median, and upper limit the 3rd quartile **(d)** Gating strategy for HELIOS intracellular staining of KLRC2+KIR2Dmix and KLRC2-KIR2D+ CD8 T cells in panels. Single lymphocytes were gated on Ghost510CD14-CD19-CD3+CD8B+CD248-CCR7CD45RO-CD45RA+.

Extended Data Fig. 6 — 2D GEX projection of the *10x_200k_donor1* **(a)**, *10x_200k_donor2* **(b)**, *10x_200k_donor3* **(c),** and *10x_200k_donor4* **(d)** datasets colored by P values for iMHC enrichment in each clonotype’s graph neighborhood (the set of iMHC scores in each clonotype’s neighborhood are compared to the remainder of the iMHC scores using an unpaired, 1-sided Mann-Whitney-Wilcoxon test). **(e)** Top 10 DEGs for the clonotypes with significant iMHC enrichment in the *10x_200k_donor1* dataset. **(f)** Top 10 DEGs for the clonotypes with significant iMHC enrichment in the *10x_200k_donor3* dataset. (g) Top 10 DEGs for the clonotypes with significant iMHC enrichment in the *10x_200k_donor4* dataset. (There were too few clonotypes with significant iMHC enrichment in the *10x_200k_donor2* dataset to identify differentially expressed genes). **(h)** Graph-vs-feature correlation between a TCR feature, iMHC score (left panel), and 2 scores derived from the GEX profile (right panels, *ZNF683* and *KLRC3* expression) is illustrated by mapping the scores onto the 2D UMAP GEX landscape for the *10x_200k_donor1 dataset* (after Z-score normalization and averaging over graph neighborhoods).

Extended Data Fig. 8 — **(a)** TCRβ sequences from the pMHC-positive clonotypes in the *10x_200k* dataset were matched to bulk TCRβ repertoires using TCRdist. To score the overlap between the set of TCR sequences in a pMHC-positive repertoire and the set of sequences in a bulk repertoire, we developed a variant of the Morisita-Horn (MH) overlap index that accounts for sequence similarity in addition to exact identity (see Methods for further details). All-vs-all scatter plots (with kernel density estimates along the diagonal) are shown for the following pMHC-positive repertoire features (see Methods for feature calculation details): log10_Pgen, the average log₁₀ generation probability of the repertoire TCRβ chains; log10_publicity, the average log₁₀ rate of occurrence in a large (N=666) dataset of PBMC repertoires; age_correlation, the linear correlation coefficient between MH overlap and subject age in the N=666 PBMC repertoire dataset (see Fig. E3a); CD8_vs_CD4, t-statistic comparing MH overlaps for CD8 and CD4 repertoires (higher indicates greater preference for CD8 repertoires; see Fig. E3b); memory_vs_naive, t-statistic comparing MH overlaps for memory and naive repertoires (higher indicates greater preference for memory repertoires; see Fig. E3c). **(b)** The pMHC-positive repertoires were matched against one another and against a set of literature-derived TCR sequences taken primarily from the VDJdb⁵⁵ and McPAS⁵⁶ databases (excluding those TCRs in the VDJdb that were themselves derived from the *10x_200k* dataset). The heatmap shows MH overlaps calculated using paired-chain TCRdist distances. Reasonable concordance between repertoires positive for the same pMHC from different donors and between pMHCpositive and literature-derived repertoires can be seen.

Extended Data Fig. 9 — **(a)** Log-transformed read counts for DNAbarcoded anti-CD45RA (x-axis) and antiCD45RO (y-axis) antibodies, averaged over pMHC+ clonotypes, are plotted for the pMHCs shown in Figure 6. In the panel on the left, clonotypes are weighted equally, while in the panel on the right, larger clonotypes are given more weight (proportional to the logarithm of the clone size) to better reflect the underlying distribution of cells (particularly for the d1_A11 pMHCs, both of which have a relatively large number of positive cells distributed unevenly among a small number of clonotypes). **(b)** Heatmap of gene set variation analysis (GSVA) scores for pMHCspecific clonotypes by donor. Significant hits (P values < 0.05 after multiple hypothesis correction using the Benjamini-Hochberg method) from the MSigDB (https://www.gsea-msigdb.org/gsea/msigdb) C7 collection^57,58 are shown. Analysis performed using Seurat⁵⁹, GSVA⁶⁰, and Cerebro ⁶¹ R packages.

Extended Data Fig. 10 — To assess the sensitivity of CoNGA’s graphvs-graph algorithm in detecting a known GEX/TCR correlation, we created artificial datasets by subsampling the MAIT cell clonotypes (iNKT cell clonotypes in mouse) down to specified levels within the context of five datasets in which those clonotypes could be clearly identified both as a distinct GEX cluster and by virtue of their invariant TCR sequences. **(a)** The fraction of MAIT or iNKT clonotypes recovered as CoNGA hits (yaxis) is plotted against the frequency to which these clonotypes were downsampled in the dataset. **(b)** The fraction of recovered clonotypes is plotted against the absolute number of downsampled clonotypes present in the dataset. Recovery rate appears to depend more strongly on the number of downsampled clonotypes than their fraction in the total dataset.

Supplementary Material

1771204_Sup_info

NIHMS1771204-supplement-1771204_Sup_info.pdf^{(21.8MB, pdf)}

1771204_Reportingsummary

NIHMS1771204-supplement-1771204_Reportingsummary.pdf^{(1.8MB, pdf)}

1771204_SD_3

NIHMS1771204-supplement-1771204_SD_3.txt^{(1.9MB, txt)}

1771204_SD_2

NIHMS1771204-supplement-1771204_SD_2.xlsx^{(96KB, xlsx)}

1771204_SD_1

NIHMS1771204-supplement-1771204_SD_1.tsv^{(926.6KB, tsv)}

1771204_SD_4

NIHMS1771204-supplement-1771204_SD_4.txt^{(2.6MB, txt)}

1771204_SD_5

NIHMS1771204-supplement-1771204_SD_5.txt^{(2.3MB, txt)}

1771204_SD_Fig_1

NIHMS1771204-supplement-1771204_SD_Fig_1.txt^{(531.5KB, txt)}

1771204_SD_Fig_2

NIHMS1771204-supplement-1771204_SD_Fig_2.xlsx^{(2.7MB, xlsx)}

1771204_SD_Fig_3

NIHMS1771204-supplement-1771204_SD_Fig_3.txt^{(2MB, txt)}

1771204_SD_Fig_5

NIHMS1771204-supplement-1771204_SD_Fig_5.xlsx^{(9.1KB, xlsx)}

1771204_SD_ED_Fig_2

NIHMS1771204-supplement-1771204_SD_ED_Fig_2.xlsx^{(791.9KB, xlsx)}

1771204_SD_ED_Fig_3

NIHMS1771204-supplement-1771204_SD_ED_Fig_3.txt^{(9.1KB, txt)}

1771204_SD_ED_Fig_8

NIHMS1771204-supplement-1771204_SD_ED_Fig_8.txt^{(2KB, txt)}

1771204_SD_ED_Fig_9

NIHMS1771204-supplement-1771204_SD_ED_Fig_9.txt^{(1.3KB, txt)}

Acknowledgments

The authors would like to thank Jongeun Park and Sarah Teichmann for assistance with the thymus atlas T cell dataset, Erick Matsen for comments and suggestions on an earlier version of this manuscript, Evan Newell and Timothy Bi for helpful discussions, and Nicholas Bradley for suggesting the use of kernel principal components analysis. We would also like to thank the developers of the scanpy single-cell analysis package, which provides the framework on which the CoNGA software is built. This research was supported by NIH grant R01 AI136514 to PT, NIH ORIP S10OD028685 to support high-performance computing at the Fred Hutch, the St. Jude Neoma Boadway Postdoctoral Fellowship to SS, and ALSAC (PT).

Footnotes

Code Availability

The CoNGA software repository is available on GitHub (https://github.com/phbradley/conga).

Conflict of Interest Statement

MJTS is employed by 10x Genomics. MJTS and AMB are option or shareholders of 10x Genomics. PB, PGT, and JCC served as unpaid consultants for 10x Genomics on the initial data analysis of the 10x_200k dataset. PGT has filed patents related to the cloning, expression, and characterization of T cell receptors. PGT has received travel or speaking expenses from 10x Genomics, Illumina, and PACT Pharma.

Data Availability

All datasets analyzed here are openly available and are accessible at https://www.10xgenomics.com/resources/datasets/ and https://developmentcellatlas.ncl.ac.uk/ (human thymic T cell data) (see Table S1 for details).

References

1.Yost KE et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med 25, 1251–1259 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wu TD et al. Peripheral T cell expansion predicts tumour infiltration and clinical response. Nature (2020) doi: 10.1038/s41586-020-2056-8. [DOI] [PubMed] [Google Scholar]
3.Guo X et al. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat. Med 24, 978–985 (2018). [DOI] [PubMed] [Google Scholar]
4.Jokinen E, Huuhtanen J, Mustjoki S, Heinonen M & Lähdesmäki H Determining epitope specificity of T cell receptors with TCRGP. bioRxiv 542332 (2019) doi: 10.1101/542332. [DOI] [Google Scholar]
5.Zheng C et al. Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing. Cell 169, 1342–1356.e16 (2017). [DOI] [PubMed] [Google Scholar]
6.Zhang L et al. Lineage tracking reveals dynamic relationships of T cells in colorectal cancer. Nature 564, 268–272 (2018). [DOI] [PubMed] [Google Scholar]
7.Gueguen P et al. Contribution of resident and circulating precursors to tumor-infiltrating CD8+ T cell populations in lung cancer. Sci Immunol 6, (2021). [DOI] [PubMed] [Google Scholar]
8.Azizi E et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell 174, 1293–1308.e36 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Minervina AA et al. Comprehensive analysis of antiviral adaptive immunity formation and reactivation down to single-cell level. bioRxiv 820134 (2019) doi: 10.1101/820134. [DOI] [Google Scholar]
10.Zemmour D et al. Single-cell gene expression reveals a landscape of regulatory T cell phenotypes shaped by the TCR. Nat. Immunol 19, 291–301 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Godfrey DI, Stankovic S & Baxter AG Raising the NKT cell family. Nat. Immunol 11, 197–206 (2010). [DOI] [PubMed] [Google Scholar]
12.Toubal A, Nel I, Lotersztajn S & Lehuen A Mucosal-associated invariant T cells and disease. Nat. Rev. Immunol 19, 643–657 (2019). [DOI] [PubMed] [Google Scholar]
13.Schattgen SA & Thomas PG Bohemian T cell receptors: sketching the repertoires of unconventional lymphocytes. Immunol. Rev 284, 79–90 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Dash P et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547, 89–93 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Glanville J et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhang H et al. Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers. Clin. Cancer Res 26, 1359–1371 (2020). [DOI] [PubMed] [Google Scholar]
17.Tubo NJ et al. Single naive CD4+ T cells from a diverse repertoire produce different effector cell types during infection. Cell 153, 785–796 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Khatun A et al. Single-cell lineage mapping of a diverse virus-specific naive CD4 T cell repertoire. J. Exp. Med 218, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.McInnes L, Healy J, Saul N & Großberger L UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software vol. 3 861 (2018). [Google Scholar]
20.Blondel VD, Guillaume J-L, Lambiotte R & Lefebvre E Fast unfolding of communities in large networks. J. Stat. Mech 2008, P10008 (2008). [Google Scholar]
21.Traag V louvain-igraph: v0.5.3 (2015). doi: 10.5281/zenodo.35117. [DOI] [Google Scholar]
22.Schneider TD & Stephens RM Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Godfrey DI, Koay H-F, McCluskey J & Gherardin NA The biology and functional importance of MAIT cells. Nat. Immunol 20, 1110–1128 (2019). [DOI] [PubMed] [Google Scholar]
24.10x_Genomics. A new way of exploring immunity: linking highly multiplexed antigen recognition to immune repertoire and phenotype (Application Note LIT000047 Rev C ) Retrieved from the 10X Genomics website: https://pages.10xgenomics.com/rs/446-PBO704/images/10x_AN047_IP_A_New_Way_of_Exploring_Immunity_Digital.pdf (2020).
25.Lu J et al. Molecular constraints on CDR3 for thymic selection of MHC-restricted TCRs from a random pre-selection repertoire. Nat. Commun 10, 1019 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Elhanati Y, Murugan A, Callan CG Jr, Mora T & Walczak AM Quantifying selection in immune receptor repertoires. Proc. Natl. Acad. Sci. U. S. A 111, 9875–9880 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Krovi SH, Kappler JW, Marrack P & Gapin L Inherent reactivity of unselected TCR repertoires to peptide-MHC molecules. Proc. Natl. Acad. Sci. U. S. A 116, 22252–22261 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Stadinski BD et al. Hydrophobic CDR3 residues promote the development of selfreactive T cells. Nat. Immunol 17, 946–955 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Wirasinha RC et al. αβ T-cell receptors with a central CDR3 cysteine are enriched in CD8αα intraepithelial lymphocytes and their thymic precursors. Immunol. Cell Biol 96, 553–561 (2018). [DOI] [PubMed] [Google Scholar]
30.Schattgen SA et al. Intestinal Intraepithelial Lymphocyte Repertoires are Imprinted Clonal Structures Selected for MHC Reactivity (2019) doi: 10.2139/ssrn.3467160. [DOI] [Google Scholar]
31.Park J-E et al. A cell atlas of human thymic development defines T cell repertoire formation. Science 367, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Carter JA et al. Single T Cell Sequencing Demonstrates the Functional Role of αβ TCR Pairing in Cell Lineage and Antigen Specificity. Front. Immunol 10, 1516 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Klarenbeek PL et al. Somatic Variation of T-Cell Receptor Genes Strongly Associate with HLA Class Restriction. PLoS One 10, e0140815 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Emerson R et al. Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. J. Immunol. Methods 391, 14–21 (2013). [DOI] [PubMed] [Google Scholar]
35.Li HM et al. TCRβ repertoire of CD4+ and CD8+ T cells is distinct in richness, distribution, and CDR3 amino acid composition. J. Leukoc. Biol 99, 505–513 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Majumder K, Bassing CH & Oltz EM Regulation of Tcrb Gene Assembly by Genetic, Epigenetic, and Topological Mechanisms. Adv. Immunol 128, 273–306 (2015). [DOI] [PubMed] [Google Scholar]
37.Luo H, Yu G, Wu Y & Wu J EphB6 crosslinking results in costimulation of T cells. J. Clin. Invest 110, 1141–1150 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Luo H, Yu G, Tremblay J & Wu J EphB6-null mutation results in compromised T cell function. J. Clin. Invest 114, 1762–1773 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Huang H, Wang C, Rubelt F, Scriba TJ & Davis MM Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat. Biotechnol (2020) doi: 10.1038/s41587-020-0505-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Jameson SC & Masopust D Understanding Subset Diversity in T Cell Memory. Immunity 48, 214–226 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Fischer DS, Wu Y, Schubert B & Theis FJ Predicting antigen-specificity of single Tcells based on TCR CDR3 regions. bioRxiv 734053 (2019) doi: 10.1101/734053. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Thomas PG & Crawford JC Selected before selection: A case for inherent antigen bias in the T-cell receptor repertoire. Current Opinion in Systems Biology 18, 36–43 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Wolf FA, Angerer P & Theis FJ SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Berg JM, Tymoczko JL & Stryer L Biochemistry. (W H Freeman, 2002). [Google Scholar]
45.Miyazawa S & Jernigan RL Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol 256, 623–644 (1996). [DOI] [PubMed] [Google Scholar]
46.Kosmrlj A, Jha AK, Huseby ES, Kardar M & Chakraborty AK How the thymus designs antigen-specific and self-tolerant T cell receptor sequences. Proc. Natl. Acad. Sci. U. S. A 105, 16671–16676 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Martin J & Lavery R Arbitrary protein−protein docking targets biologically relevant interfaces. BMC Biophysics vol. 5 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Dunker AK et al. Intrinsically disordered protein. J. Mol. Graph. Model 19, 26–59 (2001). [DOI] [PubMed] [Google Scholar]
49.Kidera A, Konishi Y, Oka M, Ooi T & Scheraga HA Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem 4, 23–55 (1985). [Google Scholar]
50.Shugay M et al. VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires. PLoS Comput. Biol 11, e1004503 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Atchley WR, Zhao J, Fernandes AD & Drüke T Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U. S. A 102, 6395–6400 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Emerson RO et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet 49, 659–665 (2017). [DOI] [PubMed] [Google Scholar]
53.Murugan A, Mora T, Walczak AM & Callan CG Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences 109, 16161–16166 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Horn HS Measurement of ‘Overlap’ in Comparative Ecological Studies. The American Naturalist vol. 100 419–424 (1966). [Google Scholar]
55.Shugay M et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res 46, D419–D427 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Tickotsky N, Sagiv T, Prilusky J, Shifrut E & Friedman N McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33, 2924–2929 (2017). [DOI] [PubMed] [Google Scholar]
57.Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A 102, 15545– 15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Godec J et al. Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation. Immunity 44, 194–206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Stuart T et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Hänzelmann S, Castelo R & Guinney J GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Hillje R, Pelicci PG & Luzi L Cerebro: interactive visualization of scRNA-seq data. Bioinformatics 36, 2311–2313 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1771204_Sup_info

NIHMS1771204-supplement-1771204_Sup_info.pdf^{(21.8MB, pdf)}

1771204_Reportingsummary

NIHMS1771204-supplement-1771204_Reportingsummary.pdf^{(1.8MB, pdf)}

1771204_SD_3

NIHMS1771204-supplement-1771204_SD_3.txt^{(1.9MB, txt)}

1771204_SD_2

NIHMS1771204-supplement-1771204_SD_2.xlsx^{(96KB, xlsx)}

1771204_SD_1

NIHMS1771204-supplement-1771204_SD_1.tsv^{(926.6KB, tsv)}

1771204_SD_4

NIHMS1771204-supplement-1771204_SD_4.txt^{(2.6MB, txt)}

1771204_SD_5

NIHMS1771204-supplement-1771204_SD_5.txt^{(2.3MB, txt)}

1771204_SD_Fig_1

NIHMS1771204-supplement-1771204_SD_Fig_1.txt^{(531.5KB, txt)}

1771204_SD_Fig_2

NIHMS1771204-supplement-1771204_SD_Fig_2.xlsx^{(2.7MB, xlsx)}

1771204_SD_Fig_3

NIHMS1771204-supplement-1771204_SD_Fig_3.txt^{(2MB, txt)}

1771204_SD_Fig_5

NIHMS1771204-supplement-1771204_SD_Fig_5.xlsx^{(9.1KB, xlsx)}

1771204_SD_ED_Fig_2

NIHMS1771204-supplement-1771204_SD_ED_Fig_2.xlsx^{(791.9KB, xlsx)}

1771204_SD_ED_Fig_3

NIHMS1771204-supplement-1771204_SD_ED_Fig_3.txt^{(9.1KB, txt)}

1771204_SD_ED_Fig_8

NIHMS1771204-supplement-1771204_SD_ED_Fig_8.txt^{(2KB, txt)}

1771204_SD_ED_Fig_9

NIHMS1771204-supplement-1771204_SD_ED_Fig_9.txt^{(1.3KB, txt)}

Data Availability Statement

[R1] 1.Yost KE et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med 25, 1251–1259 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Wu TD et al. Peripheral T cell expansion predicts tumour infiltration and clinical response. Nature (2020) doi: 10.1038/s41586-020-2056-8. [DOI] [PubMed] [Google Scholar]

[R3] 3.Guo X et al. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat. Med 24, 978–985 (2018). [DOI] [PubMed] [Google Scholar]

[R4] 4.Jokinen E, Huuhtanen J, Mustjoki S, Heinonen M & Lähdesmäki H Determining epitope specificity of T cell receptors with TCRGP. bioRxiv 542332 (2019) doi: 10.1101/542332. [DOI] [Google Scholar]

[R5] 5.Zheng C et al. Landscape of Infiltrating T Cells in Liver Cancer Revealed by Single-Cell Sequencing. Cell 169, 1342–1356.e16 (2017). [DOI] [PubMed] [Google Scholar]

[R6] 6.Zhang L et al. Lineage tracking reveals dynamic relationships of T cells in colorectal cancer. Nature 564, 268–272 (2018). [DOI] [PubMed] [Google Scholar]

[R7] 7.Gueguen P et al. Contribution of resident and circulating precursors to tumor-infiltrating CD8+ T cell populations in lung cancer. Sci Immunol 6, (2021). [DOI] [PubMed] [Google Scholar]

[R8] 8.Azizi E et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell 174, 1293–1308.e36 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Minervina AA et al. Comprehensive analysis of antiviral adaptive immunity formation and reactivation down to single-cell level. bioRxiv 820134 (2019) doi: 10.1101/820134. [DOI] [Google Scholar]

[R10] 10.Zemmour D et al. Single-cell gene expression reveals a landscape of regulatory T cell phenotypes shaped by the TCR. Nat. Immunol 19, 291–301 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Godfrey DI, Stankovic S & Baxter AG Raising the NKT cell family. Nat. Immunol 11, 197–206 (2010). [DOI] [PubMed] [Google Scholar]

[R12] 12.Toubal A, Nel I, Lotersztajn S & Lehuen A Mucosal-associated invariant T cells and disease. Nat. Rev. Immunol 19, 643–657 (2019). [DOI] [PubMed] [Google Scholar]

[R13] 13.Schattgen SA & Thomas PG Bohemian T cell receptors: sketching the repertoires of unconventional lymphocytes. Immunol. Rev 284, 79–90 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Dash P et al. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547, 89–93 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Glanville J et al. Identifying specificity groups in the T cell receptor repertoire. Nature 547, 94–98 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Zhang H et al. Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers. Clin. Cancer Res 26, 1359–1371 (2020). [DOI] [PubMed] [Google Scholar]

[R17] 17.Tubo NJ et al. Single naive CD4+ T cells from a diverse repertoire produce different effector cell types during infection. Cell 153, 785–796 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Khatun A et al. Single-cell lineage mapping of a diverse virus-specific naive CD4 T cell repertoire. J. Exp. Med 218, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.McInnes L, Healy J, Saul N & Großberger L UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software vol. 3 861 (2018). [Google Scholar]

[R20] 20.Blondel VD, Guillaume J-L, Lambiotte R & Lefebvre E Fast unfolding of communities in large networks. J. Stat. Mech 2008, P10008 (2008). [Google Scholar]

[R21] 21.Traag V louvain-igraph: v0.5.3 (2015). doi: 10.5281/zenodo.35117. [DOI] [Google Scholar]

[R22] 22.Schneider TD & Stephens RM Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18, 6097–6100 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Godfrey DI, Koay H-F, McCluskey J & Gherardin NA The biology and functional importance of MAIT cells. Nat. Immunol 20, 1110–1128 (2019). [DOI] [PubMed] [Google Scholar]

[R24] 24.10x_Genomics. A new way of exploring immunity: linking highly multiplexed antigen recognition to immune repertoire and phenotype (Application Note LIT000047 Rev C ) Retrieved from the 10X Genomics website: https://pages.10xgenomics.com/rs/446-PBO704/images/10x_AN047_IP_A_New_Way_of_Exploring_Immunity_Digital.pdf (2020).

[R25] 25.Lu J et al. Molecular constraints on CDR3 for thymic selection of MHC-restricted TCRs from a random pre-selection repertoire. Nat. Commun 10, 1019 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Elhanati Y, Murugan A, Callan CG Jr, Mora T & Walczak AM Quantifying selection in immune receptor repertoires. Proc. Natl. Acad. Sci. U. S. A 111, 9875–9880 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Krovi SH, Kappler JW, Marrack P & Gapin L Inherent reactivity of unselected TCR repertoires to peptide-MHC molecules. Proc. Natl. Acad. Sci. U. S. A 116, 22252–22261 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Stadinski BD et al. Hydrophobic CDR3 residues promote the development of selfreactive T cells. Nat. Immunol 17, 946–955 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Wirasinha RC et al. αβ T-cell receptors with a central CDR3 cysteine are enriched in CD8αα intraepithelial lymphocytes and their thymic precursors. Immunol. Cell Biol 96, 553–561 (2018). [DOI] [PubMed] [Google Scholar]

[R30] 30.Schattgen SA et al. Intestinal Intraepithelial Lymphocyte Repertoires are Imprinted Clonal Structures Selected for MHC Reactivity (2019) doi: 10.2139/ssrn.3467160. [DOI] [Google Scholar]

[R31] 31.Park J-E et al. A cell atlas of human thymic development defines T cell repertoire formation. Science 367, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Carter JA et al. Single T Cell Sequencing Demonstrates the Functional Role of αβ TCR Pairing in Cell Lineage and Antigen Specificity. Front. Immunol 10, 1516 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Klarenbeek PL et al. Somatic Variation of T-Cell Receptor Genes Strongly Associate with HLA Class Restriction. PLoS One 10, e0140815 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Emerson R et al. Estimating the ratio of CD4+ to CD8+ T cells using high-throughput sequence data. J. Immunol. Methods 391, 14–21 (2013). [DOI] [PubMed] [Google Scholar]

[R35] 35.Li HM et al. TCRβ repertoire of CD4+ and CD8+ T cells is distinct in richness, distribution, and CDR3 amino acid composition. J. Leukoc. Biol 99, 505–513 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Majumder K, Bassing CH & Oltz EM Regulation of Tcrb Gene Assembly by Genetic, Epigenetic, and Topological Mechanisms. Adv. Immunol 128, 273–306 (2015). [DOI] [PubMed] [Google Scholar]

[R37] 37.Luo H, Yu G, Wu Y & Wu J EphB6 crosslinking results in costimulation of T cells. J. Clin. Invest 110, 1141–1150 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Luo H, Yu G, Tremblay J & Wu J EphB6-null mutation results in compromised T cell function. J. Clin. Invest 114, 1762–1773 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Huang H, Wang C, Rubelt F, Scriba TJ & Davis MM Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat. Biotechnol (2020) doi: 10.1038/s41587-020-0505-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Jameson SC & Masopust D Understanding Subset Diversity in T Cell Memory. Immunity 48, 214–226 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Fischer DS, Wu Y, Schubert B & Theis FJ Predicting antigen-specificity of single Tcells based on TCR CDR3 regions. bioRxiv 734053 (2019) doi: 10.1101/734053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Thomas PG & Crawford JC Selected before selection: A case for inherent antigen bias in the T-cell receptor repertoire. Current Opinion in Systems Biology 18, 36–43 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Wolf FA, Angerer P & Theis FJ SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Berg JM, Tymoczko JL & Stryer L Biochemistry. (W H Freeman, 2002). [Google Scholar]

[R45] 45.Miyazawa S & Jernigan RL Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol 256, 623–644 (1996). [DOI] [PubMed] [Google Scholar]

[R46] 46.Kosmrlj A, Jha AK, Huseby ES, Kardar M & Chakraborty AK How the thymus designs antigen-specific and self-tolerant T cell receptor sequences. Proc. Natl. Acad. Sci. U. S. A 105, 16671–16676 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Martin J & Lavery R Arbitrary protein−protein docking targets biologically relevant interfaces. BMC Biophysics vol. 5 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Dunker AK et al. Intrinsically disordered protein. J. Mol. Graph. Model 19, 26–59 (2001). [DOI] [PubMed] [Google Scholar]

[R49] 49.Kidera A, Konishi Y, Oka M, Ooi T & Scheraga HA Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem 4, 23–55 (1985). [Google Scholar]

[R50] 50.Shugay M et al. VDJtools: Unifying Post-analysis of T Cell Receptor Repertoires. PLoS Comput. Biol 11, e1004503 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Atchley WR, Zhao J, Fernandes AD & Drüke T Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U. S. A 102, 6395–6400 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Emerson RO et al. Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet 49, 659–665 (2017). [DOI] [PubMed] [Google Scholar]

[R53] 53.Murugan A, Mora T, Walczak AM & Callan CG Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences 109, 16161–16166 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Horn HS Measurement of ‘Overlap’ in Comparative Ecological Studies. The American Naturalist vol. 100 419–424 (1966). [Google Scholar]

[R55] 55.Shugay M et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res 46, D419–D427 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Tickotsky N, Sagiv T, Prilusky J, Shifrut E & Friedman N McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33, 2924–2929 (2017). [DOI] [PubMed] [Google Scholar]

[R57] 57.Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A 102, 15545– 15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Godec J et al. Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation. Immunity 44, 194–206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Stuart T et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Hänzelmann S, Castelo R & Guinney J GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Hillje R, Pelicci PG & Luzi L Cerebro: interactive visualization of scRNA-seq data. Bioinformatics 36, 2311–2313 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Integrating T cell receptor sequences and transcriptional profiles by clonotype neighbor graph analysis (CoNGA)

Stefan A Schattgen

Kate Guion

Jeremy Chase Crawford

Aisha Souquette

Alvaro Martinez Barrio

Michael JT Stubbington

Paul G Thomas

Philip Bradley

Summary

Main

Results

CoNGA graph-vs-graph analysis

Figure 1. CoNGA graph-vs-graph analysis.

CoNGA defines a HOBIT+/HELIOS+ T cell population

Figure 2. CoNGA identifies unconventional HOBIT+ CD8 T cells in blood.

CoNGA identifies GEX/TCR correlation in thymic T cells

Figure 3. CoNGA plots and cluster logos for a large dataset of thymic T cells (thymus_atlas).

CoNGA graph-vs-feature analysis

Figure 4. Graph-vs-feature analysis highlights TCR:GEX covariation.

Figure 5. EPHB6 co-expression is a feature of TRBV30+ T cells.

TCR and GEX similarity among epitope-specific clonotypes

Figure 6. CoNGA identifies convergence of TCR sequence and gene expression profile within pMHC-positive clonotype subsets.

Discussion

Methods

CoNGA algorithm

CoNGA software package

TCR analysis

TCR sequence features

Gene expression analysis

Graph-vs-graph correlation analysis

Graph-vs-feature correlation analysis

Analysis of pMHC binding

Flow Cytometry Analysis

Extended Data

Extended Data Fig. 1. T cells belonging to the same clonotype have similar gene expression profiles.

Extended Data Fig. 2. CoNGA graphvs-graph analysis of human and mouse peripheral blood T cells.

Extended Data Fig. 3. Matching of CoNGA cluster TCR sequences to bulk repertoires.

Extended Data Fig. 4. Specific versus non-specific binding in the 10x_200k dataset.

Extended Data Fig. 5. Flow cytometry gating strategies for HOBIT/HELIOS CD8 T cells in Figure 2.

Extended Data Fig. 6. Detection of GEX neighborhoods with elevated iMHC scores across multiple donors.

Extended Data Fig. 7. Gating strategy for assessment of EPHB6 protein levels in TRBV30+/− CD4+ and CD8+ T cells in Figure 5F.

Extended Data Fig. 8. Matching of pMHC-positive TCR sequences to bulk repertoires and epitope-specific TCR sequences from the literature.

Extended Data Fig. 9. Epitope-specific T cell populations differ in activation status.

Extended Data Fig. 10. CoNGA’s ability to recover invariant T cell subsets depends on their frequency in the dataset.

Supplementary Material

Acknowledgments

Footnotes

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases