Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 23.
Published in final edited form as: Methods Mol Biol. 2022;2574:291–307. doi: 10.1007/978-1-0716-2712-9_15

Grouping T-Cell Antigen Receptors by Specificity

Chunlin Wang 1,*, Huang Huang 1, Mark M Davis 1,2,3
PMCID: PMC10035763  NIHMSID: NIHMS1875601  PMID: 36087209

Abstract

Grouping TCRs on the similarity of CDR3 sequences could effectively cluster them by specificity. Three versions of the GLIPH algorithm are described briefly here, with instructions to use GLIPH algorithms to cluster TCRs by specificity.

Keywords: TCR, CDR3, GLIPH, clustering, specificity

1. Introduction

T cells play a central role in adaptive immunity. One defining characteristic of adaptive immunity is the highly diverse repertoire of T cell receptors (TCRs) in each individual, generated through V(D)J somatic recombination process. A given T cell expresses one of two TCR types: TCRα and TCRγδ. In the rest of this chapter, we shall use the term T cell to mean αβ T cell, and TCR to mean TCRαβ, except where specified otherwise. T cells can selectively recognize and respond to epitopes presented by major histocompatibility complex (MHC) molecules through TCRs. Earlier studies have shown that similarities in the CDR3 regions were found in α or β, or both chains of TCRs recognizing the same epitope-MHC ligand[18]. In some cases, CDR3β, or CDR3α, or both are nearly identical in TCRs of same specificity. In other cases, linear sequence similarity (motifs) of 3–4 amino acids in either CDR3β, or CDR3α, or both could be found in those CDR3 sequences recognizing the same epitope-MHC ligand[911]. Therefore, grouping TCRs on the similarity of CDR3 sequences could effectively cluster them by specificity.

Tools try to group TCRs by specificity based on CDR3 sequence similarity include TCRdist algorithms[12, 13], iSMART[14] and ALICE[15], GLIPH algorithms[1, 16]. TCRdist is a tool to compute distance among CDR3 sequences, and then to cluster them hierarchically. iSMART performs pairwise local alignment on T cell receptor CDR3 sequences to group them into antigen-specific clusters[14]. ALICE groups similar CDR3 sequences from the same sample into clusters and report those clusters with more member sequences than chance. Based on our earlier studies, the version 2 of the GLIPH algorithm outperforms the other tools with better specificity and faster speed[16]. In the following sections, we will focus on GLIPH algorithms.

2. GLIPH algorithm and analysis procedure

2.1. GLIPH version 1

This algorithm, “Grouping Lymphocyte Interactions by Paratope Hotspots” or GLIPH [1](referred to as GLIPH1 in this manuscript) searches for global sequence similarity, and local sequence similarity (motifs), and automatically cluster TCR sequences into distinct groups according to their likely specificity. This algorithm runs in three stages: discovering for global and local similarity signatures, constructing clusters of TCRs with identified similarities, and evaluating enrichment of features for each cluster. The algorithm is briefly described in the following sections.

2.1.1. Pre-processing input data

GLIPH1 works on the non-redundant CDR3 amino acid sequences for both input sample set (collection of TCRs under evaluation) and reference set (a large database of TCR sequences that are not expected to be enriched for specificities found in the sample set). GLIPH1 ignores the first three and last three residues in all CDR3 sequences where computing both global and local sequence similarity.

2.1.2. Discovering local similarity signatures

GLIPH1 scans all possible 3mer, 4mer and 5mer motifs for their frequency in the sample set. To evaluate whether these motifs are specifically enriched by antigens, these frequencies are compared to a repeat random sampling of the non-redundant reference set at the same depth as the non-redundant sample set. For a particular motif in the sample set, GLIPH1 computes the observed vs expected (OVE) ratio where the observed value is the frequency of the motif found in non-redundant sample set, and the expected value is the average frequency of the same motif found in repeated sampling data of the non-redundant reference set. Additionally, GLIPH1 computes the empirical p-value for a motif as the ratio between the number (n) of the sampling data set that the motif found more frequent than the sample set and the sampling times (N). If the frequency of a motif is greater than the pre-set --lcmindepth parameter, OVE is greater than the pre-set --lcminove parameter, and the p-value is less than the pre-set --lcminip parameter, this motif is considered enriched in the sample set. This procedure is repeated for all motifs to collect all enriched motifs.

2.1.3. Discovering global similarity signatures

GLIPH1 counts the number of different positions between any pair of CDR3s of the same length. Two CDR3s are considered globally similar if the number of different positions is less than the –gccutoff parameter. If a user does not provide this –gccutoff parameter, it will be automatically set according to sample depth or the number of unique CDR3s in the sample set.

2.1.4. Constructing clusters of TCRs with identified similarities

GLIPH1 groups TCRs with identified similarities into a single cluster. Graph is used to model relationship between CDR3s in the sample set. CDR3s are represented as nodes in a graph that is used to model the relationship between these CDR3s. If two CDR3s share similarity signatures, those two CDR3s nodes are connected with an edge. A cluster is then a connected component in which any two CDR3 nodes are connected to each other by paths, but not connected to any additional nodes in the rest of the graph.

2.1.5. Calculating the likelihood (p) of a cluster of that size forming by random chances in a reference set

To compute the likelihood (p) of a cluster of that size (x) forming by random chances, GLIPH1 first repeatedly (for instance, N times) samples the non-redundant reference set at the depth as the number of unique CDR3s in the sample set. And then, GLIPH1 clusters each of those random sampling data sets and count the number of times (for instance, n) to clusters of size x in those random data sets. The likelihood (p) of a cluster of the size x is then computed as n/N. In order to speed up the calculation, GLIPH1 creates a lookup table at different sample depths, each performed 100000 times each.

2.1.6. Calculating the likelihood (p) of enrichment of common V-gene in clusters

To evaluate the likelihood (p) of enrichment of V-gene in clusters, users need to provide a file containing the frequency distribution of V-genes found in unselected reference data set. GLIPH1 calculates the Simpson diversity index for V-genes within members in a cluster and calculates the probability that a random sampling V-genes from user provided V-gene usage file would generate an equal or superior Simpson score as the observed score.

2.1.7. Calculating the likelihood (p) of enrichment of common CDR3 length in clusters

To evaluate the likelihood (p) of enrichment of CDR3 length in clusters, users need to provide a file containing the distribution of CDR3 length found in unselected reference data set. GLIPH1 calculates the Simpson diversity index for CDR3 length within members in a cluster and calculates the probability that a random sampling length from user provided CDR3 length file would generate an equal or superior Simpson score as the observed score.

2.1.8. Calculating the likelihood (p) of enrichment of clonal expansion within clusters

GLIPH1 calculates the expansion coefficient e as the average frequency of a candidate cluster. GLIPH1 random choose n (the size of the candidate cluster) CDR3s from the sample set and calculates the average frequency of the random data. The process repeats pre-set parameter N times to establish a distribution. The probability of the observed e for a candidate cluster is obtained as the one-tailed probability of observing a score at least that high in the e score distribution from randomly sampled clusters of the same size n.

2.1.9. Calculating the likelihood of enrichment of common HLA alleles associated with clusters

GLIPH1 uses a sampling approach to estimate the probability that a given HLA allele is present by chance.

2.1.10. Calculating the overall score for clusters

To evaluate the overall significance of a given cluster, GLIPH1 multiplies all feature scores together, where only the least HLA allele association score is chosen to be included into calculation.

GLIPH1 was implemented with Perl language https://www.perl.org/ in two scripts gliph-group-discovery.pl and gliph-group-scoring.pl. The following is an example command to use these two scripts.

perl gliph-group-discovery.pl --tcr=out_prefix.txt –refdb=refer_file

perl gliph-group-scoring.pl --convergence_file out_prefix-convergence-groups.txt --clone_annotations=out_prefix.txt --hla_file=hla_file --motif_pval_file=out_prefix-kmer_resample_1000_minp0.001_ove10.txt --background_L_file=v_usage_freq_file --background_V_file=cdr3_length_freq_file > out_prefix.out

All parameters of these two scripts are filenames, where refer_file, v_usage_freq_file and cdr3_length_freq_file provide background information; out_prefix.txt, hla_file provide input data information; out_prefix-convergence-groups.txt, out_prefix-kmer_resample_1000_minp0.001_ove10.txt, and out_prefix.out are output files.

2.2. GLIPH version 2

GLIPH1 works well on small and clean data sets. However, as data sets are becoming larger and noisier, the algorithm tends to generate large clusters of mixed specificities. GLIPH version 2 (referred to as GLIPH2) was developed to address these issues [16]. Following are the difference between GLIPH2 and GLIPH1.

2.2.1 Member CDR3s in a GLIPH1 cluster could be related by different similarity signatures, while member CDR3s in a GLIPH2 cluster must be related by same similarity signatures.

2.2.2 Member CDR3s in a GLIPH2 cluster based on global similarity signature must differ at the same position. GLIPH2 labels global similarity signature as a pattern [AC-IK-NP-TV-Y]*%[ AC-IK-NP-TV-Y]*, where [AC-IK-NP-TV-Y] denotes any one amino acid, symbol ‘*’ means 0 or more amino acids, and symbol ‘%’ means the position with varying amino acids. CDR3s related by the same signature can be group into the same cluster as shown in Figure 2. This restriction is required when the size of data under evaluation gets large or the density of data points gets high. In an extreme example in Figure 3, without restriction on the position of varying amino acids, CASAAAQFF and CASGGGGQFF could be grouped into the same cluster although they differ in 4 positions at the center of CDR3 sequences.

Figure 2.

Figure 2.

Position-specific global similarity signatures in the GLIPH2 algorithm. Panel A shows a CDR3 cluster by GLIPH1 based on global similarity signatures. Panel B, C and D show three CDR3 clusters based on global similarity signatures SLGQG%Y, SL%QGAY and SL%QGAY respectively by GLIPH2. Note that the data set in panel A is the same as those together in panels B, C and D except that CASSLGQGAYEQYF is duplicated three times. Position different in global similarity signature is highlighted with red font. The symbol ‘%’ denotes varying residue in global similarity signature. CDR3 nodes sharing similarity signatures are connected with lines.

Figure 3.

Figure 3.

Extreme example showing the necessity to group CDR3s with different similarity signatures into different clusters. Neighboring CDR3 nodes connected by an edge differ in a single position. However, through the network, CASAAAQFF and CASGGGGQFF, which differ in 4 positions, could be grouped into the same cluster. CDR3 nodes sharing similarity signatures are connected with lines.

2.2.3 If the parameter all_aa_interchangeable is set to be 0, amino acids with non-negative scores in BLOSUM-62 matrix are considered interchangeable and CDR3s with interchangeable amino acids at the varying position are in the same global similarity signature (Figure 4). Such restriction is removed if the parameter all_aa_interchangeable is set to be 1.

Figure 4.

Figure 4.

Panel A shows an example cluster based on global similarity signature by GLIPH1, and panel B shows clusters based on global similarity signature from the same data set as in panel A by GLIPH2. Residues with non-negative scores in BLOSUM-62 matrix are interchangeable when computing global similarity by GLIPH2. CDR3 nodes sharing similarity signatures are connected with lines.

2.2.4 Member CDR3s in a GLIPH2 cluster based on local similarity signature have an identical motif and the position difference of the motif within CDR3s is restricted within three amino acids (Figure 5).

Figure 5.

Figure 5.

Panel A shows an example cluster based on local similarity signature by GLIPH1, and panel B shows clusters based on local similarity signature (motif) from the same data set as in panel A by GLIPH2. GLIPH2 considers two identical motifs more than three residues away within CDR3 sequences as different local similarity signatures. CDR3 nodes sharing similarity signatures are connected with lines.

2.2.5 GLIPH2 uses a Fisher Exact test to assess the significance of a cluster. If a cluster is based on a local similarity signature and the signature is partially encoded by non-template nucleotides during somatic recombination, the cluster will be scored with a lower p-value. In other words, this cluster is considered statistically more significant.

2.2.6 CDR3s, which do not share either local or global similarity signatures but are found in multiple samples, are grouped into singlet clusters in GLIPH2. Those CDR3s are ignored in GLIPH1.

GLIPH2 was implemented with the C computer language for speed. GLIPH2 replaces the resampling approach with a Fisher exact test when searching for enriched local similarity motifs. This change removes the requirement for large reference data set. The reference data set needs to be much larger than the sample data set under evaluation to do resampling. In addition, GLIPH2 replaces the resampling approach with a hypergeometric test when evaluating the association of HLA alleles. Those updates make GLIPH2 run about 1000 times faster than GLIPH1.

GLIPH2 uses a configuration file to supply parameters.

Parameter Comment
out_prefix = test Prefix to files generated by GLIPH2
cdr3_file = test_CDR3.txt Input CDR3 file
hla_file = test_hla.txt Optional HLA allele file
number_of_hla_field = 1 Number of fields in HLA allele used to compute association
hla_association_cutoff = 0.1 Hypergeometric test p-value cutoff to output HLA allele
refer_file = reference_CDR3.txt Reference CDR3 file
v_usage_freq_file = reference_Vgene.txt Background V-gene distribution file
cdr3_length_freq_file = reference_Length.txt Background CDR3 length distribution
local_min_pvalue = 0.001 Motif enrichment p-value cutoff. Set to0.001 by default.
p_depth = 10000 Simulated resampling depth for non-parametric significance tests. Set to10,000 by default
kmer_min_depth = 3 Motif count cutoff. Set to3 by default
local_min_OVE = 10 Motif observed vs expected ratio cutoff. Set to10 by default
motif_distance_cutoff=3 Position difference for motif to be considered to be the same local similarity signature
ignored_end_length=3 The number of amino acids from both CDR3 ends to ignore
kmer_sizes=2,3,4 Specify the size of motifs
all_aa_interchangeable = 1 If Set to 1 (default), only those amino acids with non-negative scores in BLOSUM-62 matrix are interchangeable in global similarity signature

TCR information is provided in the input CDR3 file with tab-delimited fields.

Column Field Comment
1 CDR3 CDR3 sequences under evaluation. The field cannot be empty, cannot be “NA”.
2 V V gene for above CDR3. The field cannot be empty, cannot be “NA”.
3 J J gene for above CDR3. The field cannot be empty, can be “NA”.
4 CDR3c Same CDR3 sequence as column 1, with non-template nucleotide encoded amino acid in lower case. Optional
5 CDR3p Sequence for pairing CDR3. If column 1 is CDR3β, it is CDR3α. If column 1 is CDR3δ, it is CDR3γ. The field cannot be empty, can be “NA”.
6 Subject:condition Subject and condition are delimited with “:”. Condition can be anything such as tissue type, cell subset or treatment et al. Subject part cannot be empty, must match subject field in input HLA file. Condition part and “:” can be omit.
7 Frequency The frequency or count of this TCR.

HLA allele information is provided in the input HLA file with tab-delimited fields: subject, allele1, allele2 et al, where the subject in HLA file need to match the subject in CDR3 file.

The GLIPH2 output is comma delimited out_prefix_cluster.csv file with following fields.

Column Field Comment
1 index A number to unique to each cluster
2 pattern Motif pattern (3–5 amino acids), or global pattern with symbol ‘%’, or singlet pattern, in which identical members are found in more than one samples but does not contain any motif pattern or any global pattern
3 Fisher_score Fisher Exact test score for the cluster
4 number_subject The number of unique samples from which member CDR3s are
5 number_unique_cdr3 The number of unique CDR3s in the cluster
6 final_score The aggregative score for a cluster
7 hla_score Lowest hypergeometric test score between a cluster and HLA alleles
8 vb_score enrichment of V-gene within cluster
9 expansion_score the likelihood of enrichment of clonal expansion within clusters
10 length_score Enrichment of CDR3 length with cluster
11 cluster_size_score the likelihood of a cluster of that size forming by random chances in a reference set
12 type global pattern contains ‘%’, which indicates position allowing variants; local pattern starts with ‘motif-’; and singlet pattern likes global pattern without ‘%’ symbol
13 ulTcRb Same as CDR3c field in input CDR3 file, this column is present only if CDR3c is provided in input CDR3 file
14 TcRb Same as CDR3 field in input CDR3 file
15 V Same as V-gene field in input CDR3 file
16 J Same as J-gene field in input CDR3 file
17 TcRa Same as CDR3p field in input CDR3 file
18 Sample Same as subject:condition field in input CDR3 file
19 Freq Same as Frequency field in input CDR3 file
20–30 HLA genes List HLA alleles for each gene respectively, alleles with statistically significant scores are highlighted with symbol ‘!’ next to the allele name

In addition, if HLA information is provided, GLIPH2 output out_prefix_hla.csv file with following fields.

Column Field Comment
1 index A number to unique to each cluster, see cluster output file
2 pattern Motif pattern (3–5 amino acids), or global pattern with symbol ‘%’, or singlet pattern, in which identical members are found in more than one samples but does not contain any motif pattern or any global pattern
3 allele HLA allele associated with this cluster
4 pvalue Fisher exact test score
5 number of subjects in this cluster with this allele
6 number of subjects in this cluster with HLA
7 number of subjects with this allele in total
8 number of subjects with HLA in total

2.3. GLIPH version 3

TCR is a heterodimeric molecule with α and β chains, which collectively form a site that binds to cognate epitope-MHC. Earlier studies have shown that α and β CDR3 sequences are similar in TCRs with same specificity. To harness the pairing information between α and β CDR3 sequences, and to exploit discontinuous motifs, a major update of the GLIPH algorithm was developed (referred to as GLIPH3).

In the GLIPH3 algorithm, local similarity signatures are discontinuous or continuous motifs of a few amino acids; and a motif needs to be at the exact position within CDR3 of the same length. In addition, GLIPH3 computes a similarity score of CDR3α sequences to assess the significance of clusters of paired CDR3β sequences, or vice versa. And GLIPH3 runs a hierarchical clustering on CDR3α sequences when clusters paired CDR3β sequences, or vice versa.

GLIPH3 uses entropy fraction on 3-mers in CDR3s sequences to assess overall similarity of a group of CDR3 sequences. To extract 3-mers from CDR3 sequences, GLIPH3 walks over CDR3 sequences with a sliding window except those ignored ends specified by corresponding parameters. Assume that p1, p2, …, pi, …, pn are frequencies of n unique 3-mers and the total number of 3-mers is T, the entropy fraction is given by f=i=1npilog2pilog21T. The entropy fraction is between 0 and 1. The value of entropy fraction for completely random set of CDR3 sequences is 1. Lower the entropy fraction value is, more similar those CDR3 sequences are. GLIPH3 outputs clusters with entropy fraction values lower than parameter entropy_fraction_cutoff.

With GLIPH3 algorithm, a CDR3 sequence could be assigned to more than one clusters. When the parameter purge_cluster is set to 1, GLIPH3 purges clusters to reduce such redundancy. The purging procedure is as following

  1. Computes the entropy fraction value (x) for a cluster

  2. Computes an entropy fraction value for this cluster minus each member and finds the minimum value (y) among them

  3. If y < purge_fraction * x, removes the member with the entropy fraction value y and repeats step 1–3. Otherwise, this purging procedure break out this loop

  4. If all members of cluster A exists in cluster B, removes cluster A. If all members of cluster A exists in cluster B and all members in cluster B exists in cluster A, and if the number of amino acids in pattern A is greater than that in pattern B, removes cluster B. Otherwise, removes cluster A.

GLIPH3 uses a configuration file to supply parameters.

Parameter Comment
out_prefix = <required> Prefix to files generated by GLIPH3
cdr3_file = <required> Input CDR3 file
hla_file = [optional] Input HLA file
ag_refer_file = [optional] Reference CDR3 file for CDR3 α or γ
bd_refer_file = [optional] Reference CDR3 file for CDR3 β or δ
number_of_hla_field = 1 Number of fields in HLA allele used to compute association
hla_association_pvalue_cutoff = 0.001 Hypergeometric test p-value cutoff to output HLA allele
ag_ignored_v_end = 3 Number of amino acids from α or γ V-gene end to ignore
ag_ignored_j_end = 3 Number of amino acids from α or γ J-gene end to ignore
bd_ignored_v_end = 3 Number of amino acids from β or δ V-gene end to ignore
bd_ignored_j_end = 3 Number of amino acids from β or δ J-gene end to ignore
pattern_unique_sample_cutoff = 2 Number of unique samples cutoff. Set to 2 by default
min_motif_length = 3 Minimum number of amino acids in motif pattern
max_motif_length = 5 Maximum number of amino acids in motif pattern
max_diff_position = 2 Maximum number of different positions in global pattern
pattern_pvalue_cutoff = 0.0001 Pattern enrichment p-value cutoff. Set to 0.001 by default
pattern_ove_cutoff = 10 Pattern observed vs expected ratio cutoff. Set to10 by default
same_v = 1 Restrict pattern with the same V-gene. Set to true by default.
purge_cluster = 1 Purge clusters. Set to be true by default
continuous_motif = 1 Examine continuous motif pattern only. Set to be true by default
entropy_fraction_cutoff = 0.9 Entropy fraction cutoff to cluster. Set to be 0.9 by default
purge_fraction = 0.95 Purge fraction factor. Set to be 0.95 by default
overlap_cutoff = 60 Percentage of overlap between clusters to be listed together. Set to 60(%) by default
ag_cdr3_length_min_cutoff = 8 Minimum length cutoff for CDR3 α or γ. Set to 8 by default
ag_cdr3_length_max_cutoff = 30 Maximum length cutoff for CDR3 α or γ. Set to 30 by default
bd_cdr3_length_min_cutoff = 8 Minimum length cutoff for CDR3 β or δ. Set to 8 by default
bd_cdr3_length_max_cutoff = 30 Maximum length cutoff for CDR3 β or δ. Set to 30 by default

The input CDR3 file is .csv file with the first line as header information (column names). GLIPH3 uses the following columns cdr3a, va, ja, cdr3b, vb, jb, sid, condition and frequency, which can be in any orders, extra columns will be ignored. The column sid and either one of cdr3a and cdr3b are required. Other columns could be either missing or “NA”. The reference file is comma-delimited with three fields in the order: cdr3, v, j columns.

GLIPH3 outputs one cluster file for each chain named after the pattern <out_prefix>_<chain>_cluster.csv where chain is either TRB or TRA. If HLA information is provided, it outputs <out_prefix>_HLA.csv file as well. For instance, if information for cdr3α, cdr3β and HLA is provided, and the out_prefix is set as “test”, there will be three output files with following names: test_TRA_cluster.csv, test _TRB_cluster.csv, and test_HLA.csv. The format for the cluster file is shown in Figure 6.

Figure 6.

Figure 6.

An example cluster of paired α and β CDR3 sequences. Data are grouped together based on the presence of CDR3β pattern: TRBV12-4:…S..GTE…., where dots indicate any amino acids and the pattern …S..GTE…. is restricted by TRBV12-4. Rows are ordered according to hierarchical clustering of cdr3α sequences.

2.3. Searching for condition-specific clusters with both single-cell sequencing data and bulk sequencing data

Single-cell sequencing could generate paired α and β CDR3 sequences for limited number of cells. On the other hand, bulk sequencing could generate CDR3 sequences for either chain for much larger number of cells. Single-cell sequencing is much more expensive than bulk sequencing and normally used to generate data for selected cell population while bulk sequencing is usually used to generate data for peripheral blood mononuclear cells (PMBC). Bulk sequencing data is lack of information about the pairing of the α and β chain. Here, two approaches are proposed here to utilize the advantages of both types of data to search for condition-specific clusters

In the first approach (figure 7, panel A), GLIPH analysis is performed on single cell data on one condition against combined data on other condition(s). The identified clusters are then examined for CDR3s matching clusters’ pattern in bulk data using Fisher exact test or Mann-Whitney U test. Those clusters supported by bulk data are condition-specific clusters. The approach can be repeated for other condition-specific clusters.

Figure 7.

Figure 7.

Panel A proposes one strategy to use both single-cell sequencing data and bulk sequencing data with two different conditions. In this strategy, GLIPH analysis is performed with single-cell data on condition A against combined data on condition B to generate condition A specific clusters. Statistical test, such as Fisher Exact test for presence of CDR3s matching cluster patterns or Mann-Whitney U test for frequencies of CDR3s matching cluster patterns, is then carried out to pick up condition A specific clusters supported by bulk sequencing data. Condition B specific clusters can be generated likewise. Panel B proposes a second strategy. In this strategy, GLIPH analysis is performed with single-cell data on condition A against a reference database to generate condition A specific clusters. Statistical tests are carried out on bulk data to pick up condition A specific clusters. Condition B specific clusters can be generated likewise. The common clusters present in these two conditions are removed. * denotes condition-specific clusters. Overlapping clusters found in both conditions are shadowed by lines.

In the second approach (figure 7, panel B), GLIPH analysis is performed on single cell data on one condition against common reference data. The identified clusters are then checked against bulk data using Fisher exact test or Mann-Whitney U test. Those clusters supported by bulk data are condition-specific clusters after clusters found in more than one conditions are removed.

3. Notes

Sequence similarity searching is a very informative step in the analysis of newly determined sequences[17]. This is because when two sequences share more similarity than would be expected by chance, the simplest scientific explanation is that these two sequences arose from a common ancestor and are most likely to have similar functions. The presumption that statistically significant similarity implies common ancestry works well in protein or nucleotide sequences, but not on the CDR3 region of TCRs. CDR3 region of TCRs is generated through somatic recombination process. Identical CDR3s from different individuals arose from completely independent recombination events. Nevertheless, various studies[1, 3, 12, 16] have shown that TCRs binds to same MHC:peptide complex share extensive similarity in CDR3 region of either TCR α or β, or both. Although same MHC:peptide complex could be recognized by many distinct TCRs with different structural solutions, TCRs taking the same structure solution usually share distinguishable similarity signatures[18].

CDR3 sequences are more like enzyme active sites or transcription factors binding sites in that TCR binding to MHC:peptide complex is sensitive to mutations in CDR3s[19]. An early study [1] has showed that the impact of mutations in the CDR3 region on TCR binding to MHC:peptide complex are position-specific. Therefore, the regular scoring method widely used in the conventional sequence similarity analysis might fail to capture positional impact of mutations in CDR3 region on TCR specificity. This is the reasoning behind the restriction on varying position in global similarity signature in GLIPH2 algorithm and the restriction on the motif position in local similarity signature in GLIPH3 algorithm. Without such restriction, when data density increases, unrelated CDR3s could be grouped into same clusters.

Three GLIPH algorithms are described in this chapter. GLIPH1 introduced local and global similarity signatures in CDR3 sequences in TCRs binding to same epitope:MHC complex. GLIPH1 works well on clean, scattered data sets. However, as data under study are becoming larger and noisier, GLIPH1 tends to generate large clusters of mixed specificities. Several restrictions are introduced in GLIPH2 to keep output clusters meaningful. With the availability of paired α or β CDR3 sequences, GLIPH3, as a major update to the GLIPH2 algorithm, was developed to harness the pairing information in α or β CDR3 sequences to assess the quality of clusters. In addition, GLIPH3 extends the local similarity signature to be discontinuous motifs and allows more than one varying position in the global similarity signature. As a matter of fact, local, global and singlet patterns in GLIPH2 are consolidated into patterns of varying number of defined amino acids in the implementation. Such change greatly simplifies the implementation and speeds up the GLIPH3 algorithm.

In addition to clustering TCRs by specificity, GLIPH algorithms could be used to search for condition-specific clusters. Combination of single-cell sequencing data and bulk sequencing data, or pairing information between α or β CDR3 sequences could be used to boost the confidence of clusters identified by GLIPH algorithms.

Figure 1.

Figure 1.

The major difference between GLIPH1 and GLIPH2. Panel A shows a CDR3s cluster by GLIPH1. CDR3s sharing global similarity are connected with dashed lines and CDR3s sharing local similarity are connected with solid lines. Panel B shows CDR3 clusters by GLIPH2 from the same data set in panel A except that CASSFSKNTEAFF is duplicated in panel B; position different in global similarity signature is highlighted with bold, italic and red font and common motifs are highlighted with bold, and italic font. CDR3 nodes sharing similarity signatures are connected with lines.

References

  • 1.Glanville J, Huang H, Nau A, et al. (2017) Identifying specificity groups in the T cell receptor repertoire. Nature 547:94–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Thomas N, Best K, Cinelli M, Reich-Zeliger S, Gal H, Shifrut E, Madi A, Friedman N, Shawe-Taylor J, Chain B (2014) Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinforma Oxf Engl 30:3181–3188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Farina C, van der Bruggen P, Boël P, Parmiani G, Sensi M, Moretta L (1996) Conserved TCR usage by HLA-Cw * 1601-restricted T cell clones recognizing melanoma antigens. Int Immunol 8:1463–1466 [DOI] [PubMed] [Google Scholar]
  • 4.Miles JJ, Bulek AM, Cole DK, et al. (2010) Genetic and Structural Basis for Selection of a Ubiquitous T Cell Receptor Deployed in Epstein-Barr Virus Infection. PLoS Pathog 6:e1001198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lim A, Trautmann L, Peyrat M-A, Couedel C, Davodeau F, Romagné F, Kourilsky P, Bonneville M (2000) Frequent Contribution of T Cell Clonotypes with Public TCR Features to the Chronic Response Against a Dominant EBV-Derived Epitope: Application to Direct Detection of Their Molecular Imprint on the Human Peripheral T Cell Repertoire. J Immunol 165:2001–2011 [DOI] [PubMed] [Google Scholar]
  • 6.Grant EJ, Josephs TM, Valkenburg SA, Wooldridge L, Hellard M, Rossjohn J, Bharadwaj M, Kedzierska K, Gras S (2016) Lack of Heterologous Cross-reactivity toward HLA-A*02:01 Restricted Viral Epitopes Is Underpinned by Distinct αβT Cell Receptor Signatures. J Biol Chem 291:24335–24351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Motozono C, Kuse N, Sun X, Rizkallah PJ, Fuller A, Oka S, Cole DK, Sewell AK, Takiguchi M (2014) Molecular Basis of a Dominant T Cell Response to an HIV Reverse Transcriptase 8-mer Epitope Presented by the Protective Allele HLA-B*51:01. J Immunol 192:3428–3434 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Brennan RM, Miles JJ, Silins SL, Bell MJ, Burrows JM, Burrows SR (2007) Predictable αβ T-Cell Receptor Selection toward an HLA-B*3501-Restricted Human Cytomegalovirus Epitope. J Virol 81:7269–7273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ishihara Y, Tanaka Y, Kobayashi S, et al. (2017) A Unique T-Cell Receptor Amino Acid Sequence Selected by Human T-Cell Lymphotropic Virus Type 1 Tax 301–309-Specific Cytotoxic T Cells in HLA-A24:02-Positive Asymptomatic Carriers and Adult T-Cell Leukemia/Lymphoma Patients. J Virol. 10.1128/JVI.00974-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Miles JJ, Elhassen D, Borg NA, et al. (2005) CTL Recognition of a Bulged Viral Peptide Involves Biased TCR Selection. J Immunol 175:3826–3834 [DOI] [PubMed] [Google Scholar]
  • 11.Huth A, Liang X, Krebs S, Blum H, Moosmann A (2019) Antigen-Specific TCR Signatures of Cytomegalovirus Infection. J Immunol 202:979–990 [DOI] [PubMed] [Google Scholar]
  • 12.Dash P, Fiore-Gartland AJ, Hertz T, et al. (2017) Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547:89–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, Gaevert JA, Hertz T, Thomas PG, Bradley P, Fiore-Gartland A (2021) TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA-restricted SARS-CoV-2 associated TCR features. bioRxiv 2020.12.24.424260 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhang H, Liu L, Zhang J, et al. (2020) Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers. Clin Cancer Res 26:1359–1371 [DOI] [PubMed] [Google Scholar]
  • 15.Pogorelyy MV, Minervina AA, Shugay M, Chudakov DM, Lebedev YB, Mora T, Walczak AM (2019) Detecting T cell receptors involved in immune responses from single repertoire snapshots. PLoS Biol 17:e3000314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM (2020) Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat Biotechnol 38:1194–1202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pearson WR (2013) An Introduction to Sequence Similarity (“Homology”) Searching. Curr Protoc Bioinforma Ed Board Andreas Baxevanis Al 0 3: 10.1002/0471250953.bi0301s42 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Song I, Gil A, Mishra R, Ghersi D, Selin LK, Stern LJ (2017) Broad TCR repertoire and diverse structural solutions for recognition of an immunodominant CD8+ T cell epitope. Nat Struct Mol Biol 24:395–406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Madura F, Rizkallah PJ, Miles KM, et al. (2013) T-cell Receptor Specificity Maintained by Altered Thermodynamics. J Biol Chem 288:18766–18775 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES