Abstract
Grouping TCRs on the similarity of CDR3 sequences could effectively cluster them by specificity. Three versions of the GLIPH algorithm are described briefly here, with instructions to use GLIPH algorithms to cluster TCRs by specificity.
Keywords: TCR, CDR3, GLIPH, clustering, specificity
1. Introduction
T cells play a central role in adaptive immunity. One defining characteristic of adaptive immunity is the highly diverse repertoire of T cell receptors (TCRs) in each individual, generated through V(D)J somatic recombination process. A given T cell expresses one of two TCR types: TCRα and TCRγδ. In the rest of this chapter, we shall use the term T cell to mean αβ T cell, and TCR to mean TCRαβ, except where specified otherwise. T cells can selectively recognize and respond to epitopes presented by major histocompatibility complex (MHC) molecules through TCRs. Earlier studies have shown that similarities in the CDR3 regions were found in α or β, or both chains of TCRs recognizing the same epitope-MHC ligand[1–8]. In some cases, CDR3β, or CDR3α, or both are nearly identical in TCRs of same specificity. In other cases, linear sequence similarity (motifs) of 3–4 amino acids in either CDR3β, or CDR3α, or both could be found in those CDR3 sequences recognizing the same epitope-MHC ligand[9–11]. Therefore, grouping TCRs on the similarity of CDR3 sequences could effectively cluster them by specificity.
Tools try to group TCRs by specificity based on CDR3 sequence similarity include TCRdist algorithms[12, 13], iSMART[14] and ALICE[15], GLIPH algorithms[1, 16]. TCRdist is a tool to compute distance among CDR3 sequences, and then to cluster them hierarchically. iSMART performs pairwise local alignment on T cell receptor CDR3 sequences to group them into antigen-specific clusters[14]. ALICE groups similar CDR3 sequences from the same sample into clusters and report those clusters with more member sequences than chance. Based on our earlier studies, the version 2 of the GLIPH algorithm outperforms the other tools with better specificity and faster speed[16]. In the following sections, we will focus on GLIPH algorithms.
2. GLIPH algorithm and analysis procedure
2.1. GLIPH version 1
This algorithm, “Grouping Lymphocyte Interactions by Paratope Hotspots” or GLIPH [1](referred to as GLIPH1 in this manuscript) searches for global sequence similarity, and local sequence similarity (motifs), and automatically cluster TCR sequences into distinct groups according to their likely specificity. This algorithm runs in three stages: discovering for global and local similarity signatures, constructing clusters of TCRs with identified similarities, and evaluating enrichment of features for each cluster. The algorithm is briefly described in the following sections.
2.1.1. Pre-processing input data
GLIPH1 works on the non-redundant CDR3 amino acid sequences for both input sample set (collection of TCRs under evaluation) and reference set (a large database of TCR sequences that are not expected to be enriched for specificities found in the sample set). GLIPH1 ignores the first three and last three residues in all CDR3 sequences where computing both global and local sequence similarity.
2.1.2. Discovering local similarity signatures
GLIPH1 scans all possible 3mer, 4mer and 5mer motifs for their frequency in the sample set. To evaluate whether these motifs are specifically enriched by antigens, these frequencies are compared to a repeat random sampling of the non-redundant reference set at the same depth as the non-redundant sample set. For a particular motif in the sample set, GLIPH1 computes the observed vs expected (OVE) ratio where the observed value is the frequency of the motif found in non-redundant sample set, and the expected value is the average frequency of the same motif found in repeated sampling data of the non-redundant reference set. Additionally, GLIPH1 computes the empirical p-value for a motif as the ratio between the number (n) of the sampling data set that the motif found more frequent than the sample set and the sampling times (N). If the frequency of a motif is greater than the pre-set --lcmindepth parameter, OVE is greater than the pre-set --lcminove parameter, and the p-value is less than the pre-set --lcminip parameter, this motif is considered enriched in the sample set. This procedure is repeated for all motifs to collect all enriched motifs.
2.1.3. Discovering global similarity signatures
GLIPH1 counts the number of different positions between any pair of CDR3s of the same length. Two CDR3s are considered globally similar if the number of different positions is less than the –gccutoff parameter. If a user does not provide this –gccutoff parameter, it will be automatically set according to sample depth or the number of unique CDR3s in the sample set.
2.1.4. Constructing clusters of TCRs with identified similarities
GLIPH1 groups TCRs with identified similarities into a single cluster. Graph is used to model relationship between CDR3s in the sample set. CDR3s are represented as nodes in a graph that is used to model the relationship between these CDR3s. If two CDR3s share similarity signatures, those two CDR3s nodes are connected with an edge. A cluster is then a connected component in which any two CDR3 nodes are connected to each other by paths, but not connected to any additional nodes in the rest of the graph.
2.1.5. Calculating the likelihood (p) of a cluster of that size forming by random chances in a reference set
To compute the likelihood (p) of a cluster of that size (x) forming by random chances, GLIPH1 first repeatedly (for instance, N times) samples the non-redundant reference set at the depth as the number of unique CDR3s in the sample set. And then, GLIPH1 clusters each of those random sampling data sets and count the number of times (for instance, n) to clusters of size x in those random data sets. The likelihood (p) of a cluster of the size x is then computed as n/N. In order to speed up the calculation, GLIPH1 creates a lookup table at different sample depths, each performed 100000 times each.
2.1.6. Calculating the likelihood (p) of enrichment of common V-gene in clusters
To evaluate the likelihood (p) of enrichment of V-gene in clusters, users need to provide a file containing the frequency distribution of V-genes found in unselected reference data set. GLIPH1 calculates the Simpson diversity index for V-genes within members in a cluster and calculates the probability that a random sampling V-genes from user provided V-gene usage file would generate an equal or superior Simpson score as the observed score.
2.1.7. Calculating the likelihood (p) of enrichment of common CDR3 length in clusters
To evaluate the likelihood (p) of enrichment of CDR3 length in clusters, users need to provide a file containing the distribution of CDR3 length found in unselected reference data set. GLIPH1 calculates the Simpson diversity index for CDR3 length within members in a cluster and calculates the probability that a random sampling length from user provided CDR3 length file would generate an equal or superior Simpson score as the observed score.
2.1.8. Calculating the likelihood (p) of enrichment of clonal expansion within clusters
GLIPH1 calculates the expansion coefficient e as the average frequency of a candidate cluster. GLIPH1 random choose n (the size of the candidate cluster) CDR3s from the sample set and calculates the average frequency of the random data. The process repeats pre-set parameter N times to establish a distribution. The probability of the observed e for a candidate cluster is obtained as the one-tailed probability of observing a score at least that high in the e score distribution from randomly sampled clusters of the same size n.
2.1.9. Calculating the likelihood of enrichment of common HLA alleles associated with clusters
GLIPH1 uses a sampling approach to estimate the probability that a given HLA allele is present by chance.
2.1.10. Calculating the overall score for clusters
To evaluate the overall significance of a given cluster, GLIPH1 multiplies all feature scores together, where only the least HLA allele association score is chosen to be included into calculation.
GLIPH1 was implemented with Perl language https://www.perl.org/ in two scripts gliph-group-discovery.pl and gliph-group-scoring.pl. The following is an example command to use these two scripts.
perl gliph-group-discovery.pl --tcr=out_prefix.txt –refdb=refer_file
perl gliph-group-scoring.pl --convergence_file out_prefix-convergence-groups.txt --clone_annotations=out_prefix.txt --hla_file=hla_file --motif_pval_file=out_prefix-kmer_resample_1000_minp0.001_ove10.txt --background_L_file=v_usage_freq_file --background_V_file=cdr3_length_freq_file > out_prefix.out
All parameters of these two scripts are filenames, where refer_file, v_usage_freq_file and cdr3_length_freq_file provide background information; out_prefix.txt, hla_file provide input data information; out_prefix-convergence-groups.txt, out_prefix-kmer_resample_1000_minp0.001_ove10.txt, and out_prefix.out are output files.
2.2. GLIPH version 2
GLIPH1 works well on small and clean data sets. However, as data sets are becoming larger and noisier, the algorithm tends to generate large clusters of mixed specificities. GLIPH version 2 (referred to as GLIPH2) was developed to address these issues [16]. Following are the difference between GLIPH2 and GLIPH1.
2.2.1 Member CDR3s in a GLIPH1 cluster could be related by different similarity signatures, while member CDR3s in a GLIPH2 cluster must be related by same similarity signatures.
2.2.2 Member CDR3s in a GLIPH2 cluster based on global similarity signature must differ at the same position. GLIPH2 labels global similarity signature as a pattern [AC-IK-NP-TV-Y]*%[ AC-IK-NP-TV-Y]*, where [AC-IK-NP-TV-Y] denotes any one amino acid, symbol ‘*’ means 0 or more amino acids, and symbol ‘%’ means the position with varying amino acids. CDR3s related by the same signature can be group into the same cluster as shown in Figure 2. This restriction is required when the size of data under evaluation gets large or the density of data points gets high. In an extreme example in Figure 3, without restriction on the position of varying amino acids, CASAAAQFF and CASGGGGQFF could be grouped into the same cluster although they differ in 4 positions at the center of CDR3 sequences.
Figure 2.
Position-specific global similarity signatures in the GLIPH2 algorithm. Panel A shows a CDR3 cluster by GLIPH1 based on global similarity signatures. Panel B, C and D show three CDR3 clusters based on global similarity signatures SLGQG%Y, SL%QGAY and SL%QGAY respectively by GLIPH2. Note that the data set in panel A is the same as those together in panels B, C and D except that CASSLGQGAYEQYF is duplicated three times. Position different in global similarity signature is highlighted with red font. The symbol ‘%’ denotes varying residue in global similarity signature. CDR3 nodes sharing similarity signatures are connected with lines.
Figure 3.
Extreme example showing the necessity to group CDR3s with different similarity signatures into different clusters. Neighboring CDR3 nodes connected by an edge differ in a single position. However, through the network, CASAAAQFF and CASGGGGQFF, which differ in 4 positions, could be grouped into the same cluster. CDR3 nodes sharing similarity signatures are connected with lines.
2.2.3 If the parameter all_aa_interchangeable is set to be 0, amino acids with non-negative scores in BLOSUM-62 matrix are considered interchangeable and CDR3s with interchangeable amino acids at the varying position are in the same global similarity signature (Figure 4). Such restriction is removed if the parameter all_aa_interchangeable is set to be 1.
Figure 4.
Panel A shows an example cluster based on global similarity signature by GLIPH1, and panel B shows clusters based on global similarity signature from the same data set as in panel A by GLIPH2. Residues with non-negative scores in BLOSUM-62 matrix are interchangeable when computing global similarity by GLIPH2. CDR3 nodes sharing similarity signatures are connected with lines.
2.2.4 Member CDR3s in a GLIPH2 cluster based on local similarity signature have an identical motif and the position difference of the motif within CDR3s is restricted within three amino acids (Figure 5).
Figure 5.
Panel A shows an example cluster based on local similarity signature by GLIPH1, and panel B shows clusters based on local similarity signature (motif) from the same data set as in panel A by GLIPH2. GLIPH2 considers two identical motifs more than three residues away within CDR3 sequences as different local similarity signatures. CDR3 nodes sharing similarity signatures are connected with lines.
2.2.5 GLIPH2 uses a Fisher Exact test to assess the significance of a cluster. If a cluster is based on a local similarity signature and the signature is partially encoded by non-template nucleotides during somatic recombination, the cluster will be scored with a lower p-value. In other words, this cluster is considered statistically more significant.
2.2.6 CDR3s, which do not share either local or global similarity signatures but are found in multiple samples, are grouped into singlet clusters in GLIPH2. Those CDR3s are ignored in GLIPH1.
GLIPH2 was implemented with the C computer language for speed. GLIPH2 replaces the resampling approach with a Fisher exact test when searching for enriched local similarity motifs. This change removes the requirement for large reference data set. The reference data set needs to be much larger than the sample data set under evaluation to do resampling. In addition, GLIPH2 replaces the resampling approach with a hypergeometric test when evaluating the association of HLA alleles. Those updates make GLIPH2 run about 1000 times faster than GLIPH1.
GLIPH2 uses a configuration file to supply parameters.
Parameter | Comment |
---|---|
out_prefix = test | Prefix to files generated by GLIPH2 |
cdr3_file = test_CDR3.txt | Input CDR3 file |
hla_file = test_hla.txt | Optional HLA allele file |
number_of_hla_field = 1 | Number of fields in HLA allele used to compute association |
hla_association_cutoff = 0.1 | Hypergeometric test p-value cutoff to output HLA allele |
refer_file = reference_CDR3.txt | Reference CDR3 file |
v_usage_freq_file = reference_Vgene.txt | Background V-gene distribution file |
cdr3_length_freq_file = reference_Length.txt | Background CDR3 length distribution |
local_min_pvalue = 0.001 | Motif enrichment p-value cutoff. Set to0.001 by default. |
p_depth = 10000 | Simulated resampling depth for non-parametric significance tests. Set to10,000 by default |
kmer_min_depth = 3 | Motif count cutoff. Set to3 by default |
local_min_OVE = 10 | Motif observed vs expected ratio cutoff. Set to10 by default |
motif_distance_cutoff=3 | Position difference for motif to be considered to be the same local similarity signature |
ignored_end_length=3 | The number of amino acids from both CDR3 ends to ignore |
kmer_sizes=2,3,4 | Specify the size of motifs |
all_aa_interchangeable = 1 | If Set to 1 (default), only those amino acids with non-negative scores in BLOSUM-62 matrix are interchangeable in global similarity signature |
TCR information is provided in the input CDR3 file with tab-delimited fields.
Column | Field | Comment |
---|---|---|
1 | CDR3 | CDR3 sequences under evaluation. The field cannot be empty, cannot be “NA”. |
2 | V | V gene for above CDR3. The field cannot be empty, cannot be “NA”. |
3 | J | J gene for above CDR3. The field cannot be empty, can be “NA”. |
4 | CDR3c | Same CDR3 sequence as column 1, with non-template nucleotide encoded amino acid in lower case. Optional |
5 | CDR3p | Sequence for pairing CDR3. If column 1 is CDR3β, it is CDR3α. If column 1 is CDR3δ, it is CDR3γ. The field cannot be empty, can be “NA”. |
6 | Subject:condition | Subject and condition are delimited with “:”. Condition can be anything such as tissue type, cell subset or treatment et al. Subject part cannot be empty, must match subject field in input HLA file. Condition part and “:” can be omit. |
7 | Frequency | The frequency or count of this TCR. |
HLA allele information is provided in the input HLA file with tab-delimited fields: subject, allele1, allele2 et al, where the subject in HLA file need to match the subject in CDR3 file.
The GLIPH2 output is comma delimited out_prefix_cluster.csv file with following fields.
Column | Field | Comment |
---|---|---|
1 | index | A number to unique to each cluster |
2 | pattern | Motif pattern (3–5 amino acids), or global pattern with symbol ‘%’, or singlet pattern, in which identical members are found in more than one samples but does not contain any motif pattern or any global pattern |
3 | Fisher_score | Fisher Exact test score for the cluster |
4 | number_subject | The number of unique samples from which member CDR3s are |
5 | number_unique_cdr3 | The number of unique CDR3s in the cluster |
6 | final_score | The aggregative score for a cluster |
7 | hla_score | Lowest hypergeometric test score between a cluster and HLA alleles |
8 | vb_score | enrichment of V-gene within cluster |
9 | expansion_score | the likelihood of enrichment of clonal expansion within clusters |
10 | length_score | Enrichment of CDR3 length with cluster |
11 | cluster_size_score | the likelihood of a cluster of that size forming by random chances in a reference set |
12 | type | global pattern contains ‘%’, which indicates position allowing variants; local pattern starts with ‘motif-’; and singlet pattern likes global pattern without ‘%’ symbol |
13 | ulTcRb | Same as CDR3c field in input CDR3 file, this column is present only if CDR3c is provided in input CDR3 file |
14 | TcRb | Same as CDR3 field in input CDR3 file |
15 | V | Same as V-gene field in input CDR3 file |
16 | J | Same as J-gene field in input CDR3 file |
17 | TcRa | Same as CDR3p field in input CDR3 file |
18 | Sample | Same as subject:condition field in input CDR3 file |
19 | Freq | Same as Frequency field in input CDR3 file |
20–30 | HLA genes | List HLA alleles for each gene respectively, alleles with statistically significant scores are highlighted with symbol ‘!’ next to the allele name |
In addition, if HLA information is provided, GLIPH2 output out_prefix_hla.csv file with following fields.
Column | Field | Comment |
---|---|---|
1 | index | A number to unique to each cluster, see cluster output file |
2 | pattern | Motif pattern (3–5 amino acids), or global pattern with symbol ‘%’, or singlet pattern, in which identical members are found in more than one samples but does not contain any motif pattern or any global pattern |
3 | allele | HLA allele associated with this cluster |
4 | pvalue | Fisher exact test score |
5 | number of subjects in this cluster with this allele | |
6 | number of subjects in this cluster with HLA | |
7 | number of subjects with this allele in total | |
8 | number of subjects with HLA in total |
2.3. GLIPH version 3
TCR is a heterodimeric molecule with α and β chains, which collectively form a site that binds to cognate epitope-MHC. Earlier studies have shown that α and β CDR3 sequences are similar in TCRs with same specificity. To harness the pairing information between α and β CDR3 sequences, and to exploit discontinuous motifs, a major update of the GLIPH algorithm was developed (referred to as GLIPH3).
In the GLIPH3 algorithm, local similarity signatures are discontinuous or continuous motifs of a few amino acids; and a motif needs to be at the exact position within CDR3 of the same length. In addition, GLIPH3 computes a similarity score of CDR3α sequences to assess the significance of clusters of paired CDR3β sequences, or vice versa. And GLIPH3 runs a hierarchical clustering on CDR3α sequences when clusters paired CDR3β sequences, or vice versa.
GLIPH3 uses entropy fraction on 3-mers in CDR3s sequences to assess overall similarity of a group of CDR3 sequences. To extract 3-mers from CDR3 sequences, GLIPH3 walks over CDR3 sequences with a sliding window except those ignored ends specified by corresponding parameters. Assume that p1, p2, …, pi, …, pn are frequencies of n unique 3-mers and the total number of 3-mers is T, the entropy fraction is given by . The entropy fraction is between 0 and 1. The value of entropy fraction for completely random set of CDR3 sequences is 1. Lower the entropy fraction value is, more similar those CDR3 sequences are. GLIPH3 outputs clusters with entropy fraction values lower than parameter entropy_fraction_cutoff.
With GLIPH3 algorithm, a CDR3 sequence could be assigned to more than one clusters. When the parameter purge_cluster is set to 1, GLIPH3 purges clusters to reduce such redundancy. The purging procedure is as following
Computes the entropy fraction value (x) for a cluster
Computes an entropy fraction value for this cluster minus each member and finds the minimum value (y) among them
If y < purge_fraction * x, removes the member with the entropy fraction value y and repeats step 1–3. Otherwise, this purging procedure break out this loop
If all members of cluster A exists in cluster B, removes cluster A. If all members of cluster A exists in cluster B and all members in cluster B exists in cluster A, and if the number of amino acids in pattern A is greater than that in pattern B, removes cluster B. Otherwise, removes cluster A.
GLIPH3 uses a configuration file to supply parameters.
Parameter | Comment |
---|---|
out_prefix = <required> | Prefix to files generated by GLIPH3 |
cdr3_file = <required> | Input CDR3 file |
hla_file = [optional] | Input HLA file |
ag_refer_file = [optional] | Reference CDR3 file for CDR3 α or γ |
bd_refer_file = [optional] | Reference CDR3 file for CDR3 β or δ |
number_of_hla_field = 1 | Number of fields in HLA allele used to compute association |
hla_association_pvalue_cutoff = 0.001 | Hypergeometric test p-value cutoff to output HLA allele |
ag_ignored_v_end = 3 | Number of amino acids from α or γ V-gene end to ignore |
ag_ignored_j_end = 3 | Number of amino acids from α or γ J-gene end to ignore |
bd_ignored_v_end = 3 | Number of amino acids from β or δ V-gene end to ignore |
bd_ignored_j_end = 3 | Number of amino acids from β or δ J-gene end to ignore |
pattern_unique_sample_cutoff = 2 | Number of unique samples cutoff. Set to 2 by default |
min_motif_length = 3 | Minimum number of amino acids in motif pattern |
max_motif_length = 5 | Maximum number of amino acids in motif pattern |
max_diff_position = 2 | Maximum number of different positions in global pattern |
pattern_pvalue_cutoff = 0.0001 | Pattern enrichment p-value cutoff. Set to 0.001 by default |
pattern_ove_cutoff = 10 | Pattern observed vs expected ratio cutoff. Set to10 by default |
same_v = 1 | Restrict pattern with the same V-gene. Set to true by default. |
purge_cluster = 1 | Purge clusters. Set to be true by default |
continuous_motif = 1 | Examine continuous motif pattern only. Set to be true by default |
entropy_fraction_cutoff = 0.9 | Entropy fraction cutoff to cluster. Set to be 0.9 by default |
purge_fraction = 0.95 | Purge fraction factor. Set to be 0.95 by default |
overlap_cutoff = 60 | Percentage of overlap between clusters to be listed together. Set to 60(%) by default |
ag_cdr3_length_min_cutoff = 8 | Minimum length cutoff for CDR3 α or γ. Set to 8 by default |
ag_cdr3_length_max_cutoff = 30 | Maximum length cutoff for CDR3 α or γ. Set to 30 by default |
bd_cdr3_length_min_cutoff = 8 | Minimum length cutoff for CDR3 β or δ. Set to 8 by default |
bd_cdr3_length_max_cutoff = 30 | Maximum length cutoff for CDR3 β or δ. Set to 30 by default |
The input CDR3 file is .csv file with the first line as header information (column names). GLIPH3 uses the following columns cdr3a, va, ja, cdr3b, vb, jb, sid, condition and frequency, which can be in any orders, extra columns will be ignored. The column sid and either one of cdr3a and cdr3b are required. Other columns could be either missing or “NA”. The reference file is comma-delimited with three fields in the order: cdr3, v, j columns.
GLIPH3 outputs one cluster file for each chain named after the pattern <out_prefix>_<chain>_cluster.csv where chain is either TRB or TRA. If HLA information is provided, it outputs <out_prefix>_HLA.csv file as well. For instance, if information for cdr3α, cdr3β and HLA is provided, and the out_prefix is set as “test”, there will be three output files with following names: test_TRA_cluster.csv, test _TRB_cluster.csv, and test_HLA.csv. The format for the cluster file is shown in Figure 6.
Figure 6.
An example cluster of paired α and β CDR3 sequences. Data are grouped together based on the presence of CDR3β pattern: TRBV12-4:…S..GTE…., where dots indicate any amino acids and the pattern …S..GTE…. is restricted by TRBV12-4. Rows are ordered according to hierarchical clustering of cdr3α sequences.
2.3. Searching for condition-specific clusters with both single-cell sequencing data and bulk sequencing data
Single-cell sequencing could generate paired α and β CDR3 sequences for limited number of cells. On the other hand, bulk sequencing could generate CDR3 sequences for either chain for much larger number of cells. Single-cell sequencing is much more expensive than bulk sequencing and normally used to generate data for selected cell population while bulk sequencing is usually used to generate data for peripheral blood mononuclear cells (PMBC). Bulk sequencing data is lack of information about the pairing of the α and β chain. Here, two approaches are proposed here to utilize the advantages of both types of data to search for condition-specific clusters
In the first approach (figure 7, panel A), GLIPH analysis is performed on single cell data on one condition against combined data on other condition(s). The identified clusters are then examined for CDR3s matching clusters’ pattern in bulk data using Fisher exact test or Mann-Whitney U test. Those clusters supported by bulk data are condition-specific clusters. The approach can be repeated for other condition-specific clusters.
Figure 7.
Panel A proposes one strategy to use both single-cell sequencing data and bulk sequencing data with two different conditions. In this strategy, GLIPH analysis is performed with single-cell data on condition A against combined data on condition B to generate condition A specific clusters. Statistical test, such as Fisher Exact test for presence of CDR3s matching cluster patterns or Mann-Whitney U test for frequencies of CDR3s matching cluster patterns, is then carried out to pick up condition A specific clusters supported by bulk sequencing data. Condition B specific clusters can be generated likewise. Panel B proposes a second strategy. In this strategy, GLIPH analysis is performed with single-cell data on condition A against a reference database to generate condition A specific clusters. Statistical tests are carried out on bulk data to pick up condition A specific clusters. Condition B specific clusters can be generated likewise. The common clusters present in these two conditions are removed. * denotes condition-specific clusters. Overlapping clusters found in both conditions are shadowed by lines.
In the second approach (figure 7, panel B), GLIPH analysis is performed on single cell data on one condition against common reference data. The identified clusters are then checked against bulk data using Fisher exact test or Mann-Whitney U test. Those clusters supported by bulk data are condition-specific clusters after clusters found in more than one conditions are removed.
3. Notes
Sequence similarity searching is a very informative step in the analysis of newly determined sequences[17]. This is because when two sequences share more similarity than would be expected by chance, the simplest scientific explanation is that these two sequences arose from a common ancestor and are most likely to have similar functions. The presumption that statistically significant similarity implies common ancestry works well in protein or nucleotide sequences, but not on the CDR3 region of TCRs. CDR3 region of TCRs is generated through somatic recombination process. Identical CDR3s from different individuals arose from completely independent recombination events. Nevertheless, various studies[1, 3, 12, 16] have shown that TCRs binds to same MHC:peptide complex share extensive similarity in CDR3 region of either TCR α or β, or both. Although same MHC:peptide complex could be recognized by many distinct TCRs with different structural solutions, TCRs taking the same structure solution usually share distinguishable similarity signatures[18].
CDR3 sequences are more like enzyme active sites or transcription factors binding sites in that TCR binding to MHC:peptide complex is sensitive to mutations in CDR3s[19]. An early study [1] has showed that the impact of mutations in the CDR3 region on TCR binding to MHC:peptide complex are position-specific. Therefore, the regular scoring method widely used in the conventional sequence similarity analysis might fail to capture positional impact of mutations in CDR3 region on TCR specificity. This is the reasoning behind the restriction on varying position in global similarity signature in GLIPH2 algorithm and the restriction on the motif position in local similarity signature in GLIPH3 algorithm. Without such restriction, when data density increases, unrelated CDR3s could be grouped into same clusters.
Three GLIPH algorithms are described in this chapter. GLIPH1 introduced local and global similarity signatures in CDR3 sequences in TCRs binding to same epitope:MHC complex. GLIPH1 works well on clean, scattered data sets. However, as data under study are becoming larger and noisier, GLIPH1 tends to generate large clusters of mixed specificities. Several restrictions are introduced in GLIPH2 to keep output clusters meaningful. With the availability of paired α or β CDR3 sequences, GLIPH3, as a major update to the GLIPH2 algorithm, was developed to harness the pairing information in α or β CDR3 sequences to assess the quality of clusters. In addition, GLIPH3 extends the local similarity signature to be discontinuous motifs and allows more than one varying position in the global similarity signature. As a matter of fact, local, global and singlet patterns in GLIPH2 are consolidated into patterns of varying number of defined amino acids in the implementation. Such change greatly simplifies the implementation and speeds up the GLIPH3 algorithm.
In addition to clustering TCRs by specificity, GLIPH algorithms could be used to search for condition-specific clusters. Combination of single-cell sequencing data and bulk sequencing data, or pairing information between α or β CDR3 sequences could be used to boost the confidence of clusters identified by GLIPH algorithms.
Figure 1.
The major difference between GLIPH1 and GLIPH2. Panel A shows a CDR3s cluster by GLIPH1. CDR3s sharing global similarity are connected with dashed lines and CDR3s sharing local similarity are connected with solid lines. Panel B shows CDR3 clusters by GLIPH2 from the same data set in panel A except that CASSFSKNTEAFF is duplicated in panel B; position different in global similarity signature is highlighted with bold, italic and red font and common motifs are highlighted with bold, and italic font. CDR3 nodes sharing similarity signatures are connected with lines.
References
- 1.Glanville J, Huang H, Nau A, et al. (2017) Identifying specificity groups in the T cell receptor repertoire. Nature 547:94–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Thomas N, Best K, Cinelli M, Reich-Zeliger S, Gal H, Shifrut E, Madi A, Friedman N, Shawe-Taylor J, Chain B (2014) Tracking global changes induced in the CD4 T-cell receptor repertoire by immunization with a complex antigen using short stretches of CDR3 protein sequence. Bioinforma Oxf Engl 30:3181–3188 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Farina C, van der Bruggen P, Boël P, Parmiani G, Sensi M, Moretta L (1996) Conserved TCR usage by HLA-Cw * 1601-restricted T cell clones recognizing melanoma antigens. Int Immunol 8:1463–1466 [DOI] [PubMed] [Google Scholar]
- 4.Miles JJ, Bulek AM, Cole DK, et al. (2010) Genetic and Structural Basis for Selection of a Ubiquitous T Cell Receptor Deployed in Epstein-Barr Virus Infection. PLoS Pathog 6:e1001198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lim A, Trautmann L, Peyrat M-A, Couedel C, Davodeau F, Romagné F, Kourilsky P, Bonneville M (2000) Frequent Contribution of T Cell Clonotypes with Public TCR Features to the Chronic Response Against a Dominant EBV-Derived Epitope: Application to Direct Detection of Their Molecular Imprint on the Human Peripheral T Cell Repertoire. J Immunol 165:2001–2011 [DOI] [PubMed] [Google Scholar]
- 6.Grant EJ, Josephs TM, Valkenburg SA, Wooldridge L, Hellard M, Rossjohn J, Bharadwaj M, Kedzierska K, Gras S (2016) Lack of Heterologous Cross-reactivity toward HLA-A*02:01 Restricted Viral Epitopes Is Underpinned by Distinct αβT Cell Receptor Signatures. J Biol Chem 291:24335–24351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Motozono C, Kuse N, Sun X, Rizkallah PJ, Fuller A, Oka S, Cole DK, Sewell AK, Takiguchi M (2014) Molecular Basis of a Dominant T Cell Response to an HIV Reverse Transcriptase 8-mer Epitope Presented by the Protective Allele HLA-B*51:01. J Immunol 192:3428–3434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Brennan RM, Miles JJ, Silins SL, Bell MJ, Burrows JM, Burrows SR (2007) Predictable αβ T-Cell Receptor Selection toward an HLA-B*3501-Restricted Human Cytomegalovirus Epitope. J Virol 81:7269–7273 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ishihara Y, Tanaka Y, Kobayashi S, et al. (2017) A Unique T-Cell Receptor Amino Acid Sequence Selected by Human T-Cell Lymphotropic Virus Type 1 Tax 301–309-Specific Cytotoxic T Cells in HLA-A24:02-Positive Asymptomatic Carriers and Adult T-Cell Leukemia/Lymphoma Patients. J Virol. 10.1128/JVI.00974-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Miles JJ, Elhassen D, Borg NA, et al. (2005) CTL Recognition of a Bulged Viral Peptide Involves Biased TCR Selection. J Immunol 175:3826–3834 [DOI] [PubMed] [Google Scholar]
- 11.Huth A, Liang X, Krebs S, Blum H, Moosmann A (2019) Antigen-Specific TCR Signatures of Cytomegalovirus Infection. J Immunol 202:979–990 [DOI] [PubMed] [Google Scholar]
- 12.Dash P, Fiore-Gartland AJ, Hertz T, et al. (2017) Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547:89–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, Gaevert JA, Hertz T, Thomas PG, Bradley P, Fiore-Gartland A (2021) TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA-restricted SARS-CoV-2 associated TCR features. bioRxiv 2020.12.24.424260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang H, Liu L, Zhang J, et al. (2020) Investigation of Antigen-Specific T-Cell Receptor Clusters in Human Cancers. Clin Cancer Res 26:1359–1371 [DOI] [PubMed] [Google Scholar]
- 15.Pogorelyy MV, Minervina AA, Shugay M, Chudakov DM, Lebedev YB, Mora T, Walczak AM (2019) Detecting T cell receptors involved in immune responses from single repertoire snapshots. PLoS Biol 17:e3000314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Huang H, Wang C, Rubelt F, Scriba TJ, Davis MM (2020) Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening. Nat Biotechnol 38:1194–1202 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Pearson WR (2013) An Introduction to Sequence Similarity (“Homology”) Searching. Curr Protoc Bioinforma Ed Board Andreas Baxevanis Al 0 3: 10.1002/0471250953.bi0301s42 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Song I, Gil A, Mishra R, Ghersi D, Selin LK, Stern LJ (2017) Broad TCR repertoire and diverse structural solutions for recognition of an immunodominant CD8+ T cell epitope. Nat Struct Mol Biol 24:395–406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Madura F, Rizkallah PJ, Miles KM, et al. (2013) T-cell Receptor Specificity Maintained by Altered Thermodynamics. J Biol Chem 288:18766–18775 [DOI] [PMC free article] [PubMed] [Google Scholar]