. 2020 Oct 22;11:565096. doi: 10.3389/fimmu.2020.565096

Table 2.

Algorithms to predict antigen specificity of TCR repertoire.

References	Data	Distance measure	Clustering algorithm
Thomas et al. (183)	CDR3 sequences of CD4+ T cell repertoire before and after immunization	Replace each CDR3 by all possible n-mer peptides, then convert each n-mer peptide into numeric Atchley vectors	K-means clustering of Atchley vectors, count number of Atchley vectors assigned to each cluster, and generate into a feature vector. Classify the feature vector using hierarchical clustering (unsupervised) or support vector machine (supervised)
Dash et al. (15)	pMHC-facing loop between CDR2 and CDR3 and trimmed CDR3 sequences from 4,635 paired TCRαβ sequences	Similarity-weighted mismatch distance between the potential pMHC-contacting loops of two TCRs, defined by BLOSUM62 (named TCRdist)	Sampling density nearby each TCR estimated by weighted average distance to the nearest-neighbor receptors in repertoire (a small nearest-neighbor distance, NN-distance). Each TCR repertoire clustered using “greedy” fixed-distance-threshold clustering algorithm. At each step, TCR with the largest number of neighbors within the distance threshold chosen as a cluster center and iterated for all TCRs
Glanville et al. (16)	CDR3 from 5,711 TCRβ sequences	Global similarity by CDR3 hamming distance between two TCRs with same Vβ segment and same-length CDR3. A fold-change enrichment of local convergence motif by observed frequency of the motif over expected frequency in repeat random sampling from naïve distribution	Cluster TCRs sharing either global similarity below Hamming distance threshold (differ <2 amino acids) or share a significant motif (>10-fold enriched and <0.001 probability of occurring than in naïve TCR pool)
Cinelli et al. (184)	CDR3 from CD4+ TCRβ sequences before and after immunization	CDR3β sequences deconstructed into k-mers, then motifs ranked according to one-dimensional Bayesian classifier score comparing their frequency in repertoires of two immunization classes	Top ranking motifs selected and used to create feature vectors to train a support vector machine for classifying into distinct clusters
Priel et al. (185)	~360,000 TCRβ sequences from (188)	Levenshtein distance between TCRβ and cluster representative	UClust algorithm (189). Sort sequences according to their length, then iteratively checks for existing cluster to associate the next sequence whose Levenshtein distance from cluster's representative is smaller than a given threshold to generate “Clone-Attractors” (CAs) network
DeWitt et al. (182)	TCRβ sequences from 666 healthy individuals from (190)	Co-occurrence of global TCRβ (for genetic background) and HLA-restricted TCRβ (for immune history and receptor specificity) by analysis of covariation and hypergeometric distribution to assess significance	DBSCAN algorithm (191) to cluster public TCRβ by occurrence patterns, with (i) predefined similarity/distance threshold and (ii) minimum number of neighbors for a point to be considered as a core
Meysman et al. (186)	Two independent datasets of 412 TCRβ from [(15)] and 2,835 TCRβ sequences	Investigated length-based distance, GapAlign score, profile score, trimer score, dimer score, Lavenshtein distance score, and VJ edit distance	DBSCAN algorithm (191), an unsupervised clustering to group TCRs based on a fixed distance defined in advance
Pogorelyy and Shugay (17)	CDR3 from TCRβ sequences from (190)	Hamming distance, allowing single substitution	TCR similarity networks by Hamming distance and identify enriched TCR network hubs by testing neighborhood size (degree) enrichment against VDJ rearrangement model using ALICE algorithm (192) or against control dataset using TCRnet
Thakkar and Bailey-Kellogg (187)	CDR3 sequences, CDR3α and CDR3β analyzed separately	Local alignment using Smith-Waterman (SW) algorithm with BLOSUM45	Hierarchical agglomerative clustering, with CDRdist (a nearest neighbor classifier to predict label of another CDR based on nearby labeled CDRs) as a comparison function. Clusters defined by CDRdist thresholds
Zhang et al. (18)	82,000 CDR3 sequences from 9,700 tumor RNA-Seq samples from TCGA	Pairwise alignment score with BLOSUM62, normalized by the length of longer CDR3 sequence	From pairwise score matrix, apply a predefined cut-off value (default 3.5) to filter out low scoring comparisons A depth-first search (DFS) on the matrix to identify all connected CDR3 clusters (named iSMART)