TABLE 2.
Method | Output | Key hyperparameters | Advantages | Disadvantages | Ref. |
---|---|---|---|---|---|
Dimensionality reduction | Lower-dimensional representation of original data | Visualization of high dimensional data, discovery of subsets of data | Potential information loss | ||
Principal Component Analysis (PCA) | Original data on new axes where axes are linear combinations of original dimensions | Well-established, easy to interpret, fast, consistent results across applications on the same data | Misses nonlinear patterns in data | 44 | |
T-distributed Stochastic Neighbor Embedding (t-SNE) | Original data on new axes where axes have no inherent interpretation | Effective number of nearest neighbors (Perplexity) Cycles before algorithm is considered done (Iterations) |
Discovery of nonlinear patterns | Difficult to interpret axes, slow, repeat applications produce different results, requires downsampling | 29 |
Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) | Original data on new axes where axes have no inherent interpretation | Minimum distance between neighbors in new space Number of neighbors |
Discovery of nonlinear patterns, fast, does not require downsampling | Difficult to interpret, repeat applications produce different results | 30,31 |
Clustering | Algorithmically-determined groupings of data-points | Distance Metric (how to assign distance between two points) | Unbiased discovery of potentially biologically meaningful groups of data points | ||
Hierarchical clustering | Data points organized into a tree structure | How distance between clusters is determined (Linkage) | Easily observe multilevel clustering | Determining where to cut tree to produce clusters can be difficult | |
k-means | k clusters of original data | Number of clusters (k) | Fast, well-established | Need to specify number of clusters beforehand, cannot find clusters that are not simple spheres or ellipses | |
Density-based spatial clustering of applications with noise (DBSCAN) | Clusters of original data | Min number of points to call a region dense Radius of point’s neighborhood (Epsilon) |
No need to specify number of clusters, can find | Many data points may be classified as “noise” or one large cluster depending on hyperparameters | 45 |
Repertoire analysis | 38 | ||||
Diversity | Measure of clonal diversity | Choice of diversity metric (Gini, Entropy, Chao1, Hill, etc.) | Provides a single diversity metric for a sample or population of cells, can be compared across samples and conditions | Can be difficult to interpret intuitively, sensitive to number of samples | |
Sequence distance | Distance between two TCR or BCR sequences | Choice of distance metric (Levenshtein, etc.) | Distances can be used in downstream applications like clustering or dimensionality reduction for visualization | Distances might not be biologically meaningful | |
Motif enrichment | Significant sequence motifs | Choice of algorithm (GLIPH, etc.) | Discovery of motifs that may confer specificity | May miss larger motifs depending on hyperparameter choices | |
Phylogenetics | BCR clonal family trees | Evolutionary model for amino acid mutation | Can infer lineages and branching points during affinity maturation | Can be sensitive to hyperparameters, methods typically optimized for traditional evolutionary models |
TCR, T cell receptor; BCR, B cell receptor; GLIPH, Grouping of Lymphocyte Interactions by Paratope Hotspots.