Skip to main content
. 2022 Oct 26;20:6375–6387. doi: 10.1016/j.csbj.2022.10.029

Table 1.

Algorithms included. Summary of all algorithms applied in this analysis, with three main clustering types indicated where applicable (G = graph-based, H = hierarchical, K = K-means, O = other). The Normalization used for each algorithm is also provided, where TPM = transcripts per million, RPKM = reads per kilobase per million.

Algorithm name (source) Software Brief description Normalization Type
AltAnalyze [13] Python source code AltAnalyze uses a guide gene selection strategy that iteratively clusters cells with the hierarchical-ordered partitioning and collapsing hybrid (HOPACH) [30] algorithm, and removes genes and clusters with low intra-correlations. The top intra-correlated genes are selected as guide genes, and the final clustering results are obtained by running HOPACH on all the guide genes. Raw counts H
Ascend [6] R package Clustering by Optimal Resolution (CORE) [31] method: Euclidean distance is first calculated based on the first 20 principal components (PCs) from the principal component analysis (PCA) reduced count matrix. Hierarchical clustering is then applied on the distance matrix to obtain the initial clustering. Outlier cells from this first round of clustering are identified and removed. A re-clustering is then performed by a top down split and clusters are merged over multiple iterations. During this process, adjusted Rand index (ARI) is used to compare different clusters and identify the most stable number of clusters. Raw counts H
bigSCale [14] MATLAB source code bigSCale first computes a pairwise cell distance matrix based on the genes with a high degree of variance. Then, Ward's linkage is used on the distance matrix to assign cells into different groups. Raw counts. Scatter normalization is part of the pipeline H
Cell Ranger [15] Python/R Cell Ranger constructs a sparse k-nearest neighbors (kNN) graph where cells are linked if they are among the k nearest Euclidean neighbors. The Louvain modularity optimization algorithm is used to find highly connected modules in the graph. Then, hierarchical clustering of cluster medoids in the PCA space is done and cluster siblings are merged if there are no differentially expressed genes between them. TPM G, H
CIDR [7] R package CIDR first imputes gene expression levels for dropout genes. Then, a dissimilarity matrix is obtained by computing the Euclidean distance between every pair of cells. Finally, PCA is used on the dissimilarity matrix, and a hierarchical clustering is applied to the first few principal components for clustering. Raw counts H
Monocle [16] R package tSNE is first performed to reduce the dimensionality of the dataset. A kNN network is then constructed with k = 20. The Louvain algorithm is used on the kNN network for clustering. Raw counts G
pcaReduce [17] R package PCA is first used on the dataset to reduce its dimensions to q. Then k-means clustering is applied on the q-dimensional matrix and divides cells into (q + 1) clusters. After that, the probability of each pair of clusters being merged is calculated, and the two clusters with the highest probability are merged. This process is repeated until one cluster remains. Log2 normalization H, K
PhenoGraph [18] Python source code A weighted kNN network is first constructed with the weights being the number of shared common nearest neighbors between two connected cells. Then, the Louvain algorithm is used to divide cells in the network into different clusters. Raw counts G
RaceID [9] R/C++ A cell similarity matrix is first constructed by computing the Pearson’s correlation coefficients between all pairs of cells. Then, a distance matrix is obtained by subtracting the similarity matrix from 1. Finally, k-means clustering is used on the distance matrix to group cells into different clusters. Raw counts. RaceID does an internal normalization based on median transcript across all cells. K
RCA [19] R package A projection vector is calculated for each cell based on the Pearson correlation coefficients between the dataset and the two reference bulk transcriptomes. Average-linkage hierarchical clustering is then used on the projection vectors for clustering. RPKM H
SC3 [10] R package SC3 first runs k-means clustering on the dataset with different parameters simultaneously. Then, a consensus matrix is computed by summarizing how often each pair of cells is located in the same cluster. Finally, the result is determined by complete-linkage hierarchical clustering of the consensus matrix. Raw counts, scatter normalization is part of the pipeline H, K
Scran [20] R package Hierarchical clustering is applied on PCs. Normalization is done by deconvolving size factors from cell pools. Raw counts H
Seurat [21] R package Seurat's default pipeline first finds variable features from the dataset, then applies PCA to get the top 50 PCs. Finally, the Louvain algorithm is used on the 50 PCs for clustering. Raw counts. Log normalization is part of the Seurat pipeline G
SINCERA [22] R package Expression data are first transformed to z-scores. Hierarchical clustering is then used to divide cells into different groups. z-score scaling is part of the pipeline H
TSCAN [23] R package TSCAN first divides genes into different clusters using hierarchical clustering, which reduces the number of features to the number of gene clusters. PCA is then applied to further reduce the dataset dimensionality. Finally, a mixture of multivariate normal distributions is fitted to the data, and cells are assigned to clusters based on their probability of belonging to each cluster. Raw counts H, O