Skip to main content
. 2021 Feb 1;22:39. doi: 10.1186/s12859-021-03957-4

Fig. 1.

Fig. 1

Subsampling-based robustness metrics identify near-optimal clustering parameter values. a Schematic of the clustering parameter selection problem. In the under-clustering case, clearly distinct biological differences remain unresolved. In the optimal clustering case, cells are appropriately grouped, while in the minor over-clustering case, some groups of cells are over-split. Either the optimal clustering or the minor over-clustering parameter regime is appropriate (red arrows), since the latter can be modified through cluster re-merging to obtain the optimal clustering. With severe over-clustering (rightmost panel), groups are over-split, and the dark green cluster shows evidence of “shattering”, in that these cells cannot be re-assigned to the proper clusters even with iterative re-merging of clusters. b Schematic of the chooseR approach. After 100 randomly drawn subsampled sets of cells are clustered, a cell–cell co-clustering frequency matrix is used to calculate mean per-cluster silhouette scores. This process is repeated for each parameter value of interest. c Under-, optimal, and over-clustering illustrated on a single-cell RNAseq dataset containing ~ 11,000 PBMCs. d Silhouette distribution plot for the dataset in c. Each dot represents a cluster at a given clustering parameter value. Medians with 95% CI are shown for each parameter value. The vertical red line marks the optimal resolution (in the Seurat package), and the blue line marks the decision threshold (“Methods” section). e Median silhouette scores for each cluster at the suggested optimal resolution = 2. Cell type families are indicated by colored boxes along the x-axis. This identifies which clusters are likely less robust at the optimal parameter value and should be investigated further in a more focused way