A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Hao Wu; Disheng Mao; Yuping Zhang; Zhiyi Chi; Michael Stitzel; Zhengqing Ouyang

doi:10.1093/nargab/lqaa087

. 2021 Jan 12;3(1):lqaa087. doi: 10.1093/nargab/lqaa087

A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Hao Wu ^1,³, Disheng Mao ^2,³, Yuping Zhang ^3,^3,^✉, Zhiyi Chi ⁴, Michael Stitzel ⁵, Zhengqing Ouyang ^6,^✉

PMCID: PMC7803008 PMID: 33575647

Abstract

Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of dissimilarity measures. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio-transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods. The R-package for the proposed method can be found at https://github.com/Zhang-Data-Science-Research-Lab/LrSClust.

INTRODUCTION

Background on the biological problem

Human pancreatic islets consist of multiple types of cells, which play important roles in diabetes pathophysiology. Among them, beta (54%) and alpha (35%) cells are dominant. In bulk RNA-seq of human pancreatic islets, gene expression mainly reflects the information of these two cell types. Single-cell RNA-seq technology enables transcriptional characterization of individual cells, and thus facilitate cell-type discoveries. In a single-cell experiment, individual cells are isolated, amplified and sequenced. In this process, the information on the identities of cells is commonly missing. Currently, researchers have to use clustering techniques to partition the data into several clusters and try to infer the represented cell types based on some known marker genes. Therefore, in single-cell data analysis, the quality of clustering is crucial. In this paper, motivated by the problem of detecting cell types in human pancreatic islets, we propose a graph-based model to accomplish this clustering task.

Graph-based clustering methods

Graph-based models have been widely used in biological and biomedical research (1–9) to represent the relationships among objects. Graph can also serve as a tool for a single-cell clustering problem. In this scenario, cell relationships are represented by a similarity graph with its nodes corresponding to cells and weighted edges reflecting similarities among cells. For instance, PhenoGraph (3) takes an N × p single-cell gene expression matrix for N cells and p genes as its input, and utilizes Euclidean distance to find the k nearest neighbors (KNN) of each cell. The weight for an edge connecting two cells is calculated as the proportion of their shared neighbors over the union of all neighbors. Then, PhenoGraph employs the Louvain method to identify the graph communities presenting the clusters of cells. SNN-cliq (2) is also a graph-based clustering method proposed for single-cell clustering. It first calculates the pairwise Euclidean distances of cells, connects a pair of cells with an edge if they share at least one common neighbor in KNN, and then defines the weight of the edge as the difference between k and the highest averaged ranking of the common KNN. SNN-cliq then employs a greedy algorithm to find a maximal quasi-clique associated with each node, and finally identifies clusters by iteratively combining significantly overlapping subgraphs starting with the quasi-cliques.

In this paper, we treat RNA-seq counts as compositional data. We first make an appropriate centered log-ratio (clr) transformations. Then, we propose a graph-based clustering method for single-cell data with the following major components: (i) choosing an appropriate type of measure to represent the dissimilarity patterns of cells; (ii) transforming pairwise dissimilarities to similarities as the weights of edges connecting the corresponding cells; (iii) cutting the graph into disjoint sub-graphs representing clusters of cells. Each step can be accomplished by various methods depending on the dataset of interest. In this paper, we employ appropriate methods based on our motivating single-cell RNA-seq data for human pancreatic islets.

The rest of this paper is organized as follows. We first introduce the proposed method. We then apply our method to clr-transformed single-cell RNA-seq data for human pancreatic islets and compare with other methods. Furthermore, we conducted a simulation study. Finally, we conclude our paper.

MATERIALS AND METHODS

Single-cell RNA-seq and compositional data analysis

RNA-seq data are compositional in nature. For all next-generation sequencing abundance data, a property cannot be ignored: the abundances for each sample are limited by its arbitrary total sum (the library size). Thus, to analyze RNA-seq data, effective library size normalization is usually employed before conventional data analysis. However, the assumptions of normalization methods are often untestable in reality. Compositional data measure each sample as a composition, a vector of non-zero positive values (i.e. components) carrying relative information (10). Treating RNA-seq as compositional data opens a new perspective on data analysis, which avoids normalization. Please refer to (11) for a comprehensive treatment on this subject.

In this paper, we apply the clr-transformation (11) to raw RNA-seq counts. Before the application of clr-transformation, we add one to each raw count at first. The reason we employ this addition operation is to avoid the occurrence of minus infinity when we do natural logarithm transformation. Next, for each cell vector, Inline graphic , we calculate its geometric mean denoted by . Then we perform the following clr-transformation for each sample j (10,11):

(1)

where p is the total number of features.

We then develop our graph-based clustering method based on the clr-transformed data.

New graph-based clustering framework

The graph-based clustering aims to use graphs to represent the patterns of similarities among cells and to obtain clusters by dropping the weak edges. First, we need to define an appropriate metric to evaluate the dissimilarity between two cells. While Euclidean distance (based on L₂ norm) is commonly used to measure the dissimilarity between two objects, for our single-cell RNA-seq data from human pancreatic islets, we found that Euclidean distance is not appropriate. Although single-cell RNA-seq data is high-dimensional, it is possible that only a small set of genes can determine the underlying types of cells. Thus, we will investigate more types of distances in our clustering framework including the Manhattan distance (based on L₁ norm) and the L_∞ distance. The L_∞ distance is defined as the maximum absolute deviation of two vectors across all coordinates. Namely, for two cells Inline graphic _i and _j, suppose their clr-transformed transcriptomic profiling vectors are and , the L_∞ measure is calculated by the maximum of |y_i1 − y_j1|, ⋅⋅⋅, |y_ip − y_jp|. For the convenience of the following theoretical investigation, we define L_∞ dissimilarity as

(2)

Euclidean or L₂ dissimilarity as

(3)

and Manhattan or L₁ dissimilarity as

(4)

We want to quantify the performance of these different measures in clustering tasks. Intuitively, a good dissimilarity measure should be able to distinguish the ‘within-cluster’ dissimilarities and the ‘between-cluster’ dissimilarities. For well-separated clusters, we expect the within-cluster pairwise dissimilarity to be small and the between-cluster dissimilarity to be as large as possible. Based on this obvious rationale in clustering problems, we propose to use the ratio of the average between-cluster dissimilarity and the average within-cluster dissimilarity to quantify the goodness of the corresponding distance measure. More formally, for objects Inline graphic , we denote as the i^th cluster, . The dissimilarity score between two cells is denoted by d(_i, _j), where d(_i, _j) can be L_∞ dissimilarity, L₂ dissimilarity, or L₁ dissimilarity. Then, we consider the ratio of the expected between and within dissimilarities, denoted by R_d, in the form of

(5)

where Inline graphic _i and _j are from different clusters, and _i and belong to the same cluster.

We first provide some intuitive comparisons on the clustering effects when we choose d( Inline graphic _i, _j) to be d₂(_i, _j) or d₁(_i, _j) or d_∞(_i, _j) through a simple example. Assume there are two clusters with measures and . We first consider (i ∈ {1, ⋅⋅⋅, m}) and (j ∈ {m + 1, ⋅⋅⋅, n}), which independently follow multivariate normal distributions with means Inline graphic and , and a common p × p covariance matrix , where p is the number of features. We further assume that a subset of features separates the clusters. Without loss of generality, let , where a_i ≠ 0, and let be a p-dimensional zero vector. Naturally, for within-cluster difference, (i′ ≠ i, i, i′ ∈ {1, ⋅⋅⋅, m}) or Inline graphic (j′ ≠ j, j, j′ ∈ {m + 1, ⋅⋅⋅, n}) follows the multivariate normal distribution . For the between-cluster difference, .

When d( Inline graphic _i, _j) = d₂(_i, _j) and , R_d is denoted by . If the total number of genes p is of the same order as , , is of order 1. The reason is as follows. For numerator of , it can be decomposed into the variations coming from signals and noises. The first s coordinates of correspond to the signal source, having expectation Inline graphic , and the rest of the coordinates are from the noise source with expectation 2(p − s). Therefore, the numerator of is . The denominator of is 2p. The ratio is . Thus, if the total number of genes p is of the same order as , then is of order 1, which means it is hard to separate the clusters. Under this setting, the critical size for Inline graphic is of the same order as . Similarly, the critical size of is of the same order as , i.e. .

When d( Inline graphic _i, _j) = d_∞(_i, _j) and , R_d is denoted by . Let be a random sample of within-in cluster difference determined by feature l, then _l ∼ N(0, 1) (following the assumption of the simple example) and , where l = 1, ⋅⋅⋅, s. We have:

Let T_p be the maximum value of p Inline graphic random variables, then (T_p − d_p)/2 → G, where G is a Gumbel random variable, and d_p = 2(lnp − 1/2ln(lnp) − lnΓ(1/2)) (12). Thus, the denominator of is of the same order as ln p. When p is of the same order as , is of order 1. In this setting, the critical size of is of the same order as Inline graphic . When the number of noise features increases, can have lower signal contamination rate than and .

Furthermore, dropping the aforementioned Gaussian assumption on Inline graphic _l, let’s consider a scenario where follows a distribution with sub-Gaussian tail. We have the following theorem:

Theorem 1. For i.i.d. random variables W₁, ⋅⋅⋅, W_p which have sub-Gaussian tail, i.e. P(|W_i| ≥ ) = O(exp ( − β^α)), where α > 0 and β > 0, then as p → ∞, .

Thus, the critical size for Inline graphic under this setting is of the same order as . This can be much larger than the critical sizes of and , which are of the same order as and respectively.

In summary, the rationale indicates that the L_∞ measure can be a better choice compared to the Manhattan measure and the Euclidean measure in certain scenarios.

Given a set of cell vectors Inline graphic , where stands for the clr-transformed transcriptomic values for p genes in cell i, and an appropriate metric to evaluate the pairwise dissimilarity of cells, we can build a graph, denoted by , to represent the similarities among these cells, where is the set of cells {₁, ⋅⋅⋅, Inline graphic _n}, is the set of edges with weight e_ij for the edge connecting _i and _j (i, j ∈ {1, ⋅⋅⋅, n}, i ≠ j). We determine the weights of edges using an entropy equalizer similarity measure (13). Specifically, if there is no edge between _i and _j, we set the weight to be zero. We define the similarity between Inline graphic _i and _j (i ≠ j) as the normalized conditional probability p_j|i,

(6)

where d_ij is the dissimilarity between i and j, and Inline graphic is the variance parameter for the Gaussian kernel. Please refer to Supplementary Algorithm 1 for the calculation of σ_i. In addition, we define p_i|i = 0 for i ∈ {1, ⋅⋅⋅, n}. Then, for any cell _i, the similarity measures between _i and any other cells induce a probability distribution, i.e.

(7)

The corresponding entropy is the form of

(8)

The perplexity is defined as Inline graphic , which is a tuning parameter affecting cluster assignments. Intuitively, the perplexity can be interpreted as a smooth measure of the effective number of neighbors. Smaller perplexity will encourage forming clusters with small sizes. Larger perplexity will yield larger cluster configuration. Furthermore, it is notable that the defined similarity in Equation (6) may be asymmetric: p_j|i is not necessarily equal to p_i|j. Thus, we set Inline graphic _ij = (p_i|j + p_j|i)/2.

With the weighted graph, we then employ an appropriate graph-cutting procedure to obtain clusters (sub-graphs). Specifically, we cut the graph Inline graphic into several sub-graphs, such that cells within the same sub-graph share more similarity than the other cells. We define a cluster as a subset of cells, . All clusters form a partition of the whole set, In addition, a ‘cut’ means that for all clusters , there is no edge between and Inline graphic . Due to cutting, the ‘loss of similarity’ between two clusters is the summation of all pairwise original weights of the edges between these two clusters, denoted by

(9)

For our real data, we expect to see some relative large and biologically meaningful clusters for future study. Thus, we adopt the RatioCut approach (14) defined as below

(10)

where Inline graphic denotes the number of cells in a cluster, and is the complement of . The RatioCut optimization problem is equivalent to the following (15),

(11)

where Inline graphic is a diagonal matrix, and its diagonal elements are , , , and

(12)

This optimization problem is NP hard. In practice, we employ a spectral clustering method to solve the relaxed optimization problem through a spectral decomposition. Denote Inline graphic to be the indicator vector for cells belong to , . Based on the construction of h_ij, is an orthogonal matrix, Instead of finding a cut, we can optimize the objective function by searching among orthogonal matrices. The relaxed problem becomes

(13)

By Poincaré separation theorem,

(14)

where λ₁ ≤ ⋅⋅⋅ ≤ λ_n are eigenvalues of a Laplacian matrix Inline graphic . Thus, is the matrix which contains the first k eigenvectors as its columns.

To select the optimal number of clusters and perplexity, we maximize a Gap-statistic (16) type of objective function. Here, the total within-cluster dissimilarity for all clusters Inline graphic under perplexity p is

(15)

where n_r is the number of cells in the rth cluster. It is clear that as the number of clusters increases, the total within-cluster dissimilarity may always decrease. In the extreme case, all cells form their own clusters, the total within-cluster dissimilarity is zero. We extend the Gap statistic (16) to select the tuning parameters. More formally, the Gap-statistic type of criterion under perplexity p and the number of clusters k is defined as

(16)

where Inline graphic is estimated from bootstrap samples, which are uniformly drawn from the range of the values for that feature (16). Given the number of replicates B in simulation and the standard error of the bootstrap replicates sd(p, k), the standard error of the Gap statistic can be computed as

(17)

We then use the 1-standard-error rule to select the smallest number of clusters k, and the largest perplexity p with a large Gap-statistic value no less than the largest Gap-statistic minus its one standard error.

RESULTS

Application to single-cell RNA-seq data from human pancreatic islets

We used existing single-cell RNA-seq data for 638 cells from nondiabetic (ND) and type 2 diabetes (T2D) human islet samples (17). (17) employed a Gaussian mixture model to classify the cells based on some known biomarkers, and only reported cell types for 617 (out of 638 in total) cells from T2D and ND islets. Specifically, these cell types include (the corresponding marker gene is shown in bracket): beta cell (INS), alpha cell (GCG), delta cell (SST), PP/gamma cell (PPY), acinar cell (PRSS1), stellate cell (COL1A1) and ductal cell (KRT19). The remaining 21 (638–617) cells were not clustered to any cell type in (17).

In this paper, we applied the proposed clustering method to single-cell RNA-seq data from the whole 638 cells without knowing the marker genes. We downloaded the raw single-cell RNA-seq data from GEO (www.ncbi.nlm.nih.gov/geo/) under accession number GSE86469. To analyze this single-cell RNA-seq dataset, we first added one count to each entry of the data matrix to avoid zeros, then applied clr-transformation. Then, we applied the proposed graph-based clustering framework (with L_∞, L₁ and L₂ norms as the corresponding dissimilarities, denoted by Linf-SClust, L1-SClust and L2-SClust, respectively), to the clr-transformed single-cell RNA-seq data with the total of 26,616 genes and 638 cells from human pancreatic islets. For comparison, we also applied the existing graph-based clustering methods for single-cell RNA-seq data, i.e. PhenoGraph and SNN-cliq (with default parameter setting) to this dataset. We also applied CIDR and SC3 that can report the final optimized number of clusters and had good performances based on the investigation in (18). Tables 1 and 2 summarize the final clustering results of Linf-SClust, L1-SClust, L2-SClust, SNN-cliq, PhenoGraph, CIDR and SC3. We also visualized their results using uniform manifold approximation and projection (UMAP) in Figure 1 and Supplementary Figure S1. For Linf-SClust results, most cells in cluster 3 are stellate cells. Similarly, cluster 6, 7 and 8 can represent acinar, PP/gamma, ductal and delta cells, respectively. Cluster 1 and 5 primarily consist of ductal and alpha cells, respectively. Both cluster 2 and 4 mainly contain beta cells. We found that 71.22% of the cells in cluster 2 are non-diabetic cells, while 93.54% of the cells in cluster 4 are T2D cells. Therefore, cluster 2 may represent the non-diabetic beta cell group, and cluster 4 may represent the T2D beta cell group. For the L1-SClust, L2-SClust, SNN-cliq, PhenoGraph, CIDR and SC3 results, the clusters are harder to interpret biologically.

Table 1.

Comparison of the inferred cluster assignments for the whole 638 cells in the human pancreatic islets dataset by Linf-SClust, L1-SClust, L2-SClust, SNN-cliq and Pheno-Graph, as well as the cluster configuration for the 617 cells based on the known gene markers reported in (17)

Cluster	Acinar	Alpha	Beta	Delta	Ductal	PP/Gamma	Stellate	Other
Linf-SClust
1	1	0	0	1	26	0	0	6
2	0	3	233	1	2	0	2	6
3	0	0	0	0	0	0	16	0
4	0	0	31	0	0	0	1	0
5	0	236	0	0	0	0	0	5
6	23	0	0	0	0	0	0	0
7	0	0	0	0	0	18	0	0
8	0	0	0	23	0	0	0	0
L1-SClust
1	6	58	51	6	7	2	2	12
2	0	0	65	2	1	11	0	0
3	0	55	32	4	0	2	0	2
4	4	108	61	13	7	8	7	6
5	14	0	0	0	10	0	9	1
6	0	13	17	0	3	1	1	0
7	0	1	4	2	0	0	0	0
L2-SClust
1	2	1	2	3	6	0	3	9
2	0	5	185	15	1	9	0	2
3	0	0	0	0	0	0	14	0
4	2	136	15	7	1	8	2	8
5	8	0	0	0	0	0	0	0
6	0	0	0	0	20	0	0	2
7	0	95	0	0	0	1	0	0
8	0	0	62	0	0	0	0	0
9	0	2	0	0	0	0	0	0
10	12	0	0	0	0	0	0	0
SNN-cliq
1	0	0	0	1	21	0	0	2
2	21	0	0	0	0	0	0	0
3	3	239	264	24	7	18	19	19
Phneo-Graph
1	0	0	147	0	0	3	0	1
2	0	1	101	1	2	12	1	5
3	0	83	0	0	0	0	0	0
4	0	73	0	0	0	0	0	0
5	3	2	0	22	3	1	17	8
6	2	2	16	2	6	2	1	7
7	0	31	0	0	0	0	0	0
8	0	27	0	0	0	0	0	0
9	0	20	0	0	0	0	0	0
10	18	0	147	0	0	0	0	0
11	1	0	0	0	17	0	0	0

Open in a new tab

‘Other’ indicates the 21 (638–617) cells that were not assigned to any cell type in (17).

Table 2.

Comparison of the inferred cluster assignments for the whole 638 cells in the human pancreatic islets dataset by CIDR and SC3, as well as the cluster configuration for the 617 cells based on the known gene markers reported in (17)

Cluster	Acinar	Alpha	Beta	Delta	Ductal	PP/Gamma	Stellate	Other
CIDR
1	2	2	1	2	5	0	0	9
2	0	0	99	1	0	1	0	1
3	0	2	148	3	0	0	0	0
4	0	0	0	1	0	0	18	5
5	0	84	13	5	0	6	0	0
6	1	151	3	12	2	11	0	3
7	21	0	0	1	21	0	1	3
SC3
1	0	213	0	0	0	0	0	1
2	0	2	0	0	0	0	0	3
3	0	10	0	0	0	0	0	0
4	0	2	156	0	0	0	0	1
5	0	0	5	0	1	0	0	0
6	0	0	16	0	0	0	0	0
7	0	0	78	0	0	0	0	0
8	0	0	0	1	25	0	0	5
9	22	3	3	1	0	0	19	4
10	0	0	0	22	1	17	0	0
11	2	9	6	1	1	1	0	7

Open in a new tab

‘Other’ indicates the 21 (638–617) cells that were not assigned to any cell type in (17).

The original clustering result reported in (17) is visualized in the bottom panel in Figure 1. To quantify the difference between the result from each method and the original result in (17), we computed a purity measure. Specifically, we first identified the most frequent class in each cluster reported by each compared method based on the original cell type assignment in (17). Then, we counted the number of consistently assigned cells by each method compared with the original cell type assignment result in (17). Then, we calculated the purity by dividing this count by the total number of cells (638). The purity for each compared method is shown in Table 4. One can see that Linf-SClust is the most consistent with the original cell type assignment result in (17), which was based on one known biomarker for each cell type. This finding is consistent with the methodological nature of Linf-SClust. We further illustrated this point in Table 3 by computing the between/within-cluster dissimilarity ratios calculated based on L_∞, L₁ and L₂ norms for the cells assigned to the seven types by the corresponding seven known biomarkers in (17). The L_∞ norm resulted in the largest average ratio (around 1.26) compared to the ratios based on the L₁ norm (around 1.06) and the L₂ norm (around 1.05). This is consistent with the performance of Linf-SClust, L1-SClust and L2-SClust applied to this real dataset.

Table 4.

Purity of seven methods in human pancreatic islets data containing eight cell types: beta cell (INS), alpha cell (GCG), delta cell (SST), PP/gammacell (PPY), acinar cell (PRSS1), stellate cell (COL1A1), ductal cell (KRT19) and other cell

Method	Linf-SClust	L1-SClust	L2-SClust	Pheno-Graph
Purity	0.9467	0.5627	0.8511	0.8699
Method	SNN-cliq	CIDR	SC3
Purity	0.4796	0.8307	0.8856

Open in a new tab

Table 3.

Comparison of between/within-cluster dissimilarity ratios for the seven cell types in the human pancreatic islets dataset (17)

Genes	PRSS1	GCG	INS	SST	KRT19	PPY	COL1A1
Cells	acinar	alpha	beta	delta	ductal	gamma	stellate
L _∞	1.23	1.31	1.26	1.32	1.18	1.33	1.17
L ₁	1.10	1.06	1.02	1.05	1.04	1.04	1.09
L ₂	1.08	1.06	1.02	1.04	1.06	1.03	1.06

Open in a new tab

Biologically, we further investigated the genes contributed to the clustering in Linf-SClust. Specifically, we calculated pairwise distances for all the 638 cells based on the L_∞ norm. In total, we obtained Inline graphic dissimilarities. For each obtained dissimilarity, we investigated which gene made the contribution in L_∞. We summarized the frequencies of the ‘gene contributors’, and ranked them in Table 5. One can see that GCG, INS, SST and PPY are the top four ‘gene contributors’, PRSS1 is the eighth and COL1A1 is the 180th. The existing study in (17) only assigned 617 cells to certain known cell types based on these seven marker genes. There were 21 (638–617) cells without cell type information. Our Linf-SClust method and the analyses provide the potential direction to improve the cell-type discovery for human pancreatic islets based on single-cell RNA-seq data. Furthermore, for running time, it took about 1.2 min to run the Linf-SClust method with a specified perplexity value and the number of clusters on the single-cell RNA-seq dataset with 26,616 genes and 638 cells using a computer with one Intel64 processor.

Table 5.

Selected frequencies for known marker genes by Linf-SClust in the clustering of the total 638 cells, which include 617 cells based on the known gene markers reported in (17) and other 21 (638–617) cells that were not assigned to any cell type in (17)

	GeneIndex	GeneName	CellType	Frequency	Percent	Order
1	ENSG00000115263	GCG	Alpha	40340	19.85%	1
2	ENSG00000254647	INS	Beta	34360	16.91%	2
3	ENSG00000115263	SST	Delta	9469	4.66%	3
4	ENSG00000115263	PPY	PP/Gamma	9161	4.51%	4
5	ENSG00000115263	PRSS1	Acinar	5283	2.60%	8
6	ENSG00000115263	COL1A1	Stellate	92	0.05%	180

Open in a new tab

Simulations

We further investigated Linf-SClust, L1-SClust, L2-SClust and 13 other methods including CIDR, FlowSOM, monocle, PCAHC, PCAKmeans, pcaReduce, RaceID2, RtsneKmeans, SAFE, SC3, SC3svm, Seurat and TSCAN using 15 simulated single-cell RNA-seq datasets in (18). Table 6 provides an overview of these simulated datasets. These simulated datasets were generated based on certain real datasets using different methods (named as HVG10, Expr10 and M3Drop10), and provided true cell labels (18). We used four evaluation criteria, i.e. ARI (Adjusted Rand Index) (19), NMI (Normalized Mutual Information) (20), purity and classification error rate, to investigate the performances of the 16 methods applied to the 15 simulated datasets.

Table 6.

A simple introduction to 15 simulation datasets

Dataset	Cells	Genes	TrueClass
Koh_HVG10	531	4898	9
Koh_Expr10	531	4898	9
Koh_M3Drop10	531	4898	9
Kumar_HVG10	246	4515	3
Kumar_Expr10	246	4515	3
Kumar_M3Drop10	246	4515	3
Zhengmix4eq_HVG10	3300	1557	4
Zhengmix4eq_Expr10	3555	1556	4
Zhengmix4eq_M3Drop10	3430	1557	4
Zhengmix4uneq_HVG10	5079	1644	4
Zhengmix4uneq_Expr10	6414	1644	4
Zhengmix4uneq_M3Drop10	3830	1644	4
Zhengmix8eq_HVG10	3798	1572	8
Zhengmix8eq_Expr10	3971	1571	8
Zhengmix8eq_M3Drop10	2662	1572	8

Open in a new tab

Koh, Kumar, Zhengmix4eq, Zhengmix4uneq and Zhengmix8eq are the name of five real datasets name, as well as HVG10, Expr10 and M3Drop10 are three methods of filtering gene from real datasets (18). ‘TrueClass’ is a synonym for ‘True Cluster’ in simulation datasets.

We first investigated the effects of perplexity on the performances of Linf-SClust, L1-SClust and L2-SClust applied to the simulated datasets. Supplementary Figures S2–4 show the effects of perplexity on the four evaluation criteria applying Linf-SClust, L1-SClust and L2-SClust to the 15 simulated datasets with various perplexity values. One can see that Linf-SClust has the best overall performances on datasets Zhengmix4eq, Zheng4uneq and Zhengmix8eq with various perplexity values. L2-SClust has the best overall performance on dataset Koh. L1-SClust and L2-SClust have good overall performances on dataset Kumar. These findings suggest that it is necessary to have the options for different types of distances for single-cell RNA-seq clustering methods to facilitate the applications to diverse biological contexts.

We then compared Linf-SClust, L1-SClust and L2-SClust with the 13 clustering methods including CIDR, FlowSOM, monocle, PCAHC, PCAKmeans, pcaReduce, RaceID2, RtsneKmeans, SAFE, SC3, SC3svm, Seurat and TSCAN using the simulated datasets. We used the perplexity values that yielded the best performances for Linf-SClust, L1-SClust, L2-SClust in this study. Figures 2, 3, 4 show the comparison results for the 16 clustering methods. To summarize the comparisons, at least one of the proposed methods (i.e. Linf-SClust, L1-SClust and L2-SClust) was among the top five methods with good performances. In particular, L1-SClust ranks the first on the dataset HVG10_Kumar, Linf-SClust ranks the second on dataset Expr10_Zhengmix8eq, and L2-SClust ranks the second on dataset M3Drop10_Kumar. These findings suggest the graph-based spectral clustering techniques can be helpful for single-cell RNA-seq clustering problems.

Figure 2. — Method comparisons on five simulation datasets obtained via gene-filtering method ‘HVG10’.

Figure 3. — Method comparisons on five simulation datasets obtained via gene-filtering method ‘Expr10’.

Figure 4. — Method comparisons on five simulation datasets obtained via gene-filtering method ‘M3Drop10’.

CONCLUSION

We developed a new graph-based single-cell clustering framework. Under this framework, we investigated the choices on different measures (i.e. L_∞, L₁ and L₂) used for dissimilarity characterization on clr-transformed single-cell RNA-seq data. We theoretically investigated the effects of L_∞, L₁ and L₂ measures used for dissimilarity calculations on clustering. We applied the proposed methods to the clr-transformed single-cell RNA-seq data from human pancreatic islets. We found that the Linf-SClust method is suitable for this dataset, which provides biologically meaningful insights. We also compared the proposed methods with existing single-cell clustering methods through real data application and simulations. These analyses suggest the proposed methods are valuable additions to single-cell clustering methods.

Supplementary Material

lqaa087_Supplemental_Files

Click here for additional data file.^{(1.6MB, zip)}

ACKNOWLEDGEMENTS

The authors acknowledge the comments and suggestions from anonymous reviewers.

Contributor Information

Hao Wu, Department of Statistics, University of Connecticut, 215 Glenbrook Rd., Storrs, CT 06269, USA.

Disheng Mao, Department of Statistics, University of Connecticut, 215 Glenbrook Rd., Storrs, CT 06269, USA.

Yuping Zhang, Department of Statistics, University of Connecticut, 215 Glenbrook Rd., Storrs, CT 06269, USA.

Zhiyi Chi, Department of Statistics, University of Connecticut, 215 Glenbrook Rd., Storrs, CT 06269, USA.

Michael Stitzel, The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT 06032, USA.

Zhengqing Ouyang, Department of Biostatistics and Epidemiology, School of Public Health and Health Sciences, University of Massachusetts, 715 North Pleasant Street, Amherst, MA 01003, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NARGAB Online.

FUNDING

Faculty Research Excellence Program Award from the University of Connecticut (to Y.Z.).

Conflict of interest statement. None declared.

REFERENCES

1. Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161:1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Xu C., Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015; 31:1974–1980. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Levine J.H., Simonds E.F., Bendall S.C., Davis K.L., Amir E.A.D., Tadmor M.D., Litvin O., Fienberg H.G., Jager A., Zunder E.R. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015; 162:184–197. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Esteva A., Kuprel B., Novoa R.A., Ko J., Swetter S.M., Blau H.M., Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017; 542:115–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Zhang Y., Ouyang Z., Zhao H. A statistical framework for data integration through graphical models with application to cancer genomics. Ann. Appl. Stat. 2017; 11:161–184. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Liu Q., Zhang Y. Joint estimation of heterogeneous exponential Markov Random Fields through an approximate likelihood inference. J. Stat. Plan. Infer. 2020; 209:252–266. [Google Scholar]
7. Linder H., Zhang Y. Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes. Commun. Stat. Appl. Methods. 2019; 26:411–430. [Google Scholar]
8. Linder H., Zhang Y. A pan-cancer integrative pathway analysis of multi-omics data. Quant. Biol. 2020; 8:1–13. [Google Scholar]
9. Zhang Y., Qian M., Ouyang Q., Deng M., Li F., Tang C. Stochastic model of yeast cell-cycle network. Physica D. 2006; 219:35–39. [Google Scholar]
10. Aitchison J. The statistical analysis of compositional data. J. R. Stat. Soc. B. 1982; 44:139–160. [Google Scholar]
11. Quinn T.P., Erb I., Richardson M.F., Crowley T.M. Understanding sequencing data as compositions: an outlook and review. Bioinformatics. 2018; 34:2870–2878. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Embrechts P., Klüppelberg C., Mikosch T. Modelling Extremal Events: for Insurance and Finance. 2013; 33:Heidelberg, Germany: Springer-Verlag. [Google Scholar]
13. Hinton G.E., Roweis S.T.. Becker S, Thrun S, Obermayer K Stochastic neighbor embedding. Advances in Neural Information Processing Systems. 2003; Cambridge, MA, USA: MIT Press; 857–864. [Google Scholar]
14. Hagen L., Kahng A.B. New spectral methods for ratio cut partitioning and clustering. IEEE T. Comput. Aid. D. 1992; 11:1074–1085. [Google Scholar]
15. Von Luxburg U. A tutorial on spectral clustering. Stat. Comput. 2007; 17:395–416. [Google Scholar]
16. Tibshirani R., Walther G., Hastie T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. B Stat. Methodol. 2001; 63:411–423. [Google Scholar]
17. Lawlor N., George J., Bolisetty M., Kursawe R., Sun L., Sivakamasundari V., Kycia I., Robson P., Stitzel M.L. Single-cell transcriptomes identify human islet cell signatures sqand reveal cell-type–specific expression changes in type 2 diabetes. Genome Res. 2017; 27:208–222. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Duò A., Robinson M.D., Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 2018; 1141–1163. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Yeung K.Y., Ruzzo W.L. Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics. 2001; 17:763–774. [DOI] [PubMed] [Google Scholar]
20. Knops Z.F., Maintz J.A., Viergever M.A., Pluim J.P. Normalized mutual information based registration using k-means clustering and shading correction. Med. Image. Anal. 2006; 10:432–439. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqaa087_Supplemental_Files

Click here for additional data file.^{(1.6MB, zip)}

[B1] 1. Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161:1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Xu C., Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015; 31:1974–1980. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Levine J.H., Simonds E.F., Bendall S.C., Davis K.L., Amir E.A.D., Tadmor M.D., Litvin O., Fienberg H.G., Jager A., Zunder E.R. et al. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015; 162:184–197. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Esteva A., Kuprel B., Novoa R.A., Ko J., Swetter S.M., Blau H.M., Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017; 542:115–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Zhang Y., Ouyang Z., Zhao H. A statistical framework for data integration through graphical models with application to cancer genomics. Ann. Appl. Stat. 2017; 11:161–184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Liu Q., Zhang Y. Joint estimation of heterogeneous exponential Markov Random Fields through an approximate likelihood inference. J. Stat. Plan. Infer. 2020; 209:252–266. [Google Scholar]

[B7] 7. Linder H., Zhang Y. Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes. Commun. Stat. Appl. Methods. 2019; 26:411–430. [Google Scholar]

[B8] 8. Linder H., Zhang Y. A pan-cancer integrative pathway analysis of multi-omics data. Quant. Biol. 2020; 8:1–13. [Google Scholar]

[B9] 9. Zhang Y., Qian M., Ouyang Q., Deng M., Li F., Tang C. Stochastic model of yeast cell-cycle network. Physica D. 2006; 219:35–39. [Google Scholar]

[B10] 10. Aitchison J. The statistical analysis of compositional data. J. R. Stat. Soc. B. 1982; 44:139–160. [Google Scholar]

[B11] 11. Quinn T.P., Erb I., Richardson M.F., Crowley T.M. Understanding sequencing data as compositions: an outlook and review. Bioinformatics. 2018; 34:2870–2878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Embrechts P., Klüppelberg C., Mikosch T. Modelling Extremal Events: for Insurance and Finance. 2013; 33:Heidelberg, Germany: Springer-Verlag. [Google Scholar]

[B13] 13. Hinton G.E., Roweis S.T.. Becker S, Thrun S, Obermayer K Stochastic neighbor embedding. Advances in Neural Information Processing Systems. 2003; Cambridge, MA, USA: MIT Press; 857–864. [Google Scholar]

[B14] 14. Hagen L., Kahng A.B. New spectral methods for ratio cut partitioning and clustering. IEEE T. Comput. Aid. D. 1992; 11:1074–1085. [Google Scholar]

[B15] 15. Von Luxburg U. A tutorial on spectral clustering. Stat. Comput. 2007; 17:395–416. [Google Scholar]

[B16] 16. Tibshirani R., Walther G., Hastie T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. B Stat. Methodol. 2001; 63:411–423. [Google Scholar]

[B17] 17. Lawlor N., George J., Bolisetty M., Kursawe R., Sun L., Sivakamasundari V., Kycia I., Robson P., Stitzel M.L. Single-cell transcriptomes identify human islet cell signatures sqand reveal cell-type–specific expression changes in type 2 diabetes. Genome Res. 2017; 27:208–222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Duò A., Robinson M.D., Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 2018; 1141–1163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Yeung K.Y., Ruzzo W.L. Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics. 2001; 17:763–774. [DOI] [PubMed] [Google Scholar]

[B20] 20. Knops Z.F., Maintz J.A., Viergever M.A., Pluim J.P. Normalized mutual information based registration using k-means clustering and shading correction. Med. Image. Anal. 2006; 10:432–439. [DOI] [PubMed] [Google Scholar]

PERMALINK

A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Hao Wu

Disheng Mao

Yuping Zhang

Zhiyi Chi

Michael Stitzel

Zhengqing Ouyang

Abstract

INTRODUCTION

Background on the biological problem

Graph-based clustering methods

MATERIALS AND METHODS

Single-cell RNA-seq and compositional data analysis

New graph-based clustering framework

RESULTS

Application to single-cell RNA-seq data from human pancreatic islets

Table 1.

Table 2.

Figure 1.

Table 4.

Table 3.

Table 5.

Simulations

Table 6.

Figure 2.

Figure 3.

Figure 4.

CONCLUSION

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases