Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Aug 2;34(1):139–146. doi: 10.1093/bioinformatics/btx490

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

Zhe Sun 1, Ting Wang 2, Ke Deng 3, Xiao-Feng Wang 4, Robert Lafyatis 5, Ying Ding 1, Ming Hu 4,, Wei Chen 1,2,
Editor: Cenk Sahinalp
PMCID: PMC6454475  PMID: 29036318

Abstract

Motivation

Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored.

Results

We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods.

Availability and implementation

DIMM-SC has been implemented in a user-friendly R package with a detailed tutorial available on www.pitt.edu/∼wec47/singlecell.html.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Single cell RNA sequencing (scRNA-Seq) technologies have advanced rapidly in recent years (Gawad et al., 2016). Among them, the newly developed droplet-based technologies have generated great interests (Macosko et al., 2015; Zheng et al., 2017). They are able to measure the transcriptome of thousands of single cells simultaneously in a short time period and at a relatively low cost (Macosko et al., 2015; Zheng et al., 2017). More attractively, droplet-based technologies utilize Unique Molecular Identifier (UMI) to annotate the 3′ end of each transcript in order to reduce PCR amplification bias, increase transcript capture efficiency and substantially minimize batch effect (Islam et al., 2014). More recently, 10× Genomics has released a commercialized droplet-based Chromium system, which is efficient and cost-effective in isolating thousands of single cells with average running time of ten minutes based on the Gel bead in Emulsion (GEM) technology. They used this platform to comprehensively characterize and profile peripheral blood mononuclear cells (PBMC) (Zheng et al., 2017). Harnessing the power of these exciting new technologies, droplet-based scRNA-Seq has brought revolutionary insights to understand cellular and molecular processes at single cell resolution.

One important question in the analysis of scRNA-Seq data is to identify and characterize cell subtypes from heterogeneous tissues, which is essential to fully understand cell identity and cell function. Clustering methods have been extensively studied for many areas in the past decades. For example, unsupervised clustering methods such as K-means clustering, hierarchical clustering, and Adaptive Density Peak (ADP) clustering (Rodriguez and Laio, 2014; Wang and Xu, 2015), can be applied to droplet-based scRNA-Seq data after certain data transformation. In addition, tailored methods such as CellTree and Seurat have been proposed to analyze scRNA-Seq data with the motivation from early generation platforms (duVerle et al., 2016; Jaitin et al., 2014). However, clustering methods tailored to droplet-based scRNA-Seq data are largely lagging behind. Although existing clustering methods can be adapted, there are at least three key limitations of using those methods to cluster droplet-based scRNA-Seq data. First of all, most existing methods are developed for continuous data [e.g. Fragments Per Kilobase of transcript per Million (FPKM) or log-transformed count data] while droplet-based scRNA-Seq data consist of the discrete count of the unique UMIs, which are direct measurements of transcript copies from each gene. Converting UMI counts into continuous measure will alter the straightforward interpretation of UMI, thus it is more appealing and reasonable to directly model the count data. Second, most existing methods are designed for the early generation of scRNA-Seq technologies that measure transcriptome across a relatively small number of single cells. It is unclear how these methods can be scaled up to cluster droplet-based scRNA-Seq data, which usually contain thousands of single cells. Last but not the least, most existing methods only provide a ‘hard’ cluster membership for each cell without statistical uncertainty quantification. In order to conduct rigorous statistical inference and achieve reliable data interpretation, different sources of uncertainties in droplet-based scRNA-Seq data need to be explicitly taken into consideration in the clustering analysis.

To fill in these gaps, we proposed DIMM-SC, a Dirichlet mixture model for clustering droplet-based scRNA-Seq data. DIMM-SC explicitly models both the within-cluster and between-cluster variability of the UMI count data, leading to rigorous quantification of clustering uncertainty for each single cell. We also implemented an efficient expectation-maximization (E-M) algorithm (Dempster et al., 1977) for fast convergence. Furthermore, we proposed different strategies for initial value selection to ensure algorithm robustness. In the following sections, we first introduce the unique features of droplet-based scRNA-Seq data, as well as the details of the DIMM-SC method. Next, we compare the performance of DIMM-SC with three popular clustering methods, including K-means clustering, CellTree and Seurat, in both simulation studies and real data applications. K-means clustering is one of the most popular clustering methods and has been used in the first 10× Genomics publication (Zheng et al., 2017). CellTree has been recently developed to cluster scRNA-Seq data based on Latent Dirichlet Allocation (LDA) (duVerle et al., 2016). Seurat is a deterministic approach which relies on a graph-based clustering approach (Satija et al., 2015).

2 Materials and methods

2.1 Data description

The droplet-based scRNA-Seq data can be summarized into a UMI count matrix (Table 1), in which each row represents one gene and each column represents one single cell. Each entry in the UMI count matrix is the number of transcripts (unique UMIs) for one gene in one single cell. Compared to the data generated from early generation of scRNA-Seq technologies, droplet-based scRNA-Seq data have three important features (Gawad et al., 2016; Stegle et al., 2015; Zheng et al., 2017). First, each experiment can generate thousands of cells, which dramatically increase the data dimension and computational burden. Second, the use of UMI can reduce PCR amplification bias and quantify the copies of captured molecules. Droplet-based sequencing protocol amplifies the 3′ end of the transcript, so the number of UMI is independent of the total transcript length. The normalization method used in RPKM and FPKM, which adjusts for total transcript length, is invalid for analyzing droplet-based scRNA-Seq data. Therefore, the raw count data should be directly modeled to retain their biological interpretations. Third, the UMI count matrix is extremely sparse, and thus violates the statistical assumption of many existing clustering methods. Supplementary Figure S1 lists the empirical distribution of the UMI counts for a few representative genes, demonstrating the non-ignorable proportion of zeroes for different levels of expression. Pre-selection of informative single cells and informative genes are necessary before the downstream clustering analysis. After clustering analysis, the results are usually visualized by a t-distributed stochastic neighbor embedding (t-SNE) approach (van der Maaten and Hinton, 2008), which embeds high-dimensional transcriptome data into a two-dimensional scatter plot. Note that t-SNE is a visualization tool, and it is not intended to be used for clustering scRNA-Seq data.

Table 1.

An example of the raw UMI count table from droplet-based scRNA-Seq data

Cell 1 Cell 2 Cell 3 Cell 2000
Gene1 0 0 0 0
Gene2 1 0 1 0
Gene3 23 12 9 3
Gene 10 000 22 6 7 9 3

2.2 Statistical model

We start with a matrix X, of which the element Xij represents the number of unique UMIs for gene i in cell j where i runs from 1 to the total number of genes G, and j runs from 1 to the total number of cells C (as showed in Table 1). Xij is the count for the absolute number of transcripts. We denote the j th column of this matrix, which gives the number of unique UMIs in the j th single cell, by a vector xj=x1j,x2j,, xGj, where j=1,,C. We assume that xj is generated from a multinomial distribution with parameter vector pj=(p1j,p2j,,pGj). The element of pj, pij, is the probability that a unique UMI count taken from cell j belongs to gene i. This gives a likelihood for each cell:

Pxj|pj=Tj!i=1Gxij!p1jx1jp2jx2jpGjxGj,

where Tj=ixij is the total number of unique UMIs for the j th cell. The joint likelihood of all C cells is the product of the likelihood for each cell: j=1CPxj|pj.

2.2.1 Dirichlet mixture priors

In a Bayesian framework, we need to define a prior distribution for the multinomial parameter probability vector pj. For multinomial distribution, a commonly used conjugate prior is the Dirichlet distribution. Specifically, we assume that the proportion pj=(p1j,p2j,,pGj) follows a Dirichlet prior distribution Dirα=Dir(α1,α2,,αG):

Ppj|α=1B(α)p1jα1-1p2jα2-1pGjαG-1,

where B(α) is Beta function with parameter α=(α1,α2,,αG). All the elements in α are strictly positive (αi>0). The mean and variance of pij are αi/α and αiα-αi/(α2α+1), respectively, where α=α1+α2++αG. A large α gives small variance about the proportions pj, while a small α leads to widely spread pj’s. When the cell population is homogeneous, we assume that pj’s all follow the same prior distribution Dir(α), and the full likelihood function is as follows:

Pxjα)=Pxj|pjPpj|αdpj=Tj!i=1Gxij!i=1GΓxij+αiΓαiΓαΓTj+α.

We then assume that the cell population consists of K distinct cell types, where K can be pre-defined according to prior biological knowledge or can be estimated through model fitting. To provide a more flexible modeling framework and allow for unsupervised clustering, we extend the aforementioned single Dirichlet prior to a mixture of K Dirichlet distributions, indexed with k=1,,K, each with parameter α(k). If cell j belongs to the k th cell type, its gene expression profile pj follows a cell-type-specific prior distribution Dir(α(k)). The full likelihood function is then obtained by multiplying the Dirichlet mixture prior by the multinomial likelihood.

2.2.2 E-M algorithm for fitting the mixture of Dirichlet prior

We now use a latent variable vector Z with elements zj to represent the cell type label for the cell j. This allows us to maximize the log posterior distribution using the E-M algorithm (Dempster, 1977). We have

Pxjzj=k,α(k))i=1GΓxij+αikΓαikΓαkΓTj+αk,

and Pzj=k=πk, where πk is the proportion of the k th cell type among all cells. We can treat zj as missing data, and use the E-M algorithm to estimate α1k,α2k,,αGk and πk. The complete log likelihood is

logj=1CP(xj,zj=k)=j=1CI(zj=k)logi=1GΓxij+αikΓαikΓ(|α(k)|)Γ(Tj+|α(k)|)}.

The formula for updating α1k,α2k,,αGk is derived from the Minka’s fixed-point iteration for the leaving-one-out likelihood (https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf):

α^ik(t+1)=αik(t)j=1Cδjk{xij/(xij-1+αik(t))}j=1Cδjk{Tj/(Tj-1+|αk(t)|)}.

We repeat the above steps until the convergence of log likelihood or a maximum number of iterations is reached (see detailed algorithm in supplemental materials).

2.2.3 Selection of the number of clusters and initial values

To implement DIMM-SC, it is critical to select the total number of clusters and the initial values for the E-M algorithm. Specially, the number of clusters K can be defined with prior knowledge or can be selected from model selection criteria such as AIC or BIC (Akaike, 1974; Schwarz, 1978). Meanwhile, there are many methods to determine the initial values of α1,α2,,αG in the E-M algorithm for fitting the Dirichlet mixture model. For example, Ronning (1989) suggests to estimate i=1Gαi by

logi=1Gαi=1G-1i=1G-1logEpi1-EpiVar pi-1,

where E(pi) can be approximated by (j=1Cxij/Tj)(C) (Ronning, 1989). An alternative approach is to estimate the initial values using a method of moment estimates proposed by Weir and Hill (2002). In this paper, we applied the K-means clustering to obtain the initial clustering results, and then used either the Ronning’s method or the Weir and Hill’s method to obtain the initial estimates of Dirichlet parameter α.

2.3 Simulation studies

We performed comprehensive simulation studies to compare DIMM-SC with three existing clustering methods, including K-means clustering, Seurat and CellTree. The first two are deterministic approaches and the third one is a probabilistic approach.

In the simulation set-up, the UMI count matrix was sampled from the proposed Dirichlet mixture model. Specially, for a fixed total number of cell clusters K, we first pre-defined the values of α(k)=α1(k),α2(k),,αG(k) for the k th cell cluster, and then sampled the proportion pj=(p1j,p2j,,pGj) from a Dirichlet distribution Dirα(k). Next, we sampled the UMI count vector xj for the j th cell from the multinomial distribution Multinomial(Tj,pj). We fixed Tj as a constant across all cells.

In the simulation studies, we considered the following seven clustering methods. (i) DIMM-SC + K-means + Ronning (hereafter referred as DIMM-SC-KR), in which we used the K-means clustering to obtain the initial values of clustering labels and then used the Ronning’s method to estimate initial values of α; (ii) DIMM-SC + K-means + Weir (hereafter referred as DIMM-SC-KW), in which we used the K-means clustering to obtain the initial values of clustering labels and used the Weir and Hill’s method to estimate initial values of α; (iii) DIMM-SC + random + Ronning (hereafter referred as DIMM-SC-RR), in which we randomly selected the initial values of clustering labels and used the Ronning’s method to estimate initial values of α; (iv) DIMM-SC + random + Weir (hereafter referred as DIMM-SC-RW), in which we randomly selected the initial values of clustering labels and used the Weir and Hill’s method to estimate initial values of α; (v) K-means clustering; (vi) CellTree, a LDA-based approach to cluster scRNA-Seq data; and (vii) Seurat. To perform the simulation analysis using Seurat, we followed the tutorial instructions from the Seurat website (http://satijalab.org/seurat/) and used all genes as input to perform Principal Component Analysis (PCA). After that, we followed the ‘jackstraw’ procedure implemented in Seurat, and identified first ten PCs for their downstream algorithm. We fixed the number of PCs in all the simulation runs under each scenario. Since Seurat requires users to self-specify a resolution parameter with increased values leading to a greater number of clusters, the clustering results are very sensitive to this resolution parameter. Seurat suggests that setting this resolution parameter between 0.6 and 1.2 typically returns good results for datasets of around 3000 cells, so we ran Seurat using resolution parameter with 0.6, 0.8, 1.0 and 1.2, and chose the one with the highest adjusted rand index (ARI) value in each simulation setting. Note that ARI is a commonly used metric of the similarity between the estimated clustering labels and the true clustering labels (Rand, 1971).

We used the signal-to-noise ratio (SNR) to measure the magnitude of difference among different cell clusters. When K=2, SNR is defined as:

SNR=α(1)-α(2)1GVarα(1)+Var(α(2)),

where |.|1 is the L1 norm of a vector. We performed comprehensive simulations to investigate how different SNRs, different sequencing depths, different total numbers of cells/genes/clusters, and different proportions of noisy genes affect the clustering results. To evaluate the performance of DIMM-SC and other competing clustering methods, we used the following two metrics: (i) Clustering accuracy measured by ARI and (ii) Stability (the standard deviation of ARI). We expect a good clustering method should achieve both high accuracy and high stability.

3 Results

3.1 Simulation studies

Figure 1A shows the boxplots of ARI for seven clustering methods across 100 simulations at different SNRs. Four DIMM-SC based methods (KR, KW, RR, RW) achieved comparable performance, which produced higher accuracy and lower variability than K-means clustering, Seurat and CellTree. When SNR is high (i.e. substantial differences among cell clusters), all seven methods performed well. However, when SNR is low (i.e. different cell clusters are similar), K-means clustering, Seurat and CellTree produced less accurate and more variable clustering results, while four DIMM-SC based methods still performed well.

Fig. 1.

Fig. 1.

Boxplots of ARI for seven clustering methods across 100 simulations, investigating how different SNRs (A), number of clusters (B), number of genes (C), number of cells (D), sequencing depth (E) and the number of informative genes (F) affect clustering results

Figure 1B shows the boxplots of ARI for seven clustering methods across 100 simulations, when the total number of clusters is 2, 3, 4, 5 and 8, respectively. The four DIMM-SC based methods, especially the two methods with randomly selected initial cluster labels (RR and RW), achieved better clustering accuracy (i.e. higher ARI) with more number of clusters. K-means clustering has high variability for more clusters, since it is a deterministic procedure and is more likely to end at a local optimum when the total number of clusters increases. CellTree performed worse for more clusters, partially due to the over-parameterized LDA model and lack of fit to highly heterogeneous data. Seurat was run under different default recommended parameters and the performance varies with different parameters.

Figure 1C–F list the boxplots of ARI for seven clustering methods across 100 simulations, for different number of genes (Fig. 1C), different number of cells (Fig. 1D), different sequencing depths (Fig. 1E) and different number of informative genes (Fig. 1F) (i.e. differentially expressed genes among clusters), respectively. Consistent across all these four scenarios, more information (i.e. more genes, more cells, higher sequencing depths and more informative genes) lead to higher clustering accuracy and lower clustering variability. Four DIMM-SC based clustering methods consistently outperformed K-means clustering, Seurat and CellTree in all these simulation settings, suggesting the advantage of DIMM-SC.

3.2 Real data analysis: the publicly available 10× Genomics scRNA-seq data

3.2.1 In silicon studies based on purified cell types from published scRNA-seq data

To illustrate the application of DIMM-SC to real datasets, we first benchmarked our method against pre-defined measures in capturing true cell-to-cell similarities on published single-cell datasets. 10× Genomics has made eleven datasets from purified cell types available to public (Zheng et al., 2017). Among which, over 10 000 cells were detected in most experiments. Here we considered two scenarios: (i) a simple case with cells from three highly distinct cell types (CD56+ NK cells, CD19+ B cells and CD4+/CD25+ regulatory T cells); (ii) a challenging case with cells from three similar cell types (CD8+/CD45RA+ naive cytotoxic T cells, CD4+/CD25+ regulatory T cells and CD4+/CD45RA+/CD25- naive T cells) (Table 2). For visualization, we used the t-SNE algorithm to project the data into a two-dimensional space so that certain hidden structures in the data can be depicted intuitively (see the t-SNE visualization in Supplementary Figs S2 and S3).

Table 2.

Total number of cells, genes and validated populations for two scenarios, for in silicon studies based on purified cell types from 10× Genomics

Scenario #Genes #Cell Cell type
Simple 32 738 8385 CD56+ NK cells
10 085 CD19+ B cells
10 283 CD4+/CD25+ regulatory T cells
Challenging 32 738 11 953 CD8+/CD45RA+ naive cytotoxic T cells
10 263 CD4+/CD25+ regulatory T cells
10 479 CD4+/CD45RA+/CD25- naive T cells

We ran DIMM-SC, K-means clustering, CellTree and Seurat 50 times for both two scenarios. In the simple case, at each time, we randomly selected 1000 CD56+ NK cells, 2000 CD19+ B cells and 3000 CD4+/CD25+ regulatory T cells from the 10× Genomics datasets, and combined them together. Thus the total number of cells for clustering is 6000. Similarly, in the challenging case, 1000 CD8+/CD45RA+ naive cytotoxic T cells, 2000 CD4+/CD25+ regulatory T cells and 3000 CD4+/CD45RA+/CD25- naive T cells were randomly selected at each time.

Cell types in each dataset were known as a priori and were further validated in the respective follow-up studies, providing a reliable gold standard to benchmark the clustering performance for each method. We compared the performance of four DIMM-SC methods with K-means clustering, CellTree and Seurat, in terms of clustering accuracy and stability.

In the simple case, we applied all these seven clustering methods on the top 100 variable genes ranked by their standard deviation among all cells. Table 3 shows that all methods provided good clustering results. Two DIMM-SC methods with randomly selected initial cluster labels (RR and RW) slightly outperformed K-means clustering in terms of accuracy and variability. For the challenging case, unlike what we did in the simple case, we chose different numbers of top variable genes. Table 3 and Supplementary Figure S4 show that the ARIs of CellTree and Seurat were lower than other methods when the total number of genes used for clustering was greater than 200. DIMM-SC slightly outperformed K-means clustering in terms of accuracy. K-means clustering made a great leap forward when the total number of genes increased to 300. However, there is no further improvement of ARI with K-means clustering when top 500 or more variable genes were used. Since in the challenging case, CD4+/CD25+ regulatory T cells and CD4+/CD45RA+/CD25- naive T cells were similar to each other, more and more noisy genes were included in the analysis when we increased the total number of genes, which undermined the performance of K-means clustering. Note that K-means clustering and Seurat were only able to provide a deterministic clustering label, while DIMM-SC and CellTree can additionally provide the probability that each cell belongs to each cluster.

Table 3.

Performance of clustering in the simple case and the challenging case

#Genes DIMM-SC-KR DIMM-SC-KW DIMM-SC-RR DIMM-SC-RW K-means clustering CellTree Seurat
The simple case
100 0.952 (0.114) 0.951 (0.118) 0.982 (0.052) 0.990 (0.002) 0.951 (0.129) 0.983 (0.002) 0.983 (0.003)
The challenging case
100 0.351 (0.140) 0.357 (0.140) 0.368 (0.140) 0.408 (0.128) 0.182 (0.012) 0.278 (0.018) 0.395 (0.027)
200 0.558 (0.014) 0.559 (0.014) 0.558 (0.014) 0.559 (0.013) 0.283 (0.050) 0.389 (0.022) 0.410 (0.017)
300 0.563 (0.013) 0.564 (0.013) 0.563 (0.013) 0.563 (0.013) 0.526 (0.063) 0.419 (0.023) 0.413 (0.022)
400 0.571 (0.014) 0.571 (0.014) 0.566 (0.040) 0.571 (0.014) 0.554 (0.014) 0.404 (0.050) 0.429 (0.012)
500 0.572 (0.015) 0.572 (0.015) 0.572 (0.015) 0.572 (0.015) 0.559 (0.014) 0.397 (0.067) 0.435 (0.011)
800 0.562 (0.041) 0.562 (0.041) 0.557 (0.057) 0.556 (0.056) 0.557 (0.041) 0.365 (0.078) 0.445 (0.011)

Note: In each test case, the method with the best performance is highlighted in bold font. The number within parenthesis is standard error.

3.2.2 Real data analysis on the PMBC 68 K dataset

To examine how DIMM-SC is applicable to large-scale dataset, we applied DIMM-SC-KR on the PBMC 68 K dataset, which consists of >68 000 single cells. Among all 32 738 genes, we selected the top 1000 genes with the highest variations. Figure 2A shows a clear separation of cell types as we expected. 11 purified sub-populations of PBMCs were used as the reference to identify the cell type of each single cell from the PBMC 68 K dataset. We used the labels from cell classification analysis as the approximated truth. In this analysis, each cell was assigned to the purified population which has the highest correlation with its gene expression profile. We calculated ARIs between the true labels and inferred ones obtained from K-means clustering, CellTree, Seurat and DIMM-SC. The ARIs of K-means clustering, CellTree and DIMM-SC are 0.32, 0.28 and 0.41, respectively. To perform the analysis using Seurat, we used the default setting of Seurat to select the top 1657 variable genes, and picked the first 22 PCs for the clustering analysis. The ARI of Seurat is 0.31, suggesting that DIMM-SC performed the best in the PMBC 68 K dataset. Additionally, we highlighted vague cells in the t-SNE projection (Supplementary Fig. S2B), where vague cells are defined as cells with the largest posterior cluster-specific probability <0.95. As shown in Figure 2B, most of vague cells are located at the boundary of different clusters, which reassuring the validity of the clustering results.

Fig. 2.

Fig. 2.

The t-SNE projection of 68K PBMCs, colored by the DIMM-SC clustering assignment (A) and the illustration of vague cells with the largest posterior probability <0.95 (B)

3.2.3 Analysis of the in-house scRNA-seq data from systemic sclerosis study

Collaborating with investigators at the University of Pittsburgh, we are in the first place to use the 10× Chromium system to generate scRNA-Seq data in order to study systemic sclerosis. We applied DIMM-SC to the scRNA-Seq data of skin tissue collected from a systemic sclerosis patient. Starting from a UMI count matrix for 1180 cells generated by the 10× Genomics Cellranger pipeline, we first removed cells that had less than 300 genes expressed and filtered noisy genes that were expressed in less than five cells, then we extracted the top 1000 highly variable genes based on their standard deviations. We set the total number of clusters to be six based on our prior knowledge and utilized the KR method to generate the initial cluster labels and the initial values for the parameter α. The six cell clusters from DIMM-SC included 92, 89, 45, 156, 469 and 271 cells, respectively. Figure 3 shows the t-SNE projection of the skin cells, colored by cluster labels inferred by DIMM-SC, and the dashed circles represent potential subtypes of skin cells according to the expressions of cell type specific markers. It is interesting that fibroblast cells exhibit two clusters, suggesting possible subtypes. For each cell cluster, we identified top marker genes that were differentially expressed between the specified cluster and all the other clusters. We recognized some subtypes of skin cells for the identified clusters based on the biological knowledge of cell specific markers, such as pericyte cells specifically expressed gene RGS5, T cells specifically expressed gene IL32, endothelial cells specifically expressed gene VWF, fibroblast cells specifically expressed gene COL1A1, basal keratinocyte cells specifically expressed gene KRT14 and gene KRT5, and suprabasal keratinocyte cells specifically expressed gene KRT1 and gene KRT10.

Fig. 3.

Fig. 3.

The t-SNE projection of cells from systemic sclerosis skin tissue, colored by the DIMM-SC clustering assignment

3.3 Model fitting diagnosis

An important step in applying model-based approach is to examine whether the proposed statistical model fits the real data well. In Dirichlet distribution, the marginal distribution of p is a Beta distribution. In addition, the mean of pi, αik/α(k), is approximately proportional to its variation αikα(k)-αik/(α(k)2α(k)+1). After applying DIMM-SC to the PBMC 68 K scRNA-Seq dataset, we performed the following two analyses to evaluate the goodness of fit of the model. We first collected cells that belong to the same cell type using datasets of purified sub-populations of PBMCs from 10× Genomics, and then plotted the empirical marginal distribution of proportion pi for top variable genes. We compared such empirical distribution with the marginal distribution Betaαik,αk-αik at α(k)=α^(k), where α^(k) was estimated from the real scRNA-Seq data. Supplementary Figure S5A shows that the fitted distributions for top variable genes aligned very well with the empirical distributions, suggesting that DIMM-SC achieved good fit in real scRNA-Seq data.

Moreover, we explored the relationship between the mean and variance of pi’s, as commonly used in count data analysis, to evaluate whether any over-dispersion pattern exists. Similar to the previous analysis, we also collected cells from the same cell type, and calculated the mean and variation of pi for each gene. The scatter plot of the log mean of pi versus the log variance of pi (Supplementary Fig. S5B) shows a clear linear relationship between mean and variance. Derived from Dirichlet distribution, the expected intercept and slope can be approximated by 1 and log(α), respectively, where log(α^) was estimated from the real scRNA-Seq data. In CD56+ Natural Killer cells and CD19+ B cells, log(α^) equals to 6.60 and 6.67, respectively. As shown in Supplementary Figure S5B, the intercept and slope of the fitted line (red line) are close to the expected values, indicating a good model fitting in this real scRNA-Seq data. We noticed that, due to both technical and biological uncertainties, a few genes exhibit extra variation, which cannot be fully explained by the mean-variation relationship posited by the Dirichlet distribution. We will pursue to extend DIMM-SC to account for such additional variation in the near future.

4 Software availability

We have implemented DIMM-SC into a user-friendly R package, which is freely available on http://www.pitt.edu/∼wec47/singlecell.html. This software program can take a full matrix file that is compiled by users or can directly take the sparse UMI count matrix file from the 10× Genomics Cellranger pipeline. The output includes clustering results, the probability matrix for all cells, the probability vector for each gene, and the t-SNE projection visualization.

5 Discussion

Compared with the early generation scRNA-Seq technologies, the intrinsic characteristics of droplet-based scRNA-Seq data, including a much larger number of cells and direct counting of molecule copies using UMI, pose great challenges on statistical analysis and require new methodological development. In this study, we developed a model-based clustering method DIMM-SC for analyzing droplet-based scRNA-Seq data. DIMM-SC directly models UMI counts from scRNA-Seq data using a multinomial distribution with Dirichlet mixture priors. We demonstrated that DIMM-SC has achieved substantial improvements in clustering accuracy and stability compared to existing clustering methods such as K-means clustering, Seurat and CellTree. More importantly, our probabilistic model provides clustering uncertainty for each cell (how likely each cell belongs to each cluster), thus can benefit rigorous statistical inference and straightforward biological interpretations. In addition, DIMM-SC can be used to detect differentially expressed gene markers among different cell types, which is under our further investigation.

Our probabilistic model coupled with a computationally efficient E-M algorithm is able to cluster large-scale droplet-based scRNA-Seq data. For example, it takes around 3 h to cluster 68 000 cells using top 1000 highly variable genes. In the analysis of scRNA-Seq data, both gene level filtering and cell level filtering are critical for clustering regardless of which clustering method to use. We recommend to rank genes by their variations among all cells and choose top 500–1000 highly variable genes. In addition, we also recommend to run DIMM-SC 5–10 times, each with different random seeds, and choose the one with the largest likelihood as the final results. For the number of clusters, we can pre-define it based on prior knowledge on the tissue or determine it using some model checking criteria such as AIC or BIC (Akaike, 1974; Schwarz, 1978). As shown in Supplementary Figure S6, AIC and BIC work well in the analysis of simulated datasets, the performance in real data needs further exploration. Alternatively, it can be determined using the procedure described in ADPclust (Wang and Xu, 2015) or the Dirichlet process (Teh, 2011). DIMM-SC is currently implemented in R/Rcpp with satisfactory computing efficiency for most needs so far. Further improvement (e.g. parallel computing) can be made to accommodate larger-scaled data.

There are several noticeable limitations of our method. First, DIMM-SC only models variations among different cells from one single individual. To jointly model scRNA-Seq data from multiple individuals, a hierarchical structure can be posed in the current method to account for the individual level heterogeneity, but a more sophisticated numerical algorithm will be needed to reduce the computational cost. Second, DIMM-SC is an unsupervised method that infers structures from all data. Prior knowledge on cell-type-specific biomarkers may further improve the clustering accuracy. To use such prior information, a semi-supervised approach is needed to guide cluster inference. Furthermore, existing scRNA-Seq data from purified cells (e.g. via flow cytometry) can serve as external reference panels or training datasets to reduce experimental biases, remove outliers and improve clustering reliability. Last but not least, our DIMM-SC model ignores the measurement errors and uncertainties buried in the UMI count matrix. Multiple factors such as drop-out event, mapping percentage, sequencing depth and PCR efficiency are not considered in the current model. These limitations can be largely overcome by extending our method. We will explore these directions in the near future.

We noticed that similar models have been proposed in the field of text-mining (Yamamoto and Sadamitsu, 2005) and microbiome (Holmes et al., 2012), where word, article and topic or taxa, individual, and meta-community are studied. However, in those applications, the clusters are not well defined and require a careful interpretation. On the contrary, scRNA-Seq data usually consist of a set of known cell types from prior knowledge and have a much larger signal-to-noise ratio for the clustering analysis. Although sharing the common types of data structure, these fields have different fundamental questions, so existing methods proposed from other fields need to be tailored or extended to incorporate intrinsic characteristics of scRNA-Seq data. For example, CellTree adapts the LDA approach from the text-mining field. Although LDA is more flexible and more widely used in text-mining field than the Dirichlet mixture model based methods, we have showed that DIMM-SC is more accurate, stable and efficient than CellTree in both simulation studies and real data applications in the context of scRNA-Seq clustering analysis.

In summary, we provide a novel statistical method and an efficient computational tool DIMM-SC for clustering droplet-based single cell transcriptomic data, which facilitates rigorous statistical inference of cell population heterogeneity. We are confident that DIMM-SC will be highly useful for the fast-growing community of large-scale single cell transcriptome analysis.

Supplementary Material

Supplementary Figures

Acknowledgements

The authors would like to thank Dr. Zhiguang Huo and Dr. Kong Chen for providing valuable insights during the initial stage of this project.

Funding

This work is supported by grants R01HG007358 (W.C.), U54DK107977 (M.H.) from National Institute of Health, Children’s Hospital of Pittsburgh RAC fellowship (Z.S.), and the National Science Foundation of China Grant 11401338 (K.D.).

Conflict of Interest: none declared.

References

  1. Akaike H. (1974) New Look at Statistical-Model Identification. IEEE Trans. Automat. Contr., 19, 716–723. [Google Scholar]
  2. Dempster A.P. et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological), 39(1), 1–38. [Google Scholar]
  3. duVerle D.A. et al. (2016) CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data. BMC Bioinformatics, 17, 363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Gawad C. et al. (2016) Single-cell genome sequencing: current state of the science. Nat. Rev. Genet., 17, 175–188. [DOI] [PubMed] [Google Scholar]
  5. Holmes I. et al. (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS One, 7, e30126.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Islam S. et al. (2014) Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods, 11, 163–166. [DOI] [PubMed] [Google Scholar]
  7. Jaitin D.A. et al. (2014) Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science, 343, 776–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kivioja T. et al. (2012) Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods, 9, 72–74. [DOI] [PubMed] [Google Scholar]
  9. Macosko E.Z. et al. (2015) Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell, 161, 1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Rand W.M. (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66, 846–850. [Google Scholar]
  11. Rodriguez A., Laio A. (2014) Machine learning. Clustering by fast search and find of density peaks. Science, 344, 1492–1496. [DOI] [PubMed] [Google Scholar]
  12. Ronning G. (1989) Maximum-likelihood estimation of dirichlet distributions. J. Stat. Comput. Simul., 32, 215–221. [Google Scholar]
  13. Satija R. et al. (2015) Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol., 33, 495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Schwarz G. (1978) Estimating the dimension of a model. Ann. Stat., 6, 461–464. [Google Scholar]
  15. Stegle O. et al. (2015) Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet., 16, 133–145. [DOI] [PubMed] [Google Scholar]
  16. Teh Y.W. (2011) Dirichlet process. Encyclopedia of Machine Learning. Springer, US, pp. 280–287. [Google Scholar]
  17. van der Maaten L., Hinton G. (2008) Visualizing data using t-SNE. J. Mach. Learn. Res., 9, 2579–2605. [Google Scholar]
  18. Wang X.F., Xu Y. (2015) Fast clustering using adaptive density peak detection. Stat. Methods Med. Res., doi:10.1177/0962280215609948. [DOI] [PubMed] [Google Scholar]
  19. Weir B.S., Hill W.G. (2002) Estimating F-statistics. Annu. Rev. Genet., 36, 721–750. [DOI] [PubMed] [Google Scholar]
  20. Yamamoto M., Sadamitsu K. Dirichlet mixtures in text modeling. CS Technical report CS-TR-05-1, University of Tsukuba 2005.
  21. Zheng G.X. et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figures

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES