Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2020 May 7;48(11):5814–5824. doi: 10.1093/nar/gkaa314

BREM-SC: a bayesian random effects mixture model for joint clustering single cell multi-omics data

Xinjun Wang 1,2, Zhe Sun 1,2, Yanfu Zhang 2, Zhongli Xu 3,4, Hongyi Xin 3, Heng Huang 2, Richard H Duerr 5, Kong Chen 5, Ying Ding 1, Wei Chen 1,3,
PMCID: PMC7293045  PMID: 32379315

Abstract

Droplet-based single cell transcriptome sequencing (scRNA-seq) technology, largely represented by the 10× Genomics Chromium system, is able to measure the gene expression from tens of thousands of single cells simultaneously. More recently, coupled with the cutting-edge Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), the droplet-based system has allowed for immunophenotyping of single cells based on cell surface expression of specific proteins together with simultaneous transcriptome profiling in the same cell. Despite the rapid advances in technologies, novel statistical methods and computational tools for analyzing multi-modal CITE-Seq data are lacking. In this study, we developed BREM-SC, a novel Bayesian Random Effects Mixture model that jointly clusters paired single cell transcriptomic and proteomic data. Through simulation studies and analysis of public and in-house real data sets, we successfully demonstrated the validity and advantages of this method in fully utilizing both types of data to accurately identify cell clusters. In addition, as a probabilistic model-based approach, BREM-SC is able to quantify the clustering uncertainty for each single cell. This new method will greatly facilitate researchers to jointly study transcriptome and surface proteins at the single cell level to make new biological discoveries, particularly in the area of immunology.

INTRODUCTION

Revolutionary tools such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-Seq) and RNA expression and protein sequencing assay (REAP-seq) have been recently developed for measuring single cell surface protein and mRNA expression level simultaneously in the same cell (1–3). Oligonucleotide-labeled antibodies are used to integrate cellular protein and transcriptome measurements. It combines highly multiplexed protein marker detection with transcriptome profiling for thousands of single cells. CITE-Seq allows for immunophenotyping of cells using existing single cell sequencing approaches (3), and it is fully compatible with droplet-based single cell RNA sequencing (scRNA-Seq) technology (e.g. 10× Genomics Chromium system (4)) and utilizes the discrete count of Antibody-Derived Tags (ADT) as the direct measurement of cell surface protein abundance. This promising and popular technology provides an unprecedent opportunity for jointly analyzing transcriptome and surface proteins at the single cell level in a cost-effective way.

In CITE-Seq experiment, the abundance of RNA and surface marker is quantified by Unique Molecular Index (UMI) and Antibody-Derived Tags (ADT) respectively, for a common set of cells at the single cell resolution. These two data sources represent different but highly related and complementary biological components. Classic cell type identification relies on cell surface protein abundance, which can be measured individually with flow cytometry. Recently, scRNA-Seq data are also used to classify cell types, based on differentially expressed genes among different cell types. In fact, both data sources have their unique characteristics and can provide complementary information. For example, the use of cell surface proteins for cell gating is advantageous in identifying common cell types but may not successfully identify some rare cell types due to its low dimensionality. On the other hand, although cell clustering based on scRNA-Seq could identify more cell types because of its higher dimensionality, it is less capable to distinguish highly similar cell types, such as CD4+ T cells and CD8+ T cells, due to a poor observed correlation between a mRNA and its translated protein expression in single cell (3,5,6).

Despite the promise of this new technology, current statistical methods for jointly analyzing data from scRNA-Seq and CITE-Seq are still unavailable or immature. A novel joint clustering approach that fully utilizes the advantages and unique features of these single cell multi-omics data will lead to a more powerful tool in identifying rare cell types or reduce false positives such as doublets. Many statistical methods have been proposed for clustering scRNA-Seq data only, such as single cell interpretation via multi-kernel learning (SIMLR) (7), CellTree (duVerle et al., 2016), Seurat (8), SC3 (9), DIMM-SC (10) and BAMM-SC (11), which are either from different clustering approach categories or recommended by recent reviewers (12,13). In contrast, to our best knowledge, there is no published method tailored for joint clustering multi-omics data from CITE-Seq. A naïve approach is to do separate analysis on each data source, which is straightforward but suffering from various of issues such as lack of power and failing to capture the associations between transcriptome and expression of surface proteins. Multimodal data analysis, on the other hand, is supposed to achieve a more detailed characterization of cellular phenotypes than using transcriptome measurements alone.

In this study, we propose BREM-SC, a Bayesian Random Effects Mixture Model for joint clustering scRNA-Seq and CITE-Seq data. Because there is no existing method tailored for clustering single cell multi-omics data jointly, we compare the performance of BREM-SC with three popular single source clustering methods, including K-means clustering, SC3 (9) and TSCAN (14), and two commonly used multi-source clustering methods in the engineering field, including Multi-View Non-negative Matrix Factorization equipped with capped norm (MV-NMF) (15) and Pair-wised Co-regularized Multi-modal Spectral Clustering (PC-MSC) (16), in our simulation studies. K-means is one of the most popular clustering methods and has been used in the first 10× Genomics publication (4). SC3 and TSCAN have also been proposed for clustering scRNA-Seq data, but they fall in different clustering categories. For example, SC3 is a single cell consensus clustering method, where a consensus matrix is calculated using the Cluster-based Similarity Partitioning Algorithm (CSPA). Unlike SC3, TSCAN performs model-based clustering on the transformed expression values. We also compared BREM-SC with two existing methods MV-NMF and PC-MSC, which were developed for analyzing multi-source data. MV-NMF considers the non-negative entry constraints in dimension reduction while preserving the cross-modal consistency for reduced features, after which a standard K-means is used to finalize the clustering. To make the model robust to outliers, a capped-norm objective is also utilized. Alternatively, PC-MSC introduces a co-regularization on the spectral clustering. Neither MV-NMF nor PC-MSC has been used in single cell multi-omics analysis before, but they can be directly applied to our CITE-seq data. In our real data applications, we also include Seurat (17,18), one of the most popular tools for single cell analysis, and DIMM-SC (10), the single source version of BREM-SC, to benchmark the performance of single source clustering compared to joint clustering.

MATERIALS AND METHODS

Statistical model and estimation

The data illustration and general framework of BREM-SC modeling are shown in Figure 1. Although data from both sources are count data, there are several major differences between them. Firstly, the drop-out events are very common in the transcriptomic data, which are in fact much less frequent in the proteomic data. Therefore, the data matrix for RNA source is relatively sparse and people usually screen out genes with low variability in expression before performing analysis. Secondly, the overall scale of two data sets are significantly different, where proteomic data have larger values due to higher abundance of proteins in a cell. Based on the facts, we propose separate parametric model for each data source.

Figure 1.

Figure 1.

Core logic of BREM-SC method for joint clustering RNA and ADT single cell data.

Suppose there are Inline graphic cells generated from CITE-Seq, denote by the transcriptomic data a matrix Inline graphic and its ADT levels (measurement of surface protein) a matrix Inline graphic. We use a latent variable vector Inline graphic with elements Inline graphic to represent the cell type label for cell Inline graphic, where Inline graphic.

For transcriptomic data, each element Inline graphic in the raw count matrix represents the number of unique UMIs for gene Inline graphic in cell Inline graphic, where Inline graphic runs from 1 to the total number of genes Inline graphic, and Inline graphic runs from 1 to the total number of cells Inline graphic. We then denote the number of unique UMIs in the Inline graphic th single cell by a vector Inline graphic We assume that Inline graphic follows a multinomial distribution with parameter vector Inline graphic For this multinomial distribution, we further assume that the proportion Inline graphic follows a Dirichlet distribution Inline graphic, with all the elements in Inline graphic being strictly positive. Next, we assume that the cell population consists of Inline graphic distinct cell types. To provide a more flexible modeling framework and allow for unsupervised clustering, we extend the aforementioned single Dirichlet prior to a mixture of Inline graphic Dirichlet distributions, indexed by Inline graphic and each with parameter Inline graphic. For instance, if cell Inline graphic belongs to the Inline graphicth cell type, its gene expression profile Inline graphic follows a cell-type-specific prior distribution Inline graphic. The Dirichlet multinomial density for cell Inline graphic, as illustrated in DIMM-SC (10), can be obtained by multiplying the Dirichlet mixture prior by the multinomial density and then integrating out Inline graphic, which is derived as Inline graphic, where Inline graphic is the total number of unique UMIs for Inline graphic th cell.

Similarly, we also use the Dirichlet multinomial distribution to model surface protein (ADT) data. Suppose there are in total Inline graphic number of ADT markers, then similarly, the density of Dirichlet multinomial is derived as Inline graphic, where Inline graphic indicates the index for protein marker and Inline graphic is the total ADT counts for Inline graphic th cell.

Without considering the correlation between two datasets, the joint log-likelihood for all cells, assuming full independency between two sources, can then be derived as

graphic file with name M42.gif (1)

To further model the correlation between different data sources, we introduce cell-specific random effects into our framework. Since we assume that cells belonging to same cell type share common Dirichlet parameters, we model cell heterogeneity by directly multiplying cluster-specific Dirichlet parameters with the random effects, i.e. Inline graphic and Inline graphic, where Inline graphic and Inline graphic are the Dirichlet parameters of cell type Inline graphic for RNA and protein data, respectively, and Inline graphic is the random effects for the Inline graphicth cell. Based on the fact that Dirichlet parameters are strictly positive numbers, we assume that this cell-specific random effects Inline graphic, i.e., Inline graphic. Finally, the complete log likelihood can be derived as

graphic file with name M52.gif (2)

We use Gibbs sampler to updateInline graphic, and use random walk Metropolis within Gibbs sampler to iteratively update Inline graphic, Inline graphic and Inline graphic in (Equation 2) (details can be found in Supplementary Methods).

Selection of the number of clusters and initial values

To implement BREM-SC, it is critical to select the total number of clusters and the initial values for MCMC. Specifically, the number of clusters Inline graphic can be defined either with prior knowledge or standard model checking criterion such as Akaike's Information Criteria (AIC) or Bayesian Information Criteria (BIC). Meanwhile, there are many methods to determine the initial values of Inline graphic. For BREM-SC we applied K-means clustering to get a preliminary clustering result for each data source separately, followed by using Ronning's method (19) to estimate initial values of Inline graphic, which is similar to the estimation procedure implemented in DIMM-SC (10).

Data generation algorithm for simulation studies under BREM-SC model

Based on different parameter settings in our Dirichlet multinomial models, we simulated RNA expression and ADT measurements for each single cell. In the simulation set-up, the two count matrices were sampled from the proposed Dirichlet mixture models. Specifically, for a fixed number of cell clusters Inline graphic, we first pre-defined the values of Inline graphic and Inline graphic for the Inline graphicth cell cluster. The random effects Inline graphic are generated from a log-normal distribution with pre-specified value Inline graphic. We then computed the transcriptomic profile Inline graphic for each single cell by multiplying Inline graphic by Inline graphic. Similarly, for cellular protein expression profile, we multiplied Inline graphic by Inline graphic to compute Inline graphic. For the next step, we sampled the cluster proportion Inline graphic (or Inline graphic) from a Dirichlet distribution Inline graphic (or Inline graphic). Lastly, we sampled the UMI count vector Inline graphic for Inline graphic th cell from multinomial distribution Inline graphic and sampled the ADT count vector Inline graphic from Inline graphic, where Inline graphic and Inline graphic are each simulated from truncated normal distribution with parameters estimated from real data.

Data generation algorithm for simulation studies under model misspecification

We used R package Splatter (20) to simulate data to assess robustness of BREM-SC under model misspecification. In Splatter, the final data matrix is a synthetic dataset consisting of counts from a Gamma-Poisson (or negative-binomial) distribution. Since there is no existing method for generating single cell surface protein expression levels from CITE-Seq, we also used Splatter to generate ADT count for the proteomic data. To make our simulated gene expression data a good approximation to the real data, our model parameters (in Splatter) were estimated from the real data downloaded from 10X Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/b_cells). Specifically, for ADT data, we modified the Splatter parameters such as dropout rate, library size, expression outlier, and dispersion across features to make the simulated data more similar to real observed ADT data regarding the scale. We assumed all cell types are shared between gene expression and ADT data, and further specified differential expression parameters to generate scenarios with different magnitude of cell type differences.

Setup of BREM-SC and competing methods used in this paper

As a Bayesian method, to increase the stability of BREM-SC and avoid the extreme case of bad initialization, in practice we recommend running the algorithm with three to five chains simultaneously (parallel computing implemented within R package) and then choose the chain with maximum likelihood. In this study, we applied BREM-SC using three chains in simulation and real data analyses, and set the number of MCMCs to be 500. On the other hand, all clustering methods to which BREM-SC compared were performed under their default settings. Single-source clustering methods including K-means, SC3 and TSCAN were applied to the pooled data from RNA expression and ADT with centered log-ratio (clr)-transformation (8) for normalization when applicable while ignoring source specificity.

Metrics for clustering performance

We used adjusted rand index (ARI) (21) and adjusted mutual information (AMI) (22) as the metrics for clustering performance in our analysis. Both of them are commonly used metrics to indicate similarity between two clustering results. In general, ARI is based on pair-counting while AMI is based on Shannon information theory. ARI ranges from –1 to 1, where an ARI of value 1 indicates the clustering result is identical to the underlying truth, 0 indicates a clustering result by random clustering, and negative number indicates an even worse result than random clustering. Similar to ARI, an AMI of value 1 indicates a perfect clustering result while 0 indicates a random clustering result. Although ARI and AMI often agree in most situations, a previous study found that ARI is preferred when the reference clustering has relatively equal sized-clusters, while AMI is recommended when the reference clustering is unbalanced and thus small clusters could exist (23). Therefore, we used ARI as the clustering metric in our simulation studies since we simulated balanced clusters, and we used both ARI and AMI in our real data applications since the approximated ground truth cell types are not balanced.

Public human peripheral blood mononuclear cells (PBMC) CITE-Seq dataset

To assess the performance of BREM-SC, we used a published human PBMC CITE-Seq dataset downloaded from 10X Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3). A total of 7,865 cells from a healthy donor were stained with 14 TotalSeq-B antibodies, including CD3, CD4, CD8a, CD14, CD15, CD16, CD19, CD25, CD45RA, CD45RO, CD56, CD127, PD-1 and TIGIT. Cell-matched scRNA-Seq data are available.

In-house human PBMC CITE-Seq dataset

To further evaluate the validity of our method. We generated an in-house CITE-seq dataset of human PBMC from a healthy donor under IRB approval from the University of Pittsburgh. 1372 cells were stained with Totalseq-A from BioLegend and are prepared using the 10x Genomics platform with Gel Bead Kit V2. The prepared assay is subsequently sequenced on an Illumina Hiseq with a depth of 50K reads per cell. Cells in this dataset are measured for their surface marker abundance through CITE-seq (3). Ten surface markers are measured for every cell: CD3, CD4, CD8a, CD11c, CD14, CD16, CD19, CD56, CD127 and CD154. Cell Ranger 3.0 was used to process the data and generate UMI matrix for the downstream analysis.

RESULTS

To assess the performance of BREM-SC, we performed comprehensive simulation studies to compare BREM-SC with five existing clustering methods, including K-means, SC3, TSCAN, MV-NMF and PC-MSC, all of which are hard clustering approaches and thus will assign each cell to an exclusive cluster. We simulated scRNA-Seq and CITE-Seq data under both our proposed model and model misspecification (using R package Splatter (20)) to assess the validity and robustness of BREM-SC. For each simulation scenario, we simulated 100 datasets to assess the variability of clustering results for each method. We also applied BREM-SC on two human peripheral blood mononuclear cells (PBMC) CITE-Seq datasets to assess the usefulness of our method in real application.

BREM-SC outperforms other methods in simulation studies under our proposed model

We designed different simulation scenarios to assess the performance of BREM-SC by varying number of cells in each cluster, number of clusters, magnitude of cell type differences (i.e. the magnitude of difference among different clusters), and among-cell variabilities (indicated by the magnitude of random effects).

The results of simulation studies under our proposed model are shown in Figure 2. In general, the performance of all clustering approaches decreases as the among-cell variability, indicated by Inline graphic, increases, number of clusters increases, and number of cells decreases. As shown in Figure 2A, BREM-SC outperformed the other five competing methods by achieving the highest average ARI among 100 simulations across different levels of among-cell variabilities.

Figure 2.

Figure 2.

Boxplots of ARIs for six clustering methods across 100 simulations, investigating how various among-cell variabilities (A), magnitude of cell type differences (B), (W refers to weak signals between clusters; S refers to strong signals), number of cells in each cluster (C) and number of clusters (D) affect the clustering results.

Figure 2BD list the boxplots of ARI by varying magnitude of cell type differences, number of cells in each cluster, and number of clusters, respectively. In Figure 2B, we considered four scenarios in terms of signal strength from two data sources. To illustrate the advantage of joint clustering, we also applied K-means, SC3, TSCAN and DIMM-SC on ADT data alone. When the clustering signal is strong (i.e. difference among cell clusters is large) in both RNA expression and ADT data (referred to Both S column), both BREM-SC and SC3 performed extremely well while other methods show fair clustering results. However, when cell clusters are similar in either proteomics (referred to ADT W/ RNA S column) or transcriptomics data (referred to ADT S/ RNA W column), K-means and TSCAN produced less accurate clustering results, while BREM-SC and SC3 still performed well. As expected, strong clustering signal leads to higher clustering accuracy and lower clustering variability. If both RNA expression and ADT data are alike across different cell types (referred to Both W column), ARIs of all methods decreased but BREM-SC still performed the best. This simulation scenario clearly demonstrated that as long as either of the data source contains strong clustering signals, our BREM-SC takes full advantage of that and achieves highly satisfactory performance. In Figure 2C, we observed that more cells can provide more accurate and robust clustering results, which is as expected. In Figure 2D, we found that larger number of clusters is associated with worse clustering performance. Again, in both scenarios, BREM-SC outperformed the other five methods across all various settings.

Consistently across all four scenarios shown in Figure 2, when data are generated from our proposed true model, BREM-SC outperformed K-means clustering, SC3, TSCAN, MV-NMF and PC-MSC, suggesting its advantages in fully utilizing both types of data simultaneously.

BREM-SC outperforms other methods in simulation studies under model misspecification

To evaluate the robustness of BREM-SC when the data generation model is mis-specified, we simulated additional datasets using R package Splatter (20), a commonly used tool to simulate scRNA-Seq data whose underlying models are completely different from ours. Figure 3 shows the performance of BREM-SC and other competing methods under model misspecification. In Figure 3A, we simulated different signal strength for each data source using Splatter. Methods such as K-means and MV-NMF produced poor results across all scenarios even when signal strengths are large in both data sources. On the other hand, BREM-SC outperformed all other methods and showed good clustering performance even when signal strengths are relatively small in both data sources. In Figure 3B, we assessed the clustering performance across different probabilities that a gene is differentially expressed (DE). In general, higher DE probability indicates easier cell clustering. Again, we found that BREM-SC outperformed all other competing methods in terms of clustering accuracy. Therefore, we demonstrated the strong robustness of BREM-SC under model misspecification.

Figure 3.

Figure 3.

Boxplots of ARIs for six clustering methods across 100 simulations using Splatter. The performance under various magnitude of cell type differences were investigated. In (A), mean parameters of three cell types were set as (0.15, 0.151, 0.152) and (0.15, 0.2, 0.25) to represent two levels (S standards small and L standards for large) of cell type difference. In (B), different probabilities that a gene would be selected to be differentially expressed (DE) were set.

Analysis of a public human PBMC CITE-Seq dataset

To evaluate the clustering performance of BREM-SC on real data, we first used a published human peripheral blood mononuclear cells (PBMC) CITE-Seq dataset downloaded from 10× Genomics website. A total of 7865 cells and 14 surface protein markers are included in this dataset in addition to matched scRNA-Seq data. As a standard approach of analyzing human PBMC dataset, we extracted the top 1000 highly variable genes using the algorithm implemented in Seurat that accounts for mean-variance relationship (18). We identified seven cell types based on the biological knowledge of both protein and gene markers as the approximate truth, which is illustrated in Supplementary Figure S1. Examples of such cell type identification procedure are shown in Supplementary Figure S2. Taken together, >80% of single cells can be assigned to a specific cell type. Cells with uncertain cell types (not identified in the ground truth) were removed from computing ARIs and AMIs.

We applied eight clustering methods (K-means clustering, TSCAN, SC3, Seurat, DIMM-SC, MV-NMF, PC-MSC and BREM-SC) on this dataset and performed both single source clustering (using ADT or RNA data only) and multi-source joint clustering, respectively. We ran each method ten times to evaluate the stability of clustering performance. Note that since TSCAN is a deterministic clustering method and generates identical results, stability of performance cannot be assessed from analyzing a single dataset for this method. Similar feature is for Seurat with a fixed resolution parameter. As shown in Table 1, BREM-SC outperformed other methods for multi-source joint clustering in terms of both ARI and AMI. Also, we observed that the clustering results based on protein only are better than based on RNA only for all methods. For example, Seurat reached an ARI of 0.726 and AMI of 0.757 in protein only analysis, but only with ARI of 0.564 and AMI of 0.660 in RNA only analysis. We can explain this observation by the facts that we used much more protein information to build the approximated ground truth, and on the other hand, the transcriptome and proteome information are not as highly correlated as people may expect at single cell resolution, which has also been observed in other studies. Further, we used Uniform Manifold Approximation and Projection (UMAP) plot (24) to visualize the clustering results from BREM-SC. UMAP is a recently developed non-linear dimension reduction tool for visualization, but quickly gains popularity among single cell researches because of its outstanding performance. Figure 4 shows the UMAP plots with each cell colored by their ground truth label (Figure 4A) and cluster labels inferred by BREM-SC (Figure 4B), respectively. In general, the two plots are highly similar regarding to the distribution of different clusters (ARI = 0.840), indicating the outstanding performance of BREM-SC.

Table 1.

Performance (ARI and AMI) of different clustering methods from ten times analyses of the public human PBMC real dataset

ARI AMI
Mean (SD), [range] Mean (SD), [range]
Single source clustering - RNA only
K-means 0.524 (0.054), [0.508, 0.562] 0.592 (0.022), [0.579, 0.632]
TSCAN 0.563 (N/A), [N/A] 0.591 (N/A), [N/A]
SC3 0.596 (0.082), [0.456, 0.709] 0.629 (0.060), [0.513, 0.706]
DIMM-SC 0.629 (0.030), [0.590, 0.676] 0.666 (0.018), [0.653, 0.700]
Seurat 0.564 (N/A), [N/A] 0.660 (N/A), [N/A]
Single source clustering - protein only
K-means 0.653 (0.093), [0.436, 0.760] 0.732 (0.048), [0.632, 0.808]
TSCAN 0.691 (N/A), [N/A] 0.717 (N/A), [N/A]
SC3 0.649 (0.045), [0.568, 0.692] 0.722 (0.035), [0.670, 0.777]
DIMM-SC 0.698 (0.046), [0.619, 0.741] 0.744 (0.025) [0.701, 0.766]
Seurat 0.726 (N/A), [N/A] 0.757 (N/A), [N/A]
Multi-source joint clustering
K-means 0.661 (0.107), [0.471, 0.811] 0.723 (0.059), [0.577, 0.781]
TSCAN 0.586 (N/A), [N/A] 0.635 (N/A), [N/A]
SC3 0.679 (0.162), [0.430, 0.911] 0.723 (0.096), [0.595, 0.870]
MV-NMF 0.701 (0.154), [0.380, 0.856] 0.679 (0.098), [0.473, 0.779]
PC-MSC 0.641 (0.029), [0.621, 0.723] 0.679 (0.018), [0.672, 0.729]
BREM-SC 0.728 (0.091), [0.585, 0.840] 0.737 (0.048), [0.674, 0.800]

Figure 4.

Figure 4.

The performance of BREM-SC with the public human PBMC CITE-Seq dataset (from 10X Genomics). The UMAP projection of cells are colored by the approximate ground truth (A) and BREM-SC clustering results (B).

‘Soft-clustering’ property of BREM-SC

We further used this public human PBMC CITE-Seq dataset to illustrate the property of ‘soft-clustering’, that BREM-SC can provide the posterior probability that a given cell belongs to a specific cluster in addition to cell labels. We highlighted the ‘vague’ cells identified by BREM-SC on UMAP plots. Here ‘vague’ cells refer to those cells with largest posterior probability smaller than a pre-specified threshold. By setting up different probability threshold, the number of ‘vague’ cells can be controlled. In Figure 5, we showed the distribution of ‘vague’ cells, which are defined as the lowest 5% cells ranked by the largest cluster posterior probability, on UMAP plot for the public PBMC dataset. By comparing with Figure 4A, we found many of such ‘vague’ cells coincide with either the unknown cells that we fail to identify based on current biological knowledge or the cells on the boundary between two cell types that are closely attached to each other on UMAP plot. In addition, we observed that a subset of CD8+ T cells are labeled as ‘vague’ cells, which we suspect to be memory CD8+ T cells by differential expression analysis with RNA data. We also plotted the distribution of the top 1%, 2%, 3% and 4% ‘vague’ cells with the lowest certainty in Supplementary Figure S4.

Figure 5.

Figure 5.

Illustration of ‘soft clustering’ property by highlighting ‘vague’ cells (about lowest 5% ranked by the largest cluster posterior probability) from BREM-SC.

Analysis of an in-house human PBMC CITE-Seq dataset

Because there are limited public datasets of CITE-seq, to further evaluate our method, we performed a CITE-seq experiment on human PBMCs from a healthy donor with informed consent and University IRB approval. 10× Genomics Chromium system was used to generate CITE-Seq data with 10 cell surface markers designed in the experiment, which yielded a total of 1388 cells. Similarly, we extracted the top 1000 highly variable genes as we did for the public human PBMC dataset. For this dataset, we identified six subtypes of PBMCs based solely on the biological knowledge of cell-type-specific protein markers (illustrated in Supplementary Figure S1), and >85% of single cells were assigned to a specific cell type using these markers. Examples of such cell type identification procedure are shown in Supplementary Figure S3. We then used these cell labels as the approximated ground truth to benchmark the clustering performance for different clustering methods. Cells with uncertain cell types were removed when calculating ARIs and AMIs.

Similar to the analysis of public human PBMC dataset, we applied eight clustering methods on this in-house PBMC dataset and repeated each method ten times to evaluate the stability of its performance. As shown in Figure 6, BREM-SC performed very well in the human PBMC samples (ARI = 0.985), since the UMAP plot with each cell colored by their cell-type label based on protein markers (Figure 6A) is highly similar to the plot generated from the clustering result of BREM-SC (Figure 6B). Clustering results for all methods are summarized in Table 2. Again, BREM-SC outperformed all other competing methods for multi-source joint clustering based on the average ARI and AMI. However, it is observed that in general single source clustering with ADT count data performed better than multi-source joint clustering under this circumstance. The reason for this result is that the approximated ground truth is built solely based on protein markers, and thus will favor clustering analyses only using the protein data. Methods such as K-means and TSCAN performed much worse compared to their corresponding single source analysis results, indicating that they failed to properly incorporate RNA information and thus the joint clustering results were highly compromised by the inclusion of RNA ‘noises’. On the other hand, BREM-SC, although performed slightly worse than DIMM-SC with protein only clustering, still gives extremely satisfactory results (mean ARI = 0.966), indicating that BREM-SC can handle mRNA–protein association very well in real data and is robust to ‘noises’ introduced from one data source.

Figure 6.

Figure 6.

The performance of BREM-SC for in-house human PBMC CITE-Seq dataset. The UMAP projection of cells are colored by the ground truth (A) and BREM-SC clustering results (B).

Table 2.

Performance (ARI and AMI) of different clustering methods from ten times analyses of the in-house human PBMC real dataset

ARI AMI
Mean (SD), [range] Mean (SD), [range]
Single source clustering - RNA only
K-means 0.547 (0.073), [0.429, 0.681] 0.613 (0.062), [0.474, 0.707]
TSCAN 0.476 (N/A), [N/A] 0.574 (N/A), [N/A]
SC3 0.561 (0.017), [0.540, 0.589] 0.606 (0.022), [0.560, 0.644]
DIMM-SC 0.563 (0.101), [0.366, 0.682] 0.598 (0.069), [0.488, 0.707]
Seurat 0.675 (N/A), [N/A] 0.732 (N/A), [N/A]
Single source clustering - protein only (complete list)
K-means 0.930 (0.120), [0.702, 0.989] 0.931 (0.072), [0.794, 0.968]
TSCAN 0.929 (N/A), [N/A] 0.881 (N/A), [N/A]
SC3 0.921 (0.039), [0.816, 0.951] 0.891 (0.048), [0.760, 0.930]
DIMM-SC 0.988 (0.005), [0.975, 0.993] 0.971 (0.014), [0.935, 0.983]
Seurat 0.996 (N/A), [N/A] 0.976 (N/A), [N/A]
Multi-source joint clustering
K-means 0.718 (0.102), [0.562, 0.805] 0.755 (0.057), [0.659, 0.803]
TSCAN 0.563 (N/A), [N/A] 0.619 (N/A), [N/A]
SC3 0.798 (0.108), [0.620, 0.938] 0.788 (0.068), [0.690, 0.881]
MV-NMF 0.861 (0.115), [0.710, 0.989] 0.871 (0.086), [0.780, 0.969]
PC-MSC 0.703 (0.094), [0.573, 0.950] 0.798 (0.048), [0.705, 0.909]
BREM-SC 0.966 (0.025), [0.917, 0.985] 0.944 (0.039), [0.869, 0.970]

Robust performance with incomplete ADT data

In practice, it is common that researchers may have limited biological information of a specific cell type (especially rare cell type). As a result, they may fail to include some protein markers that are specifically differentially expressed in that rare cell type when they design their CITE-Seq experiments. In this case, researchers are not able to identify all the cell types with surface protein data only, and joint clustering methods will be preferred if they can appropriately incorporate RNA data to compensate the ‘loss’ of surface markers. To mimic the situation where the pre-designed protein markers cannot capture the characteristics of all cell types, we removed three protein markers (CD8A, CD16, CD127) in this human PBMC dataset and re-analyzed the remaining data. The results of this realistic simulation are summarized in Table 3. In general, the results of single source clustering analyses for all methods except for Seurat are much worse when some important protein markers are missing. BREM-SC, on the other hand, achieves the highest mean ARI and mean AMI compared with all other clustering methods (both single source and multi-source). In addition, the performance of BREM-SC is only slightly influenced with missing a few protein markers, which demonstrates the robustness of BREM-SC and its capability of properly handling the mRNA–protein association.

Table 3.

Performance (ARI and AMI) of clustering analysis for the subset of in-house human PBMC real dataset with the removal of three protein markers (CD8A, CD16, CD127)

ARI AMI
Mean (SD), [range] Mean (SD), [range]
Single source clustering - protein only
K-means (ADT) 0.783 (0.144), [0.671, 0.991] 0.839 (0.091), [0.766, 0.972]
TSCAN (ADT) 0.863 (N/A), [N/A] 0.811 (N/A), [N/A]
SC3 (ADT) 0.757 (0.106), [0.552, 0.840] 0.698 (0.074), [0.524, 0.752]
DIMM-SC (ADT) 0.820 (0.114), [0.681, 0.988] 0.842 (0.067), [0.774, 0.960]
Seurat (ADT) 0.929 (N/A), [N/A] 0.896 (N/A), [N/A]
Multi-source joint clustering
K-means 0.691 (0.131), [0.533, 0.800] 0.728 (0.073), [0.634, 0.789]
TSCAN 0.451 (N/A), [N/A] 0.578 (N/A), [N/A]
SC3 0.771 (0.103), [0.655, 0.948] 0.781 (0.070), [0.675, 0.899]
MV-NMF 0.517 (0.032), [0.485, 0.600] 0.644 (0.024), [0.628, 0.707]
PC-MSC 0.698 (0.092), [0.613, 0.946] 0.786 (0.046), [0.724, 0.893]
BREM-SC 0.925 (0.057), [0.816, 0.975] 0.913 (0.052), [0.803, 0.962]

DISCUSSION

When analyzing data from different data sources, ensemble clustering may be considered to integrate the separate clusters and determine an overall partition of cells that agrees the most with the source-specific cluster. However, most of ensemble clustering methods assume that the separate clusters are known in advance and do not inherently model the uncertainty (25). At the other extreme, a joint analysis that ignores the heterogeneity of the data may not capture important features that are specific to each data source. A fully integrative clustering approach is necessary to effectively combine the discriminatory power from transcriptome and protein measurements. Our probabilistic model provides a unified framework to jointly analyze multi-source data and take into account between-source correlation.

There are several noticeable limitations of this method. First, BREM-SC uses a computationally intensive MCMC algorithm, which is roughly linear with the number of cells and the number of genes used. In practice, it may result in a high computational cost when applying to large datasets (e.g. >10 k cells). To increase speed, BREM-SC has been implemented efficiently through vectorization to accommodate data in relatively large scale. We benchmarked the computational speed as well as memory consumption of BREM-SC in Supplementary Table S1. Specifically, we varied the number of cells, genes and MCMC iterations in the benchmark experiment to provide users a general guidance about computing time of BREM-SC. We also examined the convergence of MCMC in two real data analyses, and we found that the log-likelihood converges quite well within 500 interactions (shown in Supplementary Figure S5). We further developed an alternative approach to reduce running time by removing the random effects and assume independency between two data sources (illustrated in Equation 1). By assuming each of the two data follows separate multinomial distribution with Dirichlet prior, we derive the joint likelihood of two data and use E–M algorithm (26) to update parameters. We named this approach jointDIMM-SC, as it can be considered as an extension of DIMM-SC for joint analysis. This method is useful when there are in fact relatively low among-cell variabilities. We compared the performance of BREM-SC and jointDIMM-SC on the two real PBMC datasets. The estimated standard deviation of random effects for the public PBMC data is 0.7, and for the in-house data is 1.2. Supplementary Table S2 summarizes the ARI and AMI of each method performed on each dataset, by which we observed that the performance of jointDIMM-SC is always worse than BREM-SC, and such a difference is larger when the estimated variance of random effects is larger. Supplementary Figures S6 and S7 show the clustering results of jointDIMM-SC on public and in-house PBMC datasets, respectively, on UMAP plots. Although jointDIMM-SC performs worse than BREM-SC, still it could beat most of the competing methods currently available. JointDIMM-SC has also been implemented in our method as an optional choice. In addition to these, potential speed-up methods could include using graphics processing unit.

Another limitation is that BREM-SC model ignores the measurement errors and uncertainties buried in count matrices. Multiple factors such as drop-out event, mapping percentage and PCR efficiency are not considered in the current model. These limitations can be largely overcome by extending the method. We will explore these directions in the near future. Finally, although BREM-SC can allow some noise between the two data sources, still a key assumption that the two sources are consistent in some level in terms of cell type classification. Traditionally, biologists define cell type through the abundance of cell surface markers, so ADT data are preferred if available. They are in general consistent with corresponding transcriptomic data, which may provide additional information on defining sub-cell types. If the inconsistency is so large as to overwhelm the cluster-specific signals, then joint clustering methods such as BREM-SC could fail, and source specific grouping methods such as Bayesian consensus clustering (27) can be developed in future studies.

In summary, we provide a novel statistical method, BREM-SC, for joint clustering scRNA-Seq and CITE-Seq data, which facilitates rigorous statistical inference of cell population heterogeneity. Our model-based joint clustering method can be readily extended to accommodate more than two data sources or multi-source data from other fields. In addition, our well-designed in-house CITE-seq dataset will be valuable to the bioinformatics field for further method development. We are confident that BREM-SC will be highly useful for the fast-growing community of large-scale single cell analysis.

SOFTWARE AVAILABILITY

An R package of BREM-SC with a detailed tutorial is available on GitHub (https://github.com/tarot0410/BREMSC).

DATA AVAILABILITY

The published human PBMC CITE-Seq dataset that supports the finding of this study can be downloaded from 10× Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3). The in-house CITE-seq data is available in GEO (https://www.ncbi.nlm.nih.gov/geo/, GSE148665).

Supplementary Material

gkaa314_Supplemental_File

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institutes of Health [R01HL137709 to W.C., K.C., P01AI106684 to W.C., U01DK062420 to W.C., R.D., UL1TR001857 to W.C., X.W.]. Funding for open access charge: National Institutes of Health [U01DK062420].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Mimitou E.P., Cheng A., Montalbano A., Hao S., Stoeckius M., Legut M., Roush T., Herrera A., Papalexi E., Ouyang Z.. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods. 2019; 16:409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Peterson V.M., Zhang K.X., Kumar N., Wong J., Li L., Wilson D.C., Moore R., McClanahan T.K., Sadekova S., Klappenbach J.A.. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 2017; 35:936. [DOI] [PubMed] [Google Scholar]
  • 3. Stoeckius M., Hafemeister C., Stephenson W., Houck-Loomis B., Chattopadhyay P.K., Swerdlow H., Satija R., Smibert P.. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 2017; 14:865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Zheng G.X., Terry J.M., Belgrader P., Ryvkin P., Bent Z.W., Wilson R., Ziraldo S.B., Wheeler T.D., McDermott G.P., Zhu J. et al.. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 2017; 8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Haider S., Pal R.. Integrated analysis of transcriptomic and proteomic data. Curr. Genomics. 2013; 14:91–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Chen G., Gharib T.G., Huang C.C., Taylor J.M., Misek D.E., Kardia S.L., Giordano T.J., Iannettoni M.D., Orringer M.B., Hanash S.M. et al.. Discordant protein and mRNA expression in lung adenocarcinomas. Mol. Cell. Proteomics. 2002; 1:304–313. [DOI] [PubMed] [Google Scholar]
  • 7. Wang B., Ramazzotti D., De Sano L., Zhu J., Pierson E., Batzoglou S.. SIMLR: a tool for large-scale genomic analyses by multi-kernel learning. Proteomics. 2018; 18:1700232. [DOI] [PubMed] [Google Scholar]
  • 8. Satija R., Farrell J.A., Gennert D., Schier A.F., Regev A.. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 2015; 33:495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Kiselev V.Y., Kirschner K., Schaub M.T., Andrews T., Yiu A., Chandra T., Natarajan K.N., Reik W., Barahona M., Green A.R. et al.. SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods. 2017; 14:483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Sun Z., Wang T., Deng K., Wang X.F., Lafyatis R., Ding Y., Hu M., Chen W.. DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data. Bioinformatics. 2018; 34:139–146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Sun Z., Chen L., Xin H., Jiang Y., Huang Q., Cillo A.R., Tabib T., Kolls J.K., Bruno T.C., Lafyatis R.. A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nat. Commun. 2019; 10:1649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Freytag S., Tian L., Lönnstedt I., Ng M., Bahlo M.. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 2; peer review: 3 approved]. F1000Research. 2018; 7:1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Duò A., Robinson M.D., Soneson C.. A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; peer review: 2 approved]. F1000Research. 2018; 7:1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ji Z., Ji H.. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016; 44:e117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Liu J., Wang C., Gao J., Han J.. Proceedings of the 2013 SIAM International Conference on Data Mining. 2013; SIAM; 252–260. [Google Scholar]
  • 16. Kumar A., Rai P., Daume H.. Advances in Neural Information Processing Systems. 2011; 1413–1421. [Google Scholar]
  • 17. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W.M. III, Hao Y., Stoeckius M., Smibert P., Satija R.. Comprehensive integration of single-cell data. Cell. 2019; 177:1888–1902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ronning G. Maximum-likelihood estimation of dirichlet distributions. J. Statist. Comput. Simulation. 1989; 32:215–221. [Google Scholar]
  • 20. Zappia L., Phipson B., Oshlack A.. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017; 18:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Rand W.M. Objective criteria for the evaluation of clustering methods. J. Am. Statist. Assoc. 1971; 66:846–850. [Google Scholar]
  • 22. Vinh N.X., Epps J., Bailey J.. Proceedings of the 26th Annual International Conference on Machine Learning. 2009; 1073–1080. [Google Scholar]
  • 23. Romano S., Vinh N.X., Bailey J., Verspoor K.. Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 2016; 17:4635–4666. [Google Scholar]
  • 24. McInnes L., Healy J., Melville J.. Umap: uniform manifold approximation and projection for dimension reduction. 2018;
  • 25. Wang H., Shan H, Banerjee A. Bayesian cluster ensembles. Stat. Anal. Data Mining. 2011; 4:54–70. [Google Scholar]
  • 26. Dempster A.P., Laird N.M., Rubin D.B.. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc.: Series B (Methodological). 1977; 39:1–38. [Google Scholar]
  • 27. Lock E.F., Dunson D.B.. Bayesian consensus clustering. Bioinformatics. 2013; 29:2610–2616. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaa314_Supplemental_File

Data Availability Statement

The published human PBMC CITE-Seq dataset that supports the finding of this study can be downloaded from 10× Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_protein_v3). The in-house CITE-seq data is available in GEO (https://www.ncbi.nlm.nih.gov/geo/, GSE148665).


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES