Skip to main content
PeerJ logoLink to PeerJ
. 2018 Sep 3;6:e5578. doi: 10.7717/peerj.5578

Estimating the frequency of multiplets in single-cell RNA sequencing from cell-mixing experiments

Jesse D Bloom 1,
Editor: Claus Wilke
PMCID: PMC6126471  PMID: 30202659

Abstract

In single-cell RNA-sequencing, it is important to know the frequency at which the sequenced transcriptomes actually derive from multiple cells. A common method to estimate this multiplet frequency is to mix two different types of cells (e.g., human and mouse), and then determine how often the transcriptomes contain transcripts from both cell types. When the two cell types are mixed in equal proportion, the calculation of the multiplet frequency from the frequency of mixed transcriptomes is straightforward. But surprisingly, there are no published descriptions of how to calculate the multiplet frequency in the general case when the cell types are mixed unequally. Here, I derive equations to analytically calculate the multiplet frequency from the numbers of observed pure and mixed transcriptomes when two cell types are mixed in arbitrary proportions, under the assumption that the loading of cells into droplets or wells is Poisson.

Keywords: Single-cell RNA-seq, Multiplet, Doublet, 10× Chromium, scRNA-seq

Introduction

Many methods for single-cell RNA sequencing involve partitioning cells into barcoded droplets (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017), wells (Gierahn et al., 2017), or combinations of wells (Cao et al., 2017). As long as the number of possible partitions exceeds the number of cells, then most partitions will contain at most one cell. However, some fraction of the non-empty partitions will contain multiple cells, and estimating this multiplet frequency is an important aspect of experimental quality control.

The most common method to determine the multiplet frequency is to mix two types of cells (e.g., human and mouse). During the analysis of the sequencing results, each non-empty partition can be identified as containing transcripts from one or both of the two cell types. Partitions that contain a substantial number of transcripts from both cell types must be multiplets. If the two cell types are mixed equally and the average number of cells per partition is low (so that most multiplets are doublets), then the multiplet frequency can be estimated as simply twice the fraction of non-empty partitions that contain a mix of cell types. The logic is that all the multiplets are doublets, and only half the doublets will have cells of both types (the others will have two cells of the same type). This approach has been used to estimate the multiplet frequency during the prototyping of most single-cell RNA sequencing methods (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Gierahn et al., 2017; Cao et al., 2017).

However, in some cases the two cell types may be mixed in unequal proportions. Unequal mixing could arise simply from error during cell counting, or it could be an intentional aspect of experimental design (Rosenberg et al., 2018). For instance, if the researcher is actually interested in the human cells and simply wants to include an internal control to estimate the multiplet frequency during each new experiment, then (s)he may want to add fewer mouse cells so that most of the resulting data is for the human cells. In addition, when analyzing naturally occurring mixtures of cells of multiple types, the different cell types will usually be present in unequal proportions. But when the cells are mixed unequally, it is no longer valid to estimate the multiplet frequency as simply twice the fraction of non-empty partitions that contain a mix of both cell types. Surprisingly, I could find no published descriptions of how to calculate the multiplet frequency from unequal mixes of two cell types. Here, I remedy this gap in the literature by deriving the equations to compute the multiplet frequency when the cells are mixed in arbitrary proportions under the assumption that the number of cells per partition is Poisson distributed. This Poisson assumption is accurate when cells are loaded randomly and independently into partitions.

Methods

The LaTex source for this paper, the Jupyter notebooks that implement the calculations, and all materials associated with the writing and review of the paper are publicly available in a GitHub repository at https://github.com/jbloomlab/multiplet_freq. The Jupyter notebooks are also available in Files S1 and S3, and HTML renderings of the notebooks are in Files S2 and S3.

Results

Derivation of multiplet frequency from observed numbers of pure and mixed-cell droplets

Consider the case in which cells of two types (e.g., human and mouse) are distributed into individual barcoded droplets, although the same logic applies if the cells are distributed into barcoded wells or combinations of wells. Assume the sequencing data have been analyzed so that each non-empty droplet can be classified as containing at least one cell of type 1, at least one cell of type 2, or cells of both types. I will refer to the number of droplets in each of these three groupings as N1, N2, and N1,2, respectively. For instance, the 10× cellranger pipeline (version 2.1.1) returns these numbers as the “Estimated Number of Cell Partitions.”

The only assumption of the derivation is that the number of cells per droplet is Poisson distributed. Let μ1 be the average number of cells of type 1 per droplet, and μ2 be the average number of cells of type 2 per droplet. The average number of cells of any type per droplet is then μ1 + μ2. Therefore, the probability that a droplet contains at least one cell of any type is

Pr(c1)=1Pr(c=0)=1eμ1μ2. (1)

Likewise, the probability that a droplet contains multiple cells of any type (e.g., a multiplet) is

Pr(c2)=1Pr(c=0)Pr(c=1)=1eμ1μ2(μ12)eμ1+μ2. (2)

The multiplet frequency M is simply the probability that a droplet with at least one cell actually contains multiple cells, which is

M=Pr(c2)Pr(c1)=1(μ1+μ2)eμ1+μ21eμ1μ2. (3)

However, evaluating this expression for M requires the values of μ1 and μ2.

We can write down equations for μ1 and μ2 by again using the fact that the number of cells per droplet is Poisson distributed. Specifically, if N is the total number of droplets (empty and non-empty), then the expected number of droplets that have at least one cell of type 1 is N×Pr(c11)=N(1eμ1). The observed number of droplets with at least one cell of type 1 is N1, so setting the observed number equal to the expected number gives us an equation for μ1,

N1=N(1eμ1). (4)

This equation is easily solved for μ1 to yield

μ1=ln(NN1N), (5)

and likewise for μ2,

μ2=ln(NN2N). (6)

Equations (5) and (6) give us a way to determine the values (μ1 and μ2) needed to calculate the multiplet frequency (Eq. (3)) in terms of the experimental observables N1 and N2. Unfortunately, these two equations also require knowledge of the total (empty and non-empty) number of droplets N, which is not directly observable from the sequencing data.

However, we can take advantage of another relationship to calculate N. The fraction of all (empty and non-empty) droplets that contain cells of both types is N1,2N, and this fraction is simply the product of the probability that a droplet contains at least one cell of type 1 with the probability that a droplet contains at least one cell of type 2, which in mathematical terms can be stated as Pr(c11c21)=Pr(c11)×Pr(c21). Therefore,

N1,2N=N1N×N2N. (7)

This equation can be solved to give

N=N1N2N1,2, (8)

which can be completely evaluated in terms of the experimental observables. Equations (5), (6), and (8) can be used to calculate μ1 and μ2 in terms of the experimental observables, and those results used to calculate the multiplet frequency via Eq. (3). This provides an analytic solution for the multiplet frequency in terms of the three experimental observables.

Implementation and example calculations

A simple function to perform the calculations described in the previous subsection is implemented in Python in the Jupyter notebook found at https://github.com/jbloomlab/multiplet_freq/blob/master/calcmultiplet.ipynb, and in R in the Jupyter notebook found at https://github.com/jbloomlab/multiplet_freq/blob/master/calcmultiplet_R.ipynb (see also Files S1S4). To illustrate the calculations, I used this function to calculate the multiplet frequency for hypothetical data.

First, consider hypothetical data in which the two types of cells are mixed in equal proportions. Prior papers have approximated the multiplet frequency from such experiments as simply twice the fraction of non-empty droplets that contain cells of both types (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Cao et al., 2017), which is N1,2N1+N2N1,2 in the notation defined in the previous subsection. Table 1 shows that the exact equation derived in the previous subsection gives very similar results to this approximate method as long as the multiplet frequency is low. When the multiplet frequency becomes high, the approximate method starts to overestimate the true multiplet frequency, since it fails to account for the fact that some multiplets will contain more than two cells.

Table 1. Multiplet frequencies for three hypothetical experiments in which human and mouse cells are mixed equally.

Experiment Human droplets Mouse droplets Non-empty droplets Human and mouse droplets Multiplet freq Twice cross celltype freq
1 2,005 2,005 4,000 10 0.005 0.005
2 2,050 2,050 4,000 100 0.049 0.050
3 2,500 2,500 4,000 1,000 0.425 0.500

Notes:

The multiplet frequencies calculated using the exact method described here (column multiplet freq) are very similar to those obtained simply by multiplying by two the fraction of non-empty droplets that contain cells of both types (column twice cross celltype freq). However, the two methods are slightly different at higher multiplet frequencies, since the latter method fails to account for multiplets that have more than two cells.

Next, consider hypothetical data in which the two types of cells are mixed in unequal proportions. Table 2 shows the multiplet frequencies for several such experiments. An interesting aspect of the results is that at high multiplet frequencies and very unequal cell proportions, the multiplet frequency is substantially lower than the fraction of droplets containing the rarer cell type that contain a mix of both cell types. The reason is that multiplets (particularly higher-order ones) become more and more likely to contain at least one cell of the rarer type relative to droplets that contain only one cell. For instance, in the final experiment in Table 2, two-thirds of the droplets containing mouse cells have a mix of both cell types, yet less than half the non-empty droplets are multiplets (the multiplet frequency is 0.459). This somewhat non-intuitive result illustrates the importance of using the correct mathematical relationship to calculate the multiplet frequency when cell types are mixed unequally.

Table 2. Multiplet frequencies for five hypothetical experiments in which human and mouse cells are mixed unequally.

Experiment Human droplets Mouse droplets Non-empty droplets Human and mouse droplets Multiplet freq
1 2,050 2,050 4,000 100 0.049
2 3,050 1,050 4,000 100 0.065
3 3,550 550 4,000 100 0.110
4 3,850 250 4,000 100 0.245
5 3,950 150 4,000 100 0.459

Conclusions

I have described how to calculate the multiplet frequency in single-cell RNA sequencing experiments in which two cell types are mixed in arbitrary proportions. It is important to note that this calculation requires that the sequencing data have already been analyzed to determine whether each partition contains a non-negligible number of transcripts from each cell type, but many common analysis programs (such as the 10× cellranger pipeline) already do this.

The calculation also assumes that the number of cells per droplets follows a Poisson distribution. While many single-cell RNA sequencing methods are designed to partition cells in a way that concords with this assumption (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017; Gierahn et al., 2017; Cao et al., 2017), it is possible that cell clumping or other factors could bias certain partitions to contain more cells than expected under a Poisson distribution. In such a scenario, the calculations in this paper would overestimate the true multiplet frequency if the clumping is equally likely across cell types, but could underestimate the true multiplet frequency if intra-cell-type clumping is more likely than inter-cell-type clumping.

Finally, the approach in this paper only calculates the multiplet frequency—it does not actually identify the multiplets so that they can be removed from downstream analyses. For that purpose, other more sophisticated approaches have been developed (Ilicic et al., 2016; Stoeckius et al., 2017; Kang et al., 2018; Wolock, Lopez & Klein, 2018; DePasquale et al., 2018). Nonetheless, simply calculating the multiplet frequency from the data returned by standard pipelines such as the 10× cellranger is important for many purposes, and the results here enable that to be done regardless of the proportions at which the cell types are mixed.

Supplemental Information

Supplemental Information 1. Supplemental file 1.

A Jupyter notebook that implements the calculations in Python, and does the calculations for the examples shown in the tables in this paper.

DOI: 10.7717/peerj.5578/supp-1
Supplemental Information 2. Supplemental file 2.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 1.

DOI: 10.7717/peerj.5578/supp-2
Supplemental Information 3. Supplemental file 3.

A Jupyter notebook that implements the calculations in R, and does the calculations for the examples shown in the tables in this paper.

DOI: 10.7717/peerj.5578/supp-3
Supplemental Information 4. Supplemental file 4.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 3.

DOI: 10.7717/peerj.5578/supp-4

Funding Statement

This work was supported by grants R01 GM102198 and R01 AI127893 from the National Institutes of Health. The work of the author is also supported in part by a Faculty Scholars grant from HHMI and the Simons Foundation. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The author declares that he has no competing interests.

Author Contributions

Jesse D. Bloom conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

Data Availability

The following information was supplied regarding data availability:

GitHub: https://github.com/jbloomlab/multiplet_freq.

References

  • Cao et al. (2017).Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, Adey A, Waterston RH, Trapnell C, Shendure J. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science. 2017;357(6352):661–667. doi: 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • DePasquale et al. (2018).DePasquale EAK, Schnell DJ, Valiente I, Blaxall BC, Grimes HL, Singh H, Salomonis N. Doubletdecon: cell-state aware removal of single-cell rna-seq doublets. biorxiv preprint. 2018:364810. doi: 10.1101/364810. [DOI] [Google Scholar]
  • Gierahn et al. (2017).Gierahn TM, Wadsworth MH, II, Hughes TK, Bryson BD, Butler A, Satija R, Fortune S, Love JC, Shalek AK. Seq-well: portable, low-cost RNA sequencing of single cells at high throughput. Nature Methods. 2017;14(4):395–398. doi: 10.1038/nmeth.4179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ilicic et al. (2016).Ilicic T, Kim JK, Kolodziejczyk AA, Bagger FO, McCarthy DJ, Marioni JC, Teichmann SA. Classification of low quality cells from single-cell RNA-seq data. Genome Biology. 2016;17(1):29. doi: 10.1186/s13059-016-0888-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kang et al. (2018).Kang HM, Subramaniam M, Targ S, Nguyen M, Maliskova L, McCarthy E, Wan E, Wong S, Byrnes L, Lanata CM, Gate RE, Mostafavi S, Marson A, Zaitlin N, Criswell LA, Ye CJ. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nature Biotechnology. 2018;36(1):89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Klein et al. (2015).Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Macosko et al. (2015).Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Rosenberg et al. (2018).Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, Pun SH, Sellers DL, Tasic B, Seelig G. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018;360(6385):176–182. doi: 10.1126/science.aam8999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Stoeckius et al. (2017).Stoeckius M, Zheng S, Houck-Loomis B, Hao S, Yeung B, Smibert P, Satija R. Cell “hashing” with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. biorxiv preprint. 2017:237693. doi: 10.1101/237693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wolock, Lopez & Klein (2018).Wolock SL, Lopez R, Klein AM. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. biorxiv preprint. 2018:357368. doi: 10.1101/357368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Zheng et al. (2017).Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, Gregory MT, Shuga J, Montesclaros L, Underwood JG, Masquelier DA, Nishimura SY, Schanll-Levin M, Wyatt PW, Hindson CM, Bharadwaj R, Wong A, Ness KD, Beppu LW, Deeg HJ, McFarland C, Loeb KR, Valente WJ, Ericson NG, Stevens EA, Radich JP, Mikkelsen TS, Hindson BJ, Bielas JH. Massively parallel digital transcriptional profiling of single cells. Nature Communications. 2017;8:14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information 1. Supplemental file 1.

A Jupyter notebook that implements the calculations in Python, and does the calculations for the examples shown in the tables in this paper.

DOI: 10.7717/peerj.5578/supp-1
Supplemental Information 2. Supplemental file 2.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 1.

DOI: 10.7717/peerj.5578/supp-2
Supplemental Information 3. Supplemental file 3.

A Jupyter notebook that implements the calculations in R, and does the calculations for the examples shown in the tables in this paper.

DOI: 10.7717/peerj.5578/supp-3
Supplemental Information 4. Supplemental file 4.

This file contains an HTML rendering of the Jupyter notebook in Supplemental file 3.

DOI: 10.7717/peerj.5578/supp-4

Data Availability Statement

The following information was supplied regarding data availability:

GitHub: https://github.com/jbloomlab/multiplet_freq.


Articles from PeerJ are provided here courtesy of PeerJ, Inc

RESOURCES