Abstract
An important challenge in pre-processing data from droplet-based single-cell RNA sequencing protocols is distinguishing barcodes associated with real cells from those binding background reads. Existing methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves the precision of downstream analyses.
Keywords: Single-cell RNA-seq, Droplet-based protocols, Cell detection
Background
Droplet-based single-cell RNA sequencing (scRNA-seq) [1] is a powerful and widely used approach for profiling genome-wide gene expression in individual cells. Current commercial droplet-based technologies utilize gel beads [2], each containing oligonucleotide indexes made up of bead-specific barcodes combined with unique molecular identifiers (UMIs) [3] and oligo-dT tags to prime polyadenylated RNA. Single cells of interest are combined with reagents in one channel of a microfluidic chip, and gel beads in another, to form gel beads in emulsion, or GEMs. Oligonucleotide indexes bind polyadenylated RNA within each GEM reaction vesicle before gel beads are dissolved releasing the bound oligos into solution for reverse transcription. By design, each resulting cDNA molecule contains a UMI and a GEM-specific barcode. Indexed cDNA is pooled for PCR amplification and sequencing resulting in a data matrix of UMI counts for each barcode (Additional file 1: Figure S1).
Ideally, each barcode will tag mRNA from an individual cell, but this is often not the case in practice. In most datasets, more than 90% of GEMs do not contain viable cells, but rather contain ambient RNA excreted by cells in solution or as a product of cell lysis [2]. As a result, an important challenge in pre-processing droplet-based scRNA-seq data is distinguishing those barcodes corresponding to real cells from those binding ambient, or background, RNA.
Early methods to address this challenge defined real cells as those barcodes with total read counts exceeding some threshold [1, 2]. Such methods are suboptimal as they discard small cells as well as those expressing relatively few genes. To address this, Lun et al. [4] developed EmptyDrops (ED), an approach to identify individual barcodes with distributions varying from a background distribution. Similar to previous approaches [1, 2], ED identifies an upper threshold and defines real cells as those barcodes with counts above the threshold. As a second step, ED uses all barcodes with counts below a lower threshold to estimate a background distribution of ambient RNA against which remaining barcodes are tested. Those having expression profiles significantly different from the background distribution are deemed real cells. The ED approach is current state-of-the-art in the field. However, given that ED performs tests for each barcode individually, it does not leverage the strong correlation observed between cells and, consequently, compromises power for identifying cells in many datasets.
To increase the power for identifying real cells, we propose CB2, a cluster-based approach for distinguishing real cells from background barcodes in droplet-based scRNA-seq experiments. CB2 extends the ED framework by introducing a clustering step that groups similar barcodes, then conducts a statistical test to identify groups with expression distributions that vary from the background (Fig. 1, Additional file 1: Figure S2). CB2 is implemented in the R package scCB2.
Results
CB2 was evaluated and compared with ED on simulated and case study data. In SIM IA, counts are generated as in Lun et al. [4]. Briefly, given an input dataset, an inflection point dividing low from high-count barcodes is determined. Low count barcodes are pooled to estimate the background distribution. Background barcodes are sampled from this distribution to match the total number and size of barcodes below the inflection point in the input dataset. Six thousand real cells are then generated as follows. First, 2000 barcodes are randomly sampled from the high-count barcodes (referred to as G1 cells [4]); a second set of 2000 high-count barcodes is sampled and then downsampled by 90% to give G2 cells; the third set (G1.5) is obtained by sampling 2000 barcodes from the high-count range and downsampling by 50%. We note that in Lun et al. [4], only G1 and G2 cells were considered. Here, G1.5 cells were added to better reflect real data. Additional file 1: Figure S3 shows increased power of CB2 with well controlled false discovery rate (FDR) for the 6 datasets considered in Lun et al. [4] as well as 4 additional datasets. SIM IB, also considered by Lun et al. [4], is similar to SIM IA, but in SIM IB 10% of the genes in the real cells are shuffled making the real cells more different from the background and therefore easier to identify (Additional file 1: Figure S4). Additional file 1: Figure S5 shows the increased power of CB2 is maintained.
To further evaluate CB2, we applied CB2 and ED to the ten case study datasets used to generate the simulated data as well as one additional dataset considered in the ED case study and compared the number of cells identified in common as well as those uniquely identified by each approach. Additional file 2: Table S1 shows that CB2 finds 24% more cells on average (range 4–81%). Of the extra cells identified, 88% on average (range 44–100%) add to existing subpopulations. The remaining 12% (range 0–56%) make up novel subpopulations.
As an example, Fig. 2 and Additional file 1: Figure S6 show results from the Alzheimer data [5] where CB2 identifies 18% more cells. A detailed look at the unique CB2 identifications suggests that the extra cells identified are not false positives, but rather they add to existing excitatory neuron and inhibitory neuron subpopulations, and also reveal a novel subpopulation consisting of 209 cells. Specifically, Fig. 2 b and c show distribution plots and an expression heatmap of the 100 genes having the highest average expression in Subpop1 (the largest subpopulation) for cells identified by both CB2 and ED as well as those identified uniquely by CB2. As shown, cells uniquely identified by CB2 have a distribution similar to other cells, and they differ from the background. Using the marker genes from Mathys et al. [5], Fig. 2d and Additional file 1: Figure S6(b) suggest that cells identified uniquely by CB2 in Subpops 1–4 are neurons, as they show relatively high expression of neuron marker genes SYT1, SNAP25, and GRIN1. More specifically, the CB2 cells in Subpops 1–2 exhibit high expression of excitatory neuronal markers whereas the cells in Subpops 3–4 appear to be inhibitory neurons (Additional file 1: Figure S6(c) and (d)). The novel subpopulation (Subpop5) uniquely shows high expression of both oligodendrocyte and astrocyte marker genes, suggesting that this group may be mixed phenotype glial cells [6] (Additional file 1: Figure S6(e) and (f)).
By increasing the number of real cells identified, CB2 also improves the power to differentiate Alzheimer’s patients from controls. Specifically, Mathys et al. [5] profiled expression from the prefrontal cortex of 24 AD-pathology patients as well as 24 age-matched controls, and they validated differentially expressed genes in different cell types, including 9 genes in excitatory neurons and 9 in inhibitory neurons. Additional file 1: Figure S7 shows that by identifying additional cells, CB2 improves downstream differential expression analysis by resulting in more significant p values and stronger fold changes.
In a second case study (PBMC8K), CB2 increases the number of cells identified across six subpopulations by over 80% (Additional file 2: Table S1). Results are shown in Fig. 3 and Additional file 1: Figure S8. Similar to the Alzheimer’s data analysis, Additional file 1: Figure S8(b) and (c) show that cells identified uniquely by CB2 in Subpop1 have an expression profile that is similar to other cells and differs from the background. Figure 3 provides a detailed look at marker gene expression for the well-characterized PBMC8K cells using markers considered in Zheng et al. [2]. As shown in Fig. 3b, CB2 identifies additional CD14+ monocytes, T cells, B cells, and megakaryocytes. Results from two additional datasets are shown in Additional file 1: Figure S9–S10.
Discussion
Taken together, the results presented here demonstrate that CB2 provides a powerful approach for distinguishing real cells from background barcodes which will increase the number of cells identified in existing cell subpopulations in most datasets and may facilitate the identification of novel subpopulations. While advantages are expected in many settings, users will benefit from the following considerations. CB2 does not test for doublets or multiplets, and consequently, some of the high-count identifications may consist of two or more cells. Methods for identifying multiplets such as Scrublet [7], DoubletDecon [8], or DoubletFinder [9] may prove useful after applying CB2. A second important post-processing step is filtering based on mitochondrial expression. As noted in Lun et al. [4], any method for distinguishing cells from background barcodes is technically correct in identifying low-quality cells given that damaged cells exhibit expression profiles that differ from the background. Specifically, mitochondrial gene expression is often high in damaged cells; an example is shown in Subpopulation 5 of the PBMC8K data (Fig. 3b). Such cells are typically not of interest in downstream analysis and should therefore be removed. The GetCellMat function in R/scCB2 may be used toward this end.
Conclusions
Droplet-based scRNA-seq technologies provide unprecedented opportunity to address biological questions, but efficient pre-processing is required to maximize the information obtained in an experiment. CB2 allows investigators to maximize the number of cells retained and consequently to increase the power and precision of downstream analysis.
Methods
Versions
For cell identification with R/scCB2 0.99.12 and R/DropletUtils 1.5.4 [4, 10], the latest version of R [11] was used: 3.7-devel (2019-07-17 r76847). Other packages are not yet compatible or not stable with the R developers version, and so for scran 1.12.1 [12], Seurat 3.1.0 [13, 14], and ggplot2 3.2.1 [15], R 3.6.0 (2019-04-24 r76423) was used.
CB2
As CB2 relies on ED, we briefly review the ED approach before detailing the clustering test introduced in CB2. ED expects as input a G × B feature-by-barcode matrix with G features (for simplicity, we refer to features as genes) and B barcodes. Barcodes having zero counts for all genes are filtered out, and the remaining barcodes are divided into three groups based on the sum of gene expression (UMI) counts within a barcode. The background group, B0, contains all barcodes with counts less than or equal to a pre-defined lower threshold (defaults to 100); the high-count barcodes, B2, contain barcodes with counts exceeding an upper threshold (defaults to knee point); the remaining barcodes (B1) are tested (Additional file 1: Figure S2).
ED assumes that counts from a background barcode are distributed as Dirichlet-Multinomial with probability vector estimated by averaging the counts in B0 and applying the Good-Turing algorithm [16] to ensure that all probabilities are non-zero, denoted as . For a barcode b ∈ B1, ED tests against the alternative using the log-likelihood under as the test statistic. A Monte-Carlo p value is calculated via simulating Dirichlet-Multinomial barcodes of size ∣b∣ under and calculating the proportion of simulated barcodes having a test statistic more extreme than (or equal to) b’s. The false discovery rate is controlled using the Benjamini-Hochberg procedure [17].
CB2 follows ED by filtering out genes with zero counts and dividing the remaining barcodes into three groups. However, instead of testing all barcodes from B1 individually, CB2 first clusters barcodes and then tests tight clusters to identify those that differ from the background. As in methods for genome-wide association studies (Mieth et al. 2016 [18]), gene co-expression network analysis (Botía et al., 2017 [19]), and de novo transcriptome analysis (Malik et al., 2018 [20]), clustering prior to testing increases power by reducing the total number of tests and increasing the signal to noise ratio. CB2 proceeds as follows:
-
Barcodes grouped by size. CB2 orders barcodes in B1 by total countswhere Xb denotes the count vector of barcode b, ∣Xb∣ denotes the total UMI count of barcode b, and |B1| denotes the number of barcodes in B1. Groups of size S (defaults to 1000 in R/scCB2) are constructed consisting of barcodes ranging in size from smallest to largest:
where is rounded up if not an integer. If , barcodes in B1K are merged with those in B1(K − 1). Sorting barcodes by size reduces bias in the clustering and testing steps that follow.
-
2.
Barcodes clustered within group: Barcodes within each group B1j are clustered using hierarchical clustering with pairwise Pearson correlation as the similarity metric. A cluster is considered tight if the average within-cluster pairwise Pearson correlation exceeds a data-driven threshold. Tight clusters are retained for further analysis as described in step 3, below. To determine thresholds, ten tight clusters of varying size are simulated by generating 100 samples from a multinomial distribution with parameters (N, p) where N ranges from 100 to 1000 in increments of size 100. This range is chosen as we found little variation in thresholds for barcode sizes exceeding 1000; p is set to either or , whichever has larger Shannon entropy [21] as the distribution with larger entropy is less affected by outliers. For each simulated cluster C, the threshold κC is defined by its average pairwise Pearson correlation. A cluster is considered tight if the average within-cluster pairwise Pearson correlation exceeds κC for the simulated cluster of closest size.
-
3.Tight clusters tested: For each tight cluster C, we conduct a Monte-Carlo test to assess dissimilarity from the background. Pairwise Pearson correlations are calculated between every barcode in C and ; the test statistic for cluster C, Tc, is defined to be the median of these correlations. Similar to ED, to simulate background barcodes, we sample barcodes from a multinomial (N; ) where N is the size of the barcode giving Tc. The Monte-Carlo p value is:
where is the Pearson correlation between and (M defaults to 1000 in R/scCB2). Monte-Carlo p values are calculated for each cluster followed by Benjamini-Hochberg [17] to control the FDR. All barcodes within a significant cluster are identified as real cells.
-
4.
Individual barcodes tested: Barcodes that were not included in a tight cluster in Step 2 as well as those in a tight cluster that were not found to be significant in Step 3 are tested individually using ED. It is important to note that some of the barcodes identified in this step do not overlap with identifications made when ED is applied to the full set of barcodes given differences in the rates of real cells to background barcodes and differences in error rate control.
Simulations
Counts are generated as in Lun et al. [4]. As detailed there, each simulation requires an input dataset. We constructed simulations from 10 datasets: Alzheimer [5], PBMC8K, PBMC33K, mbrain1K, mbrain9K, PanT4K, MALT, PBMC4K, jurkat, and T293 (Additional file 2: Table S2). For each input dataset, the inflection point of the UMI count by sorted barcode plot is used to divide lower count from higher count barcodes. The barcodes in the lower count range are considered background. In SIM IA, two sets of 2000 barcodes randomly sampled from the higher count range are considered real cells. The first set of 2000 is referred to as large ( G1) cells; the second set is downsampled by 90% to give small (G2) cells. We added a third set of medium (G1.5) cells by sampling 2000 cells from the higher count range and downsampling by 50%. The process for simulating data in SIM IB is identical to SIM IA except that in SIM IB, 10% of the genes in each simulated real cell are shuffled making the real cells more different from the background barcodes and, consequently, making real cells easier to identify. SIM IA is a more realistic simulation (Additional file 1: Figure S4).
Case studies
We evaluated the 10 datasets used in the simulation and also the placenta data evaluated in Lun et al. [4]. These datasets vary in sequencing depth as well as in the extent of differences between the real cell and background distributions (Additional file 1: Figure S4). CB2 and ED were applied to each dataset using default settings. For plots that compare identifications between CB2 and ED, cells identified by either approach (or both) were combined and UMI counts were normalized via scran. The Seurat pipeline was used to generate t-SNE plots from the top 4000 most highly variable genes and top 50 principal components. Expression heatmaps show log transformed raw UMI counts. For heatmaps and distribution plots, mitochondrial and ribosomal genes were removed.
Differential expression analysis in Alzheimer data
Cells identified by CB2, ED, or both were combined into a single matrix and filtered similar to Mathys et al. [5]. Specifically, cells with mitochondrial gene expression making up 40% or more of the total UMI counts were removed; genes detected in fewer than two cells were also excluded giving a matrix of 28,208 genes and 74,579 barcodes. Normalization was performed using scran. Cell types were annotated using marker genes as in Mathys et al. [5] Differential expression (DE) tests between cells from Alzheimer’s cases and controls were conducted using Wilcoxon rank-sum tests as in Mathys et al. [5]. Results were compared for known DE genes extracted from Mathys et al. [5].
Implementation of CB2 and ED
For all simulation and case study analyses, CB2 and ED were implemented using default parameters. A target FDR was set at 1%.
Existing subpopulations vs. novel subpopulations
The FindNeighbors and FindClusters functions in Seurat were used with default settings to assign each cell to a cluster, referred to here as a subpopulation. For each subpopulation, we calculated the percentage of cells identified by both CB2 and ED as well as those identified uniquely by CB2. Subpopulations for which over 80% of the cells are uniquely identified by CB2 are referred to as novel subpopulations (Additional file 2: Table S3 shows the number of novel subpopulations identified using 70%, 80%, or 90% as thresholds).
Supplementary information
Acknowledgements
The authors thank Matt Bernstein, Chitrasen Mohanty, and Ziyue Wang for comments that improved the manuscript.
Review history
The review history is available as Additional file 3.
Peer review information
Barbara Cheifet was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Authors’ contributions
Z.N. and C.K. designed the research, developed the method, and wrote the first version of the manuscript. Z.N. and S.C. conducted simulations and quality control evaluations. Z.N., S.C., and C.K. built and tested the R package. Z.N., J.B., and C.K. analyzed results from early versions of the method which helped during method refinement. All authors contributed to writing the manuscript. The authors read and approved the final manuscript.
Funding
This work was supported by NIH GM102756.
Availability of data and materials
The Alzheimer case study dataset was downloaded from https://www.synapse.org/#!Synapse:syn16780177 [5]. The placenta dataset [22] is available at https://jmlab-gitlab.cruk.cam.ac.uk/publications/EmptyDrops2017-DataFiles. All other datasets in this study are available at the 10x Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets) (Additional file 2: Table S2). The R package R/scCB2 is available at https://github.com/zijianni/scCB2 [23] under the General Public License version 3, and will be submitted to Bioconductor [24]. All simulation codes and case study data analysis scripts are available at https://github.com/zijianni/codes-for-CB2-paper and Zenodo (doi: 10.5281/zenodo.3829649) [25].
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Zijian Ni, Email: zni25@wisc.edu.
Shuyang Chen, Email: schen662@wisc.edu.
Jared Brown, Email: brown46@wisc.edu.
Christina Kendziorski, Email: c.kendziorski@gmail.com.
Supplementary information
Supplementary information accompanies this paper at 10.1186/s13059-020-02054-8.
References
- 1.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014;11:163–166. doi: 10.1038/nmeth.2772. [DOI] [PubMed] [Google Scholar]
- 4.Lun ATL, Riesenfeld S, Andrews T, Dao TP, Gomes T, Marioni JC. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019;20:63. doi: 10.1186/s13059-019-1662-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Mathys H, Davila-Velderrain J, Peng Z, Gao F, Mohammadi S, Young JZ, et al. Single-cell transcriptomic analysis of Alzheimer’s disease. Nature. 2019;570:332–337. doi: 10.1038/s41586-019-1195-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Dyer CA, Kendler A, Jean-Guillaume D, Awatramani R, Lee A, Mason LM, et al. GFAP-positive and myelin marker-positive glia in normal and pathologic environments. J Neurosci Res. 2000;60:412–426. doi: 10.1002/(SICI)1097-4547(20000501)60:3<412::AID-JNR16>3.0.CO;2-E. [DOI] [PubMed] [Google Scholar]
- 7.Wolock SL, Lopez R, Klein AM. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Syst. 2019;8:281–291.e9. doi: 10.1016/j.cels.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.DePasquale EAK, Schnell DJ, Van Camp PJ, Valiente-Alandí Í, Blaxall BC, Grimes HL, et al. DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep. 2019;29:1718–1727.e8. doi: 10.1016/j.celrep.2019.09.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.McGinnis CS, Murrow LM, Gartner ZJ. DoubletFinder: doublet detection in single-cell rna sequencing data using artificial nearest neighbors. Cell Syst. 2019;8:329–337.e4. doi: 10.1016/j.cels.2019.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Griffiths JA, Richard AC, Bach K, Lun ATL, Marioni JC. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun. 2018;9:2667. doi: 10.1038/s41467-018-05083-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.R Core Team . R: A Language and Environment for Statistical Computing. 2019. [Google Scholar]
- 12.Lun ATL, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122. doi: 10.12688/f1000research.9501.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902.e21. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wickham H. ggplot2: Elegant Graphics for Data Analysis. 2016. [Google Scholar]
- 16.Gale WA, Sampson G. Good-turing frequency estimation without tears*. J Quant Linguist. 1995;2:217–237. doi: 10.1080/09296179508590051. [DOI] [Google Scholar]
- 17.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57:289–300. [Google Scholar]
- 18.Mieth B, Kloft M, Rodríguez JA, Sonnenburg S, Vobruba R, Morcillo-Suárez C, et al. Combining multiple hypothesis testing with machine learning increases the statistical power of genome-wide association studies. Sci Rep. 2016;6:1–14. doi: 10.1038/srep36671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Botía JA, Vandrovcova J, Forabosco P, Guelfi S, D’Sa K, Hardy J, et al. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst Biol. 2017;11:47. doi: 10.1186/s12918-017-0420-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Malik L, Almodaresi F, Patro R. Grouper: graph-based clustering and annotation for improved de novo transcriptome analysis. Bioinformatics. 2018;34:3265–3272. doi: 10.1093/bioinformatics/bty378. [DOI] [PubMed] [Google Scholar]
- 21.Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
- 22.Vento-Tormo R, Efremova M, Botting RA, Turco MY, Vento-Tormo M, Meyer KB, et al. Single-cell reconstruction of the early maternal–fetal interface in humans. Nature. 2018;563:347–353. doi: 10.1038/s41586-018-0698-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ni Z, Chen S, Brown J, Kendziorski C. CB2 improves power of cell detection in droplet-based single-cell RNA sequencing data. GitHub. 2020. https://github.com/zijianni/scCB2. Accessed 15 May 2020. [DOI] [PMC free article] [PubMed]
- 24.Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic analysis with bioconductor. Nat Methods. 2015;12:115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ni Z, Chen S, Brown J, Kendziorski C. CB2 improves power of cell detection in droplet-based single-cell RNA sequencing data. Zenodo. 2020; 10.5281/zenodo.3829649. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Alzheimer case study dataset was downloaded from https://www.synapse.org/#!Synapse:syn16780177 [5]. The placenta dataset [22] is available at https://jmlab-gitlab.cruk.cam.ac.uk/publications/EmptyDrops2017-DataFiles. All other datasets in this study are available at the 10x Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets) (Additional file 2: Table S2). The R package R/scCB2 is available at https://github.com/zijianni/scCB2 [23] under the General Public License version 3, and will be submitted to Bioconductor [24]. All simulation codes and case study data analysis scripts are available at https://github.com/zijianni/codes-for-CB2-paper and Zenodo (doi: 10.5281/zenodo.3829649) [25].