SiftCell: A robust framework to detect and isolate cell-containing droplets from single-cell RNA sequence reads

Jingyue Xi; Sung Rye Park; Jun Hee Lee; Hyun Min Kang

doi:10.1016/j.cels.2023.06.002

. Author manuscript; available in PMC: 2024 Jul 19.

Published in final edited form as: Cell Syst. 2023 Jul 19;14(7):620–628.e3. doi: 10.1016/j.cels.2023.06.002

SiftCell: A robust framework to detect and isolate cell-containing droplets from single-cell RNA sequence reads

Jingyue Xi ¹, Sung Rye Park ², Jun Hee Lee ², Hyun Min Kang ¹

PMCID: PMC10411962 NIHMSID: NIHMS1919287 PMID: 37473732

Summary

Single-cell RNA sequencing (scRNA-seq) massively profiles transcriptomes of individual cells encapsulated in barcoded droplets in parallel. However, in real-world scRNA-seq data, many barcoded droplets do not contain cells, but instead they capture a fraction of ambient RNAs released from damaged or lysed cells. A typical first step to analyze scRNA-seq data is to filter out cell-free droplets and isolate cell-containing droplets, but distinguishing them is often challenging; incorrect filtering may mislead the downstream analysis substantially. We propose SiftCell, a suite of software tools to identify and visualize cell-containing and cell-free droplets in manifold space via randomization (SiftCell-Shuffle), to classify between the two types of droplets (SiftCell-Boost), and to quantify the contribution of ambient RNAs for each droplet (SiftCell-Mix). By applying our method to datasets obtained by various single cell platforms, we show that SiftCell provides a streamlined way to perform upstream quality control of scRNA-seq, which is more comprehensive and accurate than existing methods. A record of this paper’s Transparent Peer Review process is included in the Supplemental Information.

Graphical Abstract

graphic file with name nihms-1919287-f0001.jpg

Electronic table of contents (eTOC) blurb

SiftCell is a comprehensive suite of software tools for robust filtering of cell-containing droplets in single-cell RNA-sequencing. It provides tools for manifold visualization (SiftCell-Shuffle), droplet classification (SiftCell-Boost), and ambient RNA quantification (SiftCell-Mix) across various single-cell RNA-seq platforms by leveraging randomization.

Introduction

The rapid development of single-cell genomic technologies has revolutionized our ability to understand the dynamics of individual cells. Single cell RNA sequencing (scRNA-seq) technologies allow us to simultaneously profile transcriptomes of thousands of individual cells, enabling us to understand the regulatory impact of genetic, developmental, environmental, and clinical determinants in a single cell resolution^1,2. Single nucleus RNA-sequencing (snRNA-seq), single cell/nuclei ATAC sequencing (scATAC/snATAC-seq), and other high throughput single cell genomic profiling technologies^3–5 can also scale to thousands of cells or nuclei, providing us with comprehensive epigenomic landscape of individual cells or cell types. This has given us unprecedented opportunities to characterize cellular heterogeneity, which is essential for understanding and treating human diseases. Many analytic methods and software tools were developed to determine cell types^6–9 or developmental trajectories^10–13, to account for systematic difference by experimental batches or technologies, to identify differentially regulated genes by clinical variables or genotypes^14,15, or to enable efficient single cell experiment via multiplexing¹⁶. However, relatively fewer methods and tools were developed to address various issues associated with upstream quality control steps of single cell genomic experiments. Still, such quality control would be essential to ensure that the downstream analysis is not misled by potential technical artifacts, such as sequence alignment or digital expression matrix generation.

A common feature shared across recent high-throughput single cell genomic technologies^1,2,17–19 is that the reads from individual cells or nuclei are distinctly barcoded so that thousands or millions of cells or nuclei can be simultaneously sequenced in a single library. The sequenced reads can be grouped into the originating cells or nuclei according to the barcodes, and the grouped information is used for the downstream single cell analysis. However, each observed barcode may not correspond to a single cell or nucleus due to several reasons. First, sequencing errors may lead to incorrect assignment of each read into its originating cells or nuclei. Second, two or more cells or nuclei can be encapsulated within the same barcoded droplet, forming ‘multiplets’, either stochastically or due to imperfect dissociation of tissues^1,2,20. Third, a droplet may fail to encapsulate the entirety of single cell, and instead captures “cell debris” or “ambient mRNAs” produced from the damaged and lysed cells. Barcodes derived from such droplets containing defective or ambient mRNAs may be mistaken to represent single cell transcriptome while they are not. We will denote such a barcoded droplet enriched for ambient RNAs as “cell-free droplets”. It is reported that scRNA-seq from solid tissues are more enriched for ambient mRNAs, particularly when incubated at high temperature²¹, which renders cells to be more vulnerable and therefore producing more cell death and lysis. Different technologies have different susceptibilities of contaminating their datasets with cell-free droplets. For example, snRNA-seq technologies inherently produce more ambient mRNAs, therefore are more likely to generate cell-free droplets compared to conventional scRNA-seq²². Because these cell-free droplets enriched for ambient mRNAs do not represent single cells, failure in filtering out such droplets produces misleading interpretation in the downstream analysis. Therefore, filtering out cell-free droplets are an essential quality control procedure to make sure scRNA-seq and snRNA-seq analysis produces biologically relevant information.

The simplest and most widely used strategy to filter out cell-free droplets is to remove droplets with very low number of unique reads or UMI (Universal Molecular Identifier) counts. This strategy is based on the simple fact that, compared to cell-containing droplets, cell-free droplets contain less mRNAs because ambient mRNAs will be excessively diluted in the media surrounding the cells, leading to low UMI counts. Earlier versions of cellRanger and DropseqTools software tools filter out droplets below a certain UMI cutoff determined from the distribution of UMI counts across the barcoded droplets and from a user-specified parameter of expected number of cells using knee plots^2,19. While this strategy works quite effectively in practice, it relies on a simplistic assumption that all droplets containing individual cells will have higher UMI counts than other barcoded droplets. Because UMI counts result from a stochastic procedure involving multiple factors, this simplistic assumption does not always hold. For example, as UMI counts and cell sizes are known to be positively correlated²³, it is possible that smaller cells are more aggressively filtered out by a fixed threshold of UMI count, losing their representation in the filtered scRNA-seq dataset.

Recently, alternative approaches have been developed to identify and filter out cell-free droplets using more sophisticated statistical models. For example, EmptyDrops method²⁴, which is adopted to the newer version of cellRanger (v3), first determines a UMI cutoff from the knee plot to identify cell-containing droplets, and then attempts to rescue droplets below the UMI cutoff using a statistical test. The assumption is that the expression profile of cell-free droplets is homogeneous, which can be estimated as a Dirichlet-Multinomial distribution. If the likelihood of observed read count from a barcoded droplet is significantly lower than those from simulated reads at a given threshold, EmptyDrops identifies them as cell-containing droplets. EmptyDrops is useful only for rescuing cell-containing droplets with lower UMI count and cannot filter out cell-free droplets with high UMI counts. DecontX and SoupX^25,26, on the other hand, assumes that every droplet contains a certain fraction of ambient RNAs, and attempts to estimate the proportion of ambient RNA contamination, and determines cell-free droplets if the estimated proportion is above a specific threshold (e.g. 10%). A recently developed method DIEM²², uses an Expectation-maximization (E-M )algorithm²⁷ to cluster barcoded droplets into cell types while modeling cell-free droplets as a separate cluster using Dirichlet-multinomial distribution. CellBender uses a deep generative model implemented by neural auto-encoders to model scRNA-seq data and applies a variational mix to evaluate the posterior probability of cell-free droplet. Finally, DropletQC²⁸, estimates proportion of intronic reads from sequence reads and use the information to separate droplets containing damaged cells or ambient RNAs. While these methods demonstrated their utility in some of the real datasets, it is often not clear what are objective criteria to evaluate their performances in distinguishing cell-containing droplets from cell-free droplets.

Here, we demonstrate SiftCell, a suite of three software tools to address challenges due to cell-free droplets in conceptually unique ways. The first tool SiftCell-Shuffle, allows us to visually distinguish cell-free and cell-containing droplets in sc/snRNA-seq experiments to help filter out cell-free droplets. SiftCell-Shuffle takes an arbitrary digital expression matrix to visualize the distribution of potentially cell-free barcoded droplets in a manifold space using randomization. The second tool, SiftCell-Boost distinguishes cell-containing droplets from cell-free droplets using a supervised learning algorithm guided by the labels generated with SiftCell-Shuffle. The third tool, SiftCell-Mix is a model-based method that allows estimation the proportion of “ambient RNAs” in each barcoded droplet. We also provide a comprehensive evaluation of existing methods to identify cell-free droplets. Therefore, in addition to providing an intuitive way of eliminating cell-free droplets and selecting cell-containing droplets, our method can evaluate and visualize the strength of each different previously available method to inform users to guide the best practice for handling cell-free droplets.

Results

SiftCell-Shuffle visually distinguishes cell-free droplets from cell-containing ones

Even though there are multiple methods to determine cell-containing droplets from sc/snRNA-seq data, currently there is no systematic way to evaluate whether one method more robustly distinguish cell-containing and cell-free droplets than the other method with real data. Previous studies utilized indirect measurement, such as fraction of mitochondrial RNAs (mtRNAs)²² or UMI counts²⁴, but they can be confounded by cell types (e.g. some cell types may contain more mtRNAs and less UMIs) or technical factors (e.g. some scRNA-seq preps contain high amount of ambient mRNAs). Other studies compared manifold plots (such as UMAP or t-SNE) after applying different filtering method and argue for one over the other based on their visual patterns of clustered cell types²⁹, but such interpretations can easily become subjective.

We developed SiftCell-Shuffle, a randomization-based scRNA-seq visualization tools focusing on distinction between cell-containing and cell-free droplets. SiftCell-Shuffle assumes that the ambient RNAs are distributed as a pseudo-bulk (i.e. in a single distribution across all dataset) while cell-containing RNAs are distributed in a cell-type-specific manner. Based on this assumption, SiftCell-Shuffle creates a digital expression matrix that mimics the ‘bulk’ distribution by randomizing the droplet barcode assignments across the UMIs. After randomization, the original and randomized digital expression matrices are jointly analyzed using a standard scRNA-seq workflow (e.g. Seurat) and individual droplets are visualized in a t-SNE and/or UMAP manifold space (Figure 1, See STAR Methods for further details). For cell-containing droplets, the original and randomized data should have very different transcriptomic profiles and will be located at very distant points to each other in the manifold space. For cell-free droplets containing mostly ambient RNAs, the original and randomized data are more likely to be located in close proximity, so the cluster of cell-free droplets can be clearly visualized.

Figure 1. — Overview of *SiftCell* Framework

The *SiftCell* software package includes three tools for visualizing and filtering barcoded droplets from scRNA- or snRNA-seq experiments: *SiftCell-Shuffle* visualizes original barcoded droplets with randomized droplets on a manifold space to distinguish “cell-containing” and “cell-free” droplets visually in the manifold space; *SiftCell-Boost* classifies cell-containing droplets and cell-free droplets by leveraging the results from *SiftCell-Shuffle* results with gradient boosting; *SiftCell-Mix* estimate the proportion of ambient RNAs in each barcoded droplet.

We first assessed the performance of SiftCell-Shuffle in the three experimental scRNA-seq or snRNA-seq datasets. First is scRNA-seq of ~10,000 PBMCs using 10X Chromium v3 chemistry. Second is snRNA-seq of ~1,000 E18 mouse nuclei using 10X Chromium, available at https://www.10xgenomics.com/resources/datasets. Third is scRNA-seq of ~1,000 cultured colon cancer cells pooled across 3 cell lines (RKO, HCT116, SW480), profiled using Drop-Seq technique³⁰. We expect that the PBMC dataset is more straightforward to distinguish cell-containing droplets from cell-free droplets than the other two datasets because snRNA-seq or Drop-Seq are known to be more enriched for ambient RNAs.

When we applied SiftCell-Shuffle on the unfiltered PBMC dataset together with unsupervised clustering produced by Seurat³¹, we observed a clear separation between the “original” (clusters 2, 3, 5, 6, 7, 8) and “randomized” (clusters 1, 4) droplets in both t-SNE and UMAP manifolds (Figure 2A, Figure S1A), except for the cluster 0, which had much lower UMI counts than other clusters (Figure 2B,2C,S1B, S1C). The original droplets that belongs to cluster 0 showed larger dispersion of UMIs across genes (Figure 2C, Figure S1C,E), and is also enriched with mtRNAs (18.5% of UMIs compared to 10.1% in other clusters; Figure 2D; Figure S1D). Altogether, these observations strongly suggest that the Cluster 0 represents cell-free droplets enriched for ambient mRNAs. On the other hand, the rest of clusters containing original droplets (cluster 2, 3, 5, 6, 7, and 8) contained very few randomized droplets (0 – 0.5%), suggesting that they likely represent cell-containing droplets with different cell types. Using known marker genes specific to immune cell types, we demonstrated that each of these clusters indeed show specific enrichment for specific immune cell types, while cluster 0 shows non-specific expression across most of these genes (Figure S2). By visualizing both original and randomized droplets together in a single manifold space, our results suggest that SiftCell-Shuffle distinguishes clusters of cell-containing droplets from cell-free droplets in a straightforward and visually interpretable/inspectable way.

Figure 2. — Visualization of cell-containing and cell-free droplets from PBMC scRNA-seq dataset with *SiftCell-Shuffle* The four panels visualize original and randomized droplets from the result of *SiftCell-Shuffle* for PBMC dataset in t-SNE manifold space generated using Seurat v3(Butler and Satija, 2017). The t-SNE manifolds were colored by (A) original (blue) vs. randomized (grey) droplets, (B) clusters produced by Seurat with with FindNeighbors and FindClusters functions, (C) the total number of UMIs in logarithmic scale corresponding to the droplet across all genes, and (D) the fraction of mitochondrial RNAs in logarithmic scale. In (A), we see clear separation between the original (blue) and randomized (gray) droplets except for the cluster (cluster 0 in (B)) in the lower-right quadrant, which we believe to be enriched for cell-free droplets. This cluster tends to have lower UMI counts in (C) and contains droplets with higher proportion of mitochondrial RNAs in (D). However, it is important to know that not all randomized droplets are clustered together in (A). Randomized droplets with higher UMI counts tend to form their own clusters (cluster 1 and 4 in (B)). This is because randomized droplets with higher UMI counts do not necessarily share similarities with cell-free droplets, because UMI count plays a role as a confounding variable (see Discussion for more details).

We made similar observations when applying SiftCell-Shuffle on the other two datasets. For example, among the 5 clusters of brain nuclei, original droplets (clusters 0, 2, 4) and randomized droplets (clusters 0, 1, 3) were well-separated except for cluster 0, suggesting that it represents cell-free droplets (Figure S3 A,B,E,F). Cluster 0 also tend to have the lower UMI counts and high mtRNA overall. Interestingly, In cluster 2, we also observed that a substantial fraction of the droplets with high mtRNAs (Figure S3D, H). These droplets may represent nuclei undergoing necrosis²⁶ or “nucleus-containing” droplets that also contain a substantial amount of ambient RNAs. When applying SiftCell-Shuffle on the Drop-Seq dataset of colon cell line mixture, we observed four distinct clusters, three representing each of three cell lines, and the largest cluster representing cell-free droplets with much lower UMI counts (Figure S4). Clusters representing each cell line were enriched for genes specific to the cell line³⁰, while the cell-free cluster tends to express most of these genes at a lower expression levels, suggesting that they contain ambient mRNAs as a mixture of multiple cell lines (Figure S5).

Across the three datasets, visualizing the original and randomized droplets in a low-dimensional manifold space provided us with a straightforward way to distinguish cell-free and cell-containing droplets. Applying unsupervised clustering based on shared nearest neighbor (SNN)³² was effective in distinguishing clusters of cell-free droplets from cell-containing droplets in PBMC (Figure 2) and brain nuclei (Figure S3). However, in colon cell line mixture, one of the clusters (cluster 1) largely contained both cell-containing (mostly HCT116) and randomized droplets together (Figure S4A,B,E,F), suggesting that unsupervised clustering does not always distinguish clusters of cell-free droplets automatically. Moreover, distinguishing cell-containing droplets from cell-free droplets by visual inspection from SiftCell-Shuffle without additional “gold-standard” labels involves subjective decision by users and may be hard to be automated in a software tool. Therefore, while SiftCell-Shuffle is an intuitive and humaninterpretable approach to identify clusters of cell-free droplets, it does not completely replace existing methods to filter cell-containing droplets in a more systematic fashion.

Evaluating the performance of droplet filtering using SiftCell-Shuffle

Our SiftCell-Shuffle framework can also be used to evaluate different approaches to filter digital expression matrix that allow us to focus on cell-containing droplets in the downstream analysis. While this approach would not be as accurate as evaluation based on “gold-standard” labels, in the absence of knowledge of true cell-free and cell-containing droplets, SiftCell-Shuffle can provide quasi-groud truth as a silver standard. We applied four existing filtering methods and visualized the filtering results in the manifold space produced by SiftCell-Shuffle. By contrasting the distribution of original and randomized droplets in the manifold space, it clearly demonstrates that cellRanger2-filtered (by UMI-cutoff) cell-containing droplets more specifically than the other methods in PBMC dataset (Figure 3 A–D). In the brain nuclei dataset, CellRanger/UMI-cutoff and EmptyDrops much more stringently filtered cell-containing droplets than DIEM and CellBender (Figure 3 F–I). In the mixture of three colon cancer cell lines, all of the four methods filtered the cell-containing droplets too stringently (CellRanger/UMI-cutoff) or too leniently (EmptyDrops, DIEM, CellBender) (Figure 3 K–N).

Figure 3. — Evaluation of droplet filtering methods with *SiftCell-Shuffle*

Each panel visualizes the results of droplet filtering methods in the same manifold spaces described in Figure 2, S3, and S4. Each colored points represents predicted cell-containing droplets (red), predicted cell-free droplets (cyan), or randomized droplets (grey). Each row corresponds to PBMC (A-E), brain nuclei (F-J), and colon cell line mixture (K-O) datasets, respectively. Each column visualizes the results from different droplet filtering methods, including *CellRanger/UMI-cutoff* (A,F,K), *EmptyDrops* (B,G,L), *DIEM* (C,H,M), *CellBender* (D,I,N), and *SiftCell-Boost* (E,J,O).

Besides visual inspections, we can also quantitatively evaluate droplet filtering methods using SiftCell-Shuffle. For a filtered droplet, we can quantify how often its nearest neighbor is an original droplet as opposed to a randomized droplet (named as % NN-concordance) as a metric. A high %NN-concordance suggest that the filtered droplets are well-separated from randomized droplets (Figure S6). A typical method to determine the number of cell-containing droplet is knee plot (Figure S7). However, our %NN-concordance plot is more informative to pinpoint where ambient RNAs start to increase. Each filtering method can be placed on this operating characteristic curve for evaluation, too. For example, in PBMC dataset, it is clear that EmptyDrops is worse than other methods in terms of % NN-concordance (Figure S6A). Among the other 3 methods, DIEM appears to filter too few droplets (n=9,112 droplets) even though %NN-concordant droplets remained high even after 10,000 droplets. In brain nuclei and colon cell line mixture, we observed that CellRanger/UMI-cutoff appears to filter stringently while others filter leniently (Figure 3F,K).

SiftCell-Boost robustly filters cell-containing and healthy droplets

Because none of the existing droplet filtering methods always provided satisfactory performance across all datasets in our evaluation, we next attempted to develop a method to filter cell-containing droplets by leveraging results from SiftCell-Shuffle. Our approach applies a gradient boosting classification algorithm XGBoost³³ by assigning randomized droplets as negative labels (representing ambient RNAs) and droplets confidently predicted to contain cells as positive labels using an overdispersion test (see STAR Methods for details). SiftCell-Boost assumes that the positively or negatively labeled droplets are confident cell-containing or cell-free droplets, respectively, and focuses on classifying the unlabeled droplets (10% in PBMC, 71% in brain nuclei, and 66% in colon cell line mixture) into either cell-containing or cell-free droplets.

We used top 1,000 highly variable genes identified from the same overdispersion test as features to train the classification model with XGBoost. Applying this algorithm robustly classified droplets into cell-containing and cell-free droplets, visually better than some of existing methods (Figure 3, S8), but we observed that droplets with high % of mtRNAs are sometimes classified as cell-containing droplets. This is because our method assumes that the distribution of ambient RNAs are random samples from existing reads, but in fact they tend to be enriched for higher % of mtRNAs due to necrosis. To address this challenge, we marked droplets with excessive % of mtRNAs as additional negative labels (see STAR Methods). In addition, for PBMCs, to avoid including unintended cell types (i.e., platelets), we also marked droplets with excessive % of PPBP as negative labels. With these additional negative labels, SiftCell-Boost clearly outperformed existing methods on PBMC and colon cell line mixture and was comparable with other methods for brain nuclei (Figure 3, S9). We also evaluated the concordance of droplet classification between the five evaluated methods. We counted how often a specific method exclusively classified each droplet into discordantly from all the other methods (Figure S10, Table S1). For example, we found that cell-containing droplets identified from CellRanger/UMI-cutoff were always consistent with at least one of the other methods. However, 288 and 455 cell-free droplets determined by CellRanger/UMI-cutoff were discordant with all the other methods for brain nuclei and colon cell line mixture data, suggesting that the method has high specificity but poor sensitivity. With this criteria, all four methods except for SiftCell-Boost had two or more datasets where >200 droplets were discordantly classified with all the other methods. However, SiftCell-Boost had 12 or less droplets discordantly classified with all other methods, suggesting that classification is more consistent to the consensus among all methods. We also evaluate the accuracy of SiftCell-Boost using 5-fold cross validation and obtained an average accuracy of 99.92% for PBMC data, 99.83% for brain nuclei data and 98.96% for colon cell line mixture (Supplementary Table S2).

SiftCell-Mix estimates the contribution from ambient RNAs in each droplet

Even though classifying each droplet into two categories is practically useful to determine droplets for downstream analysis, it is reasonable to assume that each cell-containing droplet may also contain a certain amount of reads from ambient RNAs considering the overall procedure of droplet-based scRNA-seq experiment^25,26,34. While SiftCell-Boost accurately classify cell-containing and cell-free droplets, it is important to estimate the proportion of ambient RNAs to inform downstream analysis. Once cell-containing droplets are clustered into cell types by users, SiftCell-Mix models the distribution of UMIs as a multinomial mixture of a single cell type and ambient RNAs to quantify contribution of ambient RNAs using maximum likelihood estimates (MLE). Across the three datasets – PBMC, brain nuclei, and colon cell line mixture – SiftCell-Mix corroborates the results from SiftCell-Boost, in the sense that the cell-containing droplets identified from SiftCell-Boost are estimated to have very small contribution from ambient RNAs, except for brain nuclei snRNA-seq that are expected to have contamination from ambient RNAs event for cell-containing droplets (Figure 4, Figure S11). Compared to DecontX, SiftCell-Mix provides more consistent estimates of % contribution from ambient RNAs across 3 datasets. While DecontX performed robustly for PBMC, it provided almost uniform estimates of % ambient RNAs across all droplets and failed to distinguish cellcontaining and cell-free droplets. In SiftCell-Mix, it should be noted that not all cell-free droplets had high estimates of % ambient RNAs. We suspect that this is a result of multiple factors, such as non-random contribution from individual cell types to constitute ambient RNAs in specific droplets, systematic difference of mitochondrial RNAs by droplets (particular for snRNA-seq of brain nuclei), estimation errors due to low UMI counts in certain droplets.

Figure 4. — Visualization of contribution of ambient RNAs from scRNA-seq and snRNA-seq datasets

The six panels visualize the estimates of ambient RNA contamination in a linear scale among droplets in PBMC (A,D), brain nuclei (B,E), and colon cell line mixture (C,F) by *DecontX* (A-C) and *SiftCell-Mix* (D-F) in the t-SNE manifold space excluding randomized droplets. Figure A and D show that the performance between *DecontX* and *SiftCell-Mix* is comparable in PBMC dataset. In (B), *DecontX* suggests that there is very little contamination of ambient RNAs in brain nuclei data, which is inconsistent to the expectation for typical snRNA-seq. *DecontX* estimated that 0.2% of cell-free droplets (inferred by *SiftCell-Boost)* have >10% of ambient RNAs present. On the other hand, in (C), *SiftCell-Mix* suggests a large amount of ambient RNA contamination in the same data. *SiftCell-Mix* estimates that 81.8% of cell-free droplets have >10% ambient RNAs present. In colon cell line mixture, we do not expect a large contamination from ambient RNAs, However, in (C), *DecontX* estimated that 56.8% of cell-containing droplets (inferred by *SiftCell-Boost*) have >10% of ambient RNAs present while, in (D), the estimation from *SiftCell-Mix* is only 9.9%.

Evaluation of computational cost

We evaluated the computational cost, in terms of wall time (i.e. elapsed time) and peak memory usage for SiftCell and other methods we evaluated above (Table S3). Both computational time and memory usage increased as the number of droplets increased across all methods evaluated. Each of the SiftCell methods typically finished the analysis within minutes. For the largest dataset (PBMC), SiftCell could take up to 15 minutes and consume up to 4.5GB of memory. Among the other methods, EmptyDrops and DecontX consumed a smaller memory footprint and computational time compared to SiftCell. DIEM was slower than SiftCell by a factor of 2–5. CellBender was evaluated in a GPU-enabled environment; nevertheless, its computational cost was the largest among all methods evaluated.

Discussion

In this paper, we describe SiftCell framework, a suite of software tools implementing methods including SiftCell-Shuffle, SiftCell-Boost and SiftCell-Mix, focusing on the challenges of contaminations from ambient RNAs in single-cell and single-nucelus RNA-seq experiments. SiftCell-Shuffle works with digital expression matrix and aids the investigators to visually distinguish cell-free and cell-containing droplets by contrasting with a randomized digital expression matrix. SiftCell-Boost takes the output of SiftCell-Shuffle as input and applies a machine learning method to classify cell-containing droplets and cell-free droplets. SiftCell-Mix is a model-based tool that allows quantitative estimation the contribution of “ambient RNAs” in each droplet.

Various approaches have been developed to identify and filter out cell-free droplets from scRNA-seq sequencing data. Existing methods, such as CellRanger/UMI-cutoff (https://github.com/10XGenomics/cellranger), EmptyDrops (Lun et al., 2019), DIEM (Alvarez et al., 2020), CellBender (Fleming et al., 2019), and DeContX (Yang et al., 2020) focuses on directly identifying cell-free droplets or quantifying contamination from ambient RNAs. However, unless the distinction between cell-containing and cell-free droplets are crystal clear, we believe that it is quite important to visualize all barcoded droplets to understand the overall quality of each scRNA-seq experiment. We noticed that existing tools do not provide effective visualization to understand the quality of scRNA-seq data in terms of ambient RNA contamination, and we believe that SiftCell-Shuffle is a unique tool to allow users to visually interpret spectrum of all barcoded droplets. It should be noted, however, that SiftCell-Shuffle offers only quasi-ground truth under the assumption that randomized droplets are good representatives of cell-free droplets.

Compared to existing methods SiftCell-Boost and SiftCell-Mix performed consistently better than or comparably with the best-performing methods across the three datasets, posed by different types of challenges. Most methods performed well for the PBMC dataset, which is the most recognized single-cell dataset, but many methods struggled with single-nucleus RNA-seq (brain nuclei), or scRNA-seq generated with Drop-Seq (colon cell line mixture). We believe that our method is robust across various data types and provide means to visualize their performance through SiftCell-Shuffle.

Both SiftCell-Boost and SiftCell-Mix benefit from the ability of SiftCell-Shuffle to robustly filter cell-containing droplets. Compared to existing methods, the biggest advantage of these methods is the integration with SiftCell-Shuffle. When the positive and negative labels can be visualized with randomized droplets using SiftCell-Shuffle, the users can have much more confidence on whether the trained models will reflect the actual structure of cell-type-heterogeneity and the distribution of ambient RNAs. In SiftCell-Boost and SiftCell-Mix, even though the algorithm is fully automated, users can still provide a manually curated version of training labels by leveraging visualizations from SiftCell-Shuffle.

One interesting observation we learned from SiftCell-Shuffle is that the randomized droplets appear to be grouped into multiple clusters instead of a single cluster (Figure 2B, S3B, S4B). More surprisingly, even if we only used randomized droplets, we still observed distinct clusters and non-uniformity in the manifold space (Figure S12). Because randomized droplets should have uniform distribution, they should form a single cluster in principle. However, when applying the standard clustering and manifold method implemented in Seurat, we noticed that UMI count acts a confounding factor, creating spurious clusters and manifold spaces largely explained by the UMI count per droplet. This suggests that many of current single cell analysis methods may be confounded by the number of UMIs per droplets regardless of the cell types and may be further improved.

The SiftCell framework can be easily adopted for other quality control methods for single-cell genomic data. Existing methods for identifying cell-containing droplets may be improved by incorporating outcomes from SiftCell-Shuffle. The key idea underlying SiftCell-Shuffle, SiftCell-Boost, SiftCell-Mix is not necessarily limited to scRNA-seq or snRNA-seq, so it should be possible to apply the same principle in scATAC-seq data or single cell multiome dataset, even though some tweaks may be necessary to optimize its performance.

There are rooms for further improvements in SiftCell-Shuffle. For example, it may be better to assume that ambient RNAs are not a totally random sample of the pseudo-bulk scRNA-seq reads. In fact, there are many studies demonstrating that ambient RNAs are enriched for specific features, such as mitochondrial genes or necrosis marker genes²⁸. Our current approach to randomly shuffle barcodes droplets for SiftCell-Shuffle, but provides an option to remove specific genes that are determined to be enriched or depleted in cell-free droplets. Our method can be further extended to a non-random permutation or bootstrapping, and how to define a better ‘null’ distribution of ambient RNAs is a subject of further research.

Binary classification of droplets into cell-containing/cell-free droplets with SiftCell-Boost may make the downstream analysis simpler, but more sophisticated procedure is needed to handle datasets with heavy contamination from ambient RNAs. In such cases, estimates from SiftCell-Mix can inform the quality of classification results. For example, in brain nuclei dataset, SiftCell-Mix estimated that 27.0% of cell-containing droplets (inferred by SiftCell-Boost) have > 10% of ambient RNAs present. This is substantially larger than 4.7% for PBMC, and 9.9% for colon cell line mixture, suggesting the importance of accounting for ambient RNAs when analyzing single-nucleus RNA-seq data. In the colon cell line mixture dataset, we observed that droplets containing multiple cells (multiplets) tend to be classified as cell-free droplets more often than true single cells because mixture of multiple cell types tend to be more similar to ambient RNAs.

Even though SiftCell-Mix provide quantitative estimation of contribution from ambient RNAs, but when the number of reads per cell is limited, we noticed that the estimates can be quite unstable under our maximum-likelihood framework. Imposing a stronger prior under Bayesian framework may make the estimation of more stable for sparse data.

In summary, we introduced SiftCell, a suite of tools that help investigators perform quality control of single-cell transcriptomic dataset with visual cues focusing on determining cell-containing droplets and understanding the degree of contamination from ambient RNAs in the sequenced library of scRNA-seq and snRNA-seq. We believe that SiftCell will facilitate more holistic understanding of scRNA-seq from upstream and reduce the chance that upstream technical issues such as ambient RNA contamination obscure novel scientific discovery.

STAR★Methods

RESOURCE AVAILABILITY

Lead contact

Further requests for resources or information should be directed to the lead contact Hyun Min Kang(hmkang@umich.edu).

Materials availability

This study did not generate new materials.

Data and code availability

This paper analyzes existing, publicly available data. Accession numbers for the datasets are listed in the key resources table. All code has been deposited in a GitHub repository. The DOI for this archive (or direct URL if DOI is not available) is listed in the key resources table. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

KEY RESOURCE TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Data
PBMC	10X Genomics	https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Brain nuclei	10X Genomics	https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/nuclei_900
Colon cell line mixture	Park et al., 2020	GSE149224
Software and algorithms
SiftCell	This paper	https://github.com/jyxi7676/SiftCell/ https://doi.org/10.5281/zenodo.7738864
DropletUtils(version 1.18.0)	Lun et al., 2019	https://doi.org/10.18129/B9.bioc.DropletUtils.
DIEM (version 2.2.0)	Alvarez et al., 2020	https://github.com/marcalva/diem
CellBender (version 0.1.0)	Fleming et al., 2019	https://github.com/broadinstitute/CellBender
Celda (version 1.4.7)	Yang et al., 2020	https://bioconductor.org/packages/release/bioc/html/celda.html
Seurat (version 4.0.2)	Hao et al.. 2021	https://doi.org/10.1016/j.cell.2021.04.048

Open in a new tab

Methods Details

SiftCell-Shuffle : Visualizing of cell-free and cell-containing droplets in a manifold space

Our methods assume that we have a raw digital expression (DGE) matrix X ∈ {0,1,2, ⋯^B×G, where B is the total number of unique barcodes representing individual droplets, and G is the total number of genes or features. X Then $X_{b \cdot} = \sum_{g = 1}^{G} X_{b g}$ is the total UMI counts per each barcoded droplet, and $X_{\cdot g} = \sum_{b = 1}^{B} X_{b g}$ is the total number of reads covering each gene. The SiftCell-Shuffle algorithm takes an original DGE matrix X and outputs a permuted DGE matrix X^(S) while preserving X_b∙ and X_∙g to simulate cell-free droplets containing pseudo-bulk RNAs only to approximate ambient RNAs. Specifically, let $U = \sum_{g = 1}^{G} \sum_{b = 1}^{B} X_{b g}$ be the total number of UMIs, and J_u ∈ {1, ⋯, B} × {1, ⋯, G}, u ∈ {1, …, U} represents the (barcode, gene) pair each UMI belongs to, so that $X_{b g} = \sum_{u = 1}^{U} I (J_{u} = (b, g))$ is always true.

The SiftCell-Shuffle algorithm simply permutes the barcodes and genes in J_u independently (i.e. randomizes the relationship between barcodes and genes) across all UMIs to produce $J_{u}^{(S)}$ ; then the DGE matrix after SiftCell-Shuffle becomes $X_{b g}^{(S)} = \sum_{u = 1}^{U} I (J_{u}^{(S)} = (b, g))$ As a result, the total number of UMIs for each barcode and each gene remains unchanged, because $X_{b \cdot}^{(S)} = X_{b \cdot}$ and $X_{\cdot g}^{(S)} = X_{\cdot g}$ hold as long as $J_{u}^{(S)}$ is a permutation of J_u. The main idea of this procedure is that the distribution of UMI counts for each barcode in $X_{b g}^{(S)}$ is uniform, as if the droplet barcodes are randomly assigned from a bulk RNA-seq (i.e. aggregate of all reads ignoring barcode assignment), which we assume to represent the distribution of ambient RNAs.

To visualize whether each barcoded droplet likely contains ambient RNAs or not, we construct a low-dimensional manifold plots, such as UMAP or t-SNE, after combining X_bg and $X_{b g}^{(S)}$ into one DGE. SiftCell-Shuffle uses Seurat software with default parameters, except for no minimum UMI counts and 10 PCs, on the merged DGE matrix to generate t-SNE and UMAP manifolds and visualize it. The visualized manifold distinguishes the barcodes from X_bg and $X_{b g}^{(S)}$ in different colors (Figure 2A, Figure S1A). If a barcode contains ambient RNAs only, we expect it to appear proximal to $X_{b g}^{(S)}$ in the manifold space. For barcodes representing cell-containing droplets, we expect it to be located in a separate cluster from $X_{b g}^{(S)}$ (Figure 2B, Figure S1B). These plots allow us to quickly visualize how many cell-containing and cell-free droplets exist in a scRNA-seq or snRNA-seq dataset. When the randomized and original droplets are clustered together with Seurat³¹, the putative cell-free droplets tend to be assigned the same cluster label with shuffled droplets (Figure 2B, S1B). Visualizing the total UMI counts (Figure 2C, S1C) or the proportion of mitochondrial reads (Figure 2D, S1D) also illustrate the cluster of shuffled droplets are enriched for lower total UMIs and high % of mitochondrial reads. This visualization was used to visually evaluate how well a specific quality control method classifies cell-containing and cell-free droplets across all datasets.

Evaluation of existing methods for filtering cell-containing droplets with SiftCell-Shuffle

We evaluated existing methods for classifying cell-free droplets from cell-containing droplets by visualizing the results from each method in the t-SNE manifold plots generated by SiftCell-Shuffle. We used t-SNE instead of UMAP because it distributes the cell-free droplets more widely in the manifold space, which fits for the purpose of our evaluation. Four methods are used for evaluation: (1) CellRanger/UMI-cutoff method that determines cell-containing droplets based on a UMI count threshold, which is determined from knee plot (and a few other criteria), as implement in CellRanger 2 (https://github.com/10XGenomics/cellranger), (2) EmptyDrops²⁴, implemented in DropletUtils R package³⁵, which uses a likelihood-based permutation test to determine cell-containing droplets, (3) DIEM²², which uses Expectation Maximization using Dirichlet distribution to identify droplets contaminated by ambient RNAs or extranuclear RNAs. (4) CellBender²⁹, which uses a generative model based on deep neural network, to identify cell-free and cell-containing droplets.

For UMI cutoff method, we used the default output from CellRanger 2 for PBMC and brain nuclei as they were generated from 10x Chromium. For colon cell line mixture, which is produced with DropSeq platform, we determined the UMI cutoff determined by the knee plot (UMI≥5440, Figure S7). For EmptyDrops, which is expected to be similar to CellRanger 3, we used the default parameters for PBMC and brain nuclei, which is UMI≤100 to represent ambient RNAs. For colon cell line mixture, because UMI cutoff was high, we used UMI≤200 to determine ambient RNAs. For DIEM and CellBender, we used the default parameters across all three datasets.

To illustrate the performance of each method with SiftCell-Shuffle, we visualized original and shuffled droplets in t-SNE spaces with three categories (1) shuffled droplets (2) original droplets classified as cell-containing droplets (3) original droplets classified as cell-free based on each algorithm (Figure 2, 3). This illustration is used to visually evaluate the performance of each algorithm to filter droplets.

We also developed a metric, “%NN-concordance”, as an alternative to the knee plot to estimate the number of cell-containing droplets by leveraging shuffled droplets. For each original droplet that is classified cell-containing, the nearest droplet (in terms of Euclidean distance of top 100 PCs of highly variable genes) among original + shuffled dataset is selected. The %NN concordance metric quantifies, across all filtered droplets, how often their nearest droplet is an original droplet as opposed to a shuffled droplet (Figure S6). This can be done for an arbitrary subset of droplets. This metric is intended to quantify how well the filtered droplets are separated from the shuffled droplets in a high-dimensional space.

SiftCell-Boost : Automated machine learning method to identify cell-containing droplets.

SiftCell-Boost uses a classification method (XGBoost) to classify each barcoded droplets into cell-containing (positive label) or cell-free (negative label) droplets using a training set consisting of permuted droplets from SiftCell-Shuffle and a subset of original droplets that are likely cell-containing droplets (Figure 1). Negative labels include all permuted droplets from SiftCell-Shuffle. Except for a few specified examples, we also include additional negative labels from original droplets based on the % reads from unwanted genes (3 standard deviation above the median). The unwanted genes include all mitochondrial genes across 3 datasets. For PBMC, we also included PPBP, a markers gene for platelet cell type. This is to avoid classifying platelets, which is not supposed be a part of PBMC cell types, as cell-containing droplets.

To generate positive labels, we estimated how likely each original droplet contains a cell based on the SQuAT (sparse quantile aggregation test, https://github.com/hyunminkang/squat), and applied non-parametric ranking among UMIs and z-scores from SQuAT with selected the top N expected number of cells, provided by the user. In our experiment, we used N=10,000 for PBMC, N=1,000 for brain nuclei, and N=800 for colon cell line mixture as suggested by CellRanger 2 or the published data.

To summarize, the training dataset comes from three sources: (1) negative labels are obtained from randomized droplets (2) positive labels are obtained from confident cell-containing droplets estimated from SQuAT, and (3) additional negative labels are obtained from the original droplets based on excessive contribution from mtRNAs and/or enrichment of marker genes representing unwanted cells (e.g. Platelet in PBMC datasets). The test data is the rest of the unlabeled droplets which is not part of the training data. SiftCell-Boost uses extreme gradient boosting (XGBoost) to train the classification model with the positively and negatively labeled droplets. To generate the features for XGBoost training, we use the top 100 principal components from the log-normalized digital expression matrix, focusing on 1,000 most variable genes identified by SQuAT.

Using SiftCell-Boost, we classified original droplets into cell-containing and cell-free droplets and evaluated its performance with other methods, using t-SNE visualization from SiftCell-Shuffle (Figure 2), as well as %-NN concordance metrics (Figure S6). NN-concordance metrics were evaluated for arbitrary UMI cutoff, as well as for the 5 methods to filter cell-containing droplets.

SiftCell-Mix: Model-based approach for inferring the fraction of ambient RNAs in each droplet

It is possible that some of the reads in a droplet are “contaminated” by ambient RNAs floating outside individual cells^25,34. For cell-free droplets, most reads are contributed by ambient RNAs, but even for cell-containing droplets, ambient RNAs may present within barcoded reads assigned for the droplets. We assume that the read counts for a droplet follow a mixture of Multinomial distributions, one for each cell type and an additional one representing ambient RNAs. Prior to SiftCell-Mix analysis, we assume that a large fraction of cell-containing droplets are assigned to specific cell types using other software tools (e.g. Seurat or scanpy) or by an domain expert so that the distribution of each cell type can be modeled reliably.

Let n₁, n₂, …, n_K be the number of droplets assigned to each of the K cell types and and n₀ = N be the total number of droplets in the single cell RNA-seq dataset, including cellcontaining and cell-free droplets. Let D₀, D₁, …, D_k denote the set of droplets corresponding to n₀, n₁, …, n_k, respectively. For a given droplet i ∈ {1,2, ⋯, N}, the read count across the G genes (with nonzero read count) is a vector: $x_{i} = (x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{G})$ . Let $π_{k} = (π_{k}^{1}, π_{k}^{2}, \dots, π_{k}^{G})$ for k ∈ {0,1, ⋯, K} be the multinomial probabilities representing the distribution of each cell type (or ambient RNAs for k = 0). We model x_i as a multinomial mixture between ambient RNAs (π₀) and one of the cell types (π_k k, > 0) as we will describe later. For k > 0, we define π_k as an arithmetic mean of the proportion of reads of the gene across the droplets of the cell type k:

π_{k}^{j} = \frac{1}{n_{k}} \sum_{i \in D_{k}} [\frac{x_{i}^{j}}{\sum_{g = 1}^{G} x_{i}^{g}}], j \in {1, 2, \dots, G}

We define π₀ in a similar way, but across all droplets regardless of their cell types, slightly upweighting droplets with high total UMI count, but ensuring minimum weight λ (=100 in our experiments) for droplets with low UMI counts according to $w_{i} = min (λ, \sum_{g = 1}^{G} x_{i}^{g}) .$ We weight π₀ based on logw_i to better represent the distribution of ambient RNAs enriched for low-UMI count droplets.

π_{0}^{j} = \frac{\sum_{i \in D_{k}} [\frac{x_{i}^{j}}{\sum_{g = 1}^{G} x_{i}^{g}} log w_{i}]}{\sum_{i \in D_{k}} log w_{i}}, j \in {1, 2, \dots, G}

to avoid the corner case that $π_{k}^{j} = 0$ for some (j, k), we adjust $π_{k}^{j}$ to contain a small fraction (α) of ambient RNAs as $(1 - α) π_{k}^{j} + α π_{0}^{j}$ , and used α = 0.01 in our experiments. In summary, π₁, π₂, π_K are defined as arithmetic mean of reads within each cell type, and π₀ is defined logarithmic mean of reads across all droplets with a threshold.

We model the log likelihood of the read count in a droplet as a mixture of multinomial distributions of ambient RNAs and one of the cell types. Let π_ik be the fraction of contributions from the kth category to the ith droplet where k = 1,2, …, K. Thus, the objective function is the log likelihood of the reads coming from the different cell types and ambient RNAs, which can be formulated as:

f_{i} = - log [\sum_{k = 0}^{K} γ_{i k} exp (\sum_{j = 1}^{G} x_{i}^{j} log π_{k}^{j})]

Subject to :

\sum_{k = 0}^{K} γ_{i k} = 1

0 \leq γ_{i k} \leq 1, \forall k \in {0, 1, \dots, K} and \forall i \in {1, 2, \dots, N}

The nonlinear optimization problem is solved using augmented Lagrange multiplier method with a sequential quadratic programming interior algorithm as implemented in Rsolnp package (v1.16) available in CRAN.

Supplementary Material

NIHMS1919287-supplement-1.pdf^{(732.6KB, pdf)}

NIHMS1919287-supplement-2.pdf^{(26MB, pdf)}

Highlights.

SiftCell is a suite of software tools for filtering of droplets in single-cell RNA-seq.
SiftCell-Shuffle visually distinguishes cell-free droplets in manifold space.
SiftCell-Boost filters cell-containing droplets; SiftCell-Mix quantifies ambient RNAs.
SiftCell is evaluated across three independent datasets, comparing with other methods.

Acknowledgements

JHL, SRP, and JX are supported by NIH grants CA091, DK114131 and DK102850, and ADVANCE and MTRAC awards, funded by the Michigan Economic Development Corporation. HMK is supported by NIH grant HL137182, HG011031, and HHSN268201800002I. JHL and HMK are supported by the Taubman Institute.

Footnotes

Declaration of Interests

HMK owns stock for Regeneron Pharmaceuticals. JHL is an inventor on pending patent applications related to Seq-Scope.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, and Kirschner MW (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201. 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. (2015). Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214. 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Preissl S, Fang R, Huang H, Zhao Y, Raviram R, Gorkin DU, Zhang Y, Sos BC, Afzal V, Dickel DE, et al. (2018). Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci 21, 432–439. 10.1038/s41593-018-0079-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, Chang HY, and Greenleaf WJ (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490. 10.1038/nature14590. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Habib N, Avraham-Davidi I, Basu A, Burks T, Shekhar K, Hofree M, Choudhury SR, Aguet F, Gelfand E, Ardlie K, et al. (2017). Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat Methods 14, 955–958. 10.1038/nmeth.4407. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Tsoucas D, and Yuan GC (2018). GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol 19, 58. 10.1186/s13059-018-1431-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, and Hemberg M (2017). SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486. 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jiang L, Chen H, Pinello L, and Yuan GC (2016). GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol 17, 144. 10.1186/s13059-016-1010-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, and van Oudenaarden A (2015). Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255. 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
10.Welch JD, Hartemink AJ, and Prins JF (2016). SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol 17, 106. 10.1186/s13059-016-0975-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, Choi K, Bendall S, Friedman N, and Pe’er D (2016). Wishbone identifies bifurcating developmental trajectories from single-cell data. Nature biotechnology 34, 637–645. 10.1038/nbt.3569. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, Gould J, Liu S, Lin S, Berube P, et al. (2019). Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell 176, 1517. 10.1016/j.cell.2019.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C (2017). Reversed graph embedding resolves complex single-cell trajectories. Nat Methods 14, 979–982. 10.1038/nmeth.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Soneson C, and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 15, 255–261. 10.1038/nmeth.4612. [DOI] [PubMed] [Google Scholar]
15.Ntranos V, Yi L, Melsted P, and Pachter L (2019). A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat Methods 16, 163–166. 10.1038/s41592-018-0303-9. [DOI] [PubMed] [Google Scholar]
16.Stoeckius M, Zheng S, Houck-Loomis B, Hao S, Yeung BZ, Mauck WM 3rd, Smibert P, and Satija R (2018). Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol 19, 224. 10.1186/s13059-018-1603-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gierahn TM, Wadsworth MH 2nd, Hughes TK, Bryson BD, Butler A, Satija R, Fortune S, Love JC, and Shalek AK (2017). Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat Methods 14, 395–398. 10.1038/nmeth.4179. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, et al. (2018). Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182. 10.1126/science.aam8999. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049. 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, et al. (2017). Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667. 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.O’Flanagan CH, Campbell KR, Zhang AW, Kabeer F, Lim JLP, Biele J, Eirew P, Lai D, McPherson A, Kong E, et al. (2019). Dissociation of solid tumor tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase-associated stress responses. Genome Biol 20, 210. 10.1186/s13059-019-1830-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Alvarez M, Rahmani E, Jew B, Garske KM, Miao Z, Benhammou JN, Ye CJ, Pisegna JR, Pietilainen KH, Halperin E, and Pajukanta P (2020). Enhancing dropletbased single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM. Sci Rep 10, 11019. 10.1038/s41598-020-67513-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Nadal-Ribelles M, Islam S, Wei W, Latorre P, Nguyen M, de Nadal E, Posas F, and Steinmetz LM (2019). Sensitive high-throughput single-cell RNA-seq reveals within-clonal transcript correlations in yeast populations. Nat Microbiol 4, 683–692. 10.1038/s41564-018-0346-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lun ATL, Riesenfeld S, Andrews T, Dao TP, Gomes T, participants in the 1st Human Cell Atlas, J., and Marioni JC (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20, 63. 10.1186/s13059-019-1662-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Yang S, Corbett SE, Koga Y, Wang Z, Johnson WE, Yajima M, and Campbell JD (2020). Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 21, 57. 10.1186/s13059-020-1950-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Young MD, and Behjati S (2020). SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. Gigascience 9. 10.1093/gigascience/giaa151. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Dempster AP, Laird NM, and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1–22. [Google Scholar]
28.Muskovic W, and Powell JE (2021). DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data. Genome Biol 22, 329. 10.1186/s13059-021-02547-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Fleming SJ, Marioni JC, and Babadi M (2019). CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets. bioRxiv, 791699. 10.1101/791699. [DOI] [Google Scholar]
30.Park SR, Namkoong S, Friesen L, Cho CS, Zhang ZZ, Chen YC, Yoon E, Kim CH, Kwak H, Kang HM, and Lee JH (2020). Single-Cell Transcriptome Analysis of Colon Cancer Cell Response to 5-Fluorouracil-Induced DNA Damage. Cell Rep 32, 108077. 10.1016/j.celrep.2020.108077. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Butler A, and Satija R (2017). Integrated analysis of single cell transcriptomic data across conditions, technologies, and species. bioRxiv. 10.1101/164889. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Waltman L, and Van Eck NJ (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European physical journal B 86, 1–14. [Google Scholar]
33.Chen T, and Guestrin C (2016). Xgboost: A scalable tree boosting system. pp. 785–794.
34.Heaton H, Talman AM, Knights A, Imaz M, Gaffney DJ, Durbin R, Hemberg M, and Lawniczak MKN (2020). Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat Methods 17, 615–620. 10.1038/s41592-020-0820-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Griffiths JA, Richard AC, Bach K, Lun ATL, and Marioni JC (2018). Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun 9, 2667. 10.1038/s41467-018-05083-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1919287-supplement-1.pdf^{(732.6KB, pdf)}

NIHMS1919287-supplement-2.pdf^{(26MB, pdf)}

Data Availability Statement

KEY RESOURCE TABLE

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Data
PBMC	10X Genomics	https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3
Brain nuclei	10X Genomics	https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/nuclei_900
Colon cell line mixture	Park et al., 2020	GSE149224
Software and algorithms
SiftCell	This paper	https://github.com/jyxi7676/SiftCell/ https://doi.org/10.5281/zenodo.7738864
DropletUtils(version 1.18.0)	Lun et al., 2019	https://doi.org/10.18129/B9.bioc.DropletUtils.
DIEM (version 2.2.0)	Alvarez et al., 2020	https://github.com/marcalva/diem
CellBender (version 0.1.0)	Fleming et al., 2019	https://github.com/broadinstitute/CellBender
Celda (version 1.4.7)	Yang et al., 2020	https://bioconductor.org/packages/release/bioc/html/celda.html
Seurat (version 4.0.2)	Hao et al.. 2021	https://doi.org/10.1016/j.cell.2021.04.048

Open in a new tab

[R1] 1.Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, and Kirschner MW (2015). Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201. 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. (2015). Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202–1214. 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Preissl S, Fang R, Huang H, Zhao Y, Raviram R, Gorkin DU, Zhang Y, Sos BC, Afzal V, Dickel DE, et al. (2018). Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci 21, 432–439. 10.1038/s41593-018-0079-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, Chang HY, and Greenleaf WJ (2015). Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490. 10.1038/nature14590. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Habib N, Avraham-Davidi I, Basu A, Burks T, Shekhar K, Hofree M, Choudhury SR, Aguet F, Gelfand E, Ardlie K, et al. (2017). Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat Methods 14, 955–958. 10.1038/nmeth.4407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Tsoucas D, and Yuan GC (2018). GiniClust2: a cluster-aware, weighted ensemble clustering method for cell-type detection. Genome Biol 19, 58. 10.1186/s13059-018-1431-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR, and Hemberg M (2017). SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483–486. 10.1038/nmeth.4236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Jiang L, Chen H, Pinello L, and Yuan GC (2016). GiniClust: detecting rare cell types from single-cell gene expression data with Gini index. Genome Biol 17, 144. 10.1186/s13059-016-1010-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Grun D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, and van Oudenaarden A (2015). Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255. 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]

[R10] 10.Welch JD, Hartemink AJ, and Prins JF (2016). SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol 17, 106. 10.1186/s13059-016-0975-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Setty M, Tadmor MD, Reich-Zeliger S, Angel O, Salame TM, Kathail P, Choi K, Bendall S, Friedman N, and Pe’er D (2016). Wishbone identifies bifurcating developmental trajectories from single-cell data. Nature biotechnology 34, 637–645. 10.1038/nbt.3569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, Gould J, Liu S, Lin S, Berube P, et al. (2019). Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell 176, 1517. 10.1016/j.cell.2019.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Qiu X, Mao Q, Tang Y, Wang L, Chawla R, Pliner HA, and Trapnell C (2017). Reversed graph embedding resolves complex single-cell trajectories. Nat Methods 14, 979–982. 10.1038/nmeth.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Soneson C, and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods 15, 255–261. 10.1038/nmeth.4612. [DOI] [PubMed] [Google Scholar]

[R15] 15.Ntranos V, Yi L, Melsted P, and Pachter L (2019). A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat Methods 16, 163–166. 10.1038/s41592-018-0303-9. [DOI] [PubMed] [Google Scholar]

[R16] 16.Stoeckius M, Zheng S, Houck-Loomis B, Hao S, Yeung BZ, Mauck WM 3rd, Smibert P, and Satija R (2018). Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol 19, 224. 10.1186/s13059-018-1603-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Gierahn TM, Wadsworth MH 2nd, Hughes TK, Bryson BD, Butler A, Satija R, Fortune S, Love JC, and Shalek AK (2017). Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat Methods 14, 395–398. 10.1038/nmeth.4179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, Graybuck LT, Peeler DJ, Mukherjee S, Chen W, et al. (2018). Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182. 10.1126/science.aam8999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049. 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, et al. (2017). Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667. 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.O’Flanagan CH, Campbell KR, Zhang AW, Kabeer F, Lim JLP, Biele J, Eirew P, Lai D, McPherson A, Kong E, et al. (2019). Dissociation of solid tumor tissues with cold active protease for single-cell RNA-seq minimizes conserved collagenase-associated stress responses. Genome Biol 20, 210. 10.1186/s13059-019-1830-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Alvarez M, Rahmani E, Jew B, Garske KM, Miao Z, Benhammou JN, Ye CJ, Pisegna JR, Pietilainen KH, Halperin E, and Pajukanta P (2020). Enhancing dropletbased single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM. Sci Rep 10, 11019. 10.1038/s41598-020-67513-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Nadal-Ribelles M, Islam S, Wei W, Latorre P, Nguyen M, de Nadal E, Posas F, and Steinmetz LM (2019). Sensitive high-throughput single-cell RNA-seq reveals within-clonal transcript correlations in yeast populations. Nat Microbiol 4, 683–692. 10.1038/s41564-018-0346-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Lun ATL, Riesenfeld S, Andrews T, Dao TP, Gomes T, participants in the 1st Human Cell Atlas, J., and Marioni JC (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20, 63. 10.1186/s13059-019-1662-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Yang S, Corbett SE, Koga Y, Wang Z, Johnson WE, Yajima M, and Campbell JD (2020). Decontamination of ambient RNA in single-cell RNA-seq with DecontX. Genome Biol 21, 57. 10.1186/s13059-020-1950-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Young MD, and Behjati S (2020). SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. Gigascience 9. 10.1093/gigascience/giaa151. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Dempster AP, Laird NM, and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1–22. [Google Scholar]

[R28] 28.Muskovic W, and Powell JE (2021). DropletQC: improved identification of empty droplets and damaged cells in single-cell RNA-seq data. Genome Biol 22, 329. 10.1186/s13059-021-02547-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Fleming SJ, Marioni JC, and Babadi M (2019). CellBender remove-background: a deep generative model for unsupervised removal of background noise from scRNA-seq datasets. bioRxiv, 791699. 10.1101/791699. [DOI] [Google Scholar]

[R30] 30.Park SR, Namkoong S, Friesen L, Cho CS, Zhang ZZ, Chen YC, Yoon E, Kim CH, Kwak H, Kang HM, and Lee JH (2020). Single-Cell Transcriptome Analysis of Colon Cancer Cell Response to 5-Fluorouracil-Induced DNA Damage. Cell Rep 32, 108077. 10.1016/j.celrep.2020.108077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Butler A, and Satija R (2017). Integrated analysis of single cell transcriptomic data across conditions, technologies, and species. bioRxiv. 10.1101/164889. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Waltman L, and Van Eck NJ (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European physical journal B 86, 1–14. [Google Scholar]

[R33] 33.Chen T, and Guestrin C (2016). Xgboost: A scalable tree boosting system. pp. 785–794.

[R34] 34.Heaton H, Talman AM, Knights A, Imaz M, Gaffney DJ, Durbin R, Hemberg M, and Lawniczak MKN (2020). Souporcell: robust clustering of single-cell RNA-seq data by genotype without reference genotypes. Nat Methods 17, 615–620. 10.1038/s41592-020-0820-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Griffiths JA, Richard AC, Bach K, Lun ATL, and Marioni JC (2018). Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun 9, 2667. 10.1038/s41467-018-05083-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SiftCell: A robust framework to detect and isolate cell-containing droplets from single-cell RNA sequence reads

Jingyue Xi

Sung Rye Park

Jun Hee Lee

Hyun Min Kang

Summary

Graphical Abstract

Electronic table of contents (eTOC) blurb

Introduction

Results

SiftCell-Shuffle visually distinguishes cell-free droplets from cell-containing ones

Figure 1.

Figure 2.

Evaluating the performance of droplet filtering using SiftCell-Shuffle

Figure 3.

SiftCell-Boost robustly filters cell-containing and healthy droplets

SiftCell-Mix estimates the contribution from ambient RNAs in each droplet

Figure 4.

Evaluation of computational cost

Discussion

STAR★Methods

RESOURCE AVAILABILITY

Lead contact

Materials availability

Data and code availability

Methods Details

SiftCell-Shuffle : Visualizing of cell-free and cell-containing droplets in a manifold space

Evaluation of existing methods for filtering cell-containing droplets with SiftCell-Shuffle

SiftCell-Boost : Automated machine learning method to identify cell-containing droplets.

SiftCell-Mix: Model-based approach for inferring the fraction of ambient RNAs in each droplet

Supplementary Material

Highlights.

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases