Summary
Single-cell RNA sequencing (scRNA-seq) datasets contain true single cells, or singlets, in addition to cells that coalesce during the protocol, or doublets. Identifying singlets with high fidelity in scRNA-seq is necessary to avoid false negative and false positive discoveries. Although several methodologies have been proposed, they are typically tested on highly heterogeneous datasets and lack a priori knowledge of true singlets. Here, we leveraged datasets with synthetically introduced DNA barcodes for a hitherto unexplored application: to extract ground-truth singlets. We demonstrated the feasibility of our framework, “singletCode,” to evaluate existing doublet detection methods across a range of contexts. We also leveraged our ground-truth singlets to train a proof-of-concept machine learning classifier, which outperformed other doublet detection algorithms. Our integrative framework can identify ground-truth singlets and enable robust doublet detection in non-barcoded datasets.
Keywords: single-cell genomics, lineage tracing, barcoding, singlets, doublet detection, machine learning, benchmarking, scRNA-seq, singletCode
Graphical abstract

Highlights
-
•
Synthetic DNA barcodes can identify ground-truth singlets in scRNA-seq datasets
-
•
singletCode can be leveraged to benchmark existing doublet detection tools
-
•
Performance of such tools depends on sample heterogeneity and doublet creation method
-
•
True singlets from singletCode can be used to train classifiers for doublet detection
Zhang et al. present singletCode, a framework that uses DNA barcodes to identify true single cells (“singlets”) in single-cell RNA sequencing datasets, which are typically contaminated when ≥2 cells coalesce (“doublets”). The ground-truth singlets potentiate the development of accurate machine-learning-based doublet detection algorithms.
Introduction
Rapid advances in single-cell RNA sequencing (scRNA-seq) technologies have enabled the characterization of cellular gene expression at an unprecedented resolution and scale. Such technologies have revealed extensive functional diversity of cell states across biological contexts, including cancer, evolution, and development. Briefly, scRNA-seq technologies rely on distributing individual cells from a suspension into individual reactions, each labeled with a unique ID, usually in the form of a reaction-specific sequence barcode. Despite numerous technical optimizations, multiple cells can occasionally be randomly encapsulated in a single reaction, resulting in doublets or multiplets, where two or more cells are assigned the same reaction ID (Figure 1A). In some cases, cellular physiology and the nature of experimental assays can promote cells to clump, leading to an enhanced doublet fraction.1,2 The percentage of doublets in a given experiment depends on several factors, including the features of the sample and throughput, and can be as high as 40%.3,4 In turn, such artifacts affect the downstream analyses.5 Indeed, a central challenge in the burgeoning scRNA-seq field is to identify true single cells (“singlets” from here on) and ensure that the resultant datasets accurately reflect individual cells’ transcriptomes.
Figure 1.
Establishing and testing the singletCode framework
(A) Schematic of how cells with lineage barcodes appear in the single-cell sequencer where droplets add a unique cell ID.
(B) Singlets and doublets are classified as shown for the purposes of our analyses. Singlets: (1) 1 droplet with 1 barcode and 1 cell (or 1 droplet with 1 dominant barcode and 1 cell); (2) the same barcode combination found in multiple cell IDs within a sample, suggesting the 2 cells are progeny of a single cell; (3) the same barcode found in cells across multiple “twin” samples, suggesting the 2 cells are also progeny of a single cell. Doublets: 2 cells with 2 unique barcodes in 1 droplet with 1 cell ID.
(C) Workflow schematic of how data are processed to generate true singlets and simulate doublets for benchmarking doublet detection methods.
(D) Stacked bar chart of cell types in all datasets. Each color corresponds to 1 study, with the proportions of different cell types arranged from most to least prevalent (deepest to lightest shade).
(E) Dot plot of true singlets recovered with singletCode for a given number of barcoded cells. Both barcoded and true singlet numbers are after quality control. Color corresponds to the study of origin, and each dot represents 1 sample (n = 94 samples).
(F) Stacked bar chart showing types of singlets in each dataset (across all samples for a dataset, see Tables S2 and S3). Each color represents 1 type of barcode composition that constitutes a singlet (STAR Methods).
(G) Scatterplot of comparison between the previously reported number of recovered singlets and the number of singlets recovered by singletCode, both after quality control (n = 41 samples). Color corresponds to study of origin, and each dot represents 1 sample. Averages are denoted by black dots connected by a black line. p value calculated using a 2-sided Wilcoxon signed-rank test.
(H) (Top) A schematic to demonstrate how different sets of numbers (braces) influence the values of Gini coefficient. A set of numbers with perfect equality results in a value of zero, while a set of numbers with perfect inequality will lead to a value of 1. (Bottom) The distribution of simulated Gini coefficients from randomly sampling 3 different distributions: uniform (mean 0.305), exponential (mean 0.529), and power (mean 0.829). Insets: histogram of values of the distribution being used for Gini calculations. Dotted lines: mean value.
(I) Histogram of distribution of Gini coefficient of proportion of singlets across all clusters for 3 different cluster resolutions (see STAR Methods). Resolution of 0.4 had a mean Gini coefficient across samples of 0.159, with a range of 0.015–0.459; resolution of 0.8 had a mean Gini coefficient of 0.145, with a range of 0.014–0.371; resolution of 1.2 had a mean Gini coefficient of 0.148, with a range of 0.019–0.363. Dotted line: mean value.
See also Figure S4.
Several computational frameworks have been developed to identify singlets in scRNA-seq datasets.3,4,6,7,8,9,10,11,12 Although each framework deploys its own algorithm, such methods typically rely on gene expression differences between individual cells to remove cells with a putative mixture of different expression profiles. As such, the algorithms necessitate datasets consisting of vastly different cell types or species, where doublets are collectively referred to as heterotypic doublets. Therefore, such methods do not work as well with “transcriptionally similar” cells,4 where doublets are referred to as homotypic doublets. Between these two extremes is a scenario more representative of many experimental designs—either cells are not vastly different within a population, or continuums of cell states with functional consequences exist. However, the performance of doublet detection methods has not been systematically investigated on such datasets. Furthermore, such algorithms take inferred singlets or all cells as inputs and do not have “ground-truth” singlets, posing further challenges in identifying bona fide singlets. Although certain experimental techniques such as cell hashing13 or lipid tagging11 can help increase the confidence in the doublets identified, they do not necessarily provide unique identifiers at single-cell resolution.
Recent developments in DNA barcoding methodologies have added another dimension to scRNA-seq datasets, revealing unique transcriptional signatures and clonal dynamics.14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 Here, we describe “singletCode,” a DNA barcode analysis approach for a new application: identifying “true” singlets in scRNA-seq datasets. We posited that since DNA barcoding allows for individual cells to have a unique identifier prior to scRNA-seq protocols, these barcodes could help identify “true” singlets. The singlet population could then be used to simulate doublets. This strategy could, in principle, enable the re-analysis of large amounts of existing barcoded datasets and present orthogonal ways to evaluate and compare the performance of the doublet detection algorithms. Our proof-of-concept analysis was implemented on 94 barcoded scRNA-seq samples, covering a total of 564,579 cells. Of the 338,948 barcoded cells, we extracted 293,618 singlets across several cell types, barcoding technologies, and experimental designs. We found that existing state-of-the-art doublet detection methods show lower than reported sensitivity to doublets simulated with ground-truth singlets. Since the repertoire of such datasets is increasing rapidly and DNA barcoding is becoming commonplace, our framework provides rational guidance for identifying singlets and choosing appropriate doublet detection algorithms.
Design
Since various studies use different heuristics and thresholds for barcode assignment,28,30,31,32,33 we first developed a standardized pipeline to reliably extract true singlets based on the synthetically introduced lineage barcodes (STAR Methods). We focused on major barcoding frameworks reported in the literature, including FateMap,28,31,34 ClonMapper,17 SPLINTR,26 LARRY,35 CellTag-multi,36 Watermelon,18 and TREX.37,38 Each barcoding technology has its own specifications that are accounted for by our framework when extracting singlets. Details on features of various single-cell barcoding technologies are provided in Box 1 and Table S1. Separately, we performed new experiments using Watermelon18 technology to create new barcoded scRNA-seq and single-cell Multiome (RNA and assay for transposase accessible chromatin [RNA + ATAC]) datasets.
Box 1. Generating and analyzing barcoded scRNA-seq data.
Barcode properties: lineage barcodes are unique DNA sequences that are used to track single cells over generations. These random or semi-random DNA sequences are located downstream of a fluorescent protein used as a marker for a cell being barcoded. To barcode cells, the plasmid carrying the barcode is lentivirally integrated stably into the genome and constitutively transcribed.
Sequencing: the barcode is a part of the transcript containing the poly(A) tail. Therefore, the barcode is captured by single-cell sequencing techniques along with all other mRNA transcripts and reverse transcribed into cDNA. For more robust capture and assignment, barcodes can also be separately amplified.
Processing: before downstream data analysis, low-quality barcode sequences are filtered out using a combination of the following strategies: if they do not align to a reference sequence, meet a PHRED score cutoff, meet a minimum UMI cutoff, have unusually many repeated nucleotides, or do not follow an expected nucleotide identity pattern.
To account for PCR amplification and sequencing errors, barcodes within a similarity threshold determined by either Levenshtein distance using STARCODE39 (FateMap, SPLINTR, Watermelon, CellTag-multi, and ClonMapper) or Hamming distance (TREX, LARRY) are merged.
Common strategies to account for doublets include removing all cells having more than one barcode, choosing the most dominant UMI barcode in a cell, or using mutually exclusive marker genes.
Above is a schematic of the structure of different lineage-tracing barcodes, including their lengths, diversity, and fluorescent colors. Each barcode construct contains an expressed fluorescent protein, a unique sequence, known as the barcode, and a poly(A) tail for capture. The lengths of barcodes in each method are shown as scale bars, and library diversities are shown as circles, sized proportionally to their diversity and colored according to fluorescent proteins used.
Most barcode libraries have a library complexity of >1 million, ensuring that multiple packaging copies of the same barcode are rarely present in an experimental design by chance. For all datasets across barcoding technologies, we labeled singlets as those cells that, as a result of multiplicity of infection, satisfied one of four conditions: (1) a single barcode identified per cell ID; (2) multiple barcodes identified per cell, with one barcode having a significantly higher unique molecular identifier (UMI) count than other barcodes within the same cell; (3) multiple barcodes identified per cell, but the same barcode combination was found in other cells in the same sample; and (4) multiple barcodes identified per cell, but the same barcode combination was found in other cells across samples within the same experimental design (common for barcoding studies17,28,30,31,35) (Figures 1A and 1B; STAR Methods). We refer to our framework as singletCode (Figure 1C; Data S1). To streamline the process and for wider accessibility of singletCode, we packaged the scripts implementing these singlet recovery criteria across barcoding technologies into a command-line tool in Python. We also have created a package called singletCode, available on PyPI, to readily adapt singletCode to any dataset. Additionally, we organized all major results, information about datasets and barcoding technologies used in the analysis, steps to implement singletCode, and other resources on a publicly available website (https://goyallab.github.io/SingletCodeWebsite/).
In summary, our pipeline identified singlets from 10 different publications and 6 original experimental scRNA-seq samples generated in this study, encompassing 7 different barcoding technologies, 94 scRNA-seq samples, 3 unique sequencing technologies, and 564,579 cells total. These datasets cover various cell types (two patient-derived melanoma cell lines, one triple-negative breast cancer cell line, stem cells, a fibroblast-like cell line, an induced pluripotent stem cell line, primary melanocytes, neuroepithelial progenitor cells, induced endoderm progenitors, and myeloid cells), biological processes (drug resistance, differentiation, reprogramming), and technical sequencing specifications (Figures 1D and S1–S3; Tables S2, S3, S4, and S5).28,30,31 Previous studies have established that the introduction of synthetic barcodes does not affect the ensemble of cell types in a population.28,31 Collectively across datasets, there were 338,948 barcoded cells, 293,618 (87%) of which were identified as true singlets by singletCode after quality control (Figure 1E; STAR Methods). These singlets spanned all four categories of singlets captured by the singletCode framework mentioned above, with variability in the proportion of categories across datasets (Figure 1F).
We next asked how the number of singlets extracted with singletCode compared with those originally reported in the respective studies. The original studies implemented various approaches to detect doublets, including doublet detection methods (Scrublet for LARRY35), mutually exclusive marker genes (TREX37), and barcode-based methods (SPLINTR26 and FateMap28,31,34), while some others had no clear filtering steps. For a fair comparison, we focused on studies that used barcode-based filtering since singletCode filtering is based on barcodes. We were able to recover, on average, 29% more singlets using singletCode (Figure 1G).
We next asked whether barcoded singlets exhibited a preference for specific cell states, as a bias would limit the usability of our approach. We analyzed singlets in datasets with FateMap barcodes using the Gini coefficient, a metric used to measure inequality in populations (0 and 1 imply perfect equality and inequality, respectively; 0.33 for uniform distribution) (Figure 1H).40,41 We calculated the Gini coefficient of the proportion of singlets in each uniform manifold approximation and projection (UMAP) cluster across 27 scRNA-seq samples with FateMap barcodes to be 0.159 (cluster resolution: 0.4). Our results were robust across a wide range of shared nearest-neighbor resolutions (0.4–1.2) (Figures 1I and S4). In sum, we found no clear bias of singlets in the transcriptional space for these datasets. We reached similar conclusions by separately performing nearest neighbor analysis on all FateMap datasets28,31,34 (Figure S5; Table S6; STAR Methods). We also projected singlets from datasets using other barcoding technologies onto the UMAPs and qualitatively found no systematic bias (Figure S6A). (In a small subset of cases, we observed the depletion of barcoded singlets from certain clusters, which could be attributed to the cells being of low quality (i.e., low UMI counts) (Figure S6B). Collectively, singlets exhibit no systematic bias toward particular cell types or clusters across the varied datasets analyzed here.
Results
Evaluating performance of doublet detection methods on barcoded datasets
To create datasets for benchmarking doublet detection methods, we simulated doublets by averaging6,8 the transcriptomes of two true singlets identified with singletCode (STAR Methods). We selected four methods with varying performance from a recent benchmarking study: DoubletFinder, scDblFinder, hybrid, and Scrublet.4,6,7,10,42,43 DoubletFinder and scDblFinder were selected because they exhibited high accuracy.4,42 Hybrid (an ensemble of two methods, bcds [“binary classification-based doublet scoring”] and cxds [“co-expression-based doublet scoring”]) and Scrublet were selected because they showed strong performance despite using distinct underlying algorithms for predicting doublets.4,42 The benchmarking study primarily used area under the precision-recall curve (AUPRC), area under the receiver operating characteristic (AUROC), and true negative rate (TNR) as their performance criteria. Note that AUROC is not necessarily representative of the performance of a doublet detection model in real datasets as doublets often only account for a small percentage of the entire dataset.44 AUPRC is a more robust metric for evaluating doublet detection methods given the inherent imbalance of singlet and doublet labels. An AUPRC of 1 means the method can consistently identify doublets without mislabeling any singlets as doublets.
We evaluated the AUPRC, AUROC, TNR, and doublet scores and calls of the four methods on each of the barcoded datasets. Although we tested a range of doublet rates, we present results for a reasonably practical true doublet rate of 8% (10X Genomics protocol predicts ∼8% doublet rate for maximum cell loading). For each of the AUPRC, AUROC, and TNR, the performance of the four algorithms varied substantially depending on the dataset. For example, the TREX37 dataset (mouse brains) had the highest AUPRC, while Jain et al.34 (fibroblasts, stem cell reprogramming) had consistently low performance. The variability in performance could be a result of several factors, including the experimental steps leading to scRNA-seq library generation, depth of sequencing, and cell type composition, among others. The average AUPRCs for all methods were consistently low, ranging from 0.13 to 0.34, with scDblFinder performing considerably lower than the other methods (Figure 2A). The AUROC for all methods followed a similar trend (averages 0.53–0.85; Figure 2B). The TNRs for all four methods were relatively high (range 0.93–0.95), attributable to the low doublet rate chosen (Figure 2C).
Figure 2.
Benchmarking doublet detection tools with singletCode
(A–C) Color-coded boxplots of AUPRC (A), AUROC (B), and TNR (C) values for all 4 doublet detection methods. Each boxplot is calculated from the value after running the respective method on each sample of the dataset, with actual doublet rate of 0.08 and expected doublet rate set to 0.05, 0.08, 0.1, 0.15, 0.2, and 0.25 where applicable. n = 87 samples across all datasets (see Tables S2 and S3 and STAR Methods for details). Values are averaged across expected doublet rates for AUPRCs and AUROCs, but not TNRs. Dots represent the mean value for a detection method grouped across all samples within a dataset. Lines represent SD of the values for a detection method across all samples within a dataset. Boxes span the first and third quartiles. Outliers are not depicted.
(D–F) Color-coded boxplots of AUPRC (D), AUROC (E), and TNR (F) values for 4 doublet detection methods on barcoded and non-barcoded datasets with 8% doublet rates (n = 87 for barcoded, n = 12 for non-barcoded samples). Dots represent the mean and lines represent SD. Boxes span the first and third quartiles. Outliers are not depicted. p values calculated using Wilcoxon rank-sum test.
We further evaluated each method’s binary (singlet/doublet) calls by determining the percentage of cells labeled as doublets by the method at an 8% expected and actual doublet rate (Figure S7A). DoubletFinder consistently made the correct number of doublet calls because it uses the expected doublet rate to determine the number of cells labeled as doublets. Hybrid tended to overestimate the number of doublets. Besides DoubletFinder, scDblFinder had the most accurate number of doublet calls, yet had a low AUPRC, demonstrating these doublet calls are largely incorrect. In contrast, Scrublet consistently labeled a low percentage of cells as doublets, but its comparatively high AUPRC suggests that the doublet scores it is predicting are relatively accurate. Together, this analysis reveals the unique underlying characteristics of each method in labeling doublets.
We questioned whether the AUPRCs and AUROCs for different methods were sensitive to the doublet rate. All tools achieved consistently higher AUPRCs with increasing actual doublet percentage. This is expected because as the number of true positives (doublets) increases, it also improves the chance of doublets correctly being ranked with a higher doublet score (Figure S7B). In addition, except for hybrid, which does not offer an option to specify the expected doublet rate, AUROCs for all methods were invariant of expected or actual doublet rate (Figure S7C).
Next, to directly compare the performance of doublet detection methods on barcoded and non-barcoded datasets, we selected 12 out of the 16 non-barcoded datasets4,45,46 used in previous benchmarking comparisons (the remaining 4 datasets were not chosen due to the potential concerns regarding the presence of homotypic doublets, which are difficult to annotate experimentally4). Doublets can be experimentally annotated either by using species-mix datasets such that a doublet contains cells from two different species or in patient samples with frameworks like demuxlet,46 based on SNPs. We first replicated findings from the benchmarking study4,42 on the same non-barcoded datasets in their published form, where scDblFinder exhibited the best performance (Figure S8). Next, for a fairer comparison to the barcoded datasets, we subsampled the datasets to maintain an 8% doublet rate throughout.
The AUPRCs for barcoded and non-barcoded datasets were similar using all methods except scDblFinder (Figure 2D). The AUROCs for the barcoded datasets were modestly higher than for the non-barcoded datasets using all methods except for scDblFinder (Figure 2E). The TNRs for all methods except scDblFinder showed no difference between barcoded and non-barcoded datasets (Figure 2F). Notably, scDblFinder performed at par with or better than other methods for non-barcoded datasets (Figures 2D–2F), which may be attributed to the study’s stated purpose of identifying heterotypic doublets.43
Together, barring exceptions (e.g., TREX37), our results on datasets with ground-truth singlets identified via singletCode highlights a relatively low performance of doublet detection methods.
Dataset heterogeneity as a contributor to doublet detection method performance
What could be the basis of the difference in performance between the barcoded and non-barcoded datasets? The datasets used for benchmarking tend to contain relatively more heterotypic doublets, necessitated by annotation strategies on the available datasets. In contrast, the barcoded scRNA-seq datasets largely contain cells from a single system, and, therefore, they often are not as dramatically heterogeneous.
We first examined whether there were any global correlations between heterogeneity and each tool’s performance (Figures 3A and S9). We used five common metrics to quantify heterogeneity in scRNA-seq datasets: phenotypic volume,26,28 Shannon diversity,28,31 E-distance,47 and differential expression fold change range and proportion of up- and downregulated genes.4 We found no correlation between any of the metrics and tools’ performances. The absence of correlation could be attributed to multiple confounding factors arising from inherent differences in samples, laboratory-specific protocols, experimental designs, technical specifications across datasets, and bioinformatic pipelines. We argued that it would be more accurate to compare the impact of heterogeneity by manipulating it within each sample, thus minimizing confounding factors. We simulated 10% doublet rate datasets and sampled cells from “less” and “more” transcriptional states as determined by the Euclidean distances between cells in the principal-component (PC) space (Figure 3B; STAR Methods). The AUPRC of both scDblFinder and DoubletFinder was higher for more heterogeneous subsets (Figure 3C). The AUROC of scDblFinder was similarly higher for more heterogeneous subsets (Figure 3D). The AUROC for DoubletFinder did not increase significantly with increasing heterogeneity (Figure 3D), likely because of its already high baseline value. Our results demonstrate that the higher the heterogeneity, the higher the likelihood of accurate doublet detection.
Figure 3.
Dataset heterogeneity contributes to doublet detection tool performance
(A) Quantification of how doublet detection method performance (AUPRC) was impacted by heterogeneity across samples using 5 metrics (STAR Methods). Each dot represents 1 sample, colored by doublet detection method. Samples (left to right, n = 64, 28, 28, 28, 28) were ranked by both the heterogeneity metric and AUPRC score for each of the 4 doublet detection methods and plotted according to rank. Lines of best fit along with R2 values are depicted and colored by doublet detection method.
(B) Schematic of how transcriptionally similar and dissimilar cells are subsampled for subsequent within-sample heterogeneity determinations. Transcriptionally dissimilar (more heterogeneous) cells are identified from all clusters within the dataset and have a high average Euclidean distance in PC space.
(C and D) AUPRC (C) and AUROC (D) of scDblFinder and DoubletFinder on transcriptionally more or less heterogeneous datasets (n = 14 datasets). Each small dot represents the AUPRC-measured performance of a tool on a 450 singlet and 50 doublet dataset generated with subsampling based on the as described in (B) (STAR Methods). The large dot represents the mean performance across datasets. p values calculated using a paired Wilcoxon signed-rank test.
(E and F) Color-coded boxplots of AUPRC (E) and AUROC (F) for 4 doublet detection methods on all benchmarking datasets (n = 87 samples) using average or summation doublets. Black dots represent mean values of detection methods. p values calculated using a paired Wilcoxon signed-rank test.
Comparison of doublet assignments across strategies, technologies, and modalities
Besides averaging6,8 singlets to simulate doublets (as we did above), previous approaches have also summed7,10 singlets’ transcriptomic profiles to create doublets. We created summed doublets (STAR Methods) and compared the performance of doublet detection methods for the two doublet creation strategies (Figures S10A–S10C). We found several differences in the performance of doublet detection methods for summed versus averaged doublets. First, the performance of both scDblFinder and hybrid increased substantially across all datasets for summed doublets (Figures 3E, 3F, and S10). Second, the increase in performance of scDblFinder was much higher than that of hybrid. The dramatic increase in the performance of scDblFinder may be explained by scDblFinder’s algorithm, which generates 75% of doublets using a summing strategy and uses library size as a predictor in their doublet classifier.43 Third, although Scrublet and DoubletFinder did not perform as well as scDblFinder and hybrid, their performance modestly improved for summed doublets. Overall, the cumulative performance of doublet detection methods was higher on datasets with summed doublets than on average doublets (AUPRC 0.498 vs. 0.287, AUROC 0.874 vs. 0.749, TNR 0.957 vs. 0.941) (Figures 3E, 3F, and S11). This may be due to an increased contribution of total UMI for summed doublets, a feature more easily captured by dimensionality reduction steps used in various methods. Our results demonstrate the nuances associated with doublet detection methodologies and highlight the need to investigate what constitutes a doublet: averaging or summing two singlets, or a mixture model of the two.
Since doublet detection methodologies vary in their performances, we asked whether the methods exhibited any patterns of doublet labeling. To answer this question, we created a new metric, similarity score, which calculates the fraction of doublets on which two methods make the same call (STAR Methods). A similarity score of 1 would imply that every single-cell call made on doublets by two methods is identical, while a score of 0 would imply none are. Across all datasets and doublet detection methods, the average similarity score was 0.66 (Figure 4A), suggesting some degree of inconsistency of doublet labeling. When we stratified the similarity score based on methods, some patterns emerged. For example, hybrid had an above-average similarity score with both bcds and cxds (0.81 and 0.72, respectively), which is expected because hybrid is composed of bcds and cxds (Figure 4A). Furthermore, cxds had the overall lowest similarity scores compared to other methods (73.5% of the top 200 lowest similarity scores belonged to cxds), potentially due to its unique reliance on gene co-expression rather than the typical doublet simulation and high-dimensional clustering (Figure 4B). We observed similar patterns for similarity scores of individual samples as well (Figure S12), highlighting that various tools were capturing different doublet features.
Figure 4.
Comparing the performance of doublet detection tools across scRNA-seq technologies and multiple modalities with singletCode
(A) Pairwise similarity score between all benchmarked methods (mean = 0.66, n = 91 samples; see Tables S2 and S3 and STAR Methods).
(B) Per-sample similarity score for each pairwise method comparison ranked from low to high (n = 91 samples). Scores corresponding to cxds are labeled in black and other tools are labeled in gray. cxds shows high enrichment in low similarity scores. The rest of the tools exhibit no enrichment.
(C and D) Color-coded boxplots of AUPRC (C) and AUROC (D) for all tools leveraging ensemble strategies and their components (n = 12 samples). Boxes span the first and third quartiles, with the bold line at the median. The whiskers demarcate the 1.5 interquartile range, with data beyond the length of the whiskers being outliers. Lines below the x axis dataset labels illustrate the component methods that make up the hybrid and Chord ensembles. p values calculated using a Wilcoxon rank-sum test. Non-significant (>0.05) p values are not depicted.
(E and F) Color-coded boxplots of AUPRC (E) and AUROC (F) value for all tools on 2 datasets with the same cell type (same organ) and experimental design (TREX37 barcodes), but sequenced with different technologies (n = 3 for 10X Genomics, n = 2 for Smart-seq3). Smart-seq3 data corresponds to read count data. Dots represent mean. Lines represent SD. Boxes span the first and third quartiles. p values calculated using Wilcoxon rank-sum test. Non-significant (>0.05) p values are not depicted.
(G) Schematic for 10X Genomics Multiome dataset workflow resulting in TNR calculation (STAR Methods).
(H) TNR as a result of workflow in (G). Each dot represents a subsampled dataset (n = 6; STAR Methods).
We evaluated the performance of ensemble doublet detection strategies hybrid10 and Chord,48 which combine multiple methodologies, against their individual components. We found that the ensemble methods did not always outperform their components (Figures 4C and 4D), inviting further investigation into whether this trend is consistent across different datasets.
Wide-ranging single-cell sequencing platforms now exist, including Drop-seq,49 inDrop,50 particle-templated instant partition sequencing,51 and Smart-seq.52 We asked whether doublet detection methods differ in their performance on samples from the same cell type and experimental design but sequenced on different platforms. We leveraged TREX-barcoded datasets37,38 of the same sample type, sequenced using either 10X Genomics, a droplet-based method, and Smart-seq3, a well-based method for which we used both reads (Figures 4E and 4F) and UMIs (Figures S13A and S13B) for reasons listed in the STAR Methods.37,38,52 DoubletFinder and Scrublet both resulted in relatively higher AUPRCs for 10X Genomics,37 while hybrid resulted in higher AUPRCs for the Smart-seq338 samples (Figure 4E). The AUROC was relatively less variable (Figure 4F). Given the variable performance of the methods across datasets in general (Figures 2A–2C) and the limited datasets from technologies other than 10X Genomics, we cannot necessarily conclude whether these differences are a result of the sequencing technology.
Next, we asked whether our approach could be harnessed for evaluating the performance of doublet detection methods43,53,54,55 on scATAC-seq datasets. To answer this question, we selected AMULET53 for our proof-of-concept analysis because of its scATAC-seq-specific read-count-based method and relatively high reported performance. We generated new 10X Genomics Multiome datasets (RNA + ATAC) using the Watermelon18 barcoding technology. We identified singlets in parallel from both the scATAC-seq modality using AMULET and from the scRNA-seq modality using singletCode (Figure 4G). By comparing labels of true negatives and false positives across modalities, we estimated the average TNR for AMULET to be 0.924 (Figure 4H). In summary, our approach demonstrated the potential of singletCode to systematically benchmark doublet detection in scATAC-seq (and other multi-omics) datasets.
Effects of doublets on downstream functional outcome interpretation
Doublet detection methods are implemented primarily because doublets can confound downstream analysis. While the impact of doublets on synthetic datasets has been systematically explored,4 the scope of its effect remains unclear on true-singlet datasets. We used our barcoded, singlet-only data to simulate datasets with varying doublet rates (STAR Methods) and performed commonly implemented downstream functional analyses to interpret scRNA-seq datasets—differential expression, cell-cell communication, clustering stability, and cell trajectory inference—to compare the clean (singlet-only) and doublet-contaminated datasets (Figure 5A).
Figure 5.
Doublets confound downstream biological function analyses
(A) Schematic of datasets generated for all functional analyses. A total of 14 datasets for each percentage doublet rate for all biological function analyses (STAR Methods).
(B) Differential expression analysis results for all datasets using MAST (purple, left) and Wilcoxon (blue, right) on 2 randomly chosen clusters at various doublet rates. Line color shade represents the percentage doublet rate, with a more saturated color indicative of a higher doublet rate. Precision, recall, and TNR are calculated against ground-truth differentially expressed genes in the corresponding dataset without doublets. p values calculated using a paired t test.
(C) Cell-cell communication analysis among all clusters in a dataset using CellChat. Precision (left) and recall (right) are calculated against interactions identified in the 0% dataset. Line color shade represents the percentage doublet rate, with a more saturated color indicative of a higher doublet rate. p values calculated using a paired t test.
(D) Clustering stability analysis results where the heatmap (left) indicates by color whether the correct or incorrect number of clusters were generated at the doublet rate indicated by the x axis at a clustering resolution of 0.6. The “correct” number of clusters means the same number of clusters as the 0% doublet dataset. Histogram (right) shows total number of datasets with the same number of clusters identified at different doublet percentages.
(E) Representative cell trajectories (lines) in PC space for Goyal et al.28 4 datasets at (left to right) 0%, 10%, 20%, and 40% doublet rates. Singlets are gray and doublets are red.
(F) Example visualization of trajectory inferred from singlets-only compared to the 40% doublet dataset. Each dot is representative of a pseudotime point. The errors in trajectory are represented by black lines.
(G) Line plot (left) of the error in trajectory at variable doublet rates. Each gray line represents a dataset error in trajectory normalized to 1. The average error in trajectory across all datasets is highlighted in black. The overall comparison of the errors in trajectory for datasets with 0% doublets (“singlets only”) and datasets with varying doublets (“doublets included”) is summarized in a boxplot (right). Dots represent a dataset. The box spans the first and third quartiles, with the heavy line being the average value across all datasets, and the whiskers being the range. p value calculated using Wilcoxon rank-sum test.
First, we assessed how doublets impact differentially expressed genes using two methods, MAST56 and Wilcoxon rank-sum.57 We found that increasing the doublet rate decreases the precision, recall, and TNR of identified differentially expressed genes; a higher doublet rate leads to worse performance (Figure 5B). Second, we used CellChat58,59 and CellPhoneDB60 to compare the inferred cell-cell communication pathways with and without doublets (STAR Methods). Increasing the doublet rate decreased the precision (and recall, for CellChat) for communication pathways identified (Figures 5C and S14A). Third, we examined how doublets affect clustering by independently clustering each dataset (Louvain61) and counting the number of clusters as a proxy for “clustering stability” used in the previous benchmarking study (Figures 5D and S14B). Increasing the doublet rate decreased the probability of the correct number of cell clusters (Figures 5D and S14B). Finally, we used Slingshot62 to infer cell trajectory (changes over time) at increasing doublet rates and found that qualitatively, increasing doublet rate leads to an increasingly different cell trajectory (Figures 5E and S15). For quantitative comparisons, we created a new metric that matches trajectories for each doublet rate with one from the clean dataset and calculates the Euclidean distances in the PC space between these matched trajectories (Figures 5E–5G; STAR Methods). The greater the distance between matched trajectories, the larger the error in the inferred trajectory. Increasing doublet rate increased the error in the inferred cell trajectory (Figure 5G). Note that the magnitude of decrease in performance of detection methods with increasing doublet percentage was differentially sensitive to the choice of the functional assay and performance metric (Figures 5, S14, and S15).
Doublet detection in non-barcoded datasets with classifiers trained on barcoded data
We asked whether the ground-truth singlets could be leveraged to reliably identify singlets in other datasets. To test this possibility, we argued that the true singlets identified with singletCode could be used to train a deep learning model to identify doublets. We trained an XGBoost63 classifier on samples with 10% simulated doublets (Figures 6A and S16; STAR Methods). We split the classification datasets such that 60% of a dataset was used for training at both the parameter optimization and final training steps, and 20% of each was used for validation during the optimization stage and testing during the final evaluation stage. Despite the high class imbalance (singlets:doublets = 9:1), the average AUPRC and AUROC of the classifier were 0.97 and 0.99, respectively, across all samples (Figures 6B–6E). We used a variety of negative controls, including a scrambled matrix, a matrix of only singlets with some cells falsely labeled as doublets, and a matrix of doublets with some cells falsely labeled as singlets, which yielded average AUPRCs of 0.11 (random equivalent), 0.13, and 0.17, respectively, and AUROCs of 0.50, 0.51, and 0.52, respectively (Figures 6B–6D). Note that the classifier outperforming the doublet detection algorithms is not entirely unexpected because the classifier is trained on the labeled datasets themselves. We also calculated the performance of classifier and doublet detection methods upon changing the doublet rate. As expected, the AUPRC increases with doublet rate for existing methods due to the higher probability of correct identification of the positive class with increased doublet fraction. The AUPRC and AUROC for the classifier already peaked for a low (5%) doublet fraction, and are therefore unaffected by the doublet rate (Figure 6F).
Figure 6.
Machine learning doublet classifiers trained with singletCode singlets exhibit high performance for doublet detection
(A) Schematic of classifier optimization and training (STAR Methods).
(B and D) Results of classifier doublet detection compared to other doublet detection methods as measured by AUPRC (B) and AUROC (D). Each dot represents the average respective score of a doublet detection method (color) on a given dataset (x axis). n = 10 classifiers per sample. The ribbon has the width of the SD of the respective score for each doublet detection method.
(C and E) Summary of the performance of the classifier on the negative control, the performance of the benchmarked doublet detection methods, and the performance of the classifier as measured by AUPRC (C) and AUROC (E). Each small colored dot represents a sample (15 samples, n = 10 classifiers per sample; other methods have all expected doublet rates plotted). The large black dot represents the average value for a detection method, and the lines extending from the black dot represent the SD. p values calculated using an unpaired Wilcoxon rank-sum test.
(F) Results of classifier doublet detection compared to other doublet detection methods across doublet rates ranging from 0.05–0.25 as measured by AUPRC (left) and AUROC (right). n = 10 classifiers per doublet rate. Solid lines, colored by doublet detection method, represent the mean, and the ribbon represents the SD.
(G) Schematic of training a doublet classifier on barcoded data from 1 experiment and using that classifier to identify doublets in a biological replicate, experiment 2 (STAR Methods).
(H and I) AUPRC (H) and AUROC (I) scores of classifier doublet detection when the classifier is trained on a biological replicate compared to other doublet detection methods (n = 200 classifiers). p values calculated using a Wilcoxon rank-sum test.
We wondered whether the classifier could be successfully implemented on independently performed experiments with no barcode information for the same cell type and design conditions. To test this possibility, we first used the classifier trained on sample A from the Goyal et al.28 1 dataset implemented for doublet prediction on sample B, an independently performed experiment on the same cell type (melanoma), and separately built a classifier on sample B to predict doublets in A (Figure 6G). The average AUPRC and AUROC for these inter-sample classifiers were significantly higher (0.69, 0.92 vs. 0.36, 0.78) than those obtained from other methods on the same two samples (Figures 6H and 6I). (The small reduction in AUPRC and AUROC, as compared to intra-sample values in Figures 6B–6D, could be a result of sample-to-sample variation; Figures S17A and S17B). We obtained qualitatively similar results with inter-sample classifiers from datasets consisting of other cell types (leukemia, mouse brain, bone marrow/leukemia), barcoding technologies (ClonMapper,17 TREX,37 SPLINTR26) and across species (mouse and human) (Figures S17C–S17F). The sample-to-sample variation is minimized as more samples are used to build the classifier, stratified by mouse and human datasets (Figures S17G–S17J). Our primary goal was to demonstrate that, in principle, doublets can be accurately identified by training on datasets that contain singletCode-labeled ground-truth singlets. Therefore, beyond the performance comparisons with the existing methods, our proof-of-concept analyses demonstrate the feasibility of this approach. Furthermore, given the high performance within and across experiments, we expect overfitting to be minimal. Which genes contribute to the classifier’s doublet detection efficacy can be explored systematically in future studies.
Discussion
Here, we developed and implemented singletCode, a framework that leverages synthetic DNA barcoding prior to single-cell library preparation reaction separation to extract ground-truth singlets in scRNA-seq datasets. We tested the feasibility of our approach on a range of scRNA-seq datasets and used the extracted singlets to evaluate the performance of doublet detection algorithms. We found comparatively low performance from different methods for barcoded scRNA-seq datasets, which do not require different species or patient samples to annotate doublets unlike previous benchmarking datasets. The performance of each method was differentially sensitive to how the doublets were simulated, the single-cell library generation technology, and the datasets themselves. Our work underscores the importance of incorporating synthetic barcoding in experimental designs, especially when rare cell types are of particular interest.
For barcoded datasets, singletCode provides a framework to identify ground-truth singlets for downstream analysis. Alternatively, singletCode itself can be leveraged to systematically test the performance of different doublet detection methods in scRNA-seq and other modalities, such as scATAC-seq. In particular, we expect singletCode to be helpful for datasets consisting of rare or continuums of cell types (e.g., partial reprogramming) and those that contain transcriptionally similar cells. Another application of singletCode could be to detect spatially adjacent or adhesive/sticky cells if the doublet barcoded cells tend to cluster within certain regions of the high dimensional transcriptional space.
Given the rapidly increasing repertoire of published barcoded scRNA-seq datasets from different cell types and contexts, our proof-of-concept machine learning classifier (Figure 6) can be extended to train deep learning models for broader querying of doublets on similar non-barcoded atlases,64 including potential applications to spatial transcriptomics. Many spatial transcriptomics technologies, such as 10X Visium and nanostring GeoMx, lack single-cell resolution, a limitation addressed by deconvolution using scRNA-seq datasets with computational methods such as MUSE and RCTD.65,66 However, doublets, particularly those created non-randomly because of cellular physiology and nature of experimental assays,1,2 can confound such analysis. Barcoded true singlets can help with relatively more accurate deconvolution of spatial-omics datasets.
In summary, singletCode provides a generalizable and formalizable framework to identify true singlets, illustrates the shortcomings of doublet detection methods, and presents opportunities to harness machine learning on barcoded datasets for improved doublet detection.
Limitations
One limitation of our study is that we focused on synthetically introduced barcodes; future studies can extend singletCode to native cellular identifiers (e.g., mitochondrial mutations67 or TCR sequences12), thereby potentially extending doublet detection to clinical samples. A second limitation is that simulating doublets, even with ground-truth singlets, is still a discretionary decision to imitate real doublets’ transcriptomes. Characterizing what constitutes a realistic doublet remains a challenge and can potentially be addressed by developing accurate mixture models of summing and averaging singlets. Another set of limitations is that singletCode cannot identify ground-truth singlets when a single barcode clone dominates the dataset, such that two cells with same barcodes can form an experimental doublet, or when index hopping occurs. While the probability of this happening is rare, it is a limitation in principle and should be noted. Lastly, single-cell DNA barcoding remains technically sophisticated, and the experimental/computational protocols are not standardized yet. These discrepancies may create complications in preprocessing data and identifying singlets for previously unreported technologies.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| FateMap scRNA-seq data | Goyal et al.28 | GEO: GSE233766 |
| scRNA-seq hiPS cell-derived cardiomyocytes and hiPS cells | Jiang et al.31 | GEO: GSE198729 |
| scRNA-seq hiF-T cells | Jain et al.34 | GEO: GSE227151 |
| CellTag-multi scRNA-seq data | Jindal et al.36 | GEO: GSE216521 |
| ClonMapper scRNA-seq data | Gutierrez et al.17 | GEO: GSE151431 |
| LARRY scRNA-seq data | Weinreb et al.35 | GEO:GSE140802 |
| SPLINTR scRNA-seq data | Fennell et al.26 | GEO: GSE161676 |
| TREX scRNA-seq data | Ratz et al.37 | GEO: GSE153424 |
| Watermelon sequencing data | This paper | https://doi.org/10.6084/m9.figshare.25478680 |
| FateMap barcoding datasets | Goyal et al.28 | https://doi.org/10.6084/m9.figshare.22798952; https://doi.org/10.6084/m9.figshare.22802888 |
| hiPS cell-derived cardiomyocytes and hiPS cells barcoding datasets | Jiang et al.31 | https://doi.org/10.6084/m9.figshare.19126985 |
| hiF-T cells barcoding datasets | Jain et al.34 | https://doi.org/10.6084/m9.figshare.22251223.v1, https://doi.org/10.6084/m9.figshare.22236949.v1, https://doi.org/10.6084/m9.figshare.22236955.v1 |
| Barcode sheets (cellID-barcode-sample files for each dataset) and plot data | This paper | https://doi.org/10.6084/m9.figshare.25478680 |
| Experimental models: Cell lines | ||
| MCF7 cell line | Gift from Dr. Joan Brugge (Harvard Medical School) | N/A |
| T47D cell line | Gift from Dr. Joan Brugge (Harvard Medical School) | N/A |
| Software and algorithms | ||
| singletCode PyPl package | This paper | https://pypi.org/project/singletCode/ |
| singletCode Command Line Interface | This paper | https://github.com/GoyalLab/singletCodeTools/ |
| scDblFinder 1.12.0 | Germain et al.43 | https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html |
| DoubletFinder 2.0.4 | McGinnis et al.6 | https://github.com/chris-mcginnis-ucsf/DoubletFinder |
| Scrublet 0.2.3 | Wolock et al.7 | https://github.com/AllonKleinLab/scrublet/blob/master/examples/scrublet_basics.ipynb |
| hybrid (scds 1.13.1) | Bais and Kostka10 | https://github.com/kostkalab/scds |
| Chord 2.0.1 | Xiong et al.48 | https://github.com/13308204545/Chord |
| Seurat 5.0 | Hao et al.68 | https://satijalab.org/seurat/ |
| Python | Python Software Foundation | https://www.python.org |
| Numpy 1.25.2 | Harris et al.69 | https://numpy.org/ |
| Pandas 2.1.0 | The pandas development team | https://pandas.pydata.org/ |
| SingleCellExperiment 1.20.1 | Amezquita et al.70 | https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html |
| PRROC 1.3.1 | Grau et al.71 | https://cran.r-project.org/web/packages/PRROC/index.html |
| ggplot2 3.4.2 | Wickham72 | https://cran.r-project.org/web/packages/ggplot2/index.html |
| DropletUtils 1.18.1 | Griffiths et al.73 | https://bioconductor.org/packages/release/bioc/html/DropletUtils.html |
| Data.Table 1.14.10 | Barrett et al.74 | https://github.com/Rdatatable/data.table |
| Matrix 1.6.5 | N/A | https://cran.r-project.org/web/packages/Matrix/index.html |
| scikit-learn 1.3.2 | Pedregosa et al.75 | https://scikit-learn.org/stable/whats_new/v1.3.html |
| Scipy | Virtanen et al.76 | https://scipy.org/ |
| hyperopt 0.2.7 | Bergstra et al.77 | https://hyperopt.github.io/hyperopt/ |
| XGBoost 2.0.2 | Chen et al.63 | https://github.com/dmlc/xgboost |
| clue 0.3–65 | N/A | https://cran.r-project.org/web/packages/clue/index.html |
| R | The R Project for Statistical Computing | https://cran.r-project.org/ |
| CellChat | Jin et al.58 | https://github.com/sqjin/CellChat |
| CellPhoneDB | Garcia-Alonso et al.60 | https://github.com/ventolab/CellphoneDB |
| Slingshot | Street et al.62 | https://bioconductor.org/packages/devel/bioc/vignettes/slingshot/inst/doc/vignette.html |
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Yogesh Goyal (yogesh.goyal@northwestern.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
This paper analyzes existing, publicly available scRNA-seq and barcoding data. The scRNA-seq datasets are available at Gene Expression Omnibus with accession numbers (GSE233766, GSE198729, GSE227151, GSE216521, GSE151431, GSE140802, GSE161676, GSE153424) and barcoding datasets at Figshare (https://doi.org/10.6084/m9.figshare.22798952; https://doi.org/10.6084/m9.figshare.22802888; https://doi.org/10.6084/m9.figshare.19126985; https://doi.org/10.6084/m9.figshare.22251223.v1, https://doi.org/10.6084/m9.figshare.22236949.v1, https://doi.org/10.6084/m9.figshare.22236955.v1).
-
•
Barcode sheets (cellID-barcode-sample files for each dataset), data to remake each plot, and new Watermelon-barcoded scRNA-seq and Multiome data is in Figshare: https://doi.org/10.6084/m9.figshare.25478680
-
•
All code for the analyses in this manuscript has been deposited at: https://github.com/GoyalLab/fatemap_multiplet_public; https://doi.org/10.5281/zenodo.11438747
-
•
All code for the command line interface and package has been deposited at: https://github.com/GoyalLab/singletCodeTools; https://doi.org/10.5281/zenodo.11452137
-
•
All code for the singletCode website been deposited at: https://github.com/GoyalLab/SingletCodeWebsite; https://doi.org/10.5281/zenodo.11455739
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Experimental model and study participant details
Cell lines models
Human breast carcinoma cell line MCF7 (procured from Dr. Joan Brugge, Harvard Medical School) was cultured in DMEM supplemented with 10% FBS and 1% penicillin-streptomycin. Human breast carcinoma cell line T47D (procured from Dr. Joan Brugge, Harvard Medical School) was cultured in RPMI supplemented with 10% FBS and 1% penicillin-streptomycin. Both cell lines were treated with 1 μM palbociclib (CDK4/6i) and 1 nM fulvestrant (SERD). No human subjects were used in this study.
Method details
Gini coefficient analysis for barcode uniformity and bias quantification
To perform the Gini analysis for the single-cell datasets, we first simulated what the distribution of Gini coefficients would be from a uniform, exponential, and power distribution. We selected 20 random numbers from a given distribution and then calculated the Gini coefficient for those numbers. This process was then repeated approximately 20 times each for each distribution we simulated. For the power distribution, we used an alpha value of 1.5. Next, we preprocessed our single cell data by removing cells with feature counts fewer than the 10th percentile for that sample and cells with mitochondrial counts greater than the 60th percentile for that sample. Variable features were identified using a variance stabilizing transform for the top 2000 most variable genes. Neighbors were found using the top 50 dimensions, clusters were found using three resolutions (0.4, 0.8, 1.2), and UMAP projection was calculated using the top 50 dimensions.
For each of the three resolutions for a given sample, we calculated the Gini coefficient of the proportion of cells identified as singlets per cluster. We did this across multiple resolutions as they have differing amounts of clusters and we wanted to make sure the clustering of the data had no effect on the distribution of singlets.
Nearest neighbor analysis for barcode uniformity and bias quantification
For the nearest neighbor analysis of the single-cell FateMap-barcoded datasets,28,31,34 we first performed quality-control steps, including thresholds for RNA counts, mitochondrial reads, and number of cells. Since the datasets covered several different cell types, the threshold values varied depending on the dataset, and we have provided them in Table S6. To perform the nearest neighbor analysis, we extracted the top five nearest cell neighbors and the associated distances for each cell in the principal component space, and calculated the average neighbor distance. We performed this analysis on datasets composed either exclusively of singlets identified from the DNA lineage barcodes, or those that we randomly subsampled from the entire single-cell dataset consisting of singlets and multiplets. We performed random sampling and nearest neighbor calculation three times, ensuring the total subsampled cells to be the same number as the singlets for appropriate comparisons. The singlets' average neighbor distance were normalized by subtracting the average neighbor distance of the subsampled cells. A mean value closer to zero indicates no preference for barcoding specific cells in high dimensional principal component spaces. A nonzero mean value suggests a preference for specific cells in high dimensional spaces or manifolds to be barcoded. Datasets displaying high standard deviations were affected by reduced cell counts attributed to high percentages of mitochondrial reads and a likely increase in cell death.
To account for what a “positive control” would look like for our nearest neighbor analysis, we artificially biased a control dataset by subsampling regions of principal component space from one of the samples (FM0-1_sample3, Figure S4B). We subsampled three restricted regions of the PCA plot by thresholding based on the PC1 vs. PC2 plot (n = 1586 cells) to mimic highly clustered cells, and calculated the average nearest neighbor distance to compare them with the real experimental datasets.
We reasoned that if the average neighbor distances of singlets were indistinguishable from all cells, the singlets uniformly spanned the entire scRNA-seq dataset. Indeed, singlet nearest neighbor distances were indistinguishable from the entire scRNA-seq dataset (representative plot in Figure S4C for Goyal et al.28; sample 2) and significantly different from the biased control dataset (Figures S4B and S4D). Note that this analysis was independent of the cluster designation in the Uniform Manifold Approximation and Projection (UMAP) space.
Generation of scRNA-seq datasets, singlets, and doublets
In order to simulate doublets to most effectively benchmark the detection method performances, we identified true singlets by examining the mapping of the barcode to cell ID. Because barcodes are added to cells before the oil droplet encapsulates the cell, any cell ID which has only one associated barcode is a true singlet (1:1:1 mapping of a cell to a barcode to a cell ID). To account for the scenario where multiple barcodes are added to each cell, we further characterized all cell IDs that are associated with the same combination of barcodes within the same sample or across twin samples as singlets. Lastly, it is possible that some cells receive ambient barcodes. This should result in one of such barcode having a significantly higher UMI count since all other barcodes are ambient barcodes for a given cell. We therefore also considered these cells as singlets. There are two cases where cells might be classified as such a singlet. First, the cell had only one barcode with at least 10 associated UMI. Alternatively, it could have multiple barcodes with at least 10 associated UMI, but one of these barcodes has at least 50 more UMI than the median of all UMI counts for that cell. We further identified cells that have the same combination of barcodes only when examining across multiple samples. Singlets identified from this procedure are documented but not included in the simulation process for benchmarking purposes. We used these three strategies to label true singlets in the dataset.
To simulate the doublets, we randomly selected the count data from two cells we identified as true singlets. Since previous approaches to simulate doublets performed either averaging6,8 or summing7,10 two cells, we also averaged or summed the counts from these two cells to generate simulated doublets, accounting for this change in the number of cells when evaluating the performance of doublet detection algorithms. For a desired doublet percentage, we generated an initial number of doublets based on the initial assumption that doublets are generated from singlets without replacement such that, for each doublet generated, the total number of cells will decrease by 1. Then, it follows that the number of doublets = (doublet percentage) x (number of cells) such that (number of cells) = (number of initial singlets - number of doublets). We generated the number of doublets determined by this formula. Next, we removed the singlet cells used to create doublets and corrected for the fact that we generated doublets from singlets with replacement to ensure the desired doublet percentage. We solved for the final number of singlets by number of singlets = number of doublets (1- doublet percentage)/(doublet percentage) and subsampled that amount of singlets to be included in the final matrix. This approach minimized the number of cells that needed to be trimmed in the doublet generation process to maintain desired doublet percentages. In the rare cases where there were not enough singlets remaining after those used to create doublets were removed, the number of doublets was instead calculated according to number of doublets = (doublet percentage) x (number of singlets)/(1- doublet percentage), doublets were created, the singlets used to create those doublets were removed, and the doublets were subsampled to ensure the correct final doublet percentage. The final scRNA-seq datasets were generated by adding such simulated doublets into the datasets at different percentages (5–25%). For this study, except for specific analyses where summation doublets are explicitly used, all doublets are created by averaging.
Benchmark environment and parameter settings
We annotated singlets in each scRNA-seq dataset by identifying barcoded cells using the accompanying methods published with each dataset and singletCode to annotate singlets. We ran each doublet detection method on each scRNA-seq sample with expected, where such parameters are available, and actual doublet rate set to 0.05, 0.08, 0.1, 0.15, 0.2, and 0.25. Note that hybrid does not support specifying expected doublet rate. Also, while DoubletFinder supports such a parameter, it does not take it into account when calculating doublet scores, rather just to determine the cutoff score for which cells are labeled as doublets. This means each method is run 36 times in total per sample. The sample is loaded with Seurat and converted to SingleCellExperiment in R70 if necessary. For Scrublet, the count matrices and labels were extracted and loaded into Python. All algorithms were run with recommended settings following their official tutorial. We then used the doublet scores returned by each method to calculate AUROC and AUPRC scores. As in previous benchmarking studies,4,42 TNR was calculated by first labeling cells with the top doublet scores from each method as doublets according to the expected doublet rate. Next, we defined true negatives as cases when both the doublet detection method and singletCode identified a cell as a singlet, and false positives when the methods and singletCode assignments disagreed. Using these definitions, we calculated TNR for each of the four methods.
When comparing performance of each method on barcoded and non-barcoded datasets, we ran the doublet detection methods on the non-barcoded datasets in their published form, but also in a subsampled form where we subset each dataset to an 8% doublet rate. We did this in order to compare the performance of each method on both 8% doublet barcoded and non-barcoded datasets. We ran the doublet detection methods once on each non-barcoded dataset, and we compared that to the performance of the methods on all of the barcoded datasets.
scDblFinder: This method was executed in R following the instructions at https://bioconductor.org/packages/release/bioc/vignettes/scDblFinder/inst/doc/scDblFinder.html.
DoubletFinder: This method was executed in R following the instructions at https://github.com/chris-mcginnis-ucsf/DoubletFinder.
Scrublet: This method was executed in Python following the instructions at https://github.com/AllonKleinLab/scrublet/blob/master/examples/scrublet_basics.ipynb
hybrid: This method was executed in R following the instructions at https://github.com/kostkalab/scds.
Chord: This method was executed in R following the instructions available at the following GitHub link, with parameter overkill set to True: https://github.com/13308204545/Chord.
We tested Chord on 3 datasets (Goyal et al.28 1, Goyal et al.28 2, ClonMapper17) for a total of 12 samples.
The following samples were omitted for all benchmarking comparisons: SPLINTR “inVitro_KRAS” and “retransplant”, and TREX “brain1”.
Relevant environment parameters versions are as follows: scDblFinder43: 1.12.0, DoubletFinder6: 2.0.4, scds10: 1.13.1, Scrublet7: 0.2.3, Chord48: 2.0.1, Python: 3.7, R: 4.2.3, Seurat68: 5.0.0, SingleCellExperiment70: 1.20.1, PRROC71: 1.3.1, ggplot272: 3.4.2.
Creation of Watermelon Multiome datasets
Cells were transduced with the Watermelon library as previously described.18 The samples were prepared for scRNA-seq and scATAC-seq using the standard 10X Genomics Multiome protocol. The same cells in the Multiome dataset undergo scATAC-seq, scRNA-seq, and barcode sequencing. To increase lineage barcode capture, targeted sequencing of the barcode area from the RNA was performed using the whole transcriptome amplification product generated as a part of the v2 protocol as a PCR template. Targeted RNA libraries were gel purified and sequenced with a MiSeq (Illumina).
Incorporating wide-ranging barcoding technologies within singletCode
SingletCode framework extracts information about barcodes associated with a cell and its UMI counts to identify singlets. Different single-cell barcoding techniques and analysis pipelines contain files that have their own technique-specific unique data structures and preprocessing steps. Furthermore, terminology for what is a barcode, cell ID, and UMI is not standardized. Therefore, to harmonize across technologies, the data from each barcoding method was preprocessed as closely as possible to the original study. We defined “barcode” as the transcribed DNA barcode that labels individual cells for lineage tracing experiments and is unique to an individual cell, “cell ID” as the bead-based identifier in single cell sequencing reactions that is unique to a droplet or well, and “UMI” as the unique molecular identifying sequence unique to each mRNA molecule. We processed each dataset (at least one per barcoding technology) such that the resulting output is a CSV file with columns for cell ID, barcode, and sample. The UMI count for a barcode-cell ID-sample combination is reflected by the number of duplicate rows for that combination. Specific details for each technology are provided below:
CellTag-multi36: The input RNA sequencing files were matrix.h5 for all the samples (available at GSE216521) and barcode matrix files containing information about the lineage barcode (CellTag) associated with each cell ID (provided by Jindal et al.36 upon request). More information about this file (the general naming format is “{0}_allow_ctmat.mtx") can be found from the script in the Morris Lab GitHub repository (https://github.com/morris-lab/newCloneCalling/blob/main/cloneCalling_scripts/celltag_analysis_single_assay.ipynb). The UMI counts for all barcodes present in a cell were extracted from the barcode matrix. Cells containing more than 25 or less than 1 unique barcode were filtered out in accordance with the author’s published thresholds. The following languages and packages were used for this analysis: Python 3.11.5, Numpy69 1.25.2, Scipy76 1.11.2 and Pandas78 2.1.0.
ClonMapper17: We obtained an output of the pycashier pipeline for 4 samples from17 (untreated: TP0, treated: TP0.5, TP1-1, TP1-7), which contained Illumina read information, UMI, cellID, and barcode and were named as “{sample}.cell_record_labeled.barcode.csv”. More information regarding how this intermediate is generated can be found on the Brock Lab pycashier GitHub repository (https://github.com/brocklab/pycashier). Each file was filtered such that only 20 bp barcodes were kept. Sequencing data for these samples were found under accession code GSE151431 for.17
LARRY35: The method for processing the transcriptomic data (GSE140802 for Weinreb et al.35) to obtain the data formatted for doublet detection was directly adapted from the original paper. Briefly, quality control was performed on the cells based on the UMI cutoff for the cell—the exact cutoff for each sample (sample names LK1 corresponds to the data starting with “d_”, LK2 corresponds to LK and LSK corresponds to LSK) was taken from Table S1 of the original paper35—and the mitochondrial gene percentage (20%). The samples from the LK1 experiment were merged and normalized. Separately the samples from LSK and LK2 were merged and normalized together. Then, we did barcode matching using the input file LARRY_sorted_and_filtered_barcodes.fastq.gz (provided by the authors of the study). This file contains the sample name, cell ID, lineage barcode, and UMI sequence. More information about this file can be found on the LARRY35 GitHub repository https://github.com/AllonKleinLab/LARRY/tree/master. Combinations of cell ID, lineage barcodes and UMIs were first filtered to contain only those containing >10 reads. Next, all the barcodes within a Hamming distance of 3 were combined into a barcode with the highest UMI count among them, and the UMI counts were added up. Finally, we matched the cell IDs from the transcriptomic data to corresponding lineage barcodes and UMI counts and used this data for next steps. The following languages and packages were used for this analysis: Python 3.11.5, Numpy 1.25.2, Pandas 2.1.0, and Scipy 1.11.2.
SPLINTR26: We obtained files which contained read information, cell ID, UMI, and both barcode sequence and barcode reference ID (ending in _map.csv) from Fennell et al.26 upon request. These files corresponded to samples from chemotherapy treatment experiments (KRAS_T0_1, KRAS_T0_2, pool 1, and pool 2), in vitro clonal competition experiments (FLT3_T0, KRAS_T0_2, MLL_T0), and the bone marrow cells from retransplant experiments (KF10_BM) in Fennell et al.26 From these files, we extracted the cell ID, UMI count, and barcode reference ID for singlet counting.
Sequencing data for all samples was obtained from Fennell et al.26 GEO series GSE161676. The hashtagged and pooled samples in chemotherapy treatment experiments were demultiplexed using the HTODemux() function according to the methods described in Fennell et al.26 Pool 1 corresponds to vehicle and day 2 chemotherapy samples, and pool 2 corresponds to day 5 and day 7 chemotherapy samples. Sequencing data for these pooled samples were subset according to the results of the demultiplexing. Cells found as negatives or doublets according to the HTO classifications were removed during this process. The two in vitro MLL matrices were treated as separate samples for our analyses.
TREX37: An output of the TREX pipeline, umi_count_matrix.csv, was provided by Ratz et al.37 upon request. This file contains cell ID, barcode, and UMI count information which we formatted according to our needs. Descriptions of this file and other TREX outputs can be found on the Frisen lab GitHub (https://github.com/frisen-lab/TREX). The sequencing data (in the form of filtered_gene_bc_matrices) for each sample were obtained from the Ratz et al.,37 2022 GEO series (GSE153424) merged based on the following combinations: brain 1 (GSM4644058, GSM4644059, GSM4644060), brain 2 (GSM4644061, GSM4644062, GSM4644063), brain 3 (GSM4644064, GSM4644065, GSM4644066), brain 4 (GSM4644067, GSM4644068, GSM4644069).
TREX with Smart-seq338: We additionally analyzed TREX-barcoded data that have been sequenced by Smart-seq3.38 With Smart-seq3, single cells are sorted into wells rather than using microfluidic droplets like 10X or InDrops sequencing. We obtained an output of the TREX37 pipeline, lineageBC_results_UMIonly.txt (provided by Mold et al.38), which contained cell ID, lineage barcode, and UMI count information. Since sequencing with Smart-seq3 contains both internal and external reads, such that not every barcode will have an associated UMI, we also analyzed lineageBC_results.txt (provided by authors), which had read count instead of UMI count. We used both of these count types independently for doublet detection, and refer to each as “smartseq3_reads” and “smartseq3_umis”, respectively. The lineageBC_results files were subset based on the cells belonging to each sample, using cell IDs from an output of TREX, read_count_matrix.csv (provided by authors) specific to each sample. The RNA counts matrix for downstream analyses were extracted from an RDS object (provided by the authors) with umi counts for both introns and exons from both samples (brain 1 and brain 2). This matrix was also subset based on the read_count_matrix.csv files for each respective sample.
Watermelon18: The input files for processing were an RDS file containing Seurat objects with the RNA count data for all the samples and fastq files for each sample containing the barcode information obtained by amplified sequencing using Illumina MiSeq. These datasets were generated de novo for this study. The preprocessing of this data was developed by adapting the script in the GitHub repository for Watermelon https://github.com/yaaraore/Watermelon/blob/master/code/10X_WTA_dialOut.Rmd. Briefly, cell ID, lineage barcode, and UMI sequence were extracted from the fastq files using the ShortRead (1.56.1) package. The barcodes were filtered such that their length was exactly 30 which corresponds to the length of the Watermelon18 barcode. The UMI count for each cell ID and barcode pair was calculated by counting the number of unique UMI sequences associated with the pair. Next, all the barcodes were merged to the barcode with the highest UMI within a Levenshtein distance of 3, and the UMI counts of these merged barcodes were then added to the counts of the highest-UMI barcode. If the UMI count of any barcodes of a cell was less than one-third of the UMI count of the highest UMI barcode associated with the cell, those barcodes were removed in accordance with the authors published thresholds. The script was prepared in R 4.2.3, and the software used for this analysis is Seurat68 5.0.1, SingleCellExperiment70 1.20.1, DropletUtils73 1.18.1, Data.table74 1.14.10 and Matrix 1.6.5 (https://cran.r-project.org/web/packages/Matrix/).
Pairwise comparison of calls made on simulated doublets across doublet detection methods
For samples where the expected and actual doublet rate were both 0.1, the calls (“singlet” or “doublet”) and scores made by each doublet detection method were extracted and calls for each cell were compared using what we refer to as a similarity score. The similarity score is a pairwise calculation between two doublet detection methods and was calculated by counting the number of times the same call is made on the same doublet by both methods, divided by the total number of doublets in the sample.
The aim of this metric was to assess the concordance between methods in identifying specific features that they designate as characterizing a cell as a doublet. The maximum similarity score is 1, where the same call is made by both methods for every single doublet within a sample. Similarity scores were averaged across samples for each dataset and visually represented these data in the form of a heatmap.
Cell type annotation
For each dataset, preprocessing was done according to the parameters described in the respective study. Cell type markers were sources from the original study, if reported, or from79 for those that did not specify cell type information (Watermelon,18 CellTag-multi36 iEP sample and ClonMapper17). We integrated two additional samples for CellTag-multi that were used for the CellTag-multi HSC cell type annotation only. We used these markers with scType80 to assign cell type labels for all of the datasets. Markers for all cell types and a table for exact preprocessing parameters are outlined in Tables S4 and S5.
Training a doublet classifier using “ground-truth” singlets identified with singletCode
Datasets were created with a 10% doublet rate with the same method used to simulate doublets for benchmarking. The counts matrix, cell IDs, and gene names were then loaded into Python and an AnnData object was created. The labels of each cell as a singlet or doublet were also loaded and encoded such that doublets were the positive class. The data was then split into a 60% training, 20% validation, and 20% testing set with stratification so the 10% doublet rate was maintained across each set. The training and validation set were used for hyperparameter optimization while the training and testing set were used for final model fitting and testing. The model chosen was extreme gradient boosting, or XGBoost.63 Gradient boosting uses a series of decision trees to minimize the residuals of the prediction made by the previous tree. We chose XGBoost because of its out-of-the-box performance on tabular data, speed, and extensive documentation.
To achieve peak performance, we used Bayesian optimization through the hyperopt library77 to tune the hyperparameters of our model. The hyperparameter search space was defined as follows: ‘n_estimators' ranging from 1 to 100, ‘max_depth' ranging from 2 to 20, ‘learning_rate' uniformly distributed between 0.01 and 1, ‘min_child_weight' ranging from 1 to 10, ‘gamma' uniformly distributed between 0.1 and 1.0, ‘subsample' uniformly distributed between 0.5 and 1, ‘colsample_bytree' uniformly distributed between 0.5 and 1, ‘reg_alpha' (L1 regularization term on weights) uniformly distributed between 0 and 1, ‘reg_lambda' (L2 regularization term on weights) uniformly distributed between 1 and 3, ‘scale_pos_weight' uniformly distributed between 1 and 100. The objective was set to ‘binary:logistic' for binary classification tasks, and the booster method was fixed to ‘gbtree' for a gradient-boosted tree model. The objective function, or function which is optimized during the training process, aimed to maximize the AUPRC value returned by the classifier after each optimization (loss function = -AUPRC), and also reported the accuracy and AUROC values as secondary metrics. Hyperparameter optimization was done using the fmin() function from the hyperopt library. It was conducted over 10 iterations using the Tree-structured Parzen Estimator (TPE) algorithm.
Following hyperparameter optimization, the classifier was retrained on the training set using the best-performing parameters. The model’s efficacy was evaluated on the test set, focusing on AUROC, AUPRC, and accuracy metrics. The model’s predictions, alongside actual labels and the probability scores for the test set, were documented and saved for subsequent evaluation. A comprehensive summary, including the dataset and condition identifiers, performance metrics, and optimal hyperparameters, was compiled for each experiment.
Four different controls were done with each dataset. The first involved selecting all singlets in a dataset and falsely labeling 10% of them as doublets. The second involved selecting all doublets in a dataset and falsely labeling 90% of them as singlets. The third involved scrambling the counts values within the gene expression matrix without changing any cell or singlet/doublet labels. The fourth involved shuffling the features of the matrix in order to detect whether the structure of the matrix affected the model performance.
When training the inter-sample classifiers, samples were normalized with scTransform() and integrated by SelectIntegrationFeatures(), FindIntegrationAnchors(), and IntegrateData() from Seurat. After integration, separate matrices were saved for each sample, classifiers were built on each individual sample, then used to detect doublets in the opposite sample. For a given dataset, the following samples were used for each inter-sample classifier: Goyal et al.28 1 (sample 1, sample 2), ClonMapper17 (FM1, FM7), TREX37 (brain 2, brain 3), and SPLINTR26 chemotherapy vehicle (chemoVehicle_1, chemoVehicle_2) and treated (chemoDay2_1, chemoDay2_2). Only the TREX37 dataset did not show an increase in performance compared to other doublet detection methods, and, upon further inspection, only one of its two sample classifiers showed a decrease, which could be due to sample-to-sample variation.
When training the integrated human and mouse data classifiers, samples were normalized and integrated as described above. For the integrated human data classifier, the Goyal et al.28 1 and ClonMapper17 samples were used. For the integrated mouse data classifier, the TREX37 and SPLINTR26 chemotherapy vehicle and treated samples were used. After integration, separate matrices were saved for each sample. Separate matrices were also saved based on splitting the data in half by sample pair-for the human data, Goyal et al.28 1 sample 1 and ClonMapper17 FM1 samples were saved in the same matrix as pair A, and Goyal et al.28 1 sample 2 and ClonMapper17 FM7 samples were saved in the same matrix as pair B. For the mouse data, TREX37 brain 2, SPLINTR26 chemoVehicle_1, and SPLINTR26 chemoDay2_1 were saved in the same matrix as pair A and TREX37 brain 3, SPLINTR26 chemoVehicle_2, and SPLINTR26 chemoDay2_2 were saved in the same matrix as pair B. Each pair was used to build a classifier, which was used to detect doublets in each individual sample which makes up the opposite pair. For example, the human pair A classifier was used to predict doublets in both the Goyal et al.28 1 sample 2 and ClonMapper17 FM7, individually. When comparing these classifiers to doublet detection methods, the results of the doublet detection method were used for the same sample to which the classifier was applied.
Each dataset was independently randomly split 10 times on which 10 classifiers were independently trained. Relevant packages are: Python 3.10.13, R 4.2.2, scikit-learn75 1.3.2, scipy76 1.11.3, XGBoost63 2.0.2, hyperopt77 0.2.7, Seurat68 5.0.1.
Evaluation of varying doublets percentages on functional analysis results
We curated each dataset from across various technologies into a clean dataset with no doublets using the barcode information. We then created three new contaminated datasets, each with 10%, 20% and 40% doublet rate. This was achieved by first generating the 40% doublet rate dataset, and subsequently performing the standard Seurat (v5) log normalization, scaling, variable features identification, principal component analysis and CCA integration pipeline to identify clusters with Leiden clustering.68,81 Doublets were then taken out to achieve 20%, 10% and 0% doublet rate while maintaining the exact same cluster assignment across the datasets. The 0%, 10%, 20% and 40% doublet rate datasets are generated for each dataset used in benchmarking except for Smart-seq3. These curated datasets were used in all functional analysis, except datasets were re-clustered for clustering stability.
Differential expression: We randomly selected two clusters and used the MAST and rank-sum test with the FindMarkers function in Seurat to identify differentially expressed genes (DEs) at these contamination rates. The criteria for DE is having an adjusted p-value smaller than 0.01 and an average log2-fold change greater than 0.585 (i.e., 1.5 fold-change). We defined true positives to be genes identified as DE by both clean and contaminated dataset and true negative as genes not identified as DE by both datasets. We further defined false negatives as DEs present in clean but not in contaminated dataset, and false positives as DEs present in contaminated dataset but not in clean dataset. These metrics were then used to calculate precision, recall and TNR.
Cell-cell communication: We used CellChat (v2)58,59 to identify all pairwise inter-cluster communications in the “Secreted Signaling” annotation of the database CellChat curated. Each unique communication from a source cluster to a target cluster was considered as a unique combination of intercellular communication. The inferred communications were subsetted to ligands/receptors level with p-value <0.05, as per the standard guidelines outlined in the CellChat study. We then compared the significant intercellular communication identified in the clean dataset with those identified in the contaminated datasets to calculate precision and recall. Similarly, we used CellPhoneDB60 to infer inter-cluster communications with default settings. No subsetting on database is performed per the official tutorial of CellPhoneDB. This means more interactions could be identified with CellPhoneDB and will potentially inflate the recall value. Only inter-cluster communications with p-value <0.05 are used to calculate precision and recall. We did not calculate TNR in both cases due to the vast number of possible permutations of intercellular communications.
Clustering stability: The datasets corresponding to each doublet rate underwent reprocessing and reclustering to ensure that cluster assignments were not preserved across different rates, unlike for differential expression and cell-cell communication analyses. This approach allowed cluster allocations to be determined based on the unique composition of singlets and doublets in each dataset. Clusters were found by the Louvain algorithm at four different resolutions: 0.2, 0.4, 0.6, and 0.8. The number of clusters as well as the number singlets, doublets, and doublet percentage for each cluster was recorded for each dataset at each resolution. The doublet-contaminated dataset was classified as having the “correct number of clusters” if it had the same number of clusters as the 0% doublet rate dataset.
Cell trajectory: We used Slingshot62 to identify lineage trajectories in the curated datasets with 10%, 20% and 40% doublets and the clean dataset with no doublets. Cells present across datasets maintained their cluster assignment, as mentioned above. To infer lineages, we used the getLineages() function from Slingshot with 10 PCs, and the output was a cluster-based minimum spanning tree for each lineage inferred. With this as input, we used the getCurves() function to create principal curves fitted to 150 points for each of the lineages. To compare the trajectories between the contaminated datasets and the clean dataset, we used both the cluster-based lineage information and the principal curves, which contained points in PC space associated with 150 pseudotime points.
To compare lineages across varying doublet-percentage datasets, we first matched the most similar lineages across datasets since each dataset can have more than one inferred lineage with various degrees of deviation from the clean dataset. We used a modified Hungarian matching algorithm to make optimal pairs using Levenshtein distance between minimum spanning trees as the distance metric. This was implemented using the solve_LSAP() function from the R package clue (https://CRAN.R-project.org/package=clue). The algorithm determined optimal pairs such that there was a 1:1 matching between lineages in a contaminated dataset and the clean dataset, and optimally-paired lineages were compared going forward. There were some cases where the number of lineages in the doublet-contaminated dataset did not match that of the clean dataset. In those cases, each lineage from the dataset with fewer lineages was matched to its optimal pair from the other dataset, such that unmatched lineages from the latter were excluded from the comparison.
Distance between matched lineages was calculated by summing up the Euclidean distance between each of the 150 pseudotime points. To account for the case when the number of lineages was different, a penalty was added to this distance since the difference in the number of inferred lineages implies greater difference between the datasets. So, the new distance () with penalty is
where d is the distance without penalty and is the difference between the number of inferred lineages between the two datasets.
Each distance was normalized by dividing by the maximum distance calculated across the 0–40% doublet datasets for global comparison.
Low vs. high heterogeneity analysis
We used the 10% dataset and randomly sampled 450 singlets and 50 doublets three times each from the Louvain cluster with the most cells and from all the Louvain clusters within the principal component space. We quantified the Euclidean distance on the top 30 PCs and categorized the cells from the same cluster as less heterogeneous (more transcriptionally similar) “adjacent,” and cells from all clusters as more heterogeneous (less transcriptionally similar) “distant”. We then used scDblFinder and DoubletFinder to identify doublets and calculate the respective AUROC and AUPRC values.
Sample heterogeneity analyses
To perform heterogeneity analyses, samples were subset to contain only singletCode-determined singlets which were further subsampled to contain only 1000 singlets. Samples without at least 1000 singlets were not considered for this analysis. All datasets contained at least 1 sample with at least 1000 singlets except for Jain et al.,34 and, to ensure its representation, the samples 1_DMSO_A and 2_DMSO_B were scTransformed together and then 1000 singlets were subset. Samples were normalized, scaled, the variable features were identified, and dimension reduction with PCA was performed. Datasets were clustered using FindClusters() in Seurat, which uses SNN-based clustering, at 4 resolutions: 0.2, 0.4, 0.6, and 0.8. The subsampled, clustered objects were used as the input for all of the following heterogeneity analyses.
Phenotypic Volume: Phenotypic volume was calculated as previously described26 using the 1000 most highly variable genes in a sample. As a control to mimic a more transcriptionally similar group of cells within the same sample, phenotypic volume was calculated for a randomly chosen cell and its 100 nearest neighbors for comparison.
E-Distance: The pairwise E-distance between each cluster in each sample was calculated using edist() from scperturbR as previously described.28,47 As a control, the E-distance was also calculated for the same sample with cluster labels shuffled. To compare across samples, the number of clusters that was most common across all samples, regardless of resolution, was determined (7 clusters), and all E-distances for those samples which have 7 clusters were compared.
Shannon Diversity: The Shannon Diversity value was calculated as previously described28 using the diversity() function in R. As a control, an artificially uniform cluster distribution was used to mimic maximum equitability.
Differential Expression: Differentially-expressed genes were determined by the FindMarkers() function using the Wilcox rank-sum test. Significant genes were identified by an adjusted p-value <0.05, and the proportion of upregulated (pUp) and downregulated (pDown) genes were determined along with the upper (fU) and lower (fL) bounds of expression level fold changes. This analysis mimics the heterogeneity parameters which can be specified by scDesign.4 The two heterogeneity metrics used were the sum of pUp and pDown, or total proportion of up and downregulated genes out of all marker genes, and the difference between fU and fL, or the total range between the upper and lower bound.
For relating the heterogeneity values of a sample, as determined by the above 5 metrics, to doublet detection method performance, the samples were ranked according to their heterogeneity metric values and according to their AUPRC values for each doublet detection method. Results are represented on a scatterplot of AUPRC rank vs. heterogeneity metric rank and an R2 value for the trend in doublet detection method performance given heterogeneity was calculated.
ATAC analysis
We used the fragments file from the Watermelon18-barcoded 10X Genomics Multiome dataset and used the AMULET function with default settings from scDblFinder package to identify doublets. The threshold for defining doublets is q-value <0.1. We then used singletCode on the barcode data extracted from the matching RNA library to identify ground-truth singlets. Due to the varying doublet percentage within samples, we randomly sampled 1000 cells six times from the dataset and calculated the TNR for AMULET doublet detection based on singletCode ground-truth singlets. Only cells present in singletCode singlets are included when calculating TNR value.
Recovering number of singlets using the paper’s methods and singletCode
To compare the number of singlets recovered in the original studies and by singletCode, we replicated the methods described in each study for doublet detection and quality control. Since singletCode uses only the cells that were successfully barcoded, we subsetted the cells from the original study to only those containing barcodes. Also, to ensure that the cells are truly barcoded we enforced the same UMI cutoff used in singletCode singlet counting, determined by 3 x 10−5 x Ntotal cells, and only cells which have barcodes above this threshold were considered as barcoded cells. The general workflow included identifying the barcoded cells, applying the doublet detection method described in the study (if applicable), and performing quality control to get the final number of cells which were used for downstream analysis in the original study. The doublet detection methods and quality-control metrics were obtained directly from the paper unless specified otherwise. Quality control was performed on the singletCode recovered singlets using the same metrics as above to calculate the total number of quality-controlled singlets.
Dataset preprocessing
Goyal et al.28 1: Quality control was done on the data by filtering out cells with a mitochondrial gene percentage greater than 26 and the number of genes measured either less than 200 or greater than 7200, as reported. A more stringent UMI cutoff was applied to reflect the threshold published in Goyal et al.28 As in Goyal et al.,28 doublet filtering was done by removing any cell containing more than one barcode.
Goyal et al.28 2, 3: Quality control was done on the data by filtering out cells with a mitochondrial gene percentage greater than 30 and the number of genes measured either less than 200 or greater than 7200, as reported. As in Goyal et al.,28 doublet filtering was done by removing any cell that had more than one barcode.
Goyal et al.28 4: Quality control was done on the data by filtering out cells with a mitochondrial gene percentage greater than 18 and the number of genes measured either less than 200 or greater than 7000, as reported. As in Goyal et al.,28 doublet filtering was done by removing any cell that had more than one barcode.
Goyal et al.28 5: Quality control was done on the data by filtering out cells with mitochondrial gene percentages greater than 26 and the number of genes measured either less than 200 or greater than 7200, as reported. Doublet filtering was done by removing any cell that had more than one barcode.
Goyal et al.28 6: We determined the quality-control metrics based on the distribution of the number of genes, number of counts, and the mitochondrial gene percentage in all the cells belonging to a sample. Cells with a mitochondrial gene percentage greater than 21 and the number of genes detected either less than 200 or more than 5000 were filtered out. The total number of quality-controlled cells exactly matched the number of cells reported in the paper for this dataset. Doublet filtering was done by removing any cell that had more than one barcode.
Goyal et al.28 8: We determined the quality-control metrics based on the distribution of the number of genes, number of counts, and mitochondrial gene percentage in all the cells belonging to a sample. Cells with mitochondrial gene percentages greater than 26 and the number of genes detected either less than 200 or more than 7200 were filtered out. The total number of quality-controlled cells closely matched the number of cells reported in the paper for this sample. Doublet filtering was done by removing any cell that had more than one barcode.
ClonMapper17: Quality control was done by filtering out cells with either fewer than 750 genes or greater than 4000 genes, or the percentage of mitochondrial genes greater than 8%, as reported. We did not perform doublet filtering for this analysis since the paper mentioned no explicit doublet removal method.
CellTag-multi36: We determined the quality-control metrics based on the distribution of the number of genes, number of counts, and the mitochondrial gene percentage in all the cells belonging to a sample. For the B4D21-RNA-r2-1 sample, cells with more than 60000 counts, less than 1000 genes, greater than 6000 genes, or a mitochondrial gene content greater than 10% were filtered out. For the d2-RNA-5 sample, cells with more than 30000 counts, less than 2000 genes, greater than 6000 genes, or a mitochondrial gene content greater than 10% were filtered out. In accordance with the guideline in the study, we filtered out cells that contained less than two barcodes or more than 25 to remove potential doublets.
Jain et al.34: Quality control was performed by filtering out cells with less than 200 genes or mitochondrial gene content greater than 30%, as reported. The doublet filtering method involved matching the barcode sequence to the expected pattern for 10X barcodes and removing cells whose barcode’s UMI percentage was less than 40% of all the barcode UMIs associated with a cell ID and finally, removing all the cells that still had more than one barcode associated with it.
Jiang et al.31: Quality control was performed by filtering out cells with more than 10,000 genes, as reported. The doublet filtering method involved matching the barcode sequence to the expected pattern for 10X barcodes and removing cells whose barcode’s UMI percentage was less than 30% of all the barcode UMIs associated with a cell ID and finally, removing all the cells that still had more than one barcode associated with it.
LARRY35: Quality control was performed by filtering out cells whose mitochondrial gene content was greater than 20%, as reported. The paper used Scrublet for doublet detection. We used a published list of cell IDs (stateFate_inVitro_metadata.txt.gz on https://github.com/AllonKleinLab/paper-data/tree/master/Lineage_tracing_on_transcriptional_landscapes_links_state_to_fate_during_differentiation) which are used for alldownstream processes as singlets and use that to identify singlets in the list of barcoded cells.
Smart-seq3 reads and UMIs38: Metrics for quality control were determined based on the distribution of the number of genes, number of reads and percentage of mitochondrial genes present in a cell within the sample. The gene IDs were Ensembl IDs, and 37 mouse mitochondrial genes were used to calculate the mitochondrial gene percentage. For Smart-seq338 reads based counts, cells from brain 1 with counts greater than 150000, the number of genes detected either less than 2500 or greater than 11000, or mitochondrial gene percentage greater than 5 were filtered out and cells from brain 2 with counts greater than 130000, the number of genes detected either less than 2500 or greater than 10000, or mitochondrial gene percentage greater than 5 were filtered out. For Smart-seq3,38 UMI based counts, cells from brain 1 with counts greater than 150000, the number of genes detected either less than 2500 or greater than 11000, or mitochondrial gene percentage greater than 5 were filtered out and cells from brain 2 with counts greater than 130000, the number of genes detected either less than 2500 or greater than 11000, or mitochondrial gene percentage greater than 5 were filtered out.
SPLINTR26: Quality control was performed by filtering out the cells that had a total number of reads greater than 100000 or lower than 1000, the number of genes fewer than 100 or greater than 10000 and a mitochondrial gene content of more than 20%, as reported. For doublet filtering, any cell with a single barcode or a combination of 2 barcodes present in multiple cells was classified as a singlet (and remaining 2 barcodes per cell as doublets), and any cell with more than 2 barcodes was classified as a doublet.
TREX37: Quality control was performed by filtering out cells with fewer than 500 or more than 10000 genes, as reported. Doublet filtering was performed based on the argument that any cell could express only one of the mutually-exclusive markers (Igf2, Pf4, Hexb, Rsph1, Pdgfra, Bmp4, Mog, Clic6, Rgs5, Cldn5, Reln, Igfbpl1, Slc32a1, Slc17a7, Aldoc)—any cell had non-zero expression of more than one of these markers were classified as a doublet.
Watermelon18: Quality-control metrics were determined based on the distribution of the number of genes, number of reads and percentage of mitochondrial genes present in a cell within the sample. For T47D-lag-1, T47D-lag-2, T47D-late-1, and T47D-late-2 samples, cells with total reads greater than 60000, the number of genes expressed greater than 8000 or lower than 2500, or mitochondrial gene content greater than 25% were filtered out. For T47D-naive-1 and T47D-naive-2 samples, cells with total reads greater than 60000, the number of genes expressed greater than 7500 or lower than 2500, or mitochondrial gene content greater than 20% were filtered out.
Tools to implement singletCode on new barcoded datasets
We developed two tools, a command line interface and a python package singletCode, that can be used to identify true singlets using the framework developed. The package can be installed from PyPI. The source code for both the command line and package is deposited in the github repository shared with the paper. Detailed information on their usage can be found on the website.
Quantification and statistical analysis
All boxplots were made with ggplot2 (version 3.5.0) and show the mean, standard deviation, and first and third quartiles. Outliers were not depicted unless explicitly stated. Both unpaired Wilcoxon Rank sum and paired Wilcoxon signed-rank tests as well as t tests were used. N is specified in figure captions. Significance thresholds for Wilcoxon tests and t tests were established at p < 0.05 for all analyses. R2 values were utilized to depict lines of best fit. Wherever we made conclusions based on UMAP clustering (e.g., Gini coefficient and clustering stability analysis), we tested multiple resolutions of clustering. All statistical analyses were performed in R (version 4.2.2).
Acknowledgments
We thank members of the Goyal lab (especially Ian Mellis, Nitu Kumari, Emanuelle Grody, and Aurelia Leona) and Jeff Mold for helpful discussions and comments on the manuscript. We thank Dane Vassiliadis and Mark Dawson (SPLINTR), Amy Brock (ClonMapper), Kunal Jindal and Samantha Morris (CellTag-multi), Caleb Weinreb and Allon Klein (LARRY), and Michael Ratz (TREX) for barcoded datasets. Y.G. acknowledges support from Northwestern University’s startup and Burroughs Wellcome Fund Career Awards at the Scientific Interface. Z.Z. and M.E.M. were supported by funds to Y.G. K.K. acknowledges support from the University of Pennsylvania MSTP. M.E.M. and Y.G. acknowledge support from the National Institute for Theory and Mathematics in Biology through the National Science Foundation (DMS-2235451) and the Simons Foundation (MPTMPS-00005320). Y.O. is supported by the Azrieli Faculty Fellowship. Y.G. is a CZ Biohub Investigator.
Author contributions
Y.G. conceived and designed the study. Z.Z. and M.E.M. designed and performed a majority of the analyses. K.M.A., K.K., and C.-J.E. performed a subset of the analysis. K.M.A. assisted with the singletCode package and website. M.E.M. and Y.G. prepared a majority of the figures and tables, with input from H.S. I.F., S.S., and Y.O. performed the experiments. Y.G., Z.Z., and M.E.M. wrote the manuscript, with input from all authors.
Declaration of interests
Y.G. received consultancy fees from the Schmidt Science Fellows.
Declaration of generative AI in the writing process
The entire first draft was written without the use of any generative AI. To improve the wording of a sentence, the authors used ChatGPT, but such instances were rare. After using ChatGPT, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Published: June 25, 2024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2024.100592.
Contributor Information
Madeline E. Melzer, Email: madeline.melzer@northwestern.edu.
Yogesh Goyal, Email: yogesh.goyal@northwestern.edu.
Supplemental information
Cell numbers were tracked at each stage of the analysis, from the number of cells in the raw gene count matrix to the final number of quality-controlled singletCode-identified singlets.
Quality-control thresholds include minimum and maximum number of genes and maximum percentage of mitochondrial counts. Quality control might also include doublet filtering for certain datasets. Preprocessing steps include scaling to remove effects of cell cycle, principal component analysis, clustering, dimensional reduction using UMAP, and integration of samples of a dataset.
References
- 1.Cui L.-L., Kinnunen T., Boltze J., Nystedt J., Jolkkonen J. Clumping and Viability of Bone Marrow Derived Mesenchymal Stromal Cells under Different Preparation Procedures: A Flow Cytometry-Based In Vitro Study. Stem Cells Int. 2016;2016 doi: 10.1155/2016/1764938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kuonen F., Touvrey C., Laurent J., Ruegg C. Fc block treatment, dead cells exclusion, and cell aggregates discrimination concur to prevent phenotypical artifacts in the analysis of subpopulations of tumor-infiltrating CD11b(+) myelomonocytic cells. Cytometry A. 2010;77:1082–1090. doi: 10.1002/cyto.a.20969. [DOI] [PubMed] [Google Scholar]
- 3.Bernstein N.J., Fong N.L., Lam I., Roy M.A., Hendrickson D.G., Kelley D.R. Solo: Doublet Identification in Single-Cell RNA-Seq via Semi-Supervised Deep Learning. Cell Syst. 2020;11:95–101.e5. doi: 10.1016/j.cels.2020.05.010. [DOI] [PubMed] [Google Scholar]
- 4.Xi N.M., Li J.J. Benchmarking Computational Doublet-Detection Methods for Single-Cell RNA Sequencing Data. Cell Syst. 2021;12:176–194.e6. doi: 10.1016/j.cels.2020.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Luecken M.D., Theis F.J. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 2019;15:e8746. doi: 10.15252/msb.20188746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.McGinnis C.S., Murrow L.M., Gartner Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 2019;8:329–337.e4. doi: 10.1016/j.cels.2019.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wolock S.L., Lopez R., Klein A.M. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8:281–291.e9. doi: 10.1016/j.cels.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.DePasquale E.A.K., Schnell D.J., Van Camp P.J., Valiente-Alandí Í., Blaxall B.C., Grimes H.L., Singh H., Salomonis N. DoubletDecon: Deconvoluting Doublets from Single-Cell RNA-Sequencing Data. Cell Rep. 2019;29:1718–1727.e8. doi: 10.1016/j.celrep.2019.09.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lun A.T.L., McCarthy D.J., Marioni J.C. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res. 2016;5:2122. doi: 10.12688/f1000research.9501.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bais A.S., Kostka D. scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics. 2020;36:1150–1158. doi: 10.1093/bioinformatics/btz698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McGinnis C.S., Patterson D.M., Winkler J., Conrad D.N., Hein M.Y., Srivastava V., Hu J.L., Murrow L.M., Weissman J.S., Werb Z., et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods. 2019;16:619–626. doi: 10.1038/s41592-019-0433-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sun B., Bugarin-Estrada E., Overend L.E., Walker C.E., Tucci F.A., Bashford-Rogers R.J.M. Double-jeopardy: scRNA-seq doublet/multiplet detection using multi-omic profiling. Cell Rep. Methods. 2021;1 doi: 10.1016/j.crmeth.2021.100008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Stoeckius M., Zheng S., Houck-Loomis B., Hao S., Yeung B.Z., Mauck W.M., 3rd, Smibert P., Satija R. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 2018;19:224. doi: 10.1186/s13059-018-1603-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bhang H.-E.C., Ruddy D.A., Krishnamurthy Radhakrishna V., Caushi J.X., Zhao R., Hims M.M., Singh A.P., Kao I., Rakiec D., Shaw P., et al. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat. Med. 2015;21:440–448. doi: 10.1038/nm.3841. [DOI] [PubMed] [Google Scholar]
- 15.Biddy B.A., Kong W., Kamimoto K., Guo C., Waye S.E., Sun T., Morris S.A. Single-cell mapping of lineage and identity in direct reprogramming. Nature. 2018;564:219–224. doi: 10.1038/s41586-018-0744-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Weinreb C., Rodriguez-Fraticelli A., Camargo F., Klein A.M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. bioRxiv. 2018 doi: 10.1101/467886. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gutierrez C., Al’Khafaji A.M., Brenner E., Johnson K.E., Gohil S.H., Lin Z., Knisbacher B.A., Durrett R.E., Li S., Parvin S., et al. Multifunctional barcoding with ClonMapper enables high-resolution study of clonal dynamics during tumor evolution and treatment. Nat. Cancer. 2021;2:758–772. doi: 10.1038/s43018-021-00222-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Oren Y., Tsabar M., Cuoco M.S., Amir-Zilberstein L., Cabanos H.F., Hütter J.C., Hu B., Thakore P.I., Tabaka M., Fulco C.P., et al. Cycling cancer persister cells arise from lineages with distinct programs. Nature. 2021;596:576–582. doi: 10.1038/s41586-021-03796-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Frieda K.L., Linton J.M., Hormoz S., Choi J., Chow K.H.K., Singer Z.S., Budde M.W., Elowitz M.B., Cai L. Synthetic recording and in situ readout of lineage information in single cells. Nature. 2017;541:107–111. doi: 10.1038/nature20777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Umkehrer C., Holstein F., Formenti L., Jude J., Froussios K., Neumann T., Cronin S.M., Haas L., Lipp J.J., Burkard T.R., et al. Isolating live cell clones from barcoded populations using CRISPRa-inducible reporters. Nat. Biotechnol. 2021;39:174–178. doi: 10.1038/s41587-020-0614-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Emert B.L., Cote C.J., Torre E.A., Dardani I.P., Jiang C.L., Jain N., Shaffer S.M., Raj A. Variability within rare cell states enables multiple paths toward drug resistance. Nat. Biotechnol. 2021;39:865–876. doi: 10.1038/s41587-021-00837-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tian L., Tomei S., Schreuder J., Weber T.S., Amann-Zalcenstein D., Lin D.S., Tran J., Audiger C., Chu M., Jarratt A., et al. Clonal multi-omics reveals Bcor as a negative regulator of emergency dendritic cell development. Immunity. 2021;54:1338–1351.e9. doi: 10.1016/j.immuni.2021.03.012. [DOI] [PubMed] [Google Scholar]
- 23.Leighton J., Hu M., Sei E., Meric-Bernstam F., Navin N.E. Reconstructing mutational lineages in breast cancer by multi-patient-targeted single cell DNA sequencing. bioRxiv. 2021 doi: 10.1101/2021.11.16.468877. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rodriguez-Fraticelli A.E., Weinreb C., Wang S.W., Migueles R.P., Jankovic M., Usart M., Klein A.M., Lowell S., Camargo F.D. Single-cell lineage tracing unveils a role for TCF15 in haematopoiesis. Nature. 2020;583:585–589. doi: 10.1038/s41586-020-2503-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pillai M., Hojel E., Jolly M.K., Goyal Y. Unraveling non-genetic heterogeneity in cancer with dynamical models and computational tools. Nature Computational Sci. 2023 doi: 10.1038/s43588-023-00427-0. [DOI] [PubMed] [Google Scholar]
- 26.Fennell K.A., Vassiliadis D., Lam E.Y.N., Martelotto L.G., Balic J.J., Hollizeck S., Weber T.S., Semple T., Wang Q., Miles D.C., et al. Non-genetic determinants of malignant clonal fitness at single-cell resolution. Nature. 2022;601:125–131. doi: 10.1038/s41586-021-04206-7. [DOI] [PubMed] [Google Scholar]
- 27.Sankaran V.G., Weissman J.S., Zon L.I. Cellular barcoding to decipher clonal dynamics in disease. Science. 2022;378 doi: 10.1126/science.abm5874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Goyal Y., Busch G.T., Pillai M., Li J., Boe R.H., Grody E.I., Chelvanambi M., Dardani I.P., Emert B., Bodkin N., et al. Diverse clonal fates emerge upon drug treatment of homogeneous cancer cells. Nature. 2023;620:651–659. doi: 10.1038/s41586-023-06342-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mold J.E., Weissman M.H., Ratz M., Hagemann-Jensen M., Hård J., Eriksson C.J., Toosi H., Berghenstråhle J., von Berlin L., Martin M., et al. Clonally heritable gene expression imparts a layer of diversity within cell types. bioRxiv. 2022 doi: 10.1101/2022.02.14.480352. Preprint at. [DOI] [PubMed] [Google Scholar]
- 30.Jain N., Goyal Y., Dunagin M.C., Cote C.J., Mellis I.A., Emert B., Jiang C.L., Dardani I.P., Reffsin S., Raj A. Retrospective identification of intrinsic factors that mark pluripotency potential in rare somatic cells. bioRxiv. 2023 doi: 10.1101/2023.02.10.527870. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jiang C.L., Goyal Y., Jain N., Wang Q., Truitt R.E., Coté A.J., Emert B., Mellis I.A., Kiani K., Yang W., et al. Cell type determination for cardiac differentiation occurs soon after seeding of human-induced pluripotent stem cells. Genome Biol. 2022;23:90. doi: 10.1186/s13059-022-02654-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Reffsin S., Miller J., Ayyanathan K., Dunagin M.C., Jain N., Schultz D.C., Cherry S., Raj A. Single cell susceptibility to SARS-CoV-2 infection is driven by variable cell states. bioRxiv. 2023 doi: 10.1101/2023.07.06.547955. Preprint at. [DOI] [Google Scholar]
- 33.Holze H., Talarmain L., Fennell K.A., Lam E.Y., Dawson M.A., Vassiliadis D. BARtab & bartools: an integrated Nextflow pipeline and R package for the analysis of synthetic cellular barcodes in the genome and transcriptome. bioRxiv. 2023 doi: 10.1101/2023.11.21.568179. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jain N., Goyal Y., Dunagin M.C., Cote C.J., Mellis I.A., Emert B., Jiang C.L., Dardani I.P., Reffsin S., Arnett M., et al. Retrospective identification of cell-intrinsic factors that mark pluripotency potential in rare somatic cells. Cell Syst. 2024;15:109–133.e10. doi: 10.1016/j.cels.2024.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Weinreb C., Rodriguez-Fraticelli A., Camargo F.D., Klein A.M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science. 2020;367 doi: 10.1126/science.aaw3381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Jindal K., Adil M.T., Yamaguchi N., Yang X., Wang H.C., Kamimoto K., Rivera-Gonzalez G.C., Morris S.A. Single-cell lineage capture across genomic modalities with CellTag-multi reveals fate-specific gene regulatory changes. Nat. Biotechnol. 2023 doi: 10.1038/s41587-023-01931-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ratz M., von Berlin L., Larsson L., Martin M., Westholm J.O., La Manno G., Lundeberg J., Frisén J. Clonal relations in the mouse brain revealed by single-cell and spatial transcriptomics. Nat. Neurosci. 2022;25:285–294. doi: 10.1038/s41593-022-01011-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mold J.E., Weissman M.H., Ratz M., Hagemann-Jensen M., Hård J., Eriksson C.J., Toosi H., Berghenstråhle J., Ziegenhain C., von Berlin L., et al. Clonally heritable gene expression imparts a layer of diversity within cell types. Cell Syst. 2024;15:149–165.e10. doi: 10.1016/j.cels.2024.01.004. [DOI] [PubMed] [Google Scholar]
- 39.Zorita E., Cuscó P., Filion G.J. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31:1913–1919. doi: 10.1093/bioinformatics/btv053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Schuh L., Saint-Antoine M., Sanford E.M., Emert B.L., Singh A., Marr C., Raj A., Goyal Y. Gene Networks with Transcriptional Bursting Recapitulate Rare Transient Coordinated High Expression States in Cancer. Cell Syst. 2020;10:363–378.e12. doi: 10.1016/j.cels.2020.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Mellis I.A., Bodkin N., Melzer M.E., Goyal Y. Prevalence of and gene regulatory constraints on transcriptional adaptation in single cells. bioRxiv. 2023 doi: 10.1101/2023.08.14.553318. Preprint at. [DOI] [Google Scholar]
- 42.Xi N.M., Li J.J. Protocol for executing and benchmarking eight computational doublet-detection methods in single-cell RNA sequencing data analysis. STAR Protoc. 2021;2 doi: 10.1016/j.xpro.2021.100699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Germain P.-L., Lun A., Garcia Meixide C., Macnair W., Robinson M.D. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res. 2021;10:979. doi: 10.12688/f1000research.73600.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Alexandari A.M., Kundaje A., Shrikumar A.A. General Framework for Abstention Under Label Shift. arXiv. 2018 doi: 10.48550/arXiv.1802.07024. Preprint at. [DOI] [Google Scholar]
- 45.Zheng G.X.Y., Terry J.M., Belgrader P., Ryvkin P., Bent Z.W., Wilson R., Ziraldo S.B., Wheeler T.D., McDermott G.P., Zhu J., et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 2017;8 doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kang H.M., Subramaniam M., Targ S., Nguyen M., Maliskova L., McCarthy E., Wan E., Wong S., Byrnes L., Lanata C.M., et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 2018;36:89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Peidli S., Green T.D., Shen C., Gross T., Min J., Garda S., Yuan B., Schumacher L.J., Taylor-King J.P., Marks D.S., et al. scPerturb: harmonized single-cell perturbation data. Nat. Methods. 2024;21:531–540. doi: 10.1038/s41592-023-02144-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Xiong K.-X., Zhou H.L., Lin C., Yin J.H., Kristiansen K., Yang H.M., Li G.B. Chord: an ensemble machine learning algorithm to identify doublets in single-cell RNA sequencing data. Commun. Biol. 2022;5:510. doi: 10.1038/s42003-022-03476-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M., et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Klein A.M., Mazutis L., Akartuna I., Tallapragada N., Veres A., Li V., Peshkin L., Weitz D.A., Kirschner M.W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161:1187–1201. doi: 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Clark I.C., Fontanez K.M., Meltzer R.H., Xue Y., Hayford C., May-Zhang A., D’Amato C., Osman A., Zhang J.Q., Hettige P., et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. 2023;41:1557–1566. doi: 10.1038/s41587-023-01685-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Hagemann-Jensen M., Ziegenhain C., Chen P., Ramsköld D., Hendriks G.J., Larsson A.J.M., Faridani O.R., Sandberg R. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat. Biotechnol. 2020;38:708–714. doi: 10.1038/s41587-020-0497-0. [DOI] [PubMed] [Google Scholar]
- 53.Thibodeau A., Eroglu A., McGinnis C.S., Lawlor N., Nehar-Belaid D., Kursawe R., Marches R., Conrad D.N., Kuchel G.A., Gartner Z.J., et al. AMULET: a novel read count-based method for effective multiplet detection from single nucleus ATAC-seq data. Genome Biol. 2021;22:252. doi: 10.1186/s13059-021-02469-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Granja J.M., Corces M.R., Pierce S.E., Bagdatli S.T., Choudhry H., Chang H.Y., Greenleaf W.J. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 2021;53:403–411. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Fang R., Preissl S., Li Y., Hou X., Lucero J., Wang X., Motamedi A., Shiau A.K., Zhou X., Xie F., et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 2021;12:1337. doi: 10.1038/s41467-021-21583-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Finak G., McDavid A., Yajima M., Deng J., Gersuk V., Shalek A.K., Slichter C.K., Miller H.W., McElrath M.J., Prlic M., et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 2015;16:278. doi: 10.1186/s13059-015-0844-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Fay M.P., Proschan M.A. Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat. Surv. 2010;4:1–39. doi: 10.1214/09-SS051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Jin S., Guerrero-Juarez C.F., Zhang L., Chang I., Ramos R., Kuan C.H., Myung P., Plikus M.V., Nie Q. Inference and analysis of cell-cell communication using CellChat. Nat. Commun. 2021;12:1088. doi: 10.1038/s41467-021-21246-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Jin S., Plikus M.V., Nie Q. CellChat for systematic analysis of cell-cell communication from single-cell and spatially resolved transcriptomics. bioRxiv. 2023 doi: 10.1101/2023.11.05.565674. Preprint at. [DOI] [Google Scholar]
- 60.Garcia-Alonso L., Lorenzi V., Mazzeo C.I., Alves-Lopes J.P., Roberts K., Sancho-Serra C., Engelbert J., Marečková M., Gruhn W.H., Botting R.A., et al. Single-cell roadmap of human gonadal development. Nature. 2022;607:540–547. doi: 10.1038/s41586-022-04918-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebvre E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008 [Google Scholar]
- 62.Street K., Risso D., Fletcher R.B., Das D., Ngai J., Yosef N., Purdom E., Dudoit S. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom. 2018;19:477. doi: 10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Chen T., Guestrin C.X.G.B. A Scalable Tree Boosting System. arXiv. 2016 doi: 10.48550/arXiv.1603.02754. Preprint at. [DOI] [Google Scholar]
- 64.Heimberg G., Kuo T., DePianto D., Heigl T., Diamant N., Salem O., Scalia G., Biancalani T., Turley S., Rock J., et al. Scalable querying of human cell atlases via a foundational model reveals commonalities across fibrosis-associated macrophages. bioRxiv. 2023 doi: 10.1101/2023.07.18.549537. Preprint at. [DOI] [Google Scholar]
- 65.Bao F., Deng Y., Wan S., Shen S.Q., Wang B., Dai Q., Altschuler S.J., Wu L.F. Integrative spatial analysis of cell morphologies and transcriptional states with MUSE. Nat. Biotechnol. 2022;40:1200–1209. doi: 10.1038/s41587-022-01251-z. [DOI] [PubMed] [Google Scholar]
- 66.Cable D.M., Murray E., Zou L.S., Goeva A., Macosko E.Z., Chen F., Irizarry R.A. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 2022;40:517–526. doi: 10.1038/s41587-021-00830-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Miller T.E., Lareau C.A., Verga J.A., DePasquale E.A.K., Liu V., Ssozi D., Sandor K., Yin Y., Ludwig L.S., El Farran C.A., et al. Mitochondrial variant enrichment from high-throughput single-cell RNA sequencing resolves clonal populations. Nat. Biotechnol. 2022;40:1030–1034. doi: 10.1038/s41587-022-01210-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Hao Y., Stuart T., Kowalski M.H., Choudhary S., Hoffman P., Hartman A., Srivastava A., Molla G., Madad S., Fernandez-Granda C., Satija R. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Harris C.R., Millman K.J., van der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J., et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Amezquita R.A., Lun A.T.L., Becht E., Carey V.J., Carpp L.N., Geistlinger L., Marini F., Rue-Albrecht K., Risso D., Soneson C., et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods. 2020;17:137–145. doi: 10.1038/s41592-019-0654-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Grau J., Grosse I., Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31:2595–2597. doi: 10.1093/bioinformatics/btv153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Wickham H. ggplot2. Springer International Publishing; 2016. [Google Scholar]
- 73.Griffiths J.A., Richard A.C., Bach K., Lun A.T.L., Marioni J.C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 2018;9:2667. doi: 10.1038/s41467-018-05083-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Barrett T., Dowle M., Srinivasan A., Gorecki J., Chirico M., Hocking T. 2024. R’s data.table package extends data.frame. [Google Scholar]
- 75.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine Learning in Python. arXiv. 2012 doi: 10.48550/arXiv.1201.0490. Preprint at. [DOI] [Google Scholar]
- 76.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Bergstra J., Yamins D., Cox D. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. 17--19 Jun 2013;28:115–123. [Google Scholar]
- 78.The pandas development team . 2024. pandas-dev/pandas: Pandas. [DOI] [Google Scholar]
- 79.Quan F., Liang X., Cheng M., Yang H., Liu K., He S., Sun S., Deng M., He Y., Liu W., et al. Annotation of cell types (ACT): a convenient web server for cell type annotation. Genome Med. 2023;15:91. doi: 10.1186/s13073-023-01249-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Ianevski A., Giri A.K., Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat. Commun. 2022;13:1246. doi: 10.1038/s41467-022-28803-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Butler A., Hoffman P., Smibert P., Papalexi E., Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Cell numbers were tracked at each stage of the analysis, from the number of cells in the raw gene count matrix to the final number of quality-controlled singletCode-identified singlets.
Quality-control thresholds include minimum and maximum number of genes and maximum percentage of mitochondrial counts. Quality control might also include doublet filtering for certain datasets. Preprocessing steps include scaling to remove effects of cell cycle, principal component analysis, clustering, dimensional reduction using UMAP, and integration of samples of a dataset.
Data Availability Statement
-
•
This paper analyzes existing, publicly available scRNA-seq and barcoding data. The scRNA-seq datasets are available at Gene Expression Omnibus with accession numbers (GSE233766, GSE198729, GSE227151, GSE216521, GSE151431, GSE140802, GSE161676, GSE153424) and barcoding datasets at Figshare (https://doi.org/10.6084/m9.figshare.22798952; https://doi.org/10.6084/m9.figshare.22802888; https://doi.org/10.6084/m9.figshare.19126985; https://doi.org/10.6084/m9.figshare.22251223.v1, https://doi.org/10.6084/m9.figshare.22236949.v1, https://doi.org/10.6084/m9.figshare.22236955.v1).
-
•
Barcode sheets (cellID-barcode-sample files for each dataset), data to remake each plot, and new Watermelon-barcoded scRNA-seq and Multiome data is in Figshare: https://doi.org/10.6084/m9.figshare.25478680
-
•
All code for the analyses in this manuscript has been deposited at: https://github.com/GoyalLab/fatemap_multiplet_public; https://doi.org/10.5281/zenodo.11438747
-
•
All code for the command line interface and package has been deposited at: https://github.com/GoyalLab/singletCodeTools; https://doi.org/10.5281/zenodo.11452137
-
•
All code for the singletCode website been deposited at: https://github.com/GoyalLab/SingletCodeWebsite; https://doi.org/10.5281/zenodo.11455739
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.







