Abstract
Single-cell sequencing provides detailed insights into individual cell behaviors within complex systems based on the assumption that each cell is uniquely isolated. However, doublets—where two or more cells are sequenced together—disrupt this assumption and can lead to potential data misinterpretations. Traditional doublet detection methods primarily rely on simulated genomic data, which may be less effective in homogeneous cell populations and can introduce biases from experimental processes. Therefore, we introduce ImageDoubler in this study, an innovative image-based model that identifies doublets and missing samples leveraging the Fluidigm single-cell sequencing image data. Our approach showcases a notable doublet detection efficacy, achieving a rate up to 93.87% and registering a minimum improvement of 33.1% in F1 scores compared to existing genomic-based methods. This advancement highlights the potential of using imaging to glean insight into developing doublet detection algorithms and exposes the limitations inherent in current genomic-based techniques.
Subject terms: Computational models, Image processing
The accuracy of single-cell sequencing can be hampered by the occurrence of doublets, where two or more cells are sequenced together, thus leading to data misinterpretation. Here, the authors present ImageDoubler, an image-based algorithm for detecting doublets and missing samples in single-cell sequencing.
Introduction
Single-cell sequencing technologies have revolutionized biological research, offering unprecedented insights into complex biological systems1–4. Central to the reliability of single-cell sequencing analyses is the assumption that individual cells are isolated and encapsulated into oil-based droplets, ensuring that the genomic profiles obtained uniquely represent single cells. However, doublets prevail across single-cell sequencing platforms, with rates as high as 33% reported in multiple platforms5. Failure to detect doublets and multiplets inevitably leads to erroneous interpretations.
Traditionally, doublet detection has relied on methods utilizing simulated and experimentally mixed single-cell RNA sequencing (scRNA-seq) data, employing clustering or classification algorithms to discern doublets6–10. For example, Scrublet constructed k-nearest neighbor classifiers using the union of observed cells and simulated doublets6; Solo fits a semi-supervised deep neural network to distinguish simulated doublets from the observed data8. Since they all relied on the simulated data, the legitimacy of the simulation is critical to the validity of the models. However, these strategies may have limitations, particularly when classifying the doublets in the datasets where the cell lines are not very heterogeneous or the cells of identical sample indices. Moreover, such methods evaluate their detection efficiency indirectly on the sequencing results and the downstream analyses, which may include artifacts from laboratory operations and sequencing platforms, thus despite exhaustive benchmarking efforts11, utterly identifying a superior method remains challenging due to these experimental constraints. This underscores the imperative need for a more reliable and direct gold standard and the development of models grounded in such standards.
Leveraging images as an alternative means to distinguish between singlets and doublets presents a straightforward and dependable approach aligning with direct visual confirmation. Moreover, recent technological advancements have allowed us to retrieve such images concurrently with the sequencing processes. For instance, the Fluidigm C1 platform can capture the droplets when they are released in the integrated fluidic circuit (IFC), isolating the single cells into individual reaction chambers. The optically clear IFC allows automatic stain-capturing of the cells and examining them by microscopy for viability (Fig. 1A, Supplementary Fig. 1)12. After staining, cells are automatically lysed, and the template is quickly prepared for qPCR or sequencing analysis13. The images of cells are identified by a unique block ID consisting of the row and column numbers that are directly visible in the images (Supplementary Fig. 1B). Similar IDs will also be assigned to each sequencing result through the Fluidigm C1 demultiplexing script and a map between column numbers and the barcodes. The identifiers ensure that each sequenced cell corresponds accurately to its visual representation in the images and enable these images as a more realistic gold standard than the simulations for doublet detection and downstream analyses on the sequencing data.
Fig. 1. Detecting doublets and singlets by Fluidigm C1 images.
A Preprocessing of images. Each microfluidic chip image consists of 800 blocks arranged in 40 rows and 20 columns. Each block contains one cell capturing site and zero to several captured cells. We preprocessed ten microfluidic chip images and split them into 800 equal-sized blocks. For each block, the U-pipe area, where the cells were captured by the cameras, was further cropped out automatically by template matching and by hand. B Shows examples of missing, singlet, and doublet and images of different resolutions. The exact number of the image shapes in each image set would vary depending on the templates, but they are close to the numbers presented. C Image-based model framework based on the Faster-RCNN. The “RoI-wise convolution layer” is a shared convolutional layer among all RoIs. Note that after the RoI pooling, all features within the RoIs are resized in the same shape. The model needs to determine if an object in the RoI (Region of Interest) is a cell. D Among the ten image sets, we presented two types of cross-validation: leave-one-out-cross-validation (LOOCV) and cross-resolution validation. The cross-resolution validation was trained on the high-resolution image sets and tested on the low-resolution ones. We trained five models on different train-validation splits for each test, and to apply the ensemble strategy to generate the final prediction (E), we converted the number of detected cells to a 3-class final prediction: singlet, doublet, missing by majority voting.
Image-based analysis has been widely used in high-throughput experiments to quantify phenotypes of interest for the biologist14. With recent advances in automated microscopy and image analysis tools, images have become a revolutionary resource that provides unbiased quantitative information about cell state in the approaches of image-based profiling or morphological profiling15–17. However, to the best of our knowledge, no existing doublet detection algorithms have yet exploited this advantage.
In response, we developed ImageDoubler, an image-based doublet detection algorithm utilizing the Faster-RCNN framework. In this study, we first collected 11 datasets from the Fluidigm C1 platform experiments. The image data from 10 of them were labeled by two individuals to perform the cross-image-set and cross-labeler validations. Next, in the two datasets with sequencing data available, we further validated and compared our model with the existing genomic-based methods at the gene expression level. According to these analyses, we demonstrate the algorithm’s ability to automatically identify doublets and missing cells, validated through extensive cross-validation experiments. Our findings reveal not only the high accuracy of our model in detecting doublets and missings but also its generalizability and robustness across various image resolutions and gold standards independently labeled. Furthermore, we explore the biological applicability of our model by correlating detected doublets with actual gene expression patterns and evaluating its effects in cell clustering and differential gene analyses, highlighting its superiority over traditional genomics-based doublet detection methods.
Results
ImageDoubler is trained and evaluated on 12275 image samples labeled by 2 independent labelers
We collected a total of 11 datasets from Fluidigm C1 platform experiments executed at the University of Michigan Comprehensive Cancer Center’s sequencing core (Table 1). They included two datasets (the datasets 5 and 11) that had the snapshots of the experiments as well as the paired single-cell RNA-sequencing data. The other datasets only contained the snapshots of the Fluidigm experiments for the other labs’ studies, whose single-cell samples were not sequenced in this core; therefore, their sequencing results were not available for this study. Dataset 4 was excluded from model development and evaluation due to the absence of cells in its snapshot.
Table 1.
General information about the data collected from the Fluidigm experiments
| Experiment ID | Information of the tissue or cell | Number of labeled block images (labeler 1) | Number of labeled block images (labeler 2) | Expression data available? |
|---|---|---|---|---|
| 1 | Unknown | 628 | 791 | No |
| 2 | Unknown | 587 | 794 | No |
| 3 | Bone marrow | 421 | 514 | No |
| 4a | Unknown | / | / | No |
| 5 |
SUM149 (left) H1975 (right)b |
673 | 798 | Yes |
| 6 | Unknown | 218 | 0c | No |
| 7 | Bone marrow | 597 | 785 | No |
| 8 | Unknown | 566 | 792 | No |
| 9 | Unknown | 513 | 791 | No |
| 10 | Unknown | 718 | 792 | No |
| 11 |
SUM149 (left) SUM190 (right)b |
687 | 799 | Yes |
aThis dataset is excluded in model development and evaluation as the snapshot does not contain any cell.
b“Left” refers to the blocks of columns 1–10. “Right” refers to the blocks of columns 11–20.
cLabeler skipped the entire set due to a lack of confidence in labeling most of the block images.
A snapshot of a Fluidigm experiment consists of 800 equally sized blocks arranged in an array of 40 rows and 20 columns18 (Fig. 1A, Supplementary Fig. 1A). Each block contains a U-shaped microfluidic chamber where cells are captured for analysis19. As part of the process, we segmented these snapshots into individual blocks respectively, which can be directly linked to the indices used in the scRNA-seq (Supplementary Fig. 1B). The size of each block was determined by dividing the image size by the number of rows and columns (Supplementary Fig. 2). Collectively, the 800 block images from a single Fluidigm experiment form an image set. A total of 10 image sets were retrieved from the experiments. To exclude cases where confounding pixels resemble cells (Supplementary Fig. 4), we further cropped the block images at the U-shaped regions using the template matching algorithm and manual curations (Fig. 1A, B and Supplementary Figs. 1C, S1D, S5). Excluding the blocks that could not be identified (Supplementary Fig. 1E, Supplementary Fig. 1F), 7908 block images in total were collected, including 6310 higher-resolution blocks and 1598 lower-resolution blocks (Fig. 1B, Supplementary Fig. 3A). The lower-resolution blocks were obtained from image sets 5 and 11, which were captured at a lower magnification of 2.52×, compared to the higher magnification of 6.3× used for the other image sets.
We implemented our image-based doublet detection model, ImageDoubler, based on the Faster-RCNN backbone20 utilizing its advantages of high detection accuracy, fast detection speed, and flexibility in adapting to new tasks to address the challenge of distinguishing overlapping cells, a common issue in these single-cell images (Fig. 1C). To prepare the training gold standard, two people independently scanned through all cropped blocks and hand-labeled those we were confident about their classes. Labelers drag a bounding box surrounding the point they believe should be a cell. Block images along with the bounding boxes on it (positions in [Xmin, Ymin, Xmax, Ymax]) and the object class (cell only) were sent to the model for training, generating three types of outputs: the objects’ class (cell), the bounding boxes of the objects, as well as the confidence scores. This resulted in a total of 12275 hand-labeled blocks (9359 high-resolution blocks and 2916 low-resolution blocks, Supplementary Fig. 3B), where the first labeler annotated 5608 blocks and the second labeler annotated 6856. 5298 blocks were labeled by both labelers. 93.64% and 96.37% of them showed agreement on cell numbers and block classes (Missing, Singlet, Doublet).
We adopted two cross-validation strategies to validate our model’s efficacy and stability. Among the 10 image sets, we first implemented leave-one-out cross-validation (LOOCV), where each set was held out as the test set in each fold, with the remaining sets used for training and validation. Second, we conducted cross-resolution validation by using the lower-resolution image sets (image sets 5 and 11) as the test sets, and the higher-resolution image sets as the training and validation sets (Fig. 1D). Furthermore, we randomly divided the training dataset into training and validation subsets at the image set level in an 80:20 ratio to prevent overfitting. Such a split was repeated five times and resulted in five models (Fig. 1D). On average, each split in the two types of cross-validation included 953 (±170) images of background (without cells) and 3145 (±364) images containing cells (Supplementary Fig. 6, Supplementary Data 1). To assemble these models, the final class (Missing, Singlet, or Doublet) predicted for each block was then produced by the majority vote of the five models (Fig. 1E). The class priority was set as “Doublet > Singlet > Missing” to handle the ties in the votings.
ImageDoubler provides accurate and robust detections of the doublets
To assess ImageDoubler’s performance and establish the most effective confidence threshold for cell detection, we employed four evaluation metrics (accuracy, balanced accuracy, average F1 score, and weighted F1 score) and primarily focused on the balanced accuracy and weighted F1 score across the confidence thresholds from 0.3 to 0.8. Furthermore, to address potential biases of “self-reference” from evaluating models against the standards used for their training, we also examined the performances in a cross-labeler manner, that is, evaluated the model using the test data labeled by another labeler.
Under the cross-labeler validation design, we established four evaluation scenarios within the LOOCV framework: “labeler-1” and “labeler-2” (where training and testing were conducted on the same labeler’s data), and “labeler-1-to-2” and “labeler-2-to-1” (cross-labeler training and testing). ImageDoubler presented consistent confidence in its predictions among these scenarios. The median confidence scores of the detected bounding boxes were above 0.9 in the doublet samples and 0.98 in the singlet samples, regardless of the applied confidence cutoff (Supplementary Fig. 12A). Notably, labeler-1 demonstrated superior labeling quality, yielding balanced accuracy and weighted F1 scores of 0.923 (±0.039) and 0.95 (±0.033), respectively, compared to labeler-2’s scores of 0.874 (±0.069) and 0.918 (±0.037) (p-value for balanced accuracy <1e−5, p-value for weighted F1 < 5e−7). When labeler-1’s model predicted the data in labeler-2 (labler-1-to-2), it showed a slight performance dip with 0.888 (±0.071) in balanced accuracy and 0.921 (±0.041) in weighted F1 score. Even though the evaluation results varied across different labelers, the model itself remained stable and robust by learning essential features even from the low-quality labels. Labeler-2-to-1 achieved the balanced accuracy and weighted F1 of 0.898 (±0.050) and 0.939 (±0.034), which showed less differences to labeler-1 (p-value for balanced accuracy = 0.0013, p-value for weighted F1 = 0.0125) and surpassed labeler-2 significantly, especially on weighted F1 (p-value for balanced accuracy = 0.027, p-value for weighted F1 = 0.0008, Fig. 2A, Supplementary Fig. 9A).
Fig. 2. Evaluation results of the cross-validations as well as the prediction visualizations.
A Shows the scores of four metrics (accuracy, balanced accuracy, averaged F1 score, and weighted F1 score) under different confidence thresholds of cell detection in the leave-one-out-cross-validation (LOOCV) phase. B Lists the confusion maps of the ground-truths and model predictions using the threshold of 0.7. C The scores of the four metrics for the cross-resolution validation phase and D the confusion maps under the threshold of 0.7. E Presents several examples of incorrect hand labels, but the model classified the images correctly. Incorrect hand labels were identified from images with inconsistent labels between the two labelers. When one label differed from the model’s prediction while the other matched, the differing label was likely incorrect. The three left images show the cases with higher resolution, including a false doublet (where dust outside the region was labeled as a cell), a false missing (where a cell was missed), and a false singlet (where two cells were labeled in one bounding box). The right two images, with lower resolution, both missed cells.
Upon analyzing the validation data (Supplementary Fig. 8), we selected a confidence threshold of 0.7. This threshold achieved overall better performance across the four metrics and data from the two labelers compared to other values (Supplementary Fig. 8B). Its reliability was further supported by the consistent results observed in the subsequent evaluation of detected cell numbers in the test data, with more detections of “0” and “1” cells in the “Missing” and “Singlet” blocks, and fewer in the “Doublet” blocks (Supplementary Fig. 12B). At this threshold, our model achieved high prediction accuracy across all scenarios, with an average true positive rate of over 91% and high doublet detection accuracy between 88.51% and 92.91%. These numbers also demonstrated the model’s resilience against labeling inaccuracies (Fig. 2B and Supplementary Fig. 9B).
Additionally, ImageDoubler exhibited robust performance across varying image resolutions. Training on high-resolution images and testing on low-resolution ones, we achieved a balanced accuracy of 0.944 and a weighted F1 of 0.943 in the scenario labeler-1, and 0.938, 0.930 in labeler-1-to-2 (Fig. 2C). Since we would use this cross-resolution model to study the relationship with the gene expression profiles, we only trained on the high-quality labeling dataset to ensure its accuracy. Confusion maps presented the average true positive rates of 94.41% and 93.80%, as well as the doublets detection rates of 93.24% and 93.87% (Fig. 2D) in these two evaluation scenarios.
Such stability in performance may highly rely on the ensemble strategy used in ImageDoubler. Each single model within the ensemble was trained on different subsets of the training data. By aggregating the outputs from all single models, the ensemble model maintained relatively high accuracy compared to the higher variance seen in the performances of individual models (Supplementary Fig. 10).
This robustness extended to the model’s ability to correct initial mislabelings upon reviewing the images. We re-examined the initial block images that were labeled differently by the two labelers against the model predictions and found that some of the model’s predictions were visually correct, despite the initial mislabeling by hand (Fig. 2E). This re-examination underscores the advantage of image-based methods, where model predictions can be directly evaluated against a visible gold standard.
Predictions from ImageDoubler are validated through independent expression data
An alternative way to address the “self-reference” issue was using the corresponding expression profile of each block to examine the viability of our labels and model predictions. We collected the sequencing data of image set 5 and image set 11 to create gene expression matrices consisting of 800 blocks (corresponding to the number of cell capture sites) by 29,643 genes (sourced from the Ensembl GRCh38 assembly). Each block contains cells exclusively from a single cell line and was assigned a unique identifier with its row and column number to match the expression profile to its corresponding image. In our data, blocks in columns 1–10 contain cells from the SUM149 cell line, while blocks in columns 11–20 contain cells from H1975 (in image set 5) and SUM190 (in image set 11). To emphasize the ability to generalize across sample variances, we chose the model predictions from the cross-resolution experiments of labeler-1. The ground-truths of the block classes were also from the first labeler.
Our investigation into the “Missing” category—blocks with no cell presence—revealed that these blocks consistently showed gene read counts nearing zero, indicating minimal gene expression activity (Figs. 3 and S13, S14). This trend persisted across both labeled (0.155 compared to 1.710 of the rest of the blocks in image set 5, 0.223 compared to 1.257 in image set 11) and predicted “Missing” blocks (0.174 compared to 1.686 in image set 5, 0.218 compared to 1.229 in image set 11), underscoring the validity of our labeling and prediction criteria (Fig. 3A, B). Furthermore, “Missing” blocks exhibited a significantly reduced number of expressed genes compared to blocks classified as containing cells, further validating the model’s capability to discern between cell presence and absence (Fig. 3C, D). According to the model predictions, the average numbers of non-zero genes were 783.53 (image set 5) and 1141.32 (image set 11) compared to 3818.01 and 3459.18.
Fig. 3. Relationship between the block image classes and the gene expression profiles.
666 and 671 data points were visualized for image sets 5 and 11, respectively. A, B Average gene expression in raw counts of the blocks with respect to the block classes from the ground truths and model predictions in Image sets 5 and 11, whose cell lines were SUM149 + H1975 and SUM149 + SUM190, respectively. The center lines of the box plots visualized the median values. The lower and upper hinges correspond to the first and third quartiles. The upper and lower whiskers extended from the hinges to the largest and smallest values no further than 1.5×IQR from the hinges (IQR is the distance between the first and third quantiles). Data points outside the whiskers were visualized as the outliers. C, D Average numbers of non-zero genes across different types of blocks. The error bars visualized the mean values ± the standard deviations. E, F Visualized the expression profiles by applying the principal component analysis (PCA) first and then calculating the UMAP on the 50 principal components (PCs). The points are colored according to the block types from the ground truths, model predictions, and the actual cell line inputs from the experiments corresponding to the images. The x-axis and y-axis represent the first and second UMAP components, respectively. In the “Ground-truths” of E and F, the gray points represent blocks without hand labels.
The expression patterns of doublets versus singlets further support the validity of this approach. Doublet blocks, as expected, showed significantly higher gene read counts and a greater number of non-zero gene expressions, reflecting the increased cellular content. Specifically, for image set 11, the average read counts of the predicted singlets and doublets were 1.157 and 1.351 (p-value < 1e−5, Figs. 3B, S13C). The average number of non-zero genes of the predicted singlets and doublets were 3287.21 and 3749.05 (p-value < 1e−5, Figs. 3D, S13D). As for image set 5, due to the limited number of doublets (32, compared to 116 in image set 11), particularly within the H1975 cell line (6, compared to 26 in SUM149), the read counts comparison between singlet and doublet could be ambiguous (Fig. 3A). Despite these limitations, the distinction visibly existed in the number of expressed genes between singlets and doublets (Fig. 3C), especially when only focusing on the SUM149 cell line (Supplementary Fig. 14).
By employing Uniform Manifold Approximation and Projection (UMAP) for clustering gene expression data and comparing these clusters against our labels and predictions, we illustrated the reliability of ImageDoubler and the advantages of utilizing visual properties from image data in doublet detections. First of all, the clear separation between blocks with cells and those labeled as “Missing” highlighted the model’s precision. Moreover, the model’s predictions for unlabeled blocks (the gray points in Figs. 3E, F) closely matched the classes of their adjacent blocks. Since the clustering of blocks was based on gene expression data, this alignment indicated that the model assigned the same classes to blocks with similar expression profiles. It supported that both the hand-labels and the model predictions reflect biologically meaningful classifications.
On the other hand, we observed that singlets were intermixed with blocks labeled as ‘Missing’’, and doublets were dispersed across clusters. This observation suggests that in some cases, genomic-based methods might struggle to efficiently distinguish doublets from singlets, as they are primarily reliant on gene expression data. This limitation contrasts with the image-based approach of ImageDoubler, which is not constrained by such expression-based issues (Fig. 4).
Fig. 4. Benchmark the image-based model against the seven genomic-based doublet detection methods and one empty droplet detection method on image sets 5 and 11.
The confidence score for ImageDoubler was set at 0.7, and the best performance for other methods was reported based on a series of hyperparameters for data pre-processing and post-processing. A Shows F1 scores comparing ImageDoubler and EmptyDrops for detecting missing cells (empty droplets) in image sets 5 and 11, with EmptyDrops performance averaged over ten runs using 20 random missing blocks as references. The points in cyan are the individual results of the ten runs. Error bars indicate standard deviations. B Shows F1 scores for doublet detection, excluding missing cells. The F1 scores presented are calculated for binary classification without applying class balancing as there are only two classes in the experiments of (A, B). C, D Present confusion matrices for image sets 5 and 11, respectively, illustrating model prediction details. E and F are UMAP visualizations of block gene expression profiles concerning ground truths and model predictions for the two datasets without missing samples. The ‘Cell lines’’ subplots represent clusters colored by actual input cell lines corresponding to the experiments.
ImageDoubler surpasses the genomics-based algorithms
In the comparative analysis, we benchmarked ImageDoubler against seven leading genomic-based doublet detection algorithms: DoubletDetection21, DoubletFinder7, Solo22, Scrublet6, scds10, scDblFinder9, and SoCube23. These genomic methods were trained and evaluated on the expression data of image sets 5 and 11 under various settings, including cell and gene selections (e.g., selecting the highly variable genes), doublet score thresholds, and method-specific hyperparameters. Since these methods are not designed to detect empty droplets, we applied them to the expression data excluding the ‘‘Missing’’ blocks. Only the best performances across these settings are reported and used for comparisons (Figs. S15 and S16). ImageDoubler’s ability to detect empty samples was benchmarked against EmptyDrops24. Predictions for ImageDoubler were obtained from the cross-resolution experiments using labeler 1’s data, where models were trained on the other image sets and tested on these two sets. The accuracy of classifying singlets and doublets was quantified with the F1 score against the labels from the first labeler, chosen for its higher label quality and larger consensus proportion.
ImageDoubler demonstrated superior performance over the genomic-based methods. In comparison, ImageDoubler achieved an F1 score of 0.985 in detecting missing samples in image set 5 and 0.966 in image set 11, while EmptyDrops achieved an average F1 score of 0.501 (±0.173) and 0.489 (±0.057), respectively, when referring to 20 randomly selected known missing samples for detection (Fig. 4A, Supplementary Table 1). Another approach with EmptyDrops, which involved setting a lower bound on total counts, failed in our dataset with the default value (see Benchmark settings under Methods for more details). The lack of gold standards for adjusting this parameter other than inspecting the distribution of the p-values for the assumed empty droplets25 underscored the limitation of the genomic-based methods in quantifying the viability of their parameter settings.
As for the comparisons for doublet detections (Supplementary Table 2), for image set 11, while our image-based model reached an F1 score of 0.785, the scores for the others from high to low were 0.582 (DoubletDetection), 0.517 (scds), 0.472 (SoCube), 0.373 (Solo), 0.291 (Scrublet), 0.164 (scDblFinder), and 0.154 (DoubletFinder). As for image set 5, which did not have enough samples for constructing valid simulations, all the genomic-based methods had their F1 scores lower than 0.2 when our image-based model achieved a 0.833 F1 score (Fig. 4B). Further analysis of the confusion matrices for both image sets revealed additional insights into the performance disparities. In image set 5, while Solo and Scrublet detected a number of doublets comparable to ImageDoubler, they produced a significantly higher number of false positives. Conversely, scDblFinder exhibited the lowest rate of false positives but was unable to identify the majority of doublets (Fig. 4C). In image set 11, DoubletDetection and scds achieved a better balance between false positives and false negatives, leading to their relatively higher F1 scores among the genomic-based methods (Fig. 4D).
Visualizing the blocks in clusters from expression data with respect to the predictions from the genomic-based methods further indicated their limitations when facing limited sample size and dispersed doublet distribution. While some genomic-based algorithms may perform comparably or even outperform ImageDoubler in some cases under conditions with ample doublet samples (DoubletDetection and scds in Fig. 4F), they struggled to provide accurate estimations in scenarios where simulations do not accurately reflect the true doublet profiles (Solo, and Scrublet in Fig. 4E). Such issues became more significant when including the “Missing” blocks (Supplementary Fig. 17). For example, in image set 5, Solo may treat the “Missing” cases as singlets, thus incorrectly estimating the distribution of the doublets and then inferring much more false-positive detections.
ImageDoubler is effective in cell clustering and differential expression analysis
To confirm that ImageDoubler is effective for practical experimental downstream analysis, we examined its role in producing clear cell clustering and identifying differentially expressed genes (DEGs). Similar to the experiments in the cross-resolution validations, ImageDoubler was trained with the image data excluding image sets 5 and 11, which were used for evaluation in this section, and used the hand-labeled classes from the first labeler as the gold standard. Furthermore, these test images were of lower resolution than those used in training, ensuring that the model did not encounter these specific images during training, thus maintaining the independence of the test data.
A major task of removing cells other than singlets is to avoid the misinterpretation of spurious cell clusters26. In our cases, “Missing” blocks were another source of spurious clusters, and some of the gold standard singlet blocks were located far from the main clusters and identified as separate entities (Fig. 5A). Therefore, our main evaluation metric was how accurately blocks could be clustered by comparing the true cell types with the Leiden clusters in the dataset cleaned of missings and doublets, assessed using adjusted mutual information (AMI, Supplementary Data 2). ImageDoubler performed well in cell clustering, generating clusters closer to the gold standard and true cell types, particularly in image set 11 (Figs. 5A, B), surpassing other genomic-based methods (Supplementary Fig. 18). In AMI, ImageDoubler achieved scores of 0.7741 in image set 5 and 0.7656 in image set 11. The genomic-based methods did not surpass the baseline due to the presence of “Missing” blocks (Fig. 5C).
Fig. 5. Performance of ImageDoubler in downstream analysis.
We mainly tested its efficiency in cell clustering and differential expression analysis. present the UMAP-embedded expression profiles in (A) image set 5 and (B) image set 11 after removing the “Missing” and “Doublet” based on the inferences of the doublet detection methods. “No Removal” keeps all the blocks, and the “Ground-truth” removes the blocks based on hand labels. The upper panels were colored by the true cell types, while the Leiden algorithm-inferred clusters colored the bottom panels. C Compares the Leiden clusters and the true cell types and quantifies the accuracies by the adjusted mutual information through all the doublet detection methods. “No removal” and “Ground-truth” serve as the baselines. D Shows the precision, recall, and true negative rates (TNR) of the differentially expressed genes (DEGs) from the doublet-detection algorithms-inferred singlets under different significance levels (gene’s absolute statistics should <Adjusted p-value and ≥LFC values in the legend). “LFC: 0” indicates no restriction on the gene’s log-2 fold-change values. The p-values come from the two-sided Wald test of DESeq2 and are adjusted by the Benjamini and Hochberg method. The four pairs of p-value and LFC represent four levels of stringency for identifying significant DEGs, ranging from the most restrictive (Adjusted p-value: 0.005; LFC: 0.99) to the least restrictive (Adjusted p-value: 0.05; LFC: 0). DEGs are called from the true cell types, and we use those from “Groud-truth” as the gold standard to calculate the metrics. E Compares the DEGs called from the true cell types and the Leiden clusters, where those from true cell types are gold standards.
In the DE gene analysis, an effective doublet-detection method should enhance the accuracy of detected DE genes. We designed two comparisons to assess ImageDoubler and compare it with other genomic-based methods (Supplementary Data 2). First, we contrasted inferred singlets’ DEGs with those from hand-labeled singlets categorized by true cell types (Fig. 5D). Second, we compared DEGs from Leiden clusters with true cell types (Fig. 5E). Various thresholds for adjusted p-values (0.05 and 0.005) and absolute log2 fold change (LFC, 0 and 0.99) were applied to identify DEGs. Precision, recall, and true-negative rate (TNR) were used to quantify accuracy11. In summary, compared to hand-labelings, ImageDoubler nearly doubled recall (0.9703 for image set 5 and 0.9447 for image set 11) while maintaining high precision (0.9735 and 0.9389) and TNR (0.9969 and 0.9906). When evaluating DEGs from Leiden clusters, ImageDoubler significantly outperformed genomic-based methods, particularly in precision (0.9020 for image set 5 and 0.9594 for set 11) and TNR (0.9888 and 0.9938). These results reinforce ImageDoubler’s efficiency in cell clustering and its ability to identify key genes for downstream analysis.
ImageDoubler also delivered competitive performances in the scenarios without the “Missing” samples, indicating its ability to handle different types of data. In cell clustering, ImaegDoubler achieved AMI scores of 0.7862 for image set 5, outperforming most genomic-based methods, and 0.7761 for image set 11 (Supplementary Fig. 19). In identifying DEGs, ImageDoubler maintained the same or superior accuracy levels compared to other methods (Supplementary Fig. 20).
Discussion
In this study, we introduced an image-based model, ImageDoubler, for doublet detection in single-cell sequencing, leveraging the chamber images of the Fluidigm C1 platform and the Faster-RCNN framework. This method stands out for its robustness, accuracy, and ability to generalize, effectively distinguishing between doublets, missing cells, and singlets with notable biological significance.
Our model underwent rigorous evaluation through two types of cross-validation with four metrics to assess its precision and stability. To circumvent potential biases inherent in using self-labeled data (“self-reference” issue), we engaged two independent labelers to define ground truths, further validating our model through cross-labeler comparisons. These comprehensive evaluations consistently yielded accuracy and true positive rates close to or higher than 90%, attesting to the model’s reliability. Moreover, the model’s biological relevance is substantiated through the alignment of expression profiles with the classifications of block types, as well as its demonstrated effectiveness in downstream analyses, including accurate cell clustering and identification of DEGs, (Supplementary Data 2), reinforcing the biological implications of our findings. To further explore the efficiency of ImageDoubler on downstream analyses, we also applied Slingshot cell trajectory inference to our datasets27. ImageDoubler’s high accuracy in detecting the doublets and missing samples results in trajectories that closely matched the lineage paths from ground-truth singlets (Supplementary Fig. 21). However, the homogeneity of our dataset and the lack of reference developmental information limited the generalizability of these findings, underscoring the need for more diverse test data in future studies.
Current genomics-based doublet identification algorithms are robust and straightforward, as they can be directly applied to the expression matrices. However, they still have significant improvement space. First of all, these methods often lack a direct and reliable gold standard for validating whether the detections from a specific parameter setting are accurate. For instance, EmptyDrops suggests determining the appropriate “lower” parameter by examining the distribution of p-values, while DoubletDetection and Scrublet rely on visualizing expression profiles to set cutoffs. In contrast, the accuracy of detections made by ImageDoubler can be directly verified through the corresponding images, allowing for a clear and quantifiable assessment of its performance under specific confidence scores.
Secondly, failure to detect doublets is prevalent across the genomic-based tools in the Fluidigm C1 platform data, particularly in datasets that may be homogeneous in cell types, such as the H1975 line in dataset 5. These tools perform significantly better in datasets with more heterogeneous cell populations—such as the breast cancer cell lines in dataset 11, which may contain five distinct cell types28, but they still fall short compared to ImageDoubler. These findings highlight the limitations of relying solely on genomic data for accurate doublet detection in the Fluidigm experiments and suggest that image-based methods like ImageDoubler offer a more reliable alternative.
We also have collected an additional dataset (named “Extra”) from a macrophage bone marrow sample to evaluate the methods in a more heterogeneous environment. However, problematic expression profiles in the “Missing” samples (Supplementary Fig. 22) and the absence of raw sequencing data for comprehensively diagnosing this issue led us to exclude it from the main results, despite ImageDoubler’s superior performance (Supplementary Table 2).
Despite the advancements presented by our model, there remains room for further exploration. As an image-based method, it is important to acknowledge certain limitations associated with image retrieval and the variability in image quality. Images with clear visualization at the cell releasing areas is important for ImageDoubler to provide accurate detections, and a map of indexes matching the images and cells with their expression data is required for fully incorporating ImageDoubler into a complete single-cell processing pipeline. In our experiments, we observed that when the model, trained on a limited set of low-resolution images, was applied to high-resolution images, its performance dropped significantly. This suggests that the model may struggle when facing high-variance image cohorts with a limited number of images.
Additionally, the test set with available expression data used in our evaluations was relatively small, and the platform type was restricted. Although the model performed well within the scope of our study, its applicability to other common single-cell platforms, such as those from 10x Genomics or Parse and Scale Biosciences, may be limited due to the absence of integrated imaging systems or image-to-cell mapping. A larger and more diverse test set with corresponding expression profiles would be necessary to fully validate the model’s robustness across different conditions and experimental settings. A larger and more diverse set of test data with corresponding expression profiles would be necessary to fully validate the model’s robustness across various conditions and experimental settings.
Future work could explore the integration of image-based and genomic-based methods, leveraging the strengths of both to enhance doublet detection. A multimodal training framework could be developed to integrate information from both image data and expression profiles. The labels for the images, or the predictions from the model like ImageDoubler, can be used to refine genomic-based methods, thereby improving their reliability while maintaining their convenience. While the current application of ImageDoubler is confined to the Fluidigm C1 platform, the methodologies developed here have potential applications as well-trained cell detectors in other workflows where droplet images are available. For instance, Howell et al. incorporated a YOLOv4-tiny model in a droplet microfluidic platform to detect objects within droplets at the time of capture29. In contrast, ImageDoubler operates post-capture. Further developments are needed to enable ImageDoubler to work at the point of capture and fully unlock its potential for higher detection accuracy in applications requiring precise cell identification. Additionally, in extended fluorescence microscopy used in imaging flow cytometry30, image-based detection methods could help precisely capture cells and detect abnormal cell numbers in each image from the high-throughput image flow, thereby improving data quality and enhancing the reliability of downstream analysis.
In conclusion, by integrating the strengths of image-based analyses, ImageDoubler enhances quality control and reliability in single-cell analyses, demonstrating robust and accurate performance, particularly in scenarios where genomic-based methods may struggle. Its versatility extends beyond the Fluidigm C1 platform, with potential applications in other imaging-based workflows, paving the way for more precise and comprehensive single-cell analyses across multiple modalities.
Methods
Retrieve image data and pre-process
We collected 11 microfluidic chip snapshots from the Fluidigm C1 experiments at the Single Cell Analysis Core of Rogel Comprehensive Cancer Center of the University of Michigan. Each snapshot contained 800 blocks of valves and cell-capturing sites, which were taken at the droplet-releasing moment. The fourth snapshot was excluded as it failed to capture this moment. These 800 blocks were organized into 40 rows and 20 columns.
The first step of processing the snapshots was clipping these blocks into separate image files and creating image sets for the experiments. Except for the snapshots of experiments 5, 9, 10, and 11, all the others were simply clipped equivalently to 800 parts, with 40 parts in the rows and 20 parts in the columns. As for the special ones (experiments 5, 9, 10, and 11), we would skip several pixels at the left (w1), right (w2), top (h1), and bottom (h2) of the snapshots and then do the equivalent clips. These parameters were (h1 = 50, w1 = 300, h2 = 250, w2 = 300) for experiment 5, (h1 = 300, w1 = 100, h2 = 400, w2 = 0) for experiment 9, (h1 = 350, w1 = 100, h2 = 350, w2 = 0) for experiment 10, and (h1 = 100, w1 = 300, h2 = 100, w2 = 300) for experiment 11. These operations ensured the blocks’ IDs in rows and columns match their real positions in the snapshots (Supplementary Fig. 1B).
Next, we further cropped each block to the areas containing the U-shaped tubes utilizing the template matching algorithm with the “TM_SQDIFF” comparison method in the OpenCV Python library. The tubes were the cell-capturing sites. To generate the templates, we selected the first block of each image set and manually cropped it at the tube region (Supplementary Fig. 5).
The template-matching algorithm might fail to discover the tubes when (1) they were located at the edges of the blocks with incomplete shapes (Supplementary Fig. 1E) or (2) they were partially or fully covered by the legends or noises of the snapshots (Supplementary Fig. 1F). In the first case, we curated these samples by manually cropping out the tubes, while in the second, we simply excluded them as they could not provide valid information.
To label the cells, we used labelImg31 to scan through all the cropped images and drag bounding boxes on the light dots or rings that we believed should be cells. We instructed people to label only the images they were confident in classifying and to skip those that were challenging to interpret. To avoid introducing biases, two people independently performed the labeling without discussing their decisions. The second labeler skipped the entire image set 6, stating that the majority of images in this set were difficult to interpret. To preserve the independence of the labeling process, no curations were made to the labels afterward. The unaligned labels were retained to study the robustness of ImageDoubler when trained with noisy data (i.e., potentially incorrect annotations) and its ability to make accurate inferences on images with such labels (Fig. 2).
Data partitions for cross-validations and gene expression evaluations
To reduce potential batch effects, build generalizable models, and evaluate the artifacts from the labelers, we carried out three cross-validation strategies by separating the block images into training, validation, and test sets at the image set level. First, we used the concept of leave-one-out-cross-validation (LOOCV), where we held out one image set as the test and the remaining as the train and validation set for each fold. Second, we proposed a cross-resolution validation by using the image sets 5 and 11 as the tests, which had lower resolutions than the others. This partition was also used to study the relationship between the model predictions and actual gene expressions. Third, we evaluated the model in a cross-labeler manner, similar to LOOCV, but the held-out image set was from the other independent labeler. As all the training data from the splits contained sufficient samples for the models to distinguish between background and cells (Supplementary Fig. 6, Supplementary Data 1), no specific stratification or balancing strategies were applied in the data partitions.
Input data pre-processing and augmentations
Input block images were scaled to 600 by 600 pixels. The Faster-RCNN anchor sizes were determined from the actual bounding box dimensions in 100 randomly selected block images at the scaled resolution, where the average height was 26.91 pixels, and the average width was 31.24 pixels. To ensure the anchors accommodated the range of object sizes, we set the base anchor sizes to 10, 20, and 40 pixels, where 10 was used for capturing the smaller objects of cells, and 40 was used for the larger ones.
Two data augmentations—image resizing and random horizontal flipping—were applied during batch preparation for the models. The image resizing involved randomly altering the width-height ratio from 0.7:1.3 to 1.3:0.7 (the original ratio is 1:1) and then scaling the image by a factor ranging from 0.25 to 2. The resized images were further adjusted by either padding with gray bars or cropping randomly to fit the 600×600 pixel input size. The horizontal flipping was applied post-resizing, with each image having a 50% probability of being flipped.
All these augmentations were implemented using Python’s Pillow (PIL) package. Resized images were interpolated using cubic spline interpolation (PIL.Image.BICUBIC). Incorporating these augmentations significantly improved the model’s performance, particularly in doublet detection accuracy (Supplementary Fig. 11).
Network structure and training
The backbone of the network was similar to the structure of classic Faster-RCNN. We selected it for the following reasons. First, Faster-RCNN had a relatively simple structure and benefited from extensive community support, including pre-trained models, making it flexible and easier to transfer to other tasks. Second, Faster-RCNN maintained a good balance between speed and detection accuracy. It outperformed some one-stage models, such as early versions of YOLO32, while demonstrating better or comparable speed compared to other two-stage models like Mask R-CNN33. Faster-RCNN consisted of a ResNet-50 network to extract features from the inputs, which had pre-trained on the PASCAL Visual Object Classes dataset34, a Region Proposal Network (RPN) to generate the regions that may contain the cells on the extracted feature map, a Region of Interest pooling layer to align the proposed regions and a classifier head that predicted object class and bounding boxes from these regions.
An Adam optimizer was used to optimize the loss function, which consists of four sections: 2 binary cross-entropy losses to evaluate the class predictions and 2 smooth L1 losses to evaluate the bounding boxes predictions from RPN and the classifier head.
During the 30-epoch training, the first 10 epochs froze the feature extraction network (the ResNet-50 part) and only trained the others with a learning rate of 1e−4 and a batch size of 16, and for the remaining 20 epochs, the entire network was unfrozen and trained with a learning rate of 5e−5 and a batch size of 8. No further tuning of hyperparameters was conducted during training. The initial learning rate of 1e−4 was a common starting point for a deep learning model using the Adam optimizer, and a batch size of 16 was the largest number our NVIDIA TITAN RTX GPU could handle in one batch. In the latter 30 epochs, as more parameters were unfrozen for training, these numbers were halved to allow for more training steps and better-converged losses.
To prevent overfitting, we set up a callback on the validation data. The training pipeline calculated the validation loss at the end of each epoch and saved the model weights when lower validation losses were detected. The curves relating to the training and validation losses for each fold, each model, and each labeler can be found in Supplementary Fig. 7. Since the early stopping mechanism was not employed in the callback, the training continued until the pre-set number of epochs was reached. While overfitting might appear in the loss curves at later stages, it did not affect model performance since only the weights corresponding to the minimum validation loss were saved.
Computational resources requirements
Other than the powerful Linux workstation, ImageDoubler was also tested on a Windows personal computer with an Intel i7-6700K CPU, 16 GB RAM, and a GTX 1070 Ti 7 G GPU. We re-trained the model and ran the full cross-resolution validation using labeler 1’s data on this machine. The models would take 11–16 min for one-epoch training with a batch size of 4 in the first 10 epochs and 2 in the remaining 20 epochs. The performances were 0.940 in balanced accuracy and 0.948 in weighted F1 scores, comparable to the results obtained from training on the workstation. These findings suggest that most of the recent PCs or laptops with discrete graphics cards are capable of handling the computational demands of ImageDoubler.
Performance evaluations
We evaluated the performance of our model in four metrics: average accuracy (Acc), balanced accuracy (Accbalanced), average F1-score (F1), and weighted F1-scores (F1weighted), and visualized its accuracy in the confusion matrix. Given the numbers of True Doublet (ntd), True Singlet (nts), and True Missing (ntm) detected by the model, and the nd, ns, nm for the total numbers of Doublets, Singlets, Missing images in the ground truth, the accuracies could be written as the followings:
| 1 |
| 2 |
As for the multi-class F1-scores, we first counted the true-positive (TP), false-positive (FP), and false negative (FN) for each class, for example, the Doublet, generating the Precision, Recall, and then calculated the F1 for each class (F1d, F1s, F1m). With such notations, our F1-scores metrics were:
| 3 |
| 4 |
All of the metrics were calculated using the functions from the Python scikit-learn package.
Selecting and adjusting the confidence thresholds
To determine the appropriate confidence threshold for use in test data and downstream analyses, we analyzed performance on the validation data from the LOOCV experiments to identify the threshold that consistently provided the best results across different metrics and labelers. We conducted inferences on the validation image sets using the best weights saved for the five single models and collected scores for four evaluation metrics (average accuracy, balanced accuracy, average F1-score, weighted average F1-score) for each model in every fold and for both labelers. The thresholds were then ranked in ascending order based on these metric scores, with each threshold assigned a score from 1 to 6 for each metric (with 6 indicating the best performance). The threshold with the highest overall ranking, which was 0.7 in our datasets, was selected as the optimal threshold for the test data (Supplementary Fig. 8).
For users applying the ImageDoubler framework to train a custom model, we recommend training multiple models on a variety of data, then ensembling the models, evaluating the performance of each model at different thresholds, and selecting the one that performs best on validation data. For inference, we suggest starting with our pre-trained models and a confidence score threshold of 0.7. Moreover, because our method is trained with image data, users can verify the inference results directly by visualizing the bounding boxes and adjusting the thresholds. Scripts and guidance for these visualizations are available at https://github.com/GuanLab/ImageDoubler.
Extract expression profile
To evaluate the model predictions with the corresponding gene expressions as well as benchmark against existing genomic-based doublet detection software, we generated the gene expression profiles of image set 5, a mixture of cell lines SUM149 and H1975, and image set 11, a mixture of cell lines SUM149 and SUM190. We collected RNA sequencing data in FASTQ format from C1 mRNA Seq HT IFC. Then, we used the C1™ mRNA Sequencing High Throughput Demultiplexer Perl script v2.0.135 to demultiplex row barcodes of column samples and got the processed FASTQ of individual blocks. We used Kallisto36 with Ensembl GRCh38 as the transcriptome reference to map the expression levels. We obtained 190432 × 800 transcript expression matrices from the above steps and aggregated the transcripts to 29643 genes using the “tximport” package in R and the GTF annotations of GRCh38 assembly release 108.
For the following doublet-detection analyses, we used these raw gene expression matrices and the filtered matrices without the “Missing” samples as the inputs, as required by the various doublet-detection methods. No normalization steps were applied to the raw data before detection, ensuring that the doublet-detection methods operated on the unprocessed data. Further gene and cell filterings would be only applied where necessary, following the specific guidelines of each doublet-detection method.
Benchmark settings
The benchmark experiments of the doublet-detection methods used the gene expression matrix with or without the “Missing” blocks as the inputs. Except for EmptyDrops, these methods did not involve a further split between training and testing data. They were evaluated using the same data for training. Parameters of the methods were set to their recommended values or default values if no recommendation was available. We also went through a series of hyperparameters for data pre-processing and post-processing and reported the best performance in the F1 score for each method. Detailed configurations of the methods are summarized below:
EmptyDrops (from DropletUtils version 1.22.0)24: We used the function “emptyDrops” to detect the “Missing” samples. One option was to set the “lower” parameter, a numeric scalar specifying the lower bound on the total gene counts, at or below which all barcodes (blocks’ IDs in our data) were assumed to correspond to empty droplets. However, the default value of 100 was lower than the minimum count of the actual data (556 for image set 5 and 819 for image set 11), which meant with this value, no data was used as the empty droplet assumptions for the algorithm to continue the inference, and the function would raise an error and quit.
An appropriate solution for this issue was estimating the “lower” value from a part of the real “Missing” samples. The parameter “know.empty” implemented a similar functionality. Therefore, we randomly chose 20 (10.19% of all missing samples in image set 5, 21.97% of image set 11) missing samples as the known empties and provided their indexes to the parameter “known.empty”. The number of iterations (niters) was 10000, which was large enough for our data to ensure no non-significant barcode was “TRUE” for “Limited”25. The FDR (false discovery rate) threshold for determining whether a sample was missing or not was 0.001. We repeated the detection steps 10 times for each image set.
DoubletDetection (version 4.2)21: The data pre-processing steps include: (1) selected cells (blocks in our data) that have at least 1 or 10 genes expressed or simply select all cells (blocks); (2) selected the top 2000 or 5000 highly variable genes or use all genes. These selections were implemented with the Scanpy module37. The BoostClassifier’s parameters were set as: n_iters = 30, clustering_algorithm = “louvain”, standard_scaling = True, pseudocount = 0.1, and set the others as the default. The vote threshold (voter_thresh) and the p-value threshold (p_thresh) for generating the predictions were selected within [0.3, 0.4, 0.5, 0.6] and the range from 0.01 to 0.1, respectively. The p-values around the default value (1e−7) could not detect the doublets (Supplementary Fig. 23).
Scrublet (version 0.2.3)6: Scrublet received the gene expression matrix, setting the expected_doublet_rate as 0.06. The other parameters were set as their default values following the guidance and example script in Scrublet’s GitHub repository, where min_cells = 3 (genes expressed in at least this number of cells), min_counts = 3 (genes had at least this number of counts), n_prin_comps = 30 (number of principle components for embedding), min_gene_variability_pctl = 85 (threshold of v-statistic to select highly variable gene), and sim_doublet_ratio = 2.0 (number of simulated doublets relative to number of cells). The threshold for calling the doublets was selected between 0.06 and 0.22, with a step of 0.02. As the expression profiles in our data did not present significant differences between the singlet and doublet samples, Scrublet might fail to generate workable automatic thresholds for the classifications based on its doublet scores (Supplementary Fig. 24).
Solo (from scvi-tools version 1.0.4)22: The data pre-processing steps were the same as those for DoubletDetection. Following the guidelines to use Solo for training and predicting the doublets, we first created and trained a variational auto-encoder (VAE) model with the function “scvi.model.SCVI” in the scvi-tools package and the processed data and then created and trained the solo model with the function “scvi.external.SOLO.from_scvi_model” using the VAE model as the input. The binary outputs “singlet” and “doublet” could be retrieved by calling the “predict” method of the solo model with the parameter soft = False.
Socube (version 1.1)23: SoCube operated directly on the gene matrix with raw counts without additional filtering or normalization steps before doublet detection. The detection process was initiated using the command “socube -i input_h5ad -o out_dir --gpu-ids 0”, with all other parameters set to their default values. The resulting doublet scores were then converted to binary classifications using a series of score thresholds (0.3, 0.4, 0.5, 0.6, 0.7).
DoubletFinder (version 2.0.4)7: The gene expression matrix was first read in R and converted to a Seurat object. Following the instructions in its GitHub repository, we first processed the data in the pipeline, including normalization (“NormalizeData”), identifying outliers (“FindVariableFeatures”), scaling and centering features (“ScaleData”), and generating the cluster-related information (“RunPCA”, “FindNeighbors” with the dims = 1:20, “FindClusters”, and “RunUMAP” with the dims = 1:20). Next, we identified the PC (principle component) neighborhood sizes (pKs) with no ground-truth using 20 principal components, and estimated the homotypic doublet proportion (nExp) assuming 5% doublet formation rate for image set 5 and 15% for image set 11 based on the true doublet percentages in our data. Finally, we ran the “doubletFinder” function to detect the doublets with pKs, nExp, and the other parameters: PCs = 1:20, pN = 0.25 (artificial doublets, expressed as a proportion of the merged real-artificial data), reuse.pANN = FALSE (pANN denotes the cell’s proportion of artificial nearest neighbors), sct = FALSE (whether SCTransform was used during original Seurat object pre-processing).
scDblFinder (version 1.16)9: The gene expression matrix was first read in R and converted to a SingleCellExperiment object. The top 1000, 2000, 3000, and 5000 genes, or all genes, were used for doublet detection.
scds (version 1.18)10: The gene expression matrix was first read in R and converted to a SingleCellExperiment object. Three functions, cxds, bcds, and cxds_bcds_hybrid, were called to generate the doublet scores. To retrieve the classification results “Singlet” and “Doublet” from these scores, we first normalized each of them within 0–1, and then applied the threshold from 0.3 to 0.7 with the step of 0.1. Blocks with scores higher than the threshold were considered doublets; the rest were singlets. We retrieved the corresponding F1 scores for the thresholds by comparing the classification results with the block labels and selected the threshold with the best F1 score as the optimal one.
Visualization by UMAP
We applied UMAP38 to visualize image sets 5 and 11 gene expression profiles. First, we normalized the raw counts to counts-per-million (CPM) and log-scale after adding 1. Then, we selected the top 3000 highly variable genes, scaled them to unit variance and zero mean, ran PCA, and used 50 principle components to generate UMAPs. These PCs explained 20.26% and 17.88% of the variance in the data for image sets 5 and 11, respectively. The low proportions might result from the extensive technical noise in scRNA-seq data, which did not exhibit a correlation structure and, therefore, could not be explained in a low-dimensional space. These steps were mainly implemented with Scanpy. The parameters for generating the UMAPs were all default values. The UMAPs in Fig. 3E, F visualized both the entire data and the data with valid ground-truths.
Cell clustering and differential expression analysis
Cell clusterings on the doublet detection methods’ inferred singlets were implemented using the function scanpy.tl.leiden from Scanpy, setting the resolution to 0.5. Adjusted mutual information was calculated using the scikit-learn function. Differential expression (DE) analyses were implemented with DESeq239. The count matrix was aggregated using the package tximport. We performed the DE experiments on the groups of true cell types, treating the blocks as repetitions in the groups. Different levels of stringency for identifying the significant DE genes were set on the absolute log2 fold changes (higher than 0.99 or no restriction) and the adjusted p-values (lower than 0.05 or 0.005) (Fig. 5).
Statistical tests
All the significance claims were determined based on the p-values retrieved from the two-sample Wilcoxon tests using the built-in function wilcox.test of R. Statistical test for the balanced accuracy and weighted F1 score differences used the performance scores from all folds and all confidence thresholds (Fig. 2). Statistical comparisons for the gene read counts and non-zero gene numbers used values from the blocks with available labels or predictions (Fig. 3). No statistical method was used to predetermine sample size, and no data were excluded.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Description of additional supplementary files
Source data
Acknowledgements
Y.G. is supported by NIH R35-GM133346. L.G. is supported by R01-LM012373 and R01-LM012907 awarded by NLM, R01-HD084633 awarded by NICHD. The authors would like to thank Ebrahim Azizi for providing the raw image data and the Rogel Center Single Cell Spatial Analysis Shared Resource for providing the corresponding expression data.
Author contributions
M.Z. and Y.G. designed the projects, collected data, and conducted the preliminary experiments. K.D. and X.X. labeled the images, implemented and finalized the methods. K.D. and M.Z. designed and implemented the benchmark experiments, organized the paper, and made the figures. E.K. and G.S. provided the extra image and sequencing data for analyses. K.D. and X.X. wrote the manuscript. K.D., X.X., M.Z., H.L., Y.G., and L.G. read, edited, and approved the manuscript.
Peer review
Peer review information
Nature Communications thanks Annalisa Occhipinti, who co-reviewed with Le Minh Thao Doan Feng Zhu, and the other, anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The datasets and the weights of trained models are available under this figshare repository (10.6084/m9.figshare.27606801.v2). It includes (1) image data, including the snapshots, split and cropped block images, as well as the templates for the Fluidigm experiments, and the labels for the images; (2) expression data, including the processed expression data matrices of raw read counts for datasets 5, 11, and “Extra”; (3) model weights, including the weights of the folds of cross-validations and models for the ensemble trained with the labels from both labelers. Raw sequencing results of datasets 5 and 11 can be accessed through the BioProject ID PRJNA1195642. Data and codes for generating the figures are also provided with this paper in the Source Data file. Source data are provided with this paper.
Code availability
The source code is available on GitHub (https://github.com/GuanLab/ImageDoubler) and Zenodo (10.5281/zenodo.14035928).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-024-55434-0.
References
- 1.Lawson, D. A., Kessenbrock, K., Davis, R. T., Pervolarakis, N. & Werb, Z. Tumour heterogeneity and metastasis at single-cell resolution. Nat. Cell Biol.20, 1349–1360 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Chen, H., Ye, F. & Guo, G. Revolutionizing immunology with single-cell RNA sequencing. Cell. Mol. Immunol.16, 242–249 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ofengeim, D., Giagtzoglou, N., Huh, D., Zou, C. & Yuan, J. Single-cell RNA sequencing: Unraveling the brain one cell at a time. Trends Mol. Med.23, 563–576 (2017). [DOI] [PMC free article] [PubMed]
- 4.Knouse, K. A., Wu, J., Whittaker, C. A. & Amon, A. Single cell sequencing reveals low levels of aneuploidy across mammalian tissues. Proc. Natl. Acad. Sci. USA111, 13409–13414 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wang, Y. J. et al. Comparative analysis of commercially available single-cell RNA sequencing platforms for their performance in complex human tissues. bioRxiv 10.1101/541433 (2019).
- 6.Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Syst.8, 281–291.e9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst.8, 329–337.e4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bernstein, N. J. et al. Solo: Doublet identification in single-cell RNA-seq via semi-supervised deep learning. Cell Syst.11, 95–101.e5 (2020). [DOI] [PubMed]
- 9.Germain, P.-L., Lun, A., Garcia Meixide, C., Macnair, W. & Robinson, M. D. Doublet identification in single-cell sequencing data using scDblFinder. F1000Res.10, 979 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bais, A. S. & Kostka, D. scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics36, 1150–1158 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Xi, N. M. & Li, J. J. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Syst.12, 176–194.e6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fluidigm | Products | C1. https://www.fluidigm.com/products/c1-system.
- 13.Website. From fludigm website: https://www.fluidigm.com/products/c1-system.
- 14.Caicedo, J. C. et al. Data-analysis strategies for image-based cell profiling. Nat. Methods14, 849–863 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Pennisi, E. IMAGING. ‘Cell painting’ highlights responses to drugs and toxins. Science352, 877–878 (2016). [DOI] [PubMed] [Google Scholar]
- 16.Rees, P., Summers, H. D., Filby, A., Carpenter, A. E. & Doan, M. Imaging flow cytometry: a primer. Nat. Rev. Methods Primers2, 86 (2022). [DOI] [PMC free article] [PubMed]
- 17.Lu, A. X. & Moses, A. M. Using dimensionality reduction to visualize phenotypic changes in high-throughput microscopy. Methods Mol. Biol.2800, 217–229 (2024). [DOI] [PubMed] [Google Scholar]
- 18.[No title]. https://www.fluidigm.com/binaries/content/documents/fluidigm/resources/c1-system-for-mrna-seq-medium-cell-ht-pr-101-4964/c1-system-for-mrna-seq-medium-cell-ht-pr-101-4964/fluidigm%3Afile.
- 19.[No title]. http://cn.fluidigm.com/binaries/content/documents/fluidigm/search/hippo%3Aresultset/c1-ht-medium-cell-rea-seq-tn-101-8692/fluidigm%3Afile.
- 20.Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. arXiv [cs.CV] (2015). [DOI] [PubMed]
- 21.Shor, C. A. G. JonathanShor/DoubletDetection: Doubletdetection v4.2. 10.5281/zenodo.6349517.
- 22.Bernstein, N. J. et al. Solo: doublet identification in single-cell RNA-Seq via semi-supervised deep learning. Cell Syst.11, 95–101.e5 (2020). [DOI] [PubMed] [Google Scholar]
- 23.Zhang, H. et al. SoCube: an innovative end-to-end doublet detection algorithm for analyzing scRNA-seq data. Brief. Bioinform. 24, bbad104 (2023). [DOI] [PubMed]
- 24.Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol.20, 63 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chapter 7 Droplet processing. https://bioconductor.org/books/3.17/OSCA.advanced/droplet-processing.html.
- 26.Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol.15, e8746 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom.19, 477 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wu, S. et al. Cellular, transcriptomic, and isoform heterogeneity of breast cancer cell line revealed by full-length single-cell RNA sequencing. Comput. Struct. Biotechnol. J.18, 676–685 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Howell, L., Anagnostidis, V. & Gielen, F. Multi‐object detector YOLOv4‐tiny enables high‐throughput combinatorial and spatially‐resolved sorting of cells in microdroplets. Adv. Mater. Technol.7, 2101053 (2022). [Google Scholar]
- 30.Chlis, N.-K., Rausch, L., Brocker, T., Kranich, J. & Theis, F. J. Predicting single-cell gene expression profiles of imaging flow cytometry data with machine learning. Nucleic Acids Res.48, 11335–11346 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Tzutalin. LabelImg. Git code. https://github.com/tzutalin/labelImg (2015).
- 32.Cai, L. et al. BigDetection: A large-scale benchmark for improved object detector pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 4776–4786 (2022).
- 33.He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV) 2980–2988 (2017).
- 34.Everingham, M. et al. The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis.111, 98–136 (2015). [Google Scholar]
- 35.Software. https://www.standardbio.com/products/software.
- 36.Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol.34, 525–527 (2016). [DOI] [PubMed] [Google Scholar]
- 37.Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol.19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw.3, 861 (2018). [Google Scholar]
- 39.Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of additional supplementary files
Data Availability Statement
The datasets and the weights of trained models are available under this figshare repository (10.6084/m9.figshare.27606801.v2). It includes (1) image data, including the snapshots, split and cropped block images, as well as the templates for the Fluidigm experiments, and the labels for the images; (2) expression data, including the processed expression data matrices of raw read counts for datasets 5, 11, and “Extra”; (3) model weights, including the weights of the folds of cross-validations and models for the ensemble trained with the labels from both labelers. Raw sequencing results of datasets 5 and 11 can be accessed through the BioProject ID PRJNA1195642. Data and codes for generating the figures are also provided with this paper in the Source Data file. Source data are provided with this paper.
The source code is available on GitHub (https://github.com/GuanLab/ImageDoubler) and Zenodo (10.5281/zenodo.14035928).





