Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Oct 13;26(5):bbaf538. doi: 10.1093/bib/bbaf538

OmniDoublet: a method for doublet detection in multimodal single-cell sequencing data

Lian Liu 1,#, Jiayi Ren 2,#, Xiaoxu Zhou 3, Xiang Cheng 4, Xiaoqing Pan 5, Liyuan Zhou 6,, Yan Lu 7,8,9,, Pengyuan Liu 10,11,12,
PMCID: PMC12516964  PMID: 41081728

Abstract

Doublets in single-cell sequencing data, caused by the simultaneous capture of two or more cells within a single reaction volume, introduce biases that compromise downstream analysis. Existing doublet detection methods primarily focus on single-modality data and exhibit limited robustness across datasets. To overcome these limitations, we developed OmniDoublet, a multimodal doublet detection method that integrates transcriptomic and epigenomic data. OmniDoublet leverages the Jaccard similarity coefficient to calculate weights that assess the reliability of neighboring cells across modalities, combining doublet scores from different modalities into a final integrated score. It further employs a Gaussian mixture model (GMM) to establish thresholds, enabling accurate binary classification of cells as singlets or doublets based on the integrated score. OmniDoublet offers a robust framework for detecting doublets across diverse scenarios. Benchmarking against state-of-the-art methods across various datasets demonstrates that OmniDoublet achieves superior accuracy, robustness, and scalability. By harnessing the comprehensive information from multimodal single-cell data, OmniDoublet enhances doublet detection, enabling researchers to gain more accurate and reliable insights into cellular processes.

Keywords: single-cell sequencing, doublet detection, multimodal

Introduction

The rapid advancement of single-cell sequencing technologies has revolutionized our ability to measure genomic, transcriptomic, epigenomic, and proteomic data at single-cell resolution [1–5], providing critical insights into the status, characteristics, and functions of diverse cell types. Technologies such as 10× Multiome and CITE-seq [3] allow for the simultaneous capture of multimodal data from the same cell, including gene expression, chromatin accessibility, and surface protein levels. These innovations have empowered researchers to uncover pivotal biological insights, such as the role of the pentose phosphate pathway in thymic macrophages [6], tumor heterogeneity, and tumor-immune microenvironment interactions in cervical squamous cell carcinoma [7]. These studies underscore the importance of single-cell multi-omics in understanding cellular processes, disease mechanisms, and therapeutic targets, offering novel perspectives for biomedical research.

Current single-cell sequencing platforms primarily employ microfluidics-based approaches to isolate individual cells, encapsulating them into oil droplets via microfluidic chips. However, doublets or multiplets—instances where two or more cells are inadvertently captured together—are common artifacts not limited to droplet-based systems; they also occur in microwell- or plate-based platforms. Such artifacts can severely bias downstream analyses, making accurate doublet detection and removal essential.

Doublet detection methods can be broadly classified into two categories. The first category includes demultiplexing methods, which utilize external barcoding labels (e.g. Cell Hashing [8] or MULTI-seq [9]) or genetic variation between individuals (e.g. demuxlet [10] and Vireo [11]) to identify doublets formed by cells from different donors. While these methods achieve high accuracy, they cannot detect doublets formed by cells from the same individual [12]. The second category includes computational methods that rely solely on single-cell data, such as transcriptional profiles, to detect doublets. Most of these methods are simulation-based, like Scrublet [13] and DoubletFinder [14], which synthesize artificial doublets by summing the expression profiles of two droplets in the raw data and then predict real doublets based on their similarity to the simulated doublets. In contrast, methods like AMULET [15], designed specifically for scATAC-seq data, detect doublets by enumerating genomic regions with more than two uniquely aligned reads. Despite their utility, current doublet detection methods are predominantly tailored for unimodal data and fail to fully leverage the comprehensive information available in single-cell multi-omics datasets. Additionally, performance variability among these methods has been widely documented, with no single approach consistently outperforming others across all evaluations [16].

In this paper, we first evaluated 10 existing doublet detection methods on the peripheral blood mononuclear cell (PBMC) dataset [17]. Our analysis reveals significant inconsistencies in doublet detection results among these methods when applied to the same dataset, as well as the varying performance across different modalities of the same data. To address these limitations and fully harness the potential of single-cell multimodal information, we propose OmniDoublet, a novel doublet detection method. OmniDoublet first generates artificial doublets and calculates doublet scores based on the similarity between real cells and simulated doublets. Next, OmniDoublet integrates multimodal data by computing a Jaccard similarity-based weight to assess neighbor reliability across modalities, combining doublet scores into a final integrated score. OmniDoublet then employs a Gaussian mixture model (GMM) to establish thresholds, enabling the assignment of binary labels (singlet or doublet) based on the integrated score. We benchmarked OmniDoublet against nine existing methods using semisynthetic 10× Multiome datasets, demonstrating its superior accuracy, robustness, and scalability across different doublet types and dataset sizes. To further validate its performance on real datasets, we compared OmniDoublet with COMPOSITE on publicly available cell hashing–annotated DOGMA-seq datasets, consistently observing OmniDoublet’s advantages. Beyond doublet detection, OmniDoublet improved downstream analysis, including cell clustering, differential gene expression, and peak analysis, and showed scalability when applied to CITE-seq data. Finally, we provide a comprehensive evaluation of computational requirements, offering a practical and reliable solution for doublet detection in single-cell multi-omics.

Methods

OmniDoublet

OmniDoublet is an advanced method for doublet detection using single-cell multi-omics data, leveraging the count matrices from scRNA-seq and scATAC-seq or scADT-seq as inputs. The workflow consists of five key steps: [1] doublet simulation, [2] data preprocessing, [3] unimodal nearest neighbor calculation, [4] multimodal information integration, and [5] threshold calculation and prediction.

Doublet simulation

We simulate doublets by randomly combining single-cell profiles from the original dataset (OD). For type-specific doublet simulation, the single-cell data is preclustered to enable two types of simulations: homogeneous doublets, synthesized by combining cells from the same cluster, and heterogeneous doublets, synthesized by combining cells from different clusters. Additionally, the method includes an option to normalize the synthesized doublets for consistency with the OD.

Data preprocessing

After integrating the original data with simulated doublets, a unified preprocessing pipeline is applied to all modalities. For scRNA-seq, the Scanpy pipeline is used, including quality control, normalization, and selection of highly variable genes. Principal Component Analysis (PCA) is then performed, and the resulting reduced features are used as inputs for the RNA modality. For scATAC-seq data, following quality control, the data is transformed using term frequency–inverse document frequency (TF-IDF) and dimension reduction via Latent Semantic Indexing (LSI). The reduced features are utilized as inputs for the ATAC modality. For scADT-seq, the data is processed using the centered log-ratio (CLR) transformation [3], yielding a feature matrix for the ADT modality.

Unimodal nearest neighbor calculation

For each modality, a K-Nearest Neighbor (KNN) graph is constructed using its respective feature matrix. Taking scRNA-seq and scATAC-seq data as examples, we construct KNN graphs for the RNA and ATAC modalities, respectively. Assuming the data contains N cells and k nearest neighbors, the distance between cell i and its j-th nearest neighbor in the RNA modality is denoted as Inline graphic, the distance between cell i and its j-th nearest neighbor in the ATAC modality is denoted as Inline graphic. After computation for all cells, the distance matrices DR and DA are obtained for the RNA and ATAC modalities, respectively.

graphic file with name DmEquation1.gif
graphic file with name DmEquation2.gif

The distance is normalized using the following formula to convert the distance into weights:

graphic file with name DmEquation3.gif

where Inline graphic is a small positive constant. The weight matrices WR and WA for the RNA and the ATAC modality were then obtained:

graphic file with name DmEquation4.gif
graphic file with name DmEquation5.gif

Integration of multimodal information

By default, cells in the original data are labeled as singlets (label = 0), while simulated doublets are labeled as 1. For each cell i, its corresponding modal neighbors are used to construct the neighbor matrices LR and LA for the RNA and ATAC modalities, respectively.

graphic file with name DmEquation6.gif
graphic file with name DmEquation7.gif

For cell i, the set of its RNA modal KNN neighbors is:

graphic file with name DmEquation8.gif

the set of its ATAC modal KNN neighbors is:

graphic file with name DmEquation9.gif

We define the Jaccard coefficient for the i-th cell as:

graphic file with name DmEquation10.gif

The final doublet score is calculated as:

graphic file with name DmEquation11.gif

Extension of the formula to more than two modalities

Let Inline graphic denote the set of modalities. For each modality Inline graphic, we can compute the distance matrix Inline graphic, weight matrix Inline graphic, neighbor label matrix Inline graphic, and the neighbor set Inline graphic for cell Inline graphic. For every modality pair Inline graphic, the Jaccard index between their respective neighbor sets is defined as:

graphic file with name DmEquation12.gif

The average Jaccard coefficient for cell Inline graphic across all modality pairs is then given by:

graphic file with name DmEquation13.gif

Finally, the multimodal doublet score is defined as:

graphic file with name DmEquation14.gif

Threshold calculation and prediction

A GMM with two components is used to represent the droplet scores. Theoretically, the dataset comprises two distinct categories: singlets and doublets. The droplet scores are modeled by the GMM as:

graphic file with name DmEquation15.gif

where Inline graphic and Inline graphic are the mixture weights (sum to 1).Inline graphic is the probability density function of a Gaussian distribution. Inline graphic, and Inline graphic are the means of the two components and Inline graphic, and Inline graphic are the standard deviations. For each score, the posterior probability can be predicted based on the well-fitted GMM as:

graphic file with name DmEquation16.gif

where Inline graphic and Inline graphic are the probabilities that x belongs to the first or second Gaussian component. We then identify the index where the dominant component changes, we solve for the value of x that satisfy the equation:

graphic file with name DmEquation17.gif

Using the posterior probability formula, it can be written as:

graphic file with name DmEquation18.gif

This simplifies to finding x such that:

graphic file with name DmEquation19.gif

where x represents the intersection of two Gaussian distributions and is used as the threshold. Droplets with scores greater than the threshold are classified as doublets, while droplets with scores less than or equal to the threshold are classified as singlets.

In addition to this binary classification, we compute the posterior probability that a droplet belongs to the doublet component based on the GMM. This continuous doublet score is defined as:

graphic file with name DmEquation20.gif

This score provides a probabilistic confidence estimate for each droplet and is available as part of our method’s output, enabling users to manually select a threshold for classification based on their specific needs.

Relationship between doublet classification and cell population size

Assuming that the total number of cells in the dataset is N, then the number of cells in cell cluster i can be denoted as Ni, the total number of possible doublets in the dataset is:

graphic file with name DmEquation21.gif

For cluster i, a doublet is considered to be associated with it if at least one of its two constituent cells originate from cluster i. This can occur in two scenarios: heterogeneous doublets, of which one cell belongs to cluster i and the other comes from a different cluster, homogeneous doublets, in which case both cells are from cluster i. The total number of doublets associating with cluster i can be donated as:

graphic file with name DmEquation22.gif

The probability that a doublet associates with cluster i is the fraction of doublets associating with cluster i relative to the total number of possible doublets:

graphic file with name DmEquation23.gif

It can be simplified:

graphic file with name DmEquation24.gif

As the total number of cells Inline graphic is fixed and Inline graphic is also a fixed value, we focus on the numerator and examine the first derivative of it:

graphic file with name DmEquation25.gif

Differentiating term by term:

graphic file with name DmEquation26.gif

Solving Inline graphic, we find the critical point:

graphic file with name DmEquation27.gif

Since both Inline graphic and Inline graphic are positive integers, when data contains more than one cell cluster, the Inline graphic will always be positive, the probability that doublet is assigned to cluster i increases as the size of cell population i increases. Consistent with our experiments, doublets categorized into a specific cell cluster are positively correlated with the size of that cluster.

Results

Performance of existing doublet detection methods on peripheral blood mononuclear cell datasets

To evaluate and compare existing doublet detection methods, we analyzed a multimodal PBMC Multiome dataset obtained from 10× Genomics. We tested 10 different doublet detection methods, categorized based on their compatibility with specific data types. ArchR [18] is exclusively designed for scATAC-seq data, while six methods (DoubletDecon [19], DoubletDetection [20], DoubletFinder [14], scds [21], Vaeda [22], and Solo [23]) are tailored for scRNA-seq data. The remaining three methods (scDblFinder [24], Scrublet [13], and scIBD [25]) can be applied to both scRNA-seq and scATAC-seq data. Collectively, these methods generated 13 sets of doublet predictions.

The proportion of doublets identified by different methods in the PBMC dataset varied widely, ranging from 2% to 11%, with a mean of 7.57%. scDblFinder detected the fewest doublets (2.62%), while DoubletDetection identified the most (10.48%) in scRNA-seq data (Fig. 1a). We quantified the level of agreement between methods by comparing the classification of droplets as doublets or singlets (Fig. 1b). DoubletDecon exhibited the lowest consistency, with 72.54% of its identified doublets not recognized by any other methods. Conversely, scDblFinder showed the highest consistency in scATAC-seq data, with only 9% of its identified doublets unconfirmed by other methods.

Figure 1.

Benchmarking of doublet detection on the PBMC Multiome dataset. Panel (a) shows the proportions of singlets and doublets called by each method. Panel (b) reports agreement levels for singlet and doublet labels across methods. Panels (c and d) depict overlaps in predicted doublets using Venn/flower diagrams for selected scATAC-seq and scRNA-seq methods, highlighting shared versus unique calls. Panels (e–g) show Venn diagrams comparing doublets from scDblFinder, scIBD, and Scrublet across RNA and ATAC modalities to illustrate cross-modality consistency.

Benchmarking of existing methods on the PBMC Multiome dataset. (a) Proportions of singlets and doublets identified by various methods on the PBMC dataset. (b) Agreement proportions for singlet (low) and doublet (up) classification across methods. Each bar represents the proportion of cells classified as singlets and doublets by a method and the number of other methods that agree with the classification, ranging from no agreement (0) to full agreement by all 12 methods. (c) Venn diagram illustrating the overlap of doublets identified by four detection methods (ArchR, scIBD, Scrublet, and scDblFinder) on scATAC-seq data. (d) Flower diagram representing doublets identified by nine detection methods (DoubletDecon, DoubletDetection, DoubletFinder, scDblFinder, scds, scIBD, Scrublet, Solo, and Vaeda) on scRNA-seq data. The central region indicates doublets commonly identified by all methods (0), while each petal represents unique doublets detected exclusively by each method. (e–g) Venn diagrams showing the overlap of doublets identified by scDblFinder, scIBD, and Scrublet across ATAC and RNA modalities.

Pairwise comparisons further evaluated prediction consistency between methods. Consistent with our previous findings, DoubletDecon showed the lowest recognition consistency, while Scrublet exhibited the highest consistency and performed relatively stable in scRNA-seq data (Fig. S1). We also assessed the consistency of doublet identification across methods in monomodal data (Fig. 1c and d). In scRNA-seq data, no doublets were commonly detected by all nine methods, whereas in scATAC-seq data, four methods collectively identified 246 overlapping doublets, accounting for 2.36% of all droplets. To further illustrate the overlaps among scRNA-seq doublet predictions, we have included a complementary UpSet plot (Fig. S2). Notably, the same method produced differing doublet predictions across modalities, highlighting potential modality-specific influences on detection outcomes (Figs 1e–g).

These results revealed poor consistency across methods and significant variation in predictions across modalities of the same dataset, emphasizing the need for multimodal integration to improve doublet detection.

Overview of OmniDoublet

To address these challenges, we propose OmniDoublet, a novel method that integrates information from multimodal data to predict doublets in single-cell datasets. OmniDoublet utilizes both scRNA-seq gene count matrix and scATAC-seq peak count matrix as inputs and simulates artificial doublets. For each cell within a single modality, a KNN graph is constructed based on reduced dimensions, and a doublet score is computed based on the artificial doublets among its KNNs. To integrate multimodal data, OmniDoublet calculates a weight using the Jaccard similarity coefficient, which assesses the reliability of neighbors across modalities. This weight is then used to combine doublet scores from the two modalities into a final integrated score (Fig. 2a). Finally, OmniDoublet employs a GMM to determine an optimal threshold for binary classification of droplets as singlets or doublets based on the integrated score.

Figure 2.

Overview and benchmarking of OmniDoublet on semisynthetic PBMC data. Panel (a) schematizes the OmniDoublet workflow. Panels (b–e) compare methods using ROC/AUC, AUROC, and AUPRC across 5%–25% doublet rates, MCC, and additional metrics (accuracy, F1, precision, recall). The figure summarizes overall discrimination and robustness across varying doublet prevalences.

OmniDoublet overview and benchmarking on PBMC semisynthetic datasets. (a) Schematic representation of the OmniDoublet framework. (b) ROC and AUC curves comparing various methods on a semisynthetic PBMC dataset with a 20% doublet rate. (c) Comparison of AUROC and AUPRC values across methods on semisynthetic PBMC datasets with varying doublet rates (5%–25%). (d) MCC values highlighting method performance on semisynthetic PBMC datasets with varying doublet rates (5%–25%). (e) Additional classification metrics (accuracy, F1 score, precision, and recall) for all methods on semisynthetic PBMC datasets with varying doublet rates (5%–25%).

To evaluate the performance of OmniDoublet, we benchmarked it against existing methods using several semisynthetic datasets derived from different species and tissues with varying doublet rates. Since DoubletDecon only provides binary classification (doublet or singlet) without specific doublet scores, and given that all methods except ArchR require count or peak matrices as inputs, we excluded these two methods. Additionally, we incorporated COMPOSITE [26] —a recently published method introduced during our manuscript submission process—in our evaluation, and compared its performance with OmniDoublet using experimental cell hashing–annotated datasets.

Benchmark on semisynthetic peripheral blood mononuclear cell datasets

To evaluate the performance of doublet detection methods, we used semisynthetic datasets constructed from the OD identified using existing doublet detection methods. Droplets were classified based on the consensus: those identified as singlets by all methods were considered true singlets, whereas droplets classified as doublets by at least one method were labelled as potential doublets. We applied strict filtering criteria to retain only cells consistently identified as singlets across all methods, forming reliable singlet datasets. Artificial doublets were then simulated by randomly combining the profiles of two singlets. The original singlet datasets and the simulated doublets were merged to create semisynthetic datasets with predefined doublet rates, constructing five semisynthetic datasets with doublet rates ranging from 5% to 25% in 5% increments.

We first compared OmniDoublet with nine other doublet detection methods on semisynthetic datasets generated from the 10× PBMC Multiome dataset, resulting in a total of 13 prediction sets. Using the doublet scores produced by each method, we evaluated the performance of doublet detection with receiver operating characteristic (ROC) curves and precision-recall (PR) curves (Fig. 2b). OmniDoublet demonstrates superior performance by achieving an optimal true positive rate at low false positive rates (FPR < 0.2), coupled with high precision and sustained high recall. The AUROC and AUPRC values are then compared across datasets, with OmniDoublet achieving the highest performance in both metrics (Fig. 2c). Vaeda performed well on scRNA-seq data, while scIBD and Scrublet showed strong performance on scATAC-seq data. Notably, as the doublet rate increased, AUPRC values for all methods, except scDblFinder, improved. For most methods, the doublet rate had minimal impact on AUROC values. These results show that OmniDoublet outperforms other methods across datasets with varying doublet rates, with AUPRC values increasing as the doublet rate rises.

In practical applications, relying solely on AUROC and AUPRC values is insufficient; determining specific thresholds is essential for classification tasks. OmniDoublet employs a GMM to determine an optimal threshold for classifying droplets as singlets or doublets. We compared the Matthews correlation coefficient (MCC) of various methods across datasets (Fig. 2d). As a robust metric for evaluating the performance of binary classification models on imbalanced datasets, MCC demonstrated that OmniDoublet outperformed all other methods, while Solo exhibited the poorest performance across all five datasets. With the exception of precision, OmniDoublet achieved the highest performance in accuracy, recall, and F1 score (Fig. 2e), confirming its ability to establish a reliable threshold for accurately classifying cells as singlets or doublets.

Impact of doublet types

Doublets can be classified into two types based on their cell-of-origin: homogeneous doublets, synthesized from the same cell type, and heterogeneous doublets, synthesized from different cell types. To investigate the impact of these doublet types on the performance of doublet detection methods, we generated three datasets from the rigorously filtered PBMC dataset: homogeneous (same cell type), heterogeneous (different cell types), and randomized (completely random cell selections). For each dataset, we created semisynthetic datasets with doublet rates ranging from 5% to 25% in 5% increments. We then evaluated the performance of various doublet detection methods across these datasets.

Across datasets with different doublet types, OmniDoublet’s AUPRC performance improves consistently as the doublet rate increases, in line with previous experimental results. This trend demonstrates that OmniDoublet’s detection ability is enhanced as the proportion of doublets in the dataset grows (Fig. 3a). On heterogeneous datasets, OmniDoublet outperforms random datasets in both AUROC and AUPRC metrics, whereas on homogeneous datasets, its performance is lower than that of random datasets (Fig. 3b). This discrepancy arises because OmniDoublet, like most doublet detection methods, identifies doublets based on the similarity of cell expression or peak profiles. Homogeneous doublets are particularly challenging to detect due to their expression or peak profiles closely resembling those of regular singlets. In contrast, heterogeneous doublets are easier to identify, as they often exhibit intermediate profiles and co-expression of incompatible cell type markers at reduced levels.

Figure 3.

OmniDoublet performance across doublet types and dataset sizes. Panel (a) shows AUROC and AUPRC for homogeneous, heterogeneous, and random doublets across 5%–25% rates. Panel (b) summarizes mean performance with variability (standard deviation). Panel (c) compares AUROC/AUPRC among methods on downsampled datasets of 2 k, 4 k, and 6 k cells, illustrating scalability with sample size.

OmniDoublet performance across different doublet types and dataset sizes. (a) Bar plots of AUROC and AUPRC values achieved by OmniDoublet on datasets containing different doublet types (homogeneous, heterogeneous, and random) and rates (5%–25%). (b) Mean AUROC and AUPRC values for OmniDoublet on datasets with different doublet types, with error bars indicating variability (standard deviation). (c) Comparison of AUROC and AUPRC values for various methods on downsampled datasets with cell counts of 2000, 4000, and 6000.

We evaluated the performance of various methods and found that OmniDoublet consistently achieved the highest AUPRC and AUROC values across all datasets, with the exception of AUROC values of homogeneous datasets at doublet rates of 5%–15% (Fig. S3). Similar to OmniDoublet, most methods performed best on heterogeneous datasets and worst on homogeneous datasets (Fig. S4). To make the generated doublets more realistic, we generated artificial doublets completely randomly in subsequent experiments.

Robustness to dataset size

To assess the robustness of the method to dataset size, we downsampled the PBMC 10k dataset with a doublet rate of 20%, generating datasets with cell counts of 2000, 4000, and 6000 cells. OmniDoublet consistently outperformed other methods in both AUROC and AUPRC metrics across all three downsampled datasets (Fig. 3c), demonstrating its robustness to varying dataset sizes. Additional metrics (accuracy, F1, MCC, recall, and precision) are presented in Table S1, further illustrating OmniDoublet’s high robustness.

Validation on other datasets

To evaluate the stability and generalizability of OmniDoublet across diverse datasets, we first utilized four single-cell multimodal datasets from different species and tissues: PBMC3K [27], human brain [28], mouse brain [29], and mouse kidney [30], all sourced from the 10× Genomics repository. Each dataset was preprocessed and filtered following the previously described method, and five semisynthetic datasets with varying doublet rates (ranging from 5% to 25% in 5% increments) were generated for each OD, resulting in a total of 20 semisynthetic datasets.

On these 20 semisynthetic datasets, OmniDoublet consistently achieved the highest AUROC and AUPRC values (Fig. 4a), demonstrating its superior performance and robustness across diverse data sources. The Vaeda method also performed well, ranking second in AUROC and AUPRC in most datasets. In contrast, the performance of scIBD_ATAC, which had shown strong results in earlier evaluations, declined across these new datasets. Other methods exhibited substantial variability in their performance across datasets, highlighting their lack of stability compared to OmniDoublet.

Figure 4.

Method comparison on additional datasets. Panel (a) displays AUROC and AUPRC for multiple methods across four semisynthetic datasets with 5%–25% doublet rates. Panel (b) is a heatmap of average ranks by five metrics (accuracy, F1, MCC, precision, recall), highlighting top performers per metric. Panels (c and d) show boxplots of AUROC/AUPRC and the five metrics for OmniDoublet versus COMPOSITE on real Ileum and peripheral blood datasets.

Performance comparison of various methods on other datasets. (a) Dot plots illustrating AUROC and AUPRC values for different methods across four semisynthetic datasets, with doublet rates of 5%–25%. (b) Heatmap of average ranks for each method based on five binary classification metrics: accuracy, F1 score, MCC, precision, and recall. Top three methods for each metric are highlighted with their respective average ranks. (c) Box plots illustrating AUROC and AUPRC values of OmniDoublet and COMPOSITE on real ileum and peripheral blood (PB) datasets. (d) Box plots showing five binary classification metrics for OmniDoublet and COMPOSITE on the same Ileum and PB datasets.

To further evaluate their performance on binary classification metrics, we ranked each method on each dataset according to metrics such as F1 score, MCC, and recall. Methods were ranked from best to worst (1, 2, …, 13), and their average ranks across the datasets were calculated and visualized in a heatmap (Fig. 4b and Table S2). For clarity, the average ranks of the top three performing methods for each metric are labeled within the heatmap.

OmniDoublet consistently demonstrated the best performance in F1 score, MCC, and recall. It achieved the top rank for recall across nearly all datasets while significantly outperforming other methods on F1 score and MCC. Specifically, for F1 score, OmniDoublet’s average rank of 2 was notably better than the second-ranked method, Scrublet_RNA, which had an average rank of 4.1, a difference of 2.1. For MCC, OmniDoublet’s average rank of 2.1 outperformed the second-ranked Scrublet_RNA (3.9) by 1.8. In terms of accuracy, OmniDoublet achieved an average rank of 4.7, tying for second place with scDblFinder_RNA, and closely following the top-ranked Scrublet_RNA, which had an average rank of 3.9. For precision, OmniDoublet ranked third to last (average rank of 8.28), likely due to its tendency to identify a higher proportion of doublets compared to other methods.

We next evaluated OmniDoublet using real datasets from the COMPOSITE study [31], which included 10 peripheral blood T-cell-enriched and seven ileum immune-cell-enriched DOGMA-seq datasets with cell hashing annotations. Although the DOGMA-seq datasets consist of three modalities—RNA, ATAC, and ADT—we restricted our comparison to the RNA and ATAC modalities to ensure consistency with our simulated datasets and the current implementation of OmniDoublet. We compared the performance of COMPOSITE and OmniDoublet across these datasets. As shown in Fig. 4c, OmniDoublet consistently outperformed COMPOSITE in both AUROC and AUPRC, demonstrating superior performance across datasets derived from both tissue sources. Notably, both methods performed better on peripheral blood datasets compared to ileum datasets.

Further evaluation using binary classification metrics revealed that OmniDoublet achieved higher accuracy, F1 score, MCC, and precision than COMPOSITE, while COMPOSITE exhibited a slightly higher recall (Fig. 4d). Consistent with the AUROC and AUPRC results, the peripheral blood datasets yielded stronger overall performance than the ileum datasets across all five metrics.

Taken together, these results demonstrate that OmniDoublet excels at balancing multiple performance metrics and provides robust thresholds for doublet classification. Its consistently strong performance on real-world datasets further validates its reliability and practical applicability. OmniDoublet generalizes well across diverse species, tissue types, and experimental settings, demonstrating superior stability, and accuracy for doublet detection. The distribution of performance across metrics is further illustrated in the boxplots (Fig. S5), reinforcing OmniDoublet’s effectiveness in both semisynthetic benchmarks and real-world datasets.

Impact of doublets on downstream analysis

To assess the impact of doublets and their removal on downstream single-cell data analysis, we utilized the semisynthetic dataset with a 20% doublet rate, generated from the rigorously filtered 10k PBMC dataset. Three datasets were created: the OD, the dataset with manually added doublets (DA), and the dataset after doublet removal (DR). We then conducted downstream analysis on each of these datasets.

Clustering and cell type proportions

We performed downstream analysis on both scRNA-seq and scATAC-seq data using the Seurat [32] pipeline, with results visualized via UMAP [33]. By comparing the UMAP results of the OD and DA datasets, we observed that some doublets formed a distinct new cell cluster (cluster 9), while others were embedded within existing clusters or created bridges between clusters (Fig. 5a). After identifying and removing doublets using OmniDoublet, these bridges and the doublet-induced cluster were effectively eliminated, demonstrating OmniDoublet’s capability to accurately identify and remove doublets.

Figure 5.

Impact of OmniDoublet on downstream analysis. Panel (a) shows UMAPs of the original (OD), doublet-added (DA), and doublet-removed (DR) datasets with ground-truth and Seurat-based cluster labels, highlighting improved separation after removing doublets. Panel (b) is a Sankey diagram tracking cell label transitions from OD to DA and DR. Panel (c) compares overlap of top 100 marker genes across cell types before and after doublet addition/removal. Panel (d) presents Venn diagrams of DEGs for two cell-type contrasts across OD, DA, and DR, indicating changes after doublet removal.

OmniDoublet’s impact on downstream analysis. (a) UMAP visualizations of the OD, the doublet-added dataset (DA), and the doublet-removed dataset (DR) using OmniDoublet. The top row shows clustering annotations based on ground truth labels, while the bottom row shows clustering results obtained with Seurat’s standard workflow. (b) Sankey diagram illustrating cell classifications transitions from the OD to DA and DR datasets. (c) Bar plots showing the overlap of the top 100 marker genes across cell types in OD versus DA datasets and OD versus DR datasets. (d) Venn diagrams of differentially expressed genes for CD14 Mono versus CD16 Mono (left) and CD4 Naïve versus CD4 TCM (right) across the OD, DA, and DR datasets, highlighting DEGs changes after doublet removal.

To further explore the impact of doublets on cell clustering, we visualized the relationship between the ground truth cell labels and the predicted classifications in the DA and DR datasets using a Sankey diagram (Fig. 5b). OmniDoublet successfully identified the majority of doublets, and the accuracy of cell annotation improved significantly after doublets removal. Notably, 41% of doublets were identified as CD14 Mono, suggesting that their cell profiles closely resemble those of CD14 Mono cells.

To investigate this further, we examined the relationship between the predicted cell types of doublets and the proportions of those cell types in the original data. We observed a positive correlation between the predicted doublets and cell type proportions (Pearson correlation for DA: r = 0.83, P = .0058; for DR: r = 0.89, P = .0012; Fig. S6). This correlation can be attributed to the randomized nature of the doublet simulation, where larger cell clusters are more likely to contribute to doublet formation.

In summary, our analysis highlights that doublets can significantly distort cell clustering and cell type identification. However, the use of OmniDoublet to remove these doublets substantially improves the accuracy and clarity of downstream analysis, ensuring more reliable biological interpretations.

Gene expression

We analyzed the marker gene profiles of each cell cluster in the DA dataset. The doublet cluster exhibited a marker gene expression profile resembling that of CD14 Mono and CD4 cells (Fig. S7). In the UMAP plot, this doublet cluster positioned between the two cell types, suggesting that these cells likely originated from a combination of CD14 Mono and CD4 cells. Notably, these two cell types accounted for the majority of the OD, further confirming the positive correlation between doublet type and cell type proportion.

We compared the top 100 gene markers for each cell type across the three datasets (Fig. 5c). The presence of doublets in the datasets introduced significant variations in differential gene analysis. In contrast, the removal of doublets using OmniDoublet effectively reduced these variations, bringing the results closer to those of the original data and improving the overall quality of the dataset. We then calculated the differentially expressed genes (DEGs) between CD14 Mono and CD16 Mono, as well as CD4 Naïve and CD4 TCM cell types, across the three datasets (Fig. 5d). These results indicate that the DA dataset identifies a higher number of false-positive DEGs compared to the OD dataset. Conversely, the DR dataset identifies more true DEGs that are consistent with the original data, thereby improving the accuracy of DEG identification while preserving the key characteristics of the OD.

Next, we performed enrichment analysis on the DEGs identified in CD4 Naïve and CD4 TCM cell types. This analysis revealed specific gene pathways significantly enriched in the OD and DR datasets, but absent in the DA dataset (Fig. S8). Notably, two of these pathways are involved in the proliferation of CD4+ T-cells and are highly correlated with CD4 TCM cells, which are more responsive to reactivation and proliferation compared to naive T-cells [34]. Another pathway is associated with the regulation of cell adhesion among leukocytes, including T-cells. Adhesion molecules are critical for T-cell homing, interaction with antigen-presenting cells, and tissue residency [35–37]. Importantly, TCM cells may express distinct adhesion molecules compared to naive cells, facilitating migration to secondary lymphoid organs. These pathways reflect functional differences between CD4 Naïve and CD4 TCM cells, underscoring their biological relevance. Together, these findings further confirm that the removal of doublets significantly enhances the accuracy of pathway analysis.

In summary, the removal of doublets using OmniDoublet not only improves the quality and accuracy of differential gene expression and pathway analysis but also ensures a more reliable biological interpretation of cell types and their functional properties.

Peak analysis

For the ATAC modality, differential peaks were detected for CD14 Mono and CD16 Mono, as well as CD4 Naïve and CD4 TCM cell types, across the three datasets (Fig. S9). Similar to the results from the RNA modality, the DR dataset identified more differential peaks consistent with the original data, while the DA dataset tended to identify more false-positive peaks. Specifically, for the comparison between CD4 Naïve and CD4 TCM cell types, the proportion of false positives in the DR dataset was slightly higher (2.4%) compared to the DA dataset (2.2%). However, the DR dataset identified 10.6% more true positives. These findings underscore the importance of DR in enhancing the reliability and accuracy of differential peak analysis.

Extension to CITE-seq datasets

OmniDoublet, a method designed for single-cell multi-omics data, supports the analysis of both scRNA-seq and scATAC-seq data. To evaluate its scalability, we extended its application to CITE-seq data, a multi-omics platform integrating RNA and ADT (antibody-derived tag) data. Two CITE-seq datasets, Bone Marrow [38], and PBMC [39], were collected for benchmarking. Currently, no doublet detection methods are specifically tailored for CITE-seq data. Following a protocol similar to earlier experiments, eight existing doublet detection methods for scRNA-seq were applied to identify and filter potential doublets, after which doubles were simulated at rates ranging from 5% to 25% to benchmark OmniDoublet against the existing methods.

OmniDoublet achieved superior performance in terms of both AUROC and AUPRC values across all 10 CITE-seq semisynthetic datasets (Fig. 6a). Vaeda also performed well, ranking second in AUPRC values on nine of 10 datasets, with the exception of the Bone Marrow dataset at a 25% doublet rate, where its AUPRC was slightly lower than DoubletDetection. While DoubletDetection achieved high AUROC values across all the datasets, ranking second on average, its performance on AUPRC was less stable. Across the CITE-seq datasets, we observed improved performance for most methods, except for scDblFinder, as doublet rates increased. This improvement likely reflects the greater prominence of doublet-specific features at higher doublet rates, which aids in distinguishing doublets from singlets. Notably, OmniDoublet consistently outperformed others by achieving the highest accuracy, F1 score, MCC, and recall. Its mean F1 score was 0.755, outperforming Scrublet_RNA by 35.5%, and its mean MCC was 0.722, surpassing DoubletDetection by 25.3% (Fig. 6b). These results highlight OmniDoublet’s robust scalability and stability across diverse single-cell sequencing modalities.

Figure 6.

CITE-seq benchmarking. Panel (a) shows AUROC and AUPRC for 10 methods on PBMC and bone marrow datasets across 5%–25% doublet rates. Panel (b) summarizes the distribution of five metrics (accuracy, F1, MCC, precision, and recall) for the 10 methods, indicating variability and relative ranking on multimodal data.

Benchmarking methods on the CITE-seq datasets. (a) Dot plots displaying AUROC and AUPRC values for 10 doublet detection methods on PBMC and bone marrow datasets, with doublet rates of 5%–25%. (b) Boxplots illustrating the overall performance distributions of the 10 methods on the CITE-seq datasets across five binary classification metrics: accuracy, F1 score, MCC, precision, and recall.

Comparison of computational resources

To complement the performance evaluation, we assessed the computational resources required by each method, including runtime, CPU usage, and maximum memory (Fig. 7a). Processing semisynthetic PBMC data with a doublet rate of 20% revealed that Solo demands significantly more time and CPU resources due to its deep-learning-based framework. In contrast, Scrublet demonstrated the shortest runtime and lowest memory usage, while scds exhibited minimal CPU utilization. Both OmniDoublet and COMPOSITE require relatively higher computational time and memory compared to most other methods, primarily due to their ability to jointly process scRNA-seq and scATAC-seq data. Notably, OmniDoublet achieves improved doublet detection performance while maintaining practical efficiency, with typical analyses completing within a few minutes and requiring only moderate computational resources, comparable to those of single-modality methods.

Figure 7.

Computational resource usage. Panel (a) compares runtime, CPU usage, and memory for each method on a semisynthetic PBMC dataset with 20% doublets. Panel (b) shows scalability of the same resources across datasets with 2–10 k cells, illustrating how methods differ in efficiency as data size grows.

Computational resource usage comparison. (a) Bar plots displaying runtime (seconds), CPU usage (%), and memory consumption (GB) for each method on the semisynthetic PBMC dataset with a doublet rate of 20%. (b) Dot plots showing the scalability in terms of runtime, CPU usage, and memory consumption across datasets with 2000, 4000, 6000, 8000, and 10 000 cells.

Further comparisons across datasets with 2000 to 10 000 cells (Fig. 7b) showed that runtime increased linearly for most methods, except for Solo, whose runtime grew exponentially. Similarly, scIBD exhibited higher runtime due to its iterative algorithm. CPU usage remained relatively stable as the cell count increases, except for scIBD_ATAC, COMPOSITE, and OmniDoublet, whose CPU usage decreased with larger datasets, indicating potential algorithmic bottlenecks. Maximum memory usage also rose with the cell count with Scrublet_RNA maintaining the lowest memory requirement, while scIBD_ATAC exhibited the steepest growth, peaking at 10,000 cells. Compared to OmniDoublet, COMPOSITE requires more time and memory. These findings provide a comprehensive overview of the trade-offs between computational efficiency and performance among doublet detection methods.

Discussion

While various tools have been developed for doublet detection, most are tailored exclusively for single-modal data and often yield inconsistent results across datasets. To address these challenges and harness the full potential of multimodal information, we developed OmniDoublet, a method that integrates multimodal data using a nearest-neighbor graph and Jaccard coefficients. Through rigorous benchmarking on semisynthetic PBMC datasets, OmniDoublet demonstrated superior performance across multiple metrics, including predictive doublet scores and binary classification accuracy (Figs 2c–e). Additionally, OmniDoublet maintained high performance under varying doublet generation strategies, including homogeneous, heterogeneous, and mixed doublet cases. While homogeneous doublets posed greater detection challenges due to their similarity to singlets, OmniDoublet consistently outperformed other methods in all conditions (Fig. 3b and c). Although our current analyses focused on two-modality datasets, the general framework and formulation of OmniDoublet support seamless extension to higher-order multimodal integration, offering a scalable solution for increasingly complex single-cell datasets.

Doublets in single-cell data adversely impact downstream analysis, including cell clustering, cell type proportion estimation, differential gene expression and peak analysis (Figs 5a–d), thereby compromising the accuracy and reliability of biological interpretations. By removing doublets with OmniDoublet, the quality of downstream analysis improved substantially, leading to more accurate clustering, enhancing cell type classification, and reliable differential analysis of genes and peaks.

A key strength of OmniDoublet is its integration of multimodal data to enhance doublet detection performance, making it a valuable tool for single-cell multi-omics studies. However, OmniDoublet has certain limitations. It is specifically designed for multimodal data and cannot be directly applied to single-modal datasets. Future work could explore adaptations of the method to accommodate single-modality data, e.g. by leveraging reference datasets or modality-independent features.

Additionally, its runtime and memory requirement are higher compared to some single-modal methods, reflecting the computational demands of processing multimodal information. Nevertheless, the resource requirements remain practical for most use cases. Currently, OmniDoublet uses gene and peak count matrices as inputs, which may limit its ability to capture additional biologically relevant features.

Further improvements could focus on incorporating other cell-specific features, such as library size, gene co-expression patterns, or sequencing fragment information, as seen in tools like AMULET. To support broader applications while preserving computational efficiency, upcoming versions of OmniDoublet may also benefit from algorithmic optimizations aimed at reducing runtime and memory consumption. Additionally, OmniDoublet could be expanded to leverage information from auxiliary modalities or reference datasets, thereby improving its applicability to single-modality datasets and enhancing its ability to characterize diverse cell states. These advancements would further strengthen OmniDoublet’s utility in challenging analytical scenarios.

CONCLUSION

OmniDoublet is a robust and scalable doublet detection method tailored for multimodal single-cell sequencing data. By integrating information from multiple data modalities, OmniDoublet consistently outperformed eight existing unimodal doublet detection methods and COMPOSITE across multiple semisynthetic Multiome datasets, demonstrating superior accuracy, robustness, and adaptability to diverse datasets, including those with varying doublet types, doublet rates and downsampled data. The comparison with COMPOSITE on real datasets further highlights OmniDoublet’s practical utility in authentic multimodal single-cell applications.

By effectively identifying and removing doublets, OmniDoublet significantly improves the accuracy of downstream analysis, such as cell clustering, differential gene expression, and differential peak detection. These improvements enable researchers to achieve biologically meaningful results and enhance the reliability of single-cell studies. The scalability of OmniDoublet was further highlighted by its outstanding performance on CITE-seq data, showcasing its adaptability across different single-cell modalities.

In addition to its high performance, our analysis of computational resource requirements provides researchers practical guidance for selecting the most suitable doublet detection method based on their dataset and computational constraints. Seamlessly integrating into existing single-cell analysis workflows, OmniDoublet represents a powerful and efficient tool for advancing the analysis of single-cell multi-omics data. By enhancing doublet detection and downstream analyses, OmniDoublet empowers researchers to uncover deeper and more reliable insights into cellular heterogeneity and processes.

Key Points

  • OmniDoublet integrates transcriptomic and epigenomic data using a Jaccard similarity coefficient for robust doublet detection, outperforming unimodal methods in accuracy and scalability.

  • OmniDoublet demonstrates higher accuracy in doublet detection, especially for challenging homogeneous doublets, compared to existing methods.

  • By removing doublets, OmniDoublet enhances clustering, cell type classification, and gene/peak analysis, ensuring more reliable biological insights.

  • OmniDoublet excels across diverse datasets, offering robust performance with efficient computational resource usage, even for large or complex datasets.

Supplementary Material

OmniDoublet-Suppl-Data-Final_bbaf538

Acknowledgements

We express our gratitude to Xiaoli Hong and Chao Bi at the Core Facility of Zhejiang University School of Medicine for their technical assistance. Additionally, we would like to acknowledge anonymous reviewers for their valuable commenting and feedback on the manuscript.

Contributor Information

Lian Liu, Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 3 Qingchun E Road, Shangcheng District, Hangzhou 310016, China.

Jiayi Ren, Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 3 Qingchun E Road, Shangcheng District, Hangzhou 310016, China.

Xiaoxu Zhou, Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women’s Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 1 Xueshi Road, Shangcheng District, Hangzhou 310006, China.

Xiang Cheng, Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women’s Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 1 Xueshi Road, Shangcheng District, Hangzhou 310006, China.

Xiaoqing Pan, Department of Mathematics, Shanghai Normal University, 100 Guilin Road, Xuhui District, Shanghai 200234, China.

Liyuan Zhou, Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 3 Qingchun E Road, Shangcheng District, Hangzhou 310016, China.

Yan Lu, Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women’s Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 1 Xueshi Road, Shangcheng District, Hangzhou 310006, China; Cancer Center, Zhejiang University, 866 Yuhangtang Road, Xihu District, Hangzhou 310013, China; Zhejiang Key Laboratory of Frontier Medical Research on Cancer Metabolism, Zhejiang University, 268 Kaixuan Road, Shangcheng District, Hangzhou 310029, China.

Pengyuan Liu, Department of Respiratory Medicine, Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University School of Medicine, 3 Qingchun E Road, Shangcheng District, Hangzhou 310016, China; Cancer Center, Zhejiang University, 866 Yuhangtang Road, Xihu District, Hangzhou 310013, China; Department of Physiology, University of Arizona College of Medicine, 1656 E Mabel St, Tucson, AZ 85724, United States.

Author contributions

PY, YL and LZ considered and designed the study; LL developed the algorithm; LZ, JR, XZ, and XC assisted in single-cell data analysis; LL drafted the manuscript; LZ, XP, YL and PL revised the manuscript. All authors discussed and commented upon the study.

Conflict of interest: None declared.

Funding

This work has been supported in part by the Key R&D Program of Zhejiang Province (2025C02053), the National Natural Science Foundation of China (82472625 and 82372870), and the National Institutes of Health (HL149620).

Data availability

The datasets analyzed in this study were collected from public databases. All 10× Genomics Single Cell Multome ATAC + Gene Expression and CITE-seq datasets can be downloaded at https://www.10xgenomics.com/datasets/. The datasets generated in COMPOSITE study can be found in Zenodo under the accession code https://zenodo.org/records/11167174.

Code availability

The OmniDoublet source code is available at https://github.com/mmmads/OmniDoublet.

References

  • 1. Picelli  S, Faridani OR, Björklund ÅK. et al.  Full-length RNA-seq from single cells using smart-seq2. Nat Protoc  2014;9:171–81. 10.1038/nprot.2014.006. [DOI] [PubMed] [Google Scholar]
  • 2. Dey  SS, Kester  L, Spanjaard  B. et al.  Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol  2015;33:285–9. 10.1038/nbt.3129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Stoeckius  M, Hafeimester  M, Stephenson  W. et al.  Simultaneous epitope and transcriptome measurement in single cells. Nat Methods  2017;14:865–68. 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Valentina P . Single Cell Methods. Sequencing and Proteomics, Vol. 1979. New York, New York, NY: Springer, 2019. [Google Scholar]
  • 5. Bartosovic  M, Kabbe  M, Castelo-Branco  G. Single-cell CUT&Tag profiles histone modifications and transcription factors in complex tissues. Nat Biotechnol  2021;39:825–35. 10.1038/s41587-021-00869-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Tsai  T-L, Zhou  T-A, Hsieh  Y-T. et al.  Multiomics reveal the central role of pentose phosphate pathway in resident thymic macrophages to cope with efferocytosis-associated stress. Cell Rep  2022;40:111065. 10.1016/j.celrep.2022.111065. [DOI] [PubMed] [Google Scholar]
  • 7. Fan  J, Lu  F, Qin  T. et al.  Multiomic analysis of cervical squamous cell carcinoma identifies cellular ecosystems with biological and clinical relevance. Nat Genet  2023;55:2175–88. 10.1038/s41588-023-01570-0. [DOI] [PubMed] [Google Scholar]
  • 8. Stoeckius  M, Zheng  S, Houck-Loomis  B. et al.  Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol  2018;19:224. 10.1186/s13059-018-1603-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. McGinnis  CS, Patterson  DM, Winkler  J. et al.  MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat Methods  2019;16:619–26. 10.1038/s41592-019-0433-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Kang  HM, Subramaniam  M, Targ  S. et al.  Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol  2018;36:89–94. 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Huang  Y, McCarthy  DJ, Stegle  O. Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol  2019;20:273. 10.1186/s13059-019-1865-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Neavin  D, Senabouth A, Arora H. et al.  Demuxafy: Improvement in droplet assignment by integrating multiple single-cell demultiplexing and doublet detection methods. Genome Biol  2024;25:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wolock  SL, Lopez  R, Klein  AM. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Cell Systems  2019;8:281–291.e9. 10.1016/j.cels.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. McGinnis  CS, Murrow  LM, Gartner  ZJ. DoubletFinder: doublet detection in single-cell RNA sequencing data using artificial nearest Neighbors. Cell Systems  2019;8:329–337.e4. 10.1016/j.cels.2019.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Thibodeau  A, Eroglu  A, McGinnis  CS. et al.  AMULET: a novel read count-based method for effective multiplet detection from single nucleus ATAC-seq data. Genome Biol  2021;22:252. 10.1186/s13059-021-02469-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Xi  NM, Li  JJ. Benchmarking computational doublet-detection methods for single-cell RNA sequencing data. Cell Systems  2021;12:176–194.e6. 10.1016/j.cels.2020.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. 10x Genomics . PBMC from a healthy donor—granulocytes removed through cell sorting (10k). (2020).
  • 18. Granja  JM, Corces  MR, Pierce  SE. et al.  ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet  2021;53:403–11. 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. DePasquale  EAK, Schnell  DJ, Van Camp  P-J. et al.  DoubletDecon: deconvoluting doublets from single-cell RNA-sequencing data. Cell Rep  2019;29:1718–1727.e8. 10.1016/j.celrep.2019.09.082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.DoubletDetection (Version v3.0). (2020).
  • 21. Bais  AS, Kostka  D. scds: computational annotation of doublets in single-cell RNA sequencing data. Bioinformatics 2020;36:1150–58. 10.1093/bioinformatics/btz698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Schriever  H, Kostka  D. Vaeda computationally annotates doublets in single-cell RNA sequencing data. Bioinformatics 2023;39:btac720. 10.1093/bioinformatics/btac720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Bernstein  NJ, Fong  NL, Lam  I. et al.  Solo: doublet identification in single-cell RNA-Seq via semi-supervised deep learning. Cell Systems  2020;11:95–101.e5. 10.1016/j.cels.2020.05.010. [DOI] [PubMed] [Google Scholar]
  • 24. Germain  P-L, Lun  A, Macnair  W. et al.  Doublet identification in single-cell sequencing data using scDblFinder. F1000Res  2021;10:979. 10.12688/f1000research.73600.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Zhang  W, Jiang  R, Chen  S. et al.  scIBD: a self-supervised iterative-optimizing model for boosting the detection of heterotypic doublets in single-cell chromatin accessibility data. Genome Biol  2023;24:225. 10.1186/s13059-023-03072-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Hu  H, Wang  X, Feng  S. et al.  A unified model-based framework for doublet or multiplet detection in single-cell multiomics data. Nat Commun  2024;15:5562. 10.1038/s41467-024-49448-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.10x Genomics. PBMC from a Healthy Donor—Granulocytes Removed Through Cell Sorting (3k). (2020).
  • 28.10x Genomics. Flash-Frozen Human Healthy Brain Tissue. (2021).
  • 29.10x Genomics. Fresh Embryonic E18 Mouse Brain. (2021).
  • 30.10x Genomics. Flash-Frozen Mouse Kidney. (2023).
  • 31. Hu  H. Data for ‘a unified model-based framework for doublet or multiplet detection in single-cell multiomics data’. Zenodo  2024; v1. 10.5281/ZENODO.11167173. [DOI] [PMC free article] [PubMed]
  • 32. Hao  Y, Stuart  T, Kowalski  M. et al.  Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol  2024;42:293–304. 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. McInnes  L, Healy  J, Melville  J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at https://doi.org/10.48550/arXiv.1802.03426 (2020), 10.1093/aob/mcaf225. [DOI]
  • 34. Künzli  M, Masopust  D. CD4+ T cell memory. Nat Immunol  2023;24:903–14. 10.1038/s41590-023-01510-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Keller  HR, Ligons  DL, Li  C. et al.  The molecular basis and cellular effects of distinct CD103 expression on CD4 and CD8 T cells. Cell Mol Life Sci  2021;78:5789–805. 10.1007/s00018-021-03877-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Gérard  A, Cope  AP, Kemper  C. et al.  LFA-1 in T cell priming, differentiation, and effector functions. Trends Immunol  2021;42:706–22. 10.1016/j.it.2021.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Woodland  DL, Kohlmeier  JE. Migration, maintenance and recall of memory T cells in peripheral tissues. Nat Rev Immunol  2009;9:153–61. 10.1038/nri2496. [DOI] [PubMed] [Google Scholar]
  • 38. Stuart  T, Butler  A, Hoffman  P. et al.  Comprehensive integration of single-cell data. Cell  2019;177:1888–1902.e21. 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.10x Genomics. 5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor with cell surface proteins. (2019).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Hu  H. Data for ‘a unified model-based framework for doublet or multiplet detection in single-cell multiomics data’. Zenodo  2024; v1. 10.5281/ZENODO.11167173. [DOI] [PMC free article] [PubMed]

Supplementary Materials

OmniDoublet-Suppl-Data-Final_bbaf538

Data Availability Statement

The datasets analyzed in this study were collected from public databases. All 10× Genomics Single Cell Multome ATAC + Gene Expression and CITE-seq datasets can be downloaded at https://www.10xgenomics.com/datasets/. The datasets generated in COMPOSITE study can be found in Zenodo under the accession code https://zenodo.org/records/11167174.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES