Abstract
Background
Single-cell multi-omics (scMulti-omics) technologies have revolutionized our understanding of cellular functions and interactions by enabling the simultaneous measurement of diverse cellular modalities. Integrating these heterogeneous data types presents significant challenges due to differences in scale, resolution, and biological variability across the omics layers. Traditional computational methods often fail to reconcile these differences, leading to a loss of critical biological variability and subtle intermolecular interactions.
Methods
To address these challenges, we have developed a single-cell multi-omics deep learning model (scMDCF) based on contrastive learning, tailored for the efficient characterization and integration of scMulti-omics data. scMDCF features a cross-modality contrastive learning module that harmonizes data representations across different omics types, ensuring consistency and preserving data heterogeneity by accommodating information entropy. Furthermore, a cross-modality feature fusion module extracts common low-dimensional latent representations of scMulti-omics data, effectively balancing the diverse characteristics of these data types.
Results
Extensive empirical studies demonstrate that scMDCF outperforms existing state-of-the-art scMulti-omics models across various types of scMulti-omics data. In particular, scMDCF exhibits advanced analytical capabilities in extracting cell-type-specific peak-gene associations and cis-regulatory elements from SNARE-seq data, and in elucidating immune regulation from CITE-seq data. In a post-BNT162b2 mRNA SARS-CoV-2 vaccination dataset, scMDCF successfully annotates specific vaccine-induced B cell subpopulations, uncovering dynamic interactions and regulatory mechanisms within the immune system post-vaccination. Most importantly, using Alzheimer’s disease-specific data, scMDCF identifies computational minority Microglia and Endothelial cell populations, revealing ELF1 as a putative candidate transcription factor biomarker in Microglia, which potentially influences GTPase activity and may suppresses Alzheimer’s pathology.
Conclusions
We propose scMDCF, a contrastive learning based framework for single-cell multi-omics integration that harmonizes cross-modality representations while preserving biological heterogeneity. Applications across diverse scMulti-omics datasets demonstrate improved clustering performance, effective batch-effect mitigation, and mechanistic insights into underlying biological processes. Code and reproducible workflows are openly available.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13073-025-01586-7.
Keywords: Single-cell multi-omics, Contrastive learning, ScMulti-omics integration and clustering
Background
Advances in single-cell sequencing technology have greatly expanded our ability to profile individual cells across various modalities. Protocols that simultaneously describe the same cell using multiple modalities provide a comprehensive view of single cells, thereby promoting a fundamental understanding of the molecular hierarchy from the genome to the phenome [1–3]. For example, ATAC-seq has been developed to detect open chromatin regions and assess chromatin accessibility across tissues and cells. Accurately correlating the chromatin accessibility landscape with gene expression profiles enables the identification of cis-regulatory elements, thereby enhancing our understanding of how cellular identity is established and reprogrammed, as well as the mechanisms determining cell fate [4, 5]. Cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) facilitates simultaneous profiling of RNA gene expression alongside a panel of cell surface proteins. This technique offers direct insights into cell signaling and cell-cell interactions, thereby capturing dynamic biological processes at the cellular level across diverse fields such as oncology, immunology, infectious diseases, and anatomical pathology [6, 7]. However, due to the biological and technical variation between different omics protocols, even when sequencing the same set of cells, the biological insights of these modalities can be inconsistent. Therefore, integrating multi-omics data to eliminate batch effects and preserve biological heterogeneity presents significant challenges, and it is crucial for obtaining biological phenomena that can be jointly interpreted across multiple modalities.
Several computational methods have been proposed for single-cell multi-omics dataset integrative analysis. Typically, these methods can be divided into two categories: statistical methods and deep learning-based methods. Statistical methods such as BREM-SC [7] and MOFA+ [8] have emerged as notable examples. BREM-SC [7] introduces a Bayesian random effects mixture model tailored for the joint analysis of CITE-seq data. It leverages Dirichlet multinomial distributions to model ADT-seq and RNA-seq data independently, incorporating cell-specific random effects to decipher correlations between the omics. On the contrary, MOFA+ [8] adopts a Bayesian group factor analysis framework, implementing stochastic variational inference for large-scale datasets, and employs automatic relevance determination priors to reveal inter-omic structural relationships. Parallel to these statistical approaches, methodologies rooted in the fusion of similarity matrices, such as CiteFuse [6] and Seurat V4 [9], have also gained attention. CiteFuse [6] employs a two-step approach to analyze scMulti-omics data, calculating cell-to-cell similarity matrices for ADT-seq and RNA-seq, and then fusing these matrices using a similarity network fusion algorithm [6]. Seurat V4 [9] integrates scMulti-omics data via weighted nearest-neighbor theory, learning integration weights and generating a similarity graph. However, both statistical methods and similarity matrix fusion approaches have inherent limitations [9]. The former, including BREM-SC [7] and MOFA+ [8], assumptions of positive correlations between modalities and specific distributional forms, such as Dirichlet or Gaussian priors. In practice, however, different modalities often capture distinct, uncorrelated, or even negatively correlated biological processes. When these assumptions are not met, the integration may yield dataset-specific or inconsistent results by forcing unrelated patterns into shared components. The latter, exemplified by CiteFuse [6] and Seurat V4 [9], primarily apply linear fusion strategies to integrate similarity matrices derived from each omics layer. However, this approach fails to capture the dynamic, nonlinear manifold structures on which biological signals often reside. In scMulti-omics data, each modality may follow distinct, curved manifolds shaped by complex regulatory mechanisms, and linear fusion can distort these geometries, collapsing subtle but meaningful variations. Moreover, in high-dimensional and sparse settings, this distortion may amplify technical noise and obscure true biological structure, leading to suboptimal integration and clustering results. Therefore, considering the high-dimensional and sparse characteristics of the scMulti-omics datasets, it is essential to infer low-dimensional joint representations from multiple modalities prior to integration.
In recent years, substantial progress has been made in deep learning techniques for modeling and integrating divergent scMulti-omics datasets. GLUE [10] introduces a graph-linked unified embedding that incorporates prior regulatory knowledge in the form of gene–peak interaction graphs to align different modalities. Similarly, scCross [11] is a deep generative framework that harnesses modality-specific variational autoencoders integrated with generative adversarial networks and mutual nearest neighbor (MNN) alignment to jointly embed multimodal single-cell profiles. While these methods have demonstrated effectiveness, they rely heavily on curated or inferred cross-omics priors, which may be incomplete, dataset-specific, or even unavailable in certain biological contexts, limiting their applicability and generalizability. In contrast, models such as Cobolt [12] utilize a hierarchical Bayesian generative-based VAE model to learn a common representation, and totalVI [13], which is specifically designed for CITE-seq datasets, formulates a deep variational autoencoder to acquire a joint probabilistic representation. Further, scMDC [14] adopts a different strategy, concatenating multi-omics data before processing it through a multimodal autoencoder. Along the same line, MultiVI [15] provides a probabilistic framework for integrating scMulti-omics data without requiring prior correspondence. These prior-free methods provide greater flexibility and better scalability across diverse datasets, especially when inter-modality relationships are weak or unknown. However, these multimodal joint analysis models that learn joint probability distributions or directly link raw omics data, may overlook the consistency and specificity among different omics types. Furthermore, similar to many deep learning architectures, these models exhibit a ‘black-box’ nature, which impedes our ability to interpret and understand the significance of the latent variables. This limitation makes it challenging to extract meaningful insights or verify their biological relevance.
Here, we propose a single-cell multi-omics deep learning model based on contrastive learning, called scMDCF, designed to extract joint representations from single-cell multi-omics data. scMDCF adopts multi-omics informed encoders to independently learn the processed feature representations of each omics type. Then, its cross-modality contrastive learning module integrates the representations, discerning consistent and inconsistent information across different omics types. scMDCF employs a self-optimizing multimodal deep embedded clustering module using Kullback-Leibler (KL) divergence, to cluster the low-dimensional shared representations. Finally, the common latent representation is decoded through the multi-omics reconstruction decoder back into the feature matrix of each omics. In addition, three training losses including the KL loss, the reconstruction loss, and the contrastive learning loss are optimized to accurately reveal cell clustering labels and uncover common latent factors across different modalities. We evaluate the clustering performance of scMDCF by comparing it with the state-of-the-art scMulti-omics clustering methods on various CITE-seq and single-cell RNA-seq and ATAC-seq datasets, and demonstrate that scMDCF outperforms most of the computational methods, particularly in its robustness against batch effects and its ability to accurately identify marker peak-gene associations and cis-regulatory elements in SNARE-seq data. To further validate the effectiveness of scMDCF, we apply the model to a snMulti-omics DLPFC tissue dataset [16] from individuals diagnosed with Alzheimer’s disease and unaffected control donors. This analysis successfully identified computational minority cell clusters that are often overlooked by other scMulti-omics computational methods. Strikingly, we revealed ELF1 as a putative candidate transcription factor biomarker in Microglia, which upregulates regulon genes involved in GTPase activity, potentially modulating cellular function and contributing to the suppression of Alzheimer’s pathology.
Methods
Data processing
Our analysis included comprehensive data processing procedures tailored to each data type. For CITE-seq data, we adopted standard workflows using the SCANPY package [17]. For scRNA-ATAC-seq datasets, we followed the preprocessing pipeline implemented in the episcanpy package [18]. For CITE-seq data processing, genes and ADTs (Antibody-Derived Tags) with no counts were filtered to focus on informative features. We then normalized cell counts using a library size normalization method. Each cell’s library size was divided by the median library sizes of all cells, effectively removing relative bias and allowing comparison between cells. Subsequently, we transformed the counts to logarithmic scale to stabilize variance and scaled the data to achieve zero mean and unit variance. Finally, we selected the top 1,000 highly variable genes, while using full ADTs, to capture the most significant features of the CITE-seq dataset. For scRNA-ATAC-seq, we first performed quality control to filter out low-quality cells and peaks. Next, we applied log normalization to the filtered feature matrix. Finally, we selected the top 2,500 highly variable features, thereby focusing on the most informative genes for downstream integrative analysis. These tailored processing steps were essential for reducing technical variability and enhancing biological signals, ensuring that the uniqueness of each dataset was appropriately addressed, thus providing a solid foundation for robust downstream analyses including clustering, trajectory inference, and differential gene expression studies.
Multi-omics informed encoder
In the multi-omics data analysis, the encoder in an autoencoder plays a pivotal role, compressing high-dimensional data into a lower-dimensional space, enabling the extraction of complex features. This functionality is particularly crucial for RNA-seq data, which are often high-dimensional and sparse. On this basis, our approach enhances feature representation across different modalities in multi-omics data, using multi-omics informed encoders tailored for each data type.
For the RNA-seq, we take the pre-processed gene expression matrix
as the input of the RNA-specific encoder. The matrix
is of size
, where n is the number of cells and m is the number of genes. The RNA-specific encoder performs a transformation through multiple stacked blocks, and each stacked block includes a fully connected layer, a BatchNorm1d layer, and a ReLU activation. The formula of each stacked block is as follows:
![]() |
1 |
where
is a weight matrix of size
with d being the dimension of the lower-dimensional space, m being the pre-processed features number,
and
are the learnable weight matrix and bias, respectively, crucial for encoding RNA-specific features. The
is the latent representation of RNA-seq of size
with n denotes the number of cells. The function
is a non-linear activation function, typically ReLU. The normalization function
represents a normalization operation, such as BatchNorm1d. It is mathematically expressed as follows:
![]() |
2 |
where
and
are the adaptable parameters of the BatchNorm1d layer, contributing to the normalization process. The ablation results for the BatchNorm1d layer are provided in Additional File 1: Fig. S9.
To gain comprehensive biological insights, we integrate additional omics modalities for broader biological insights. Our model includes a symmetric multi-omics informed encoder designed to distill enhanced feature representations from alternative omic datasets, aligning them to the uniform dimensionality of
. This encoder is formulated as:
![]() |
3 |
where
denotes the pre-processed expression matrix from alternative omics data, such as ATAC or ADT. The parameters
and
are the learnable weights and biases specific to this type of omics data. The latent dimension of
is
,
is
. The encoders efficiently capture and represent the unique characteristics of each specific omic dataset, enabling a comprehensive multi-omics analysis.
Cross-modality contrastive learning
In the context of scMulti-omics data integration, a major challenge lies in learning a unified representation that captures both the shared biological signals across omics and the characteristics unique to each omics layer. This naturally motivates the use of contrastive learning, which aims to learn robust and discriminative representations by maximizing agreement between related inputs. Unlike traditional contrastive learning frameworks such as InfoNCE [19], which rely on distinguishing positive from negative sample pairs, the modalities of scMulti-omics data are naturally measured from the biological entities. Applying InfoNCE directly to this scenario may be biologically inappropriate and limit integration performance. To address this, we propose a cross-modality contrastive learning module, which aims to maximize modality consistency while preserve modality-specific feature patterns. This strategy avoids explicit negative sampling and is more biologically interpretable in the scMulti-omics setting. This is achieved by maximizing mutual information across the enhanced features extracted from each omics-specific encoder. The mutual information, symbolized as
, quantifies the shared information between two distinct omics representations, which is represented as follows:
![]() |
4 |
where
denotes mutual information. Since a softmax function is applied at the final layer of the multi-omics informed encoders, each element of
and
represents the over-cluster class probability, and
denotes their joint probability distribution, defined as:
![]() |
5 |
Through cross-modality contrastive learning, scMDCF can capture consistent patterns across different omics datasets. Additionally, scMDCF leverages information entropy to maintain the unique individual characteristics inherent to each omics type. This dual approach ensures both the integration of shared information and the preservation of distinct biological attributes. The information entropy for each omics type is defined as follows:
![]() |
6 |
![]() |
7 |
Here,
denotes the information entropy. From an information theory perspective, a higher entropy value
indicates a more informative representation. Maximizing
and
helps avoid trivial clustering solutions where all samples are assigned to the same cluster.
The loss function for contrastive learning
, merges mutual information and entropy to optimize the learning process. It is formulated as follows:
![]() |
8 |
where
is a balancing parameter that regulates the impact of entropy on the loss function. This comprehensive contrastive learning approach ensures a comprehensive representation of multi-omics data, capturing both shared and unique aspects of diverse omics datasets.
Cross-modality feature fusion module
In scMDCF, we employ a two-step process to derive a unified representation from multi-omics data. Initially, our cross-modality contrastive learning module extracts both consistent and inconsistent latent features between different omics types. Following this, the cross-modality feature fusion module integrates and refines these features.
The process begins by concatenating the representations
and
into
. This step fuses different omic features from the same cell into a single representation. Subsequently, we used a nonlinear fusion network to process
to produce a common latent representation
in a shared omic space, optimized for clustering tasks. The nonlinear fusion network contains one layer block for CITE-seq and two layer blocks for RNA-ATAC-seq data, and each layer block contains a Linear layer, a BatchNorm1d layer, and a ReLU activation layer, the formula of each layer block of the nonlinear fusion network is defined as follows:
![]() |
9 |
where
is the representation of size
formed by concatenating
and
column-wise, which fuses different omics characteristics of the same cell,
is the activation function ReLU() of the nonlinear fusion network,
signifies BatchNorm1d, which normalizes the input, and
and
represent the learnable weight matrix and bias within the nonlinear fusion network, respectively. This structured process ensures that the resulting common representation
effectively embodies both the individual characteristics and the shared attributes of the multi-omics data, thereby facilitating robust clustering analysis. The latent dimension of Z is
, where z represents the dimension of the common latent representation of the multi-omics dataset.
Multi-omics reconstruction decoder
In our multi-omics analysis framework, each type of omics is first processed through a multi-omics informed encoder to augment feature representation across cells. Following this, a cross-modality contrastive learning module identifies both consistent and unique biological characteristics among modalities. A subsequent feature fusion module then integrates these into a common latent representation,
. This representation is then decoded back into individual omics types using specific reconstruction decoders, structured inversely to their corresponding encoders. The multi-omics reconstruction decoder aims to reconstruct the original omics expression matrices:
![]() |
10 |
where
represents the function of a single layer block in the decoder. Each layer block consists of a Linear layer, followed by a BatchNorm1d layer and ReLU activation function. The decoder is composed of multiple such layer blocks.
is the common latent representation.
and
represent the weight matrix and the bias vectors for the Linear layer, respectively.
The reconstruction loss for each omics type is calculated using the mean squared error (MSE) between the original and reconstructed matrices as follows:
![]() |
11 |
where
represents the preprocessed omics matrix fed into the omics-specific encoder, and
is the reconstructed omics matrix output by the omics-specific reconstruction decoder.
Similar to the process for
,
also undergoes a parallel operation in our model. This ensures that each omics type is individually and effectively decoded from the common latent representation, maintaining the integrity and distinct biological information inherent in each data type.
Self-optimizing multimodal deep embedded clustering
In the clustering phase of our model, we implement a self-optimizing multimodal deep embedded clustering approach [20] [21]. The core of this stage is the application of Kullback-Leibler (KL) divergence as the clustering loss function, which can enhance the association between similar cells and prevent clustering centroids from collapsing in the latent space. We provide a step-by-step explanation of the definitions and motivations behind the two core probability distributions,
and
, as well as the Kullback-Leibler (KL) divergence loss function used in scMDCF. These components are crucial for learning meaningful clusters from high-dimensional single-cell multimodal data.
The soft assignment t-distribution
is defined as:
![]() |
12 |
Here,
represents the embedded representation of the i-th cell, which is the common latent representation of the scMulti-omics data after cross-modality feature fusion module, and
refers to the initial centroid of cluster j, computed by applying K-means clustering to the embedded representation
with the cluster number k. This formula models the similarity between each cell
and every cluster centroid
using a Student’s t-distribution, which has the advantage of assigning lower probability to distant clusters while focusing more on nearby clusters. The use of the t-distribution allows scMDCF to better capture the local structure of the data and avoids over-assigning probability to outliers or distant clusters, which helps scMDCF maintain a more meaningful clustering structure. Next, to guide scMDCF learn a more distinct and compact clustering structure, we introduce an auxiliary target distribution
that reflects the ideal clustering structure we aim to achieve. This target distribution is defined as:
![]() |
13 |
The design of
emphasizes the cells that are confidently assigned to a cluster (i.e., where
is high) while down weighting ambiguous points. This makes
a more “refined” version of the soft assignments, idealized clustering target that helps refine the clustering optimization process of scMDCF.
Specifically, we design the automatic strategy to estimate the optimal cluster number in a data-driven manner. For datasets with ground truth labels, the number of cluster centers (K) is set to the actual number of clusters indicated by the labels. However, for the majority of datasets without ground truth labels, we introduce an automatic strategy to estimate the optimal K. In this case, we applied Louvain clustering to the embedding learned by scMDCF, exploring resolution values from 0 to 1.0 in increments of 0.1, and selected the resolution corresponding to the maximum silhouette score as the optimal clustering solution.
Finally, to align the learned distribution
with the target
, we use the Kullback-Leibler (KL) divergence as the loss function:
![]() |
14 |
The KL divergence measures the difference between the target distribution
and the learned distribution
. By minimizing this loss, we force the model to make the soft assignments
approach the idealized target
, thus improving the clustering structure.
Joint embedding and clustering
In our study, we optimize the multimodal common latent representation of multiomics data and clustering tasks using an unsupervised clustering algorithm that alternates between two stages: pre-training and training.
During the pre-training stage, we define the total loss for multi-omics data in our scMDCF model as the sum of the reconstruction losses for each omics type, along with a weighted contrastive loss, which is defined as:
![]() |
15 |
where
is a hyperparameter that balances multi-omics data reconstruction and the optimization of the latent representation based on contrastive fusion. The parameter
is fixed at 0.1 in all datasets.
After pre-training, we initialize the cluster centers using K-means clustering within the multi-omics embedded feature space. Then, the total loss includes the reconstruction losses for each omics type including contrastive loss, and Kullback-Leibler divergence loss as follows:
![]() |
16 |
where
,
and
are hyperparameters controlling the cluster optimization. In the optimization process, we carefully adjust the influence of different components through hyper-parameters. For consistency across all datasets, these parameters are set to specific values:
,
, and
.
Implementation details
In our study, we designed a model architecture tailored to both CITE-seq and SNARE-seq data, ensuring optimal processing and analysis of these multi-omics datasets. For the analysis of the CITE-seq data, our model features encoder layers set to {1000, 512, 64, 8} for RNA-seq and {the number of ADTs, 512, 64, 8} for ADT-seq. The corresponding decoder layers are structured as {8, 512, the number of ADTs} for ADT-seq and {8, 64, 512, 1000} for RNA-seq. In the case of the SNARE-seq data, the encoder layers for both RNA-seq and ATAC-seq are set to {2500, 512, 64}, while the decoder layers are {8, 64, 512, 2500} for RNA-seq and {8, 512, 2500} for ATAC-seq. The nonlinear fusion network layers for CITE-seq are configured as {16, 8}, and {128, 32, 8} for SNARE-seq data.
During pre-training, the Adam optimizer with the AMSGrad variant is employed with a learning rate of 1e-2 to optimize reconstruction and contrastive losses over 200 epochs. Post-pretraining, K cluster centroids are initialized using the K-means algorithm in the model’s latent space. The clustering stage involves optimization of all loss functions, including clustering loss, using the Adadelta optimizer with a learning rate of 1e-3 and a rho of 0.95. This stage spans 200 epochs, with cluster centroid updates every 10 epochs, ensuring an effective and nuanced clustering process for multi-omics data analysis.
Competing methods
Seurat v4 (https://github.com/satijalab/seurat). Seurat v4 [9] is an R toolkit for multi-omics analysis that employs an unsupervised weighted nearest neighbor strategy to learn cell-specific modality “weights”, which facilitates integrative analysis of multiple modalities.
GLUE (https://github.com/gao-lab/GLUE). GLUE [10] is a deep learning framework based on a variational autoencoder, which incorporates a knowledge-based guidance graph linking genomic regions to genes according to their genomic proximity.
MultiVI (https://scvi-tools.org/). MultiVI [15] is a deep generative framework built on probabilistic assumptions specific to various omics modalities, facilitating the joint analysis of multimodal and unimodal single-cell datasets.
FigR (https://github.com/buenrostrolab/FigR). FigR [22] facilitates the integration of scRNA-seq and scATAC-seq datasets by aligning the modalities through a canonical correlation analysis-based method.
scMoMaT (https://github.com/PeterZZQ/scMoMaT). scMoMaT [23] employs a matrix tri-factorization framework that decomposes each count matrix into three components: a cell matrix, a feature matrix, and an association matrix that captures the interaction strengths between cells and features.
MOFA+ (https://github.com/bioFAM/MOFA+). MOFA+ [8] is a statistical framework based on the Bayesian group factor analysis for the integrative analysis of multi-omics data, which can infer a low-dimensional representation for the multi-omics data with a small number of (latent) factors capturing global sources of variability.
scMDC (https://github.com/xianglin226/scMDC). scMDC [14], an end-to-end deep multimodal autoencoder model, employs an encoder for concatenated data from different modalities and two decoders to separately decode the data from each modality. scMDC can distinctly characterize different data sources and jointly learn latent features of deep embedding for clustering analysis.
Cobolt (https://github.com/epurdom/cobolt). Cobolt [12] is a multimodal variational autoencoder based on a hierarchical Bayesian generative model, which can analyze and integrate multi-modality datasets like SNARE-seq with single-modality data.
BREM-SC (https://github.com/tarot0410/BREMSC). BREM-SC [7] is a Bayesian Random Effects Mixture model which combines MCMC algorithm and Dirichlet distribution, and BREM-SC can jointly cluster paired single cell CITE-seq.
totalVI (https://github.com/YosefLab/totalVI_reproducibility). totalVI [13] is a deep generative model designed for CITE-seq data analysis that learns a joint probabilistic representation that accounts for the unique noise, technical biases of each modality, and batch effects.
scDeepCluster (https://github.com/ttgump/scDeepCluster). scDeepCluster [24] is a clustering method based on a deep zero-inflated negative binomial (ZINB) model autoencoder which is designed for scRNA-seq data and efficiently maps read count matrix to a low-dimensional latent representation.
scziDesk (https://github.com/xuebaliang/scziDesk). scziDesk [25], a deep learning model, utilizes a denoising autoencoder for the characterization of single-cell RNA sequencing (scRNA-seq) data. In addition, scziDesk introduces a soft self-training K-means algorithm, specifically designed to cluster cell populations within the latent space derived from the data.
SCALE (https://github.com/jsxlei/SCALE?tab=readme-ov-file). SCALE [26] integrates a deep generative framework with a probabilistic Gaussian Mixture Model to adeptly learn latent features that accurately characterize single-cell ATAC sequencing (scATAC-seq) data.
Evaluation metrics
To thoroughly evaluate the performance of our clustering against ground truth labels, we utilized a diverse set of metrics, each offering unique insights into different aspects of cluster quality. These metrics include Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), V-measure, Adjusted Mutual Information (AMI), and Fowlkes-Mallows Index (FMI).
- Normalized Mutual Information (NMI): We started with NMI, a symmetric measure that quantifies the mutual statistical information shared between two sets of clusterings. For our dataset with n cells, where X represents the ground truth labels and Y the predicted labels, the NMI is defined as:

17 - Adjust Rand Index (ARI): Next, the ARI was employed to measure the similarity between the predicted and the ground truth labels, thus reflecting the accuracy of our clustering method. The ARI formula is given by:

18 - V-measure: We also included the V-measure to simultaneously assess the homogeneity and completeness of our clustering. Among them, the completeness is defined as: Completeness =
, the homogeneity is defined as: Homogeneity =
, and the V-measure is defined as: 
19 - Adjusted Mutual Information (AMI): The AMI was utilized to account for the mutual information shared between the ground truth and the predicted clusters, adjusting for chance and it is defined as:

20 - Fowlkes Mallows (FMI): Finally, the FMI evaluates the similarity among clusterings obtained by different algorithms. We applied it to assess the match between our predicted labels and the ground truth, taking into account true positive (TP), false positive (FP), and false negative (FN) counts:

21
To rigorously evaluate the clustering performance of our method on unlabeled datasets, particularly on unlabeled datasets, we employed a trio of metrics: Average Silhouette Width (ASW), Davies-Bouldin Index (DB), and Calinski-Harabasz Index (CH). Each metric contributes a unique perspective on clustering quality, thereby enabling a comprehensive evaluation.
- Average Silhouette Width (ASW): This metric measures the level of similarity within clusters versus between clusters. A high ASW value indicates a clear delineation between clusters, while lower values suggest overlapping or poorly defined clusters. ASW is computed as:
where a(i) is the average intra-cluster distance, and b(i) is the average nearest-cluster distance for each cell.
22 - Davies-Bouldin Index (DB): This index evaluates the compactness and separation of clusters. A lower DB value indicates well-separated clusters with minimal variance within each cluster. It is defined as:
where the
23
is the distance between the cluster centroids
and
and
is the average distance of all cells in cluster i to the centroid of cluster i. - Calinski-Harabasz Index (CH): CH index quantifies the ratio of between-cluster dispersion to within-cluster dispersion, offering a measure of cluster validity:
where k is the number of clusters, Tr(B) is the trace of the between-cluster dispersion matrix,
24
, where
is the centroid of cluster i; Tr(W) is the trace of the within-cluster dispersion matrix,
,
is the cluster i, x is a cell in
.
Finally, to thoroughly assess our model’s ability to mitigate batch effect in multi-batch datasets, we employed three key metrics: Inverse Simpson’s Index of Integration (iLISI), Conditional Local Inverse Simpson’s Index (cLISI), and batchKL. These metrics were specifically chosen for their effectiveness in benchmarking the performance of the methods in reducing batch effect.
Inverse Simpson’s Index of Integration (iLISI): This metric evaluates the degree of integration and mixing among different batches post-integration. A higher iLISI score indicates better performance in batch mixing. We calculated iLISI scores for each cell using the ‘compute_LISI’ function from the R package ‘lisi’ [27], and then determined the overall iLISI by averaging these scores.
- Conditional Local Inverse Simpson’s Index (cLISI): cLISI measures the extent of mixing of different cell types within the local neighborhoods of each cell. Scores close to 1 suggest that the clusters are primarily composed of a single cell type, indicating effective separation of cell types despite batch integration [28].

25 - BatchKL: This metric quantifies how well cells from different batches are mixed within each cluster, using the Kullback-Leibler divergence. Lower BatchKL values suggest better mixing of cells from different batches, which is desirable in batch effect mitigation.
Here, B represents the total number of batches,
26
is the proportion of cells from batch b in a specific cluster, and
is the proportion of batch b cells across all clusters. The detail of parameters is the same as in reference [20].
Identification of marker genes, peaks, and ADTs in cell clusters
In our study, the process of identifying marker genes, peaks, and ADTs (Antibody-Derived Tags) was conducted using the ‘FindAllMarkers’ function from the ‘Seurat’ package [9]. To ensure the relevance and specificity of the markers identified, we set specific parameters for the function: only.pos=TRUE to focus on positive markers, logfc.threshold=0.25 to select markers with at least a 0.25 log-fold change, and min.pct=0.25 to include markers present in at least 25% of cells within any cluster. After identifying the markers, we then selected the top five genes, peaks, and ADTs for each cell cluster based on their statistical significance and effect size. To effectively communicate these findings, we utilized the ‘pheatmap’ function for visualization. This approach allowed us to generate heatmaps displaying the expression levels of the top markers within the respective cell clusters.
CellChat analysis
In our study, we employed the ‘CellChat’ R package [29] to infer cell-cell communication networks, leveraging its robust analytical and visualization capabilities. Initially, we created a CellChat object using the ‘createCellChat’ function, which incorporated the comprehensive gene expression matrix and cell annotation information. After that, we utilized ‘CellChatDB.human’ as the reference database, which is specifically curated with human ligand-receptor interactions. Next, using the ‘subsetData’ function, we extracted significant signaling genes from our dataset, isolating genes essential for intercellular communication. To determine the key receptors and ligands within each cell population, we applied ‘identifyOverExpressedGenes’ and ‘identifyOverExpressedInteractions’ functions. Following this, we projected the expression values of the identified ligand-receptor pairs onto the Protein-Protein Interaction (PPI) network using ‘projectData’, which provided a comprehensive view of how these interactions fit into the larger network of protein interactions. Then, we used ‘computeCommunProb’ to compute the probabilities of communication between cells, based on the expression and interaction data of relevant ligand-receptor pairs. To further delve into the communication networks at the signaling pathway level, we used ‘computeCommunProbPathway’ to explore the complex network of interactions that occur across various signaling pathways among different cell types. The final step involved visualizing the inferred cell communication networks through ‘netVisual_aggregate’. This visualization provided an insightful overview of the communication patterns and pathways, highlighting key interactions among cell populations.
Peak-gene association
In the subsection focusing on peak-gene association in our study, we implemented a two-step process involving the analysis of genomic regions and the association of ATAC peaks with gene targets. We first employed the ‘RegionStats’ function, a comprehensive tool for genomic regional analysis. This function was utilized to calculate key genomic features, including GC content, length of the regions, and frequency of dinucleotide bases within peak regions. After characterizing the genomic regions, we proceeded to associate each ATAC peak with potential gene targets. For this purpose, we used the ‘plotBrowserTrack’ function from the ‘ArchR’ R package [30], which is specifically designed to link peaks to genes, enabling us to identify one or more gene targets for each ATAC peak identified in our dataset.
Motif analysis
To identify and enrich transcription factor (TF) motifs from peak regions, we conducted a detailed motif analysis by executing the ‘AddMotifs’ function on the ChromatinAssay object. For our human samples, we employed ‘BSgenome.Hsapiens.UCSC.hg38’ as the reference genome and calculated position frequency matrices (PFMs) under JASPAR2020 database [31]. With the motifs added and PFMs calculated, we then utilized the ‘RunchromVAR’ function. To identify specific TF motifs within our target peak regions, we employed the ‘FindMotifs’ function. This process involved several key steps, leveraging the functionalities provided by the ‘Signac’ package [32].
Functional enrichment
We performed comprehensive Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses using the ‘clusterProfiler’ R package [33]. Indeed, we selected the top 400 genes exhibiting highly variable expression for enrichment analysis. For the GO enrichment analysis, we employed the ‘enrichGO’ function that configured with several key parameters: keyType=“ENTREZID”, OrgDb=org.Hs.eg.db, int=“ALL”, pvalueCutoff=0.05, qvalueCutoff=0.05, and readable=T. Similarly, for the KEGG enrichment analysis, we used the ‘enrichKEGG’ function. The parameters set for this analysis were organism=“hsa”, pvalueCutoff=0.05, and qvalueCutoff=0.05.
Gene set enrichment analysis
We conducted a Gene Set Enrichment Analysis (GSEA) to investigate the biological significance of marker genes in our dataset. Our first step was to identify all marker genes using the ‘FindAllMarkers’ function [9] based on the avg_log2FC. Then, we chose ‘hallmark gene sets’ as the parameter for TERM2GEN and this enrichment analysis was carried out using the ‘GSEA’ function [33]. Overall, this step involved assessing the enrichment of the hallmark gene sets in our ranked list of marker genes, providing insights into the biological processes and pathways that were significantly represented.
Gene set variation analysis
We carried out a Gene Set Variation Analysis (GSVA) to gain insights into the differential activity of gene sets across various cell subpopulations. Our initial step involved using the ‘AverageExpression’ function [9]. This function was employed to calculate the average expression level of each gene across different cell subpopulations. With the average gene expression levels determined, we proceeded with the actual GSVA enrichment analysis. This was conducted using the ‘gsva’ function [34]. For this analysis, we specified ‘hallmark gene sets’ as the ‘gset.idx.list’ parameter, which guided the analysis towards a focused set of well-characterized and biologically meaningful gene sets. Additionally, we selected ‘ssgsea’ as the ‘method’ parameter, a choice that facilitated the identification of gene sets showing variation in their enrichment profiles across the different subpopulations.
Cell trajectory
In our study, the analysis of cell trajectory was intricately conducted using Monocle3 [35], a powerful tool for deciphering cellular dynamics and differentiation pathways. The first step in our trajectory analysis was the creation of a Monocle object. This was achieved using the ‘new_cell_data_set’ function, which laid the groundwork for the subsequent steps in our analysis. Upon creating the Monocle object, we proceeded with data preprocessing utilizing the ‘preprocess_data’ function. We integrated the latent embedding obtained from our scMDCF analysis as the outcome of dimensionality reduction. This embedding was seamlessly incorporated into the Monocle object, enhancing the trajectory analysis by leveraging the refined data representation provided by scMDCF. After that, for the cell trajectory inference, we employed the ‘learn_graph’ function. This function is paramount in Monocle3’s ability to deduce and map the developmental pathways of cells, revealing the dynamic progression of cellular states. The final step involved visualizing the inferred trajectory using the ‘plot_cells’ function. Here, we used the latent embedding from scMDCF as the ‘reduction_method’ parameter and the predicted labels from scMDCF for the ‘color_cells_by’ parameter.
Results
Overview of scMDCF architecture
scMDCF efficiently learns a low-dimensional common representation from scMulti-omics data by comparing consistent and inconsistent features across different omics types, as shown in Fig. 1. The process of learning this common representation in scMDCF involves six main steps. (1) Preprocessing: raw matrices from various omics are normalized and highly variable features are selected. (2) Multi-omics informed encoding: each omics modality is passed through multiple sequentially stacked network units to obtain modality-specific embeddings. (3) Cross-modality contrastive learning: this step maximizes the shared biological patterns across modalities by optimizing mutual information, while preserving modality-specific features through information entropy regularization. (4) Cross-modality feature fusion: the module concatenates both the consistent and divergent latent features from distinct omics types and projects them into a common latent space. (5) Multi-omics reconstruction decoder: the common representation is decoded throught the decoders back into individual omics feature expression matrices, and the decoder is structured symmetrically to the corresponding encoder. (6) Self-optimizing deep embedded clustering: this step refines the common latent space using a KL-divergence objective, which is jointly optimized with the reconstruction and contrastive losses.
Fig. 1.
The network architecture of scMDCF. a The overall workflow and functionality of scMDCF. b The detailed pipeline of scMDCF: scMDCF begins by taking preprocessed scMulti-omics datasets as inputs and preserves modality-specific features using Multi-omics informed encoders. Then, scMDCF extracts and integrates inconsistent biological features across various omics types with a cross-modality contrastive learning and feature fusion module. Finally, scMDCF utilizes an embedded self-optimizing clustering module to enhance clustering performance. c-d Analysis of biological significance. scMDCF can detect multimodal cellular interactions and trace B cell differentiation (c) in the CITE-seq dataset following BNT162b2 mRNA vaccination. Furthermore, scMDCF is applicable to the SNARE-seq dataset, identifying significant peak-gene associations and cis-regulatory elements (d)
Specifically, the multi-omics informed encoders take the preprocessed individual omics data matrices as input to learn the preliminary compressed representations for each omics type. First, we preprocess the raw expression matrices by normalizing the data and selecting highly variable features for further analysis. The encoder module identifies key feature patterns in each omics by multiple stacked network blocks, three for CITE-seq and two for scRNA-ATAC-seq. Each stacked network block consists of a fully connected layer, followed by a BatchNorm1d layer and a ReLU activation function. The fully connected layers reduce the dimensionality of the input omic expression matrix and enhance the omics modality-specific features. The enhanced features are further normalized using BatchNorm1d layer to remove the technical discrepancies between modalities. BatchNorm1d helps standardize feature distributions across modalities and stabilize the optimization process. While it does not directly extract modality-specific features, it facilitates downstream contrastive learning that captures both shared and unique representations. This crucial step ensures that scMDCF captures the biological salient patterns inherent in each omics data type, thus enabling the model to perform precise and differentiated omics profiles. The ReLU activation introduces non-linearity into the network and helps mitigate the vanishing gradient problem. Through the multi-block architecture of encoders, the repeated application of ReLU activations across blocks equips the encoder with non-linear representation capabilities, enabling the encoder to capture complex and non-linear relationships within omics-specific features. Finally, at the end of the encoder, we apply the Softmax activation to convert the latent representation into probability distributions for the subsequent cross-modality contrastive learning module.
To extract and integrate both the shared patterns across modalities and the modality-specific heterogeneity in the common representation for scMulti-omics data, scMDCF employs the cross-modality contrastive learning module to maximize the mutual information between the enhanced feature spaces of different omics types. This approach significantly advances previous autoencoder-based methods, which often integrate scMulti-omics data by concatenating latent representation of each modality. Meanwhile, to ensure that each omics type retains its unique information, we maximize the information entropy of each omics-specific enhanced feature latent representation. High information entropy encourages diverse and informative distributions, preventing the collapse of representations into trivial solutions and thus preserving modality-specific patterns. Eventually, to fully integrate the shared and modality-specific patterns into a common representation, we employ the cross-modality feature fusion module to integrate and refine these feature patterns. The cross-modality feature fusion module consists of the nonlinear fusion network, which contains several layer blocks. Each layer block includes a Linear layer, a BatchNorm1d layer, and a ReLU activation layer. This structured process ensures that the resulting common representation effectively embodies both the individual characteristics and the shared attributes of the multi-omics data, thereby facilitating robust clustering analysis.
To self-optimize the clustering task, scMDCF incorporates the self-optimizing multimodal deep embedded clustering module. We use the KL divergence to align the soft assignments with the auxiliary target distribution, both derived from the common representation. Here, the assignments act as soft labels that guide the iterative refinement of the clustering assignments (see Self-optimizing multimodal deep embedded clustering section for the formulations). To minimize the risk of deviations in the clustering distribution caused by unreliable assignments, we initialize the clustering centers through pretraining and K-means. Furthermore, we design an automatic strategy to estimate the optimal number of clusters, reducing the reliance on prior domain knowledge. Finally, the common latent representation is decoded back into individual omics feature expression matrices, with multi-omics reconstruction decoders being structured inversely to the corresponding encoders. By jointly optimizing the KL loss, reconstruction losses, and contrastive learning loss, scMDCF refines the common latent space for the scMulti-omics data. This comprehensive approach allows scMDCF to accurately balance clustering and feature learning, ensuring reliable performance in scMulti-omics data analysis. Beyond its remarkable clustering capabilities, the common representation of scMulti-omics data learned by scMDCF offers excellent visualization effects, preserving the associations between different cell subpopulations. This aspect is crucial in terms of application, as it enables us to integrate the clustering outcomes of scMDCF to identify putative candidate biomarkers and cis-regulatory elements for distinct cell types across various modalities. Significantly, scMDCF identified computational minority cell populations within large-scale datasets and revealed ELF1 as a putative candidate transcriptional repressor biomarker in Microglia. ELF1 may influence GTPase activity by upregulating regulon genes, potentially contributing to the suppression of Alzheimer’s pathology.
scMDCF accurately perform integrative analysis of single-cell transcriptome and chromatin accessibility in multimodal data
To explore the effectiveness of scMDCF in the joint analysis of scMulti-omics data, we evaluated scMDCF on four well-labeled scRNA-ATAC-seq datasets and benchmarked the performance against the state-of-the-art single-cell multi-omics analysis models, including scMDC [14], scCross [11], scMoMaT [23], Seurat-V4 WNN [9], MOFA+ [8], Cobolt [12], GLUE [10], MultiVI [15], and FigR [22]. We also included a comparison with two specialized single-cell RNA-seq clustering models, scDeepCluster [24] and scziDesk [25], along with SCALE [26], a dedicated framework for single-cell ATAC-seq analysis. The datasets utilized in this section were obtained from 10X Genomics and benchmark study [36], including four labeled scRNA–ATAC-seq datasets (pbmc 10x public, multiome bmmc, human brain 3k, and pbmc 10k). To investigate the performance of the single-cell multi-omics analysis models, we used Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), V-measure and Homogeneity as the evaluation metrics, leveraging ground-truth cell-type labels for the multimodal datasets. The five evaluation metrics are calculated using the true labels and predicted labels. As depicted in Fig. 2a, scMDCF exhibited the highest NMI, AMI, HOM, and V-measure values across all four datasets. Furthermore, scMulti-omics methods generally outperform single omics methods, demonstrating their effectiveness in integrating diverse biological information from scMulti-omics data, which results in a notable enhancement of clustering performance.
Fig. 2.
Benchmarking of scMDCF in terms of cell clustering on scRNA-seq and scATAC-seq datasets. a Comparative evaluation of cell clustering performance on four labeled datasets using metrics such as NMI, ARI, V-measure, AMI, and Homogeneity. b Comparative evaluation of cell clustering performance on four unlabeled datasets utilizing ASW Score, CH Index, and DB Index. c UMAP visualizations of the cell embeddings of scMDCF and other competing methods on the ‘pbmc10k’ dataset
To explore and compare the embedding representations obtained by scMDCF and the baseline methods, we visualized the low-dimensional embedded representations of scMDCF, the original RNA, the original ATAC data, and all baseline methods across three datasets, for GLUE [10] and scCross [11] we include the modality-specific embeddings as well as the combined embedding. Specifically, for the ‘pbmc 10k’ dataset, we leveraged the available ground-truth cell-type annotations to color the UMAP plots accordingly (Fig. 2c). For the ‘human brain 3k’ and ‘pbmc 10x public’ datasets, which lack such annotations, we colored each UMAP plot according to the unsupervised cluster IDs generated by each individual method (Additional File 1: Figs. S2–3). The UMAP visualizations on ‘pbmc 10k’ dataset (Fig. 2c and Additional File 1: Fig. S1) reveal that scMDCF successfully identified the same number of clusters as in the original RNA and ATAC data, in contrast to scMDC, which did not manage as shown in Additional File 1: Fig. S4. Additionally, scMDCF effectively preserved the subpopulation relationships between cell clusters, as seen by the close association between naive B cell, intermediate B cell, and memory B cell clusters within the B cell population (marked by the red circles), as well as the relationship between CD14 mono, CD16 Mono, and cDC clusters. scMDCF also maintained a distinct separation between cells from different populations. In contrast, other baseline methods struggled to cluster as effectively as scMDCF in the B and Mono populations. This indicates that scMDCF not only achieves remarkable clustering performance but also effectively preserves the inherent biological structure in the scMulti-omics data.
In addition to the well-labeled scMulti-omics datasets, numerous scMulti-omics datasets are unlabeled. To demonstrate the robust clustering capabilities of scMDCF, we used it on four unlabeled scRNA-seq and ATAC-seq datasets: GSM4949911 (D1), nextgem chromium (D2), mouse brain 5k (D3), and hpap (D4). In addition, we quantified clustering performance using the average silhouette width (ASW), Davies-Bouldin (DB) and Calinski-Harabasz (CH) indices where higher ASW values suggest closer proximity of cells within the same cluster; lower Davies-Bouldin index values denote better separation between clusters; the higher Calinski-Harabasz index values indicate that the clusters are dense and well separated. To standardize the meaning of the legend, we visualize the DB index values as their reciprocal. Figure 2b shows that scMDCF achieved the lowest Davies-Bouldin index (DB) values and highest Calinski-Harabasz index (CH) values for three out of four datasets. This observation highlights the capacity of scMDCF to reliably preserve the cell-cell consistency in a robust manner. We further benchmarked scMDCF on a human brain single-nucleus multi-omics atlas comprising over 105,332 nuclei from the dorsolateral prefrontal cortex (DLPFC) of Alzheimer’s disease (AD) donors and unaffected controls, profiled with snRNA-seq and snATAC-seq [16]. From the Additional File 1: Fig. S5, we can observe that scMDCF maintains superior integration clustering performance at atlas scale. Moreover, scMDCF effectively corrects batch effects across distinct brain regions, achieving coherent cross-region integration while preserving cell-type structure (Additional File 1: Note 1).
To confirm the efficiency and scalability of scMDCF, we conducted a runtime comparison with other deep learning-based scMulti-omics methods. We utilized a large-scale public multiome dataset provided by the Sanger’s website [37], comprising 647,366 cells. We randomly subsampled this dataset at increasing cell counts: 5,000, 10,000, 50,000, 100,000, 150,000, 200,000, 250,000, 300,000, 350,000, 400,000, 450,000, 50,000, 550,000 and 600,000 cells. The experimental results are depicted in Additional File 1: Fig. S6, where the x-axis denotes the log-transformed runtime (in seconds). Compared to other deep learning-based scMulti-omics methods that can be accelerated using the GPU, scMDCF has much faster runtimes at all test scales. We also observed that some methods encountered scalability limitations when dealing with larger cell counts: scCross [11], scMoMaT [23] and GLUE [10] could not scale up beyond 100,000 cells, and scMDC could not complete training beyond 200,000 cells. In contrast, scMDCF is specifically designed with a batch training strategy, enabling stable and efficient performance even when processing over 300,000 cells. These results confirm that scMDCF maintains excellent computational efficiency and scalability when applied to large-scale multi-omics datasets.
scMDCF exhibits outstanding performance in integrated analysis of CITE-data
We evaluated the integration capabilities of scMDCF using the data from CITE-seq experiments, including single-cell RNA sequencing and single-cell surface protein profiling that measure distinct sets of proteins. Among the CITE-seq datasets, the ‘inhouse’ dataset consists of 1,372 cells. These cells were processed on the 10X Genomics platform utilizing the Gel Bead Kit V2 and subsequently sequenced on an Illumina Hiseq, resulting in a sequencing depth of 50,000 reads per cell [7]. The dataset ‘GSE128639’ contained 30,672 cells and was derived from human bone marrow. The ‘spleen lymph’ dataset, originating from mouse spleen lymph nodes, was processed in two separate experimental runs using the 10x Chromium system over two days. It has 16,282 cells distributed across two batches [13], providing an invaluable resource to examine scMDCF’s efficacy under varying experimental conditions.
We compared the performance of scMDCF on the three well-annotated CITE-seq datasets with six state-of-the-art scMulti-omics clustering methods and two scRNA-seq specific clustering methods using five metrics, including NMI, ARI, AMI, V-measure, and Fowlkes-Mallows index. The competing methods included scMDC [14], Seurat-V4 [9], MOFA+ [8], BREMSC [7], totalVI [13], and MultiVI [15] for CITE-seq clustering, and scDeepCluster [24] and scGAE [38] for scRNA-seq clustering. As depicted in Fig. 3a, scMDCF consistently achieved the highest AMI, NMI, and V-measure values across all datasets. Furthermore, scMDCF attained the highest Fowlkes-Mallows index and ARI values on two out of the three datasets. Notably, clustering methods tailored to scRNA-seq consistently had the lowest performance across all datasets. These observations indicate the superiority and necessity of scMulti-omics methods for analyzing CITE-seq datasets. In addition, we found that scMDC was the second-best method across all evaluation metrics, highlighting the effectiveness of deep learning methods that concatenate to cluster CITE-seq data. Nevertheless, our proposed cross-modality contrastive learning-based model surpassed scMDC, demonstrating its superior clustering capabilities in the CITE-seq analysis.
Fig. 3.
Benchmarking of scMDCF in terms of cell clustering on CITE-seq datasets. a Comparative evaluation of clustering performance across three labeled CITE-seq datasets using metrics such as AMI, Fowlkes-Mallows, NMI, ARI, and V-measure. b UMAP visualizations of the cell embeddings of scMDCF and other competing methods on the ‘inhouse’ dataset. c UMAP visualizations of the cell embeddings of scMDCF and other competing methods on the ‘Spleen Lymph’ dataset, with cells colored by predicted labels (left) and batch labels (right)
Figure 3b shows the UMAP plots representing the embedding of scMDCF, MultiVI [15], scMDC [14], MOFA+ [8], Seurat [9], totalVI [13], BREMSC [7], as well as the original RNA and ADT, and two scRNA-seq-specific deep clustering methods, scDeepCluster and scGAE on the ‘inhouse’ dataset. Notably, MOFA+ [8] struggled to distinguish ‘unknown’ cell types from others, as highlighted by the surrounding red circle. Similarly, totalVI [13] had difficulty to separate NK cells from ‘unknown’ cell types, also marked by a surrounding red circle. Furthermore, Seurat [9] failed to cluster ‘unknown’ cell types effectively, resulting in a striated pattern in the UMAP plot (surrounded by a red circle). In contrast, BREMSC [7] can clearly separate different cell clusters, but sacrificed the biological relationship between the cell populations present in the raw RNA and ADT data. Both scMDCF and scMDC effectively separated the specific cell clusters, but scMDCF excelled in preserving the inherent biological relationships of the original omics data. Specifically, scMDCF distinctly clustered NK cells, CD8+ T cells, and B cells, mirroring the patterns observed in the raw ADT and RNA data. Additionally, scMDCF separates ‘unknown’ cells from CD14+ monocytes while maintaining the similarity relationship between CD14+ and CD16+ monocytes. And the UMAP plots for ‘GSE128639’ dataset are in the Additional File 1: Fig. S7. Overall, scMDCF achieves superior clustering performance and simultaneously preserves the biological relationships within the omics data.
After that, to investigate a potential batch effect in data generated by different sequencing protocols, Fig. 3c presents the low-dimensional embedded representations of scMDCF, six competing scMulti-omics methods, and original embeddings of the RNA and ADT modalities from the ‘spleen lymph’ dataset that contains two batches and thirty-five cell types. We see that scMDCF integrated cells from the two batches in most of its clusters. However, MOFA+ [8], BREMSC [7], and totalVI [13] clustered cells from different batches into separate cell clusters, indicating a limited ability to handle batch effects. In addition, we used Inverse Simpson’s index of integration (iLISI), conditional Local Inverse Simpson’s Index (cLISI), and batchKL to demonstrate the capability of scMDCF to mitigate the batch effect. Additional File 1: Fig. S8a illustrates that scMDCF outperformed the baseline methods, achieving the highest iLISI, 1/cLISI, and 1/batchKL values. While all the baseline methods exhibited similar iLISI and 1/batchKL values, this comparison highlights that only scMDCF effectively achieves superior clustering performance while significant mitigating of batch effects.
Evaluation of hyperparameter selection and ablation study
We conducted a comprehensive assessment of the impact of several factors of our cross-modality feature fusion module on performance. The factors included the number of layers, the dimensions of the latent embeddings, the number of selected genes, the parameter
of the cross-modality contrastive learning model, and the different loss weights. We used three RNA+ATAC datasets and three CITE-seq datasets.
Figure 4a and b display the clustering performance of our scMDCF model with different numbers of layers and dimensions of latent embeddings in the cross-modality feature fusion module. Specifically, we used six metrics (NMI, ARI, AMI, FMI, HOM, COM, and V-measure) for the evaluation on the six RNA+ATAC and CITE-seq datasets and calculated the average performance. The results demonstrated that scMDCF achieved optimal performance with two layers on the RNA+ATAC datasets and one layer for the CITE-seq datasets. Furthermore, the optimal dimensionality of the common latent representation on these datasets was 8. Next, we optimized the loss weights, a pivotal factor in enhancing the cross-modality feature fusion module’s performance. Figure 4c illustrates the clustering performance of scMDCF on the six datasets with different loss weights. The training loss function contained reconstruction losses for two types of omics data, contrastive fusion loss and clustering loss. We selected 22 distinct loss weighting schemes (Additional File 2: Table S1) based on previous experience [39], each associated with a unique set of loss weights. To evaluate the performance of the various loss weights, we calculated the average NMI values on each dataset and eventually identified the most suitable scheme. Figure 4c presents the bubble chart for all cases. Notably, the 16th scheme (RNA-seq reconstruction loss weight of 0.1, ATAC or ADT-seq reconstruction loss weight of 10, clustering loss weight of 1, and contrastive fusion loss weight of 5) was the most effective across all datasets, especially for the ‘public 10x pbmc’ (scRNA-seq and ATAC-seq) and ‘human brain 3k’ (CITE-seq).
Fig. 4.
Hyperparameter tuning and ablation study of scMDCF. a Radar plots illustrating the clustering performance with different layers of the cross-modality feature fusion module for CITE-seq datasets (left) and RNA+ATAC datasets (right). b Radar plots showing clustering performance with varying dimensions of the common representation for CITE-seq datasets (left) and RNA+ATAC datasets (right). c Bubble plot depicting clustering performance (NMI) under various fine-tuning training loss weights for RNA+ATAC datasets and CITE-seq datasets. d Evaluation of clustering performance with different numbers of selected genes for RNA+ATAC datasets (left) and CITE-seq datasets (right). e Analysis of the impact of different hyperparameter
values of the cross-modality contrastive learning module on RNA+ATAC datasets (left) and CITE-seq datasets (right). f Results from an ablation study revealing the impact of various types of omics data on clustering performance. g Results from an ablation study highlighting the effect of the cross-modality contrastive learning module on clustering performance
After exploring the model’s architecture, dimensions, and loss weighting schemes, we investigated how different numbers of highly variable genes might affect the performance of scMDCF. We set the numbers of selected genes to {1000, 2000, 2500, 3000, 4000} and evaluated the clustering performance of scMDCF under these conditions. These thresholds were selected considering the trade-off between the capture of relevant biological insights and computational efficiency. As depicted in Fig. 4d, for most scRNA-seq and ATAC-seq datasets, the optimal NMI value was achieved with 2,500 genes. Conversely, for the CITE-seq dataset, the optimal NMI value was generally achieved when the gene count was set to 1,000. Notably, we observed distinct sensitivities to gene counts between CITE-seq, scRNA-seq and ATAC-seq datasets, likely due to the substantial disparities in dimensionality between ADT and ATAC data. Typically, ADT data has only a few hundred dimensions, whereas ATAC data contain several hundred thousand dimensions. The
values in the cross-modality contrastive learning module serve as regulators of the information entropy. Figure 4e presents the impact of different
values on the clustering performance of scMDCF. We conducted experiments using a range of
values drawn from the set {0.1, 0.5, 1, 5, 10}. From the experimental results, we observe that for the majority of RNA+ATAC datasets, the highest NMI was attained with
set to 10. This phenomenon can be attributed to the high dimensionality and sparsity of ATAC and RNA-seq data. In contrast, for the CITE-seq dataset, the optimal clustering performance was observed with
at 0.5 due to the distinctive characteristics of ADT data. In summary, our comprehensive analysis provided critical insights into the factors influencing the performance of scMDCF.
Finally, we performed an analysis of scMDCF separately on RNA and ADT data in the CITE-seq dataset, as well as on RNA and ATAC data in the scRNA-seq and ATAC-seq dataset. The experimental results are summarized in Fig. 4f. We observe that scMDCF exhibited the best clustering performance for multi-omic data, indicating its effective ability to integrate complementary information from multiple omics datasets. Furthermore, we evaluated scMDCF clustering performance by excluding the cross-modality feature fusion module. As depicted in Fig. 4g, we can observe that removal of this module resulted in a decreased clustering performance across all datasets, highlighting the essential role of the cross-modality feature fusion module, also for bringing stability in handling scMulti-omics datasets.
scMDCF can infer biologically meaningful peak-gene associations and cell type specific cis-regulatory elements from single-cell multimodal data
To investigate the application of scMDCF, we applied the model to the SNARE-seq dataset collected by 10X Genomics technology, to assess finding cell-type-specific cis-regulatory elements in multimodal data, to further enhance our understanding of cellular heterogeneity and molecular processes. This dataset includes joint measurements of the transcriptome and chromatin accessibility in single cells or cell nuclei [40].
To comprehensively characterize the populations of cells that are present in multimodal data from Peripheral Blood Mononuclear Cells (PBMCs), we started by pre-processing the data followed by clustering using the scMDCF algorithm. Our model successfully identified eight distinct cell clusters, and other baseline methods’ UMAP visualization plots are in Additional File 1: Fig. S10. Guided by the distinct cell clusters identified, we detected the top five differentially-expressed genes (Additional File 1: Fig. S11 left) associated with each cluster and then annotated the cell clusters using the known marker genes (Fig. 5b). Finally, we exposed 8 cell types including Plasma B cells (marker gene: BANK1, IGKC) [41], NK cells (marker gene: GNLY), CD4+ memory T cells (marker gene: IL7R), CD8+ T cells (marker gene: IFITM3) [42], monocytes (marker gene: IRF8) [43], endothelial cells (marker gene: PLXDC2) [44], pDC (marker genes: PLD4, PTGDS) [45, 46], Naive B cells (marker gene: FCRL1) [47], as shown in Fig. 5a. Further, integrating ATAC-seq data with the cell types and employing the FindAllMarker function from Seurat [9], we identified the top five peaks of expression for each cell type, illustrated in Additional File 1: Fig. S11 right. To explore the peak-gene association from single-cell multimodal data, we employed the addPeak2GeneLinks function [30] to identify and correlate chromatin accessibility peaks to genes links within each identified cell type. Figure 5e provides the peak-gene linkages, and we employed a k-nearest neighbor approach to cluster pseudobulked samples based on the accessibility of the linked cis-regulatory elements (CREs), using k-means clustering to identify co-occurring regulatory modules. We can observe that most of the peaks associated with the marker genes were also consistent with the marker peaks of the corresponding cell types (Fig. 5c and d); for example, the gene GZMB is a marker gene for NK cells, and the most highly associated peak ‘chr14-24634018–24634551’ was only highly expressed in NK cells (Fig. 5d); In a parallel example, IFITM3 is a marker for CD4+ memory T cells and correlated with the peak ‘chr5-134086746–134087439’, which similarly showed selective high expression in CD4+ memory T cells (Fig. 5c). These findings were not isolated instances; Fig. 5e provides multiple examples of gene-peak relationships, underscoring the precision of our approach to recognizing cell type-specific cis-regulatory elements.
Fig. 5.
scMDCF efficiently identifies peak-gene associations and cis-regulatory elements of single-cell multimodal data. a UMAP visualization of scMDCF clustering results on the SNARE-seq dataset, with cell clusters annotated using marker peaks and genes. b Feature plots illustrate the expression levels of marker genes for each cell cluster. c Genomic tracks for chromatin accessibility around the IFITM3 locus. Right: integrated IFITM3 expression levels are shown in the violin plot for each cell type. Loops shown below the top panel indicate peak-to-gene linkages identified on the full dataset. Genes show the genomic tracks for accessibility around IFITM3. Bottom: Motif enrichment plot for BHLHE40, correlated with the peak region. d Genomic tracks for chromatin accessibility around the GZMB locus. Right: integrated GZMB expression levels are shown in the violin plot for each cell type. Loops shown below the top panel indicate peak-to-gene linkages identified on the full dataset. Genes show the genomic tracks for accessibility around GZMB. Bottom: Motif enrichment plot for JUNB, correlated with the peak region. e Heatmap showing chromatin accessibility (left) and gene expression (right) for 492,882 peak-to-gene linkages. f TF footprint for the JUNB motif with the subtracting the Tn5 bias normalization method. g The heatmap plot of GSVA pathway enrichment analysis
To exemplify modality-specific features and cell type-specific cis-regulatory mechanisms between peaks and genes, we first used coverage plots to visualize and further clarify the associations between chromatin accessibility and gene expression across the various cell types, as shown in Fig. 5c and d. We observed marked differences in the chromatin landscape surrounding key genomic loci. Figure 5c presents a comparative view of chromatin accessibility near the IFITM3 gene across several cell types, including NK, endothelial, CD8+ T cells, and others, highlighting distinct peaks corresponding to each cell type’s regulatory profile. Similarly, Fig. 5d focuses on the GZMB gene, where the NK cell-specific peak indicates a potential cis-regulatory element driving gene expression. Together, these results emphasize the intricate link between chromatin accessibility and gene regulation, providing insight into the regulatory structures.
We then conducted a motif enrichment analysis based on the linked peaks using Signac package’s [32] FindMotifs function to identify transcription factor binding sites associated with the cell type-specific chromatin accessibility peaks. We selected one motif from the top result motifs for visualization (Fig. 5c and d), which highlight the consensus sequences for the transcription factor binding motifs identified within the chromatin accessibility peaks. In particular, the peak ‘chr11-326533–327530’, associated with the IFITM3 gene, pointed to BHLHE40 as a significant transcription factor. This peak and the IFITM3 gene are known CD8+ T cell markers. The transcription factor BHLHE40 is known to be critically important for the development and maintenance of CD8+ T cell functionality, particularly regarding mitochondrial health and epigenetic regulation [48]. Besides, we correlated the motif activity with the expression level of corresponding TFs, in Fig. 5f, we visualized the TF footprint for the JUNB motif, and the JUNB binding site has been identified on the GZMB promoter [49]. These findings confirm the efficacy of scMDCF in identifying cell type-specific cis-regulatory elements and its potential to unravel the intricacies of gene regulation in cellular phenotypes.
To investigate the cell type-specific cis-regulatory elements discovered by our scMDCF model, we used Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis (GSVA) to uncover the regulatory mechanisms behind gene expression and chromatin accessibility variations in the specific cell types. We first selected the top 20 most highly expressed genes from each cell type identified by scMDCF, resulting in a total of 160 highly expressed genes. We noted that the ‘graft rejection’ pathway was highly enriched in the GSEA (Additional File 1: Fig. S12). Stimulatory signaling from dendritic cells (DC), lymphocyte activation, recirculation, and integration within PBMCs play dominant roles in kidney transplant rejection. Subsequently, we applied GSVA to assess the functional states of the cell clusters to better understand the unique gene expression patterns of each cell type. The heatmap of the GSVA pathway (Fig. 5g) revealed a notable enrichment of the allograft rejection signaling pathway, highly associated with NK cells and CD8+ T cells. This finding aligns with the established understanding that antigen-triggered T cell activation, followed by infiltration of activated CD8+ T cells and NK cells, plays a crucial role in acute allograft rejection [50]. Moreover, previous research has shown that a BHLHE40 deficiency significantly reduces allograft rejection [51]. In conjunction with our analysis, which identified an enriched motif within the ‘chr11-326533–327530’ peak specific to CD8+ T cells, these results underscore the ability of scMDCF to uncover cell-type-specific cis-regulatory elements through multimodal data analysis.
scMDCF can accurately identify cell types from joint modalities and infer the intricate cellular interactions involved in immune regulation
To assess the effectiveness of the scMDCF algorithm in CITE-seq data analysis, we applied it to a CITE-seq dataset of human bone marrow mononuclear cells (GSE128639), containing single-cell transcriptome and surface proteome data from 30,672 cells. We first clustered the CITE-seq data using scMDCF and identified 10 distinct cell clusters, as shown in Fig. 6a, highlighting the distinct expression profiles and cellular heterogeneity within the dataset. Then we annotated the cell clusters based on known marker genes, including surface proteins as cell surface proteins show greater variation between cell types and are constant within the same cell type compared to their corresponding RNA expression. We selected the top five differentially expressed genes and ADTs for each cluster. The heatmap plots in Additional File 1: Fig. S13a and Additional File 1: Fig. S13b illustrate the expression levels of these marker ADTs and genes. Finally, we determined the 10 cell types as NK cells (marker ADT: CD56, marker gene: KLRF1) [52], Precursors (marker ADT: CD34) [53], monocytes (marker genes: FCGR3A, MS4A7) [54], activated B cells (marker gene: IGHM, marker ADT: CD19) [55], CD8+ T cells (marker genes: CD8B, CD8A), Naïve CD4+ T cells (marker gene: FHIT), pDC (marker gene: IRF8, marker ADT: CD123) [56], Tregs (marker gene: TRAC), NKT (marker genes: CCL5, CMC1), CD14+CD16+monocytes (marker gene: S100A12) [42].
Fig. 6.
scMDCF can identify heterogeneity and cellular interactions in CITE-seq data of human bone marrow mononuclear cells. a UMAP visualization of scMDCF clustering results on the CITE-seq dataset, with cell clusters annotated using marker ADTs and genes. b Diagram showing visualization of expression distribution of the differential RNA (left) and ADT (right). The cord diagram (centre) illustrates the discrepancies between the cell annotation outcomes from joint omics clustering in CITE-seq and the cell annotation results from single-omics clustering respectively. c-d Track plots representing the top marker adts (c) and the genes (d) encoding the specific adts of each cell cluster including protein CD34 for Precursor cell, protein CD19 for Activated B cell, protein CD56 for NK cell and protein CD28 for Treg cell. e The circle plot of cell-cell communication shows the signaling pathway of ‘IL2’. f The bubble plot depicts a comparative analysis of significant ligand-receptor pairs instrumental in ‘IL2’ signaling from Treg cells to all other cell types. Dot color reflects communication probabilities and dot size represents computed p-values. Empty space means the communication probability is zero. p-values are computed from one-sided permutation test. g The KEGG enrichment plot for the human bone marrow monouclear cell dataset. h Heatmap of top 20 marker gene expression levels for each cell cluster. i The DNA binding motifs for some cell types
To illustrate the correlation between genes and their corresponding ADTs, we analyzed the expression distribution and expression levels of the genes and their encoded ADTs in Fig. 6c and d. For example, the Precursor cell cluster exhibited the highest expression level of ADT CD34, correlating with the presence of the gene CD34 (encoding the surface protein CD34) within the cluster. Similarly, CD19, a biomarker for activated B cells [55], is encoded by the CD19 gene, both showing the highest expression levels in activated B cells. In the case of NK cells, the CD56 antibody, a marker for natural killer (NK) cells, is encoded by the NCAM1 gene [57], with both CD56 and NCAM1 showing high expression in NK cells. This indicates the effectiveness of our approach in linking gene expression to protein markers. It was evident that some cell populations remained uncharacterized when using the individual modality analysis.
To show the superiority of multi-modal clustering in CITE-seq compared to single-modality clustering, we conducted separate clustering on RNA-seq and ADT-seq of the CITE-seq dataset. In Fig. 6b, we visualized the results of RNA and ADT single-modality clustering. In the individual RNA-seq clustering analysis, Treg and NKT cells were not clearly differentiated. However, the ADT sequencing revealed distinct markers for these cells, such as CD4 and CD25 (Fig. 6b); Additionally, naive CD4+ T and CD8+ T cells, which were not distinct in scRNA-seq analysis, were effectively differentiated using ADT markers such as CD4 and CD8a (Fig. 6b). This indicates the integral role of ADT sequencing in complementing RNA-seq, especially in cases where RNA-seq falls short in clarity. Similarly, our analysis revealed that solely relying on ADT-seq presents other issues. In the individual clustering of the ADT-seq, there was observable overlap between Treg and naive CD4+ T cells at the boundaries of the cluster, and they have a differentiated RNA marker, such as FHIT (Fig. 6b); NKT and Treg cells that tended to converge at the edge of the cluster in the analysis of the ADT-seq, were distinctly identified by the RNA marker CD8A (Fig. 6b). This scenario illustrates the necessity of integrating RNA markers in ADT-seq analysis to achieve a nuanced and accurate resolution for cell type identification. Overall, using the scMDCF algorithm in joint modality analysis successfully leverages the strengths of both the RNA-seq and ADT-seq.
Finally, we explored cell types that are challenging to distinguish in single-omics clustering, along with their differentially-expressed genes, which often reveal key biological insights. We first selected the top 20 highly expressed genes from each cell type identified by scMDCF on this multi-omics dataset, resulting in a total of 200 highly expressed genes (Fig. 6h). Specifically, we employed RcisTarget [58] to identify transcription factors (TFs) within these genes and determined the DNA binding motif exhibiting the highest enrichment score across all cell clusters (Fig. 6i). Then, we performed KEGG pathway analysis to map these 200 genes to known biological pathways, enabling us to understand their roles in cellular processes and disease mechanisms, as shown in Fig. 6g. This analysis led us to the discovery of the ‘Th17 cell differentiation’ signaling pathway, which has been demonstrated that the balance between Th17 cells and Tregs has emerged as a prominent factor in regulating autoimmunity and cancer [59], and CD28 is required for the differentiation of Tregs from naive T cells [60]. Then, we visualized the differential expression of the CD28 protein in two types of cells (Additional File 1: Fig. S14a and Fig. 6d), and observed that Treg cells show the highest expression of CD28. Therefore, we bring further understanding of Treg cell differentiation from our clustering results. We performed cell-cell communication by CellChat analysis [29] and found the IL-2 signal pathway (Fig. 6e), in line with previous research showing that CD28 mediated Treg differentiation is via IL-2 signaling, as IL-2 can block and restore the differentiation between Treg and wild-type T cells [60, 61]. Specific to ‘IL-2’ signaling, CellChat [29] identified ligand-receptor pair ‘IL-7’ as the most significant signaling, contributing to the communication from Treg to Naive CD4+ T and CD8+ T cell. This is in agreement with a reported experimental finding [62]. Overall, CellChat [29] analysis of cell-cell communication supports these findings and establishes the importance of the IL-2 signaling pathway.
In addition, we conducted a Gene Set Enrichment Analysis (GSEA) to validate our findings about Treg cell differentiation. This analysis reaffirmed the presence of the ‘IL-2’ signaling pathway (Additional File 1: Fig. S14c), providing another layer of confirmation that our study can integrate biological insights from multiple modalities of the CITE-seq to delineate biological processes that cannot be discernible through single-omics. Finally, we utilized Monocle3 [35] to construct the cell trajectory over pseudotime (Additional File 1: Fig. S14b). We selected Treg cells as the base for our pseudotime trajectory analysis, we were able to infer the developmental pathway of these cells within the dataset. This comprehensive approach allowed us to visualize and understand the progression and differentiation patterns of Treg cells, offering clues to their roles in the immune system. In conclusion, the experimental results indicate that scMDCF analysis of the BMNC CITE-seq dataset effectively recognizes biological information from joint modalities to reflect subtype differentiation signals. In summary, the scMDCF algorithm efficiently integrates data from multiple modalities and provides a detailed view of the intricate cellular interactions and pathways involved in immune regulation.
scMDCF can identify multimodal cellular interactions and trace B cell differentiation across SARS-CoV-2 vaccination
We utilized the scMDCF algorithm to analyze a CITE-seq dataset collected post-BNT162b2 mRNA vaccination, aiming to investigate its efficacy in analyzing subtypes within large-scale multi-omics datasets. The dataset came from six healthy donors with no history of SARS-CoV-2 infection and encompassed circulating PBMC samples collected after a series of BNT162b2 mRNA vaccinations. The samples were acquired at four specific time points: immediately before vaccination (Day 0), shortly after the first vaccination (Days 2 and 10), and a week following the booster vaccination (Day 28) [63]. The dataset contained 113,897 single cells, with simultaneously measured transcriptomes and surface proteins. We first processed and clustered the post-vaccination CITE-seq dataset using scMDCF, then annotated the obtained cell clusters by their respective cell types, as initially characterized in the dataset. This process is represented visually in Fig. 7a, by a UMAP visualization of the latent embeddings with cell type annotations.
Fig. 7.
Single-cell multimodal analysis of B cell differentiation and cellular interactions in response to SARS-CoV-2 vaccination. The CITE-seq dataset collected from six healthy donors post-BNT162b2 mRNA vaccination and was profiled at four distinct time points: Day0 (immediately before vaccination), Day2, Day10 (post-primary vaccination), Day 28 (seven days post-boost vaccination). a UMAP visualization of 113,897 single cells profiled with CITE-seq and clustered by scMDCF. b UMAP visualization illustrates the B cell cluster with subpopulation assignments based on both gene expression and surface expression of canonical B cell markers. c Percentages of B cell clusters across the four timepoints. d The heatmap plot for pseudotime analysis of marker genes in B cell subsets. e The dotplots depict the expression levels of genes associated with the “immune response-regulating signaling pathway” from GO enrichment analysis, across the four subtypes of B cells at each of the four timepoints. f Expression levels of the genes across differentiation. Each dot shows the expression in an individual pseudotime-ordered cell, while the line represents the smoothed fit of expression levels. g Number of ligand-receptor interactions of B cells after the BNT162b2 mRNA vaccinations. h Ligand-receptor pairs and their corresponding communication scores within ‘MIF’ signaling across different B cell type interactions. i Relative importance of each B cell type based on the computed four network centrality measures of MIF network
To explore the responses of B cell subpopulations to the BNT162b2 mRNA vaccine, we extracted the latent embeddings using scMDCF. These embeddings were then subjected to Louvain clustering for the optimal cluster number self-search strategy, which revealed four distinct B cell subpopulations. For each subpopulation, we identified and analyzed the top 10 differentially-expressed genes and ADTs, shown in the heatmap plots of Additional File 1: Fig. S15. These subpopulations were annotated as memory IgM B cells (marker ADTs: CD27, IgM) [64], plasmablast cells (marker ADT: HLA-DR, marker genes: IGHD, IGHM) [65], Naive B cells (marker ADTs: IgD [66], IgM [67]), and Class Switched Memory B cells (marker ADTs: CD27+, IgD-, IgM-, CD86) [68, 69]. Further, Fig. 7b displays the annotated UMAP visualization for these distinct subpopulations of B cells.
To investigate how the B cell subpopulations varied over the vaccination course, we analyzed their distribution at the four post-vaccination time points, as depicted in Fig. 7c. We observed an increase in the percentage of plasmablast cells and memory IgM+ B cells following vaccination, consistent with previous studies [70]. We then employed Monocle3 for pseudotime trajectory analysis to explore the differentiation pathways of B cells in response to the BNT162b2 mRNA vaccine [35], which helped us trace the developmental pathways of the cells, particularly focusing on cell differentiation processes. After choosing naive B cells as the starting point for this analysis, we identified a unique differentiation pathway: the transcriptional trajectory of naive B cells evolving into plasmablasts, then into memory IgM B cells, and finally to class switch memory B cells, illustrated in Additional File 1: Fig. S16. We conducted differential gene expression analysis along the B cell trajectory, categorizing the genes into four distinct clusters as shown in Fig. 7d. Notably, the genes within clusters 1 and 2 were highly expressed at the early pseudotime point, while those in cluster 3 were highly expressed at the later pseudotime point.
To understand the cellular immunity mechanisms and the regulation of mRNA vaccine-responsive genes and cell surface proteins post-BNT162b2 mRNA vaccination, we performed Gene Ontology (GO) enrichment analysis across the B cell subpopulations. This analysis focused on gene-associated GO terms, and revealed significant enrichment of terms related to “immune response-regulating signaling pathway”, as shown in Additional File 1: Fig. S18a. This indicates the ability of B cells to secrete antibodies and mediate the humoral immune response, evidenced by visualization of genes of this pathway at the four time points in Fig. 7e. We focused on the antibody-related genes, such as IGHM, IGHD, and IGHG2, and plotted the pseudotime expression trajectories of these genes (Fig. 7f and Additional File 1: Fig. S17). Comparing these trajectories with the actual expression patterns in the pathways shown in Fig. 7e, we found that the trends in the expression of these genes in corresponding cell types are consistent. Notably, the memory IgM cells and class-switched memory B cells had a lower proportion of IGHM and a higher proportion of IGHG compared to the other B cell subgroups, aligning with previous research findings [71].
Furthermore, we performed KEGG enrichment analysis on the top 100 most highly expressed genes for each B cell subpopulation, giving 400 genes. Additional File 1: Fig. S18b reveals significant enrichment in the “Coronavirus disease - COVID-19” pathway at the Human Diseases level and in pathways like “Intestinal immune network for IgA production” and “B cell receptor signaling pathway” at the Organismal Systems level. These findings corroborate that B cells are capable of producing IgA, which is joined by a J-chain after BNT162b2 mRNA vaccination and only has a short lifespan post-stimulation [72]. This information on the antibody version further validates our previous findings related to the IGHA gene, emphasizing its role in the post BNT162b2 mRNA vaccination process. In addition, to elucidate B cell receptor (BCR) expression from the perspective of ADTs, we examined B cell receptor (BCR) expression using ADT level visualization, particularly focusing on class-switched memory B cells. From the Additional File 1: Fig. S19, we upregulated markers such as CD11c and CD95, which are associated with class switching recombination and antigen-experienced B cells, emphasizing their importance in B cell differentiation and immune response [71].
Finally, we utilized Cellchat [29] to infer cell-cell communication among the B cell subgroups. Initially, we analyzed gene expression profiles of known receptor-ligand pairs to deduce interactions between various pairs of cells (Fig. 7g). This analysis revealed that class-switched memory B cells predominantly express specific ligands, which are recognized by memory IgM B cells and plasmablast cells, and to a lesser extent by naive B cells. Then we identified the MIF signaling pathway as the most prominent. This pathway is vital for the survival and maturation of memory B cells post-COVID-19 vaccination, as indicated by previous research [73]. We illustrate the activation of the MIF signaling pathway within these memory B cell subgroups (Additional File 1: Fig. S20), highlighting its significance in antibody maintenance post-vaccination. Specific to ‘MIF’ signaling, we observed a dominant cellular communication between class switched memory B cell and plasmablast cell, where MIF signaling was mainly mediated by CXCR4, CD44, and CD74 (Fig. 7h-i). This observation is consistent with prior research that MIF-enhanced B cell migration is regulated by CXCR4 and CD74 [74]. In summary, our comprehensive analysis of B cell subpopulations post-BNT162b2 mRNA vaccination revealed dynamic changes and complex interactions within the immune system. Our findings provide deeper insight into how vaccines affect human immune responses and offer valuable scientific evidence for optimizing future vaccine development and disease treatment strategies.
scMDCF identifies biologically significant minority cell populations, revealing putative novel markers and inferred regulatory mechanisms in Alzheimer’s disease
To further validate that scMDCF can provide biological insights we applied scMDCF to a dataset comprising 105,332 cells from dorsolateral prefrontal cortex (DLPFC) tissues affected by Alzheimer’s disease (AD) and from unaffected donors [16], featuring single nucleus RNA sequencing matched (snRNA-seq) and single nucleus ATAC sequencing (snATAC-seq). Specifically, scMDCF can identify eight distinct cellular clusters (Fig. 8a): Astrocytes (Ast, 6%), Excitatory neurons (Exc, 25.5%), Inhibitory neurons (Inh, 15.3%), Microglia (Mic, 3%), Oligodendrocytes (Oli, 33%), Oligodendrocyte precursor cells (OPCs, 12.7%), Pericytes (Per, 4.4%), and Endothelial cells (End, 0.24%). The cell populations are annotated based on the differential expression of the top 10 genes per cluster, as illustrated in Fig. 8b (left). The differential expression genes (DEGs) within each cell population were identified using the Wilcoxon rank-sum test implemented in the Seurat package [9], with p-values adjusted according to the Benjamini-Hochberg procedure at a significance level of 0.05. scMDCF demonstrates a notable ability to detect computational minority cell clusters, such as the 2,981 Microglia cells and 253 Endothelial cells among the 105,332 total cells in the dataset. This performance highlights scMDCF’s strength in resolving low-abundance populations with large-scale, imbalanced multi-omics data. Most scMulti-omics methods perform better in the condition that all clusters have the balanced size than in the condition that clusters have variable random sizes. The imbalanced distributions of the AD dataset can lead to minority cell populations being blended into larger clusters, thus compromising the accurate resolution of those smaller populations. To address this, we applied Seurat [9] across a range of clustering resolution from 0.1 to 0.7 and visualized the resulting UMAP plots, along with marker genes and peaks heatmap, as shown in the Additional File 1: Fig. S21 to Additional File 1: Fig. S27. However, despite increasing the resolution, many of the resulting clusters lacked distinct differentially expressed markers, making manual annotation difficult and biologically ambiguous. This emphasizes the superiority of scMDCF in preserving crucial cell type-specific biological insights that will facilitate the analysis of snMulti-omics data.
Fig. 8.
The cellular diversity results of scMDCF on the DLPFC dataset from Alzheimer’s disease and unaffected donors. a UMAP visualization of the scMDCF clustering performance on the snMulti-omics DLPFC dataset, with colors representing distinct cluster identities (labeled on the bottom). The proportion of Alzheimer’s disease cells within each cluster is displayed in the top right, while the proportion of each cell cluster is shown in the bottom right. b The heatmap illustrates the expression levels of the top 10 highly expressed genes in each cell cluster, along with the corresponding Gene Ontology (GO) enrichment results. c Heatmap plot (left) showing the 1,457 linked marker peaks across all cell clusters, along with a corresponding heatmap of enriched motifs from the peaks (right), with color indicating the column Z score of normalized accessibility. d The circos plot shows the GO enrichment pathways of the ELF1 regulon genes, along with the expression levels of these regulon genes in Microglia Unaffected cells and Microglia AD cells. The bar plot on the right side of the circos plot represents the GO pathways enriched by these genes, with bar height indicating the number of enriched genes. The heatmap on the left side shows the expression levels of these genes in Microglia cells from Alzheimer’s disease (AD) patients and healthy controls. e The density plot visualizes the distribution of ELF1 motif density across various cell types. f The density plot visualizes the distribution differences of regulon genes DOCK4 and ELMO1 between healthy individuals and Alzheimer’s disease patients in Microglia cells. g The line plots display the scaled expression trajectories of regulon genes across pseudotime bins in Microglia cells, highlighting differences between healthy individuals and Alzheimer’s disease patients. h Heatmap displaying the expression of significantly expressed genes in healthy Microglia cells over pseudotime. The color represents z-score–scaled expression levels
To confirm the validity of the identified computational minority cell populations, further investigation is needed to determine whether these findings represent true biological signals or potential false positives. We first performed gene ontology (GO) enrichment analyses on the top 10 differentially expressed genes identified in each cell population to elucidate the functions of these different cell populations (Fig. 8b right). We observed a significant enrichment of small GTPase-mediated signal transduction pathways and GTPase activity regulation pathways in Microglia. Existing literature suggests that small GTPases regulate a variety of functions in Microglia, such as phagocytosis and vesicular transport, and play a key pathological role in Alzheimer’s disease, where disruption of GTPase signalling leads directly to neurodegeneration [75]. In addition, we identified the TRAIL signalling pathway that starts in Endothelial cells and acts on Microglia through the TNFRSF10A gene (Additional File 1: Fig. S28). Previous studies have demonstrated that TRAIL promotes the proliferation and migration of endothelial cells and modulates both the innate and adaptive immune responses in the pathogenesis of various immunological disorders, particularly in AD-related neuroinflammation [76]. These findings indicate that the computational minority cell types identified by scMDCF are not false positives or random occurrences, but are potentially biologically meaningful and play a significant role in key biological processes within the dataset.
Building on these foundational analyses, we further explored the regulatory patterns at the multiome level to identify cell type-specific and AD-specific active cis-regulatory elements (CREs) and target genes. Initially, we employed ArchR [30] to map regulatory peaks to their target genes by identifying peak-gene correlations. Subsequently, we performed transcription factor motif enrichment analysis based on the chromatin accessibility of these linked peaks. This analysis led us to infer that the activity of these TFs could regulate the expression of genes involved in AD pathogenesis, highlighting their potential role in disease-specific regulatory mechanisms. Despite the intricate nature of disease-associated regulatory networks, their fundamental structure can be broken down into modular components, such as peak-TF-gene trios [77]. In our study, our focus was on pinpointing AD-specific regulatory trios using multiomics approaches that leverage scMDCF’s high-resolution single-cell clustering. This integrative analysis enables the precise mapping of transcriptional regulation within the context of AD, providing insights into the cell type-specific mechanisms that contribute to disease pathology. To identify candidate cis-regulatory elements with accessibility associated with Alzheimer’s disease, we first utilized the snMultiome dataset to perform peak-to-gene linkage analysis across various cell types. In total, this analysis yielded 468,896 peak-to-gene linkages (Additional File 1: Fig. S29). From these linked peaks, we further filtered for cell type-specific peaks, as illustrated in Fig. 8c (left). Then, we enriched the motifs from these specific peaks (Fig. 8c right). Notably, our analysis revealed a significant and unique enrichment of the ELF1 motif in Microglia cells (Fig. 8c right and e), suggesting that ELF1 may serve as a novel marker motif specific to this cell type. To our knowledge, this is the first identification of ELF1 as a transcription factor marker associated with Alzheimer’s disease in Microglia.
To validate our hypothesis and further elucidate the regulatory role of ELF1 in Microglia related to Alzheimer’s disease, we conducted validation from both the disease regulatory mechanisms and gene-specific perspectives. Initially, we constructed a gene regulatory network (GRN) for ELF1, leveraging multimodal single-cell genomic measurements to model gene expression through TF–peak interactions [78]. This GRN included a set of 54 genes that are either positively or negatively regulated by ELF1. Strikingly, we enriched the molecular function GTPase regulator activity (GO:0030695) from the ELF1 regulon genes module (Fig. 8d), and the GTPase-related pathways were obtained when performing GO enrichment analysis on the top 10 highly expressed genes in Microglia (Fig. 8b). This indicates that ELF1 may regulate gene expression to activate GTPase protein activity, influencing lipid transport and A
metabolism, thus preventing the disrupted GTPase signaling in microglia from directly causing neurodegeneration. To validate this hypothesis, we focused on ELF1 regulon genes associated with GTPase regulator activity (AKAP13, ARHGAP24, DOCK10, DOCK4, DOCK5, ELMO1, PLEKHG1, RGS20), which are implicated in various brain disorders, including Alzheimer’s disease. We visualized the density plots of these genes in healthy and AD Microglia cells, respectively (Fig. 8f and Additional File 1: Fig. S30). We observed that the expression of these regulon genes is significantly higher in healthy Microglia cells compared to those in Alzheimer’s disease patients. This suggests a potential Alzheimer’s disease-associated expression difference in these ELF1 regulon genes, which are involved in GTPase regulator activity. To further elucidate the dynamic expression of regulon genes during Microglial state transitions and validate ELF1’s regulatory role in Alzheimer’s disease, we constructed a pseudotime trajectory to Microglia. This revealed an upregulation of these genes over pseudotime in healthy Microglia (Fig. 8h), surpassing levels seen in Alzheimer’s Microglia (Fig. 8g). Strikingly, we observed marked expression changes near pseudotime point 3, potentially marking a critical cellular transition. In summary, these integrative snMultiome analyses demonstrate that scMDCF effectively clusters computational minority cell populations from large-scale datasets that are often overlooked by other scMulti-omics methods. Leveraging the high-resolution clustering capabilities of scMDCF, we identified ELF1 as a putative candidate biomarker acting as a transcriptional repressor in Microglia associated with Alzheimer’s disease. This discovery highlights ELF1’s potential role in modulating microglial function and its broader implications for understanding the pathophysiology of Alzheimer’s disease.
Discussion
Single-cell multi-omics technologies, which simultaneously measure multiple cellular components such as DNA, RNA, and proteins within individual cells, are crucial for uncovering the complex mechanisms underlying cellular processes and the inherent heterogeneity of biological systems. In our study, we introduced a single-cell multi-omics deep learning model, scMDCF, which is capable of managing the complexity and noise inherent in multi-omics data and which has made significant progress in the comprehensive analysis of such datasets. Its unique multi-omics informed encoders and cross-modality contrastive learning module efficiently extract and integrate both consistent and inconsistent biological features across different omics types. The enhanced features are then concatenated and fed into the cross-modality feature fusion module to further compress these features and mitigate the batch effect of different omics to obtain a generic potential embedding representation for downstream analysis. In addition, we developed a self-optimizing embedded clustering algorithm that enhances the association between similar cells and prevents clustering centroids from collapsing in the latent space.
To validate the effectiveness of scMDCF in scRNA-seq and scATAC-seq, such as SNARE-seq, we compared it against ten scMulti-omics clustering methods, along with two scRNA-seq and one scATAC-seq clustering methods. The experimental results indicate that scMDCF achieved superior clustering performance, outperforming both multi-modal and single-modal clustering methods. In particular, it identified subgroups that are typically indiscernible using other methods, effectively differentiating them while maintaining relational similarities within the subgroups. For CITE-seq data, scMDCF surpassed six other scMulti-omics clustering methods and two scRNA-seq clustering approaches, demonstrating not only higher clustering accuracy but also effective mitigation of batch effects due to tissue type variations and differing experimental conditions. Moreover, scMDCF proved to be computationally efficient, exhibiting the shortest runtime among the compared algorithms.
Beyond clustering capabilities, scMDCF has uncovered numerous cell type-specific biological insights. In our SNARE-seq analysis, we leveraged scMDCF’s clustering results and gene expression data, which enabled the identification of cell type-specific peak-gene associations and led to the discovery of potential ATAC biomarkers specific to individual cell types. Furthermore, by enriching marker peaks, we identified specific cis-regulatory elements, such as motifs. Additionally, by focusing on highly expressed genes within each cell type, we identified key signaling pathways. Notably, these pathways and cis-regulatory elements exhibited similar regulatory functions across their specific cell types, underscoring the strength of our methodology in uncovering complex cellular dynamics. For CITE-seq, scMDCF’s clustering results enabled the identification of potential biomarker proteins by analyzing marker gene expression and their associated proteins in each cell type, showing that integrated clustering of RNA and ADT provides more precise cell-type identification than single-modality clustering. In addition, we conducted pathway analysis to uncover intercellular immunoregulatory mechanisms based on gene and protein associations. We applied the CITE-seq dataset from individuals vaccinated with the BNT162b2 mRNA vaccine to identify specific vaccine-induced B cell subpopulations and uncover dynamic interactions and regulatory mechanisms of the human immune response associated with immunization of these B cell subpopulations.
In addition to the robust ability to integrate single-cell multi-omics data, scMDCF can discover computational minority cell populations and putative candidate biomarker that are unachievable by alternative computational tools and unreported in the original study. In the Alzheimer’s disease case, only scMDCF can detect computational minority cell populations, such as Microglia and Endothelial cells. Rigorous validation experiments were conducted to confirm that detected computational minority cell populations genuinely exist, rather than being random false positives. Notably, in the computational minority Microglia population, scMDCF identified a putative candidate transcription factor, ELF1, which may play a pivotal regulatory role in Alzheimer’s pathology. Through the validation we found that ELF1 is involved in the regulation of GTPase activity, potentially modulating cellular processes relevant to neuroinflammation and disease progression. This finding highlights a previously unrecognized regulatory pathway, where ELF1 could act as a key mediator in the suppression of pathological mechanisms underlying Alzheimer’s disease, providing novel insights into potential therapeutic targets.
In summary, scMDCF, with its unique cross-modality contrastive learning module embedded in an autoencoder framework, offers a groundbreaking approach for analyzing single-cell multi-omics data, especially suitable for CITE-seq and SNARE-seq data. As a deep learning application in single-cell biology, scMDCF offers fresh perspectives on cell type-specific biomarkers, cis-regulatory elements, and cellular immunity responses, integrating multiple modalities at single cell resolution. Future enhancements of scMDCF are planned to accommodate emerging multi-omics data types, thereby expanding its analytical capabilities.
Conclusions
scMDCF is a novel contrastive learning based framework for scMulti-omics integration analysis that harmonizes cross-modality representations while preserving biological heterogeneity. We demonstrate that scMDCF improves integrative analysis across multiple datasets in terms of clustering performance, batch-effect mitigation, and visualization quality. All data, code, and reproducible workflows are publicly available to facilitate use and verification. We anticipate that scMDCF will be particularly informative for the analysis of disease related scMulti-omics datasets, with special relevance to Alzheimer’s disease and immune system studies.
Supplementary information
Additional file 1: Fig S1. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘pbmc 10k’ dataset. Fig S2. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘human brain 3k’ dataset. Fig S3. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘pbmc 10x public’ dataset. Fig S4. Visualization of the number of cell clusters for scMDC on the ‘pbmc 10k’ dataset. Fig S5. Benchmarking of scMDCF in terms of cell clustering, batch effect correction, and visualization quality on large-scale scMulti-omics dataset. Fig S6. Comparative analysis of computational efficiency. Fig S7. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘GSE128639’ dataset. Fig S8. Benchmarking the efficacy in eliminating batch effects and performance in runtime between scMDCF and competing methods. Fig S9. Ablation study about normalization study. Fig S10. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘human pbmc 3k’ dataset. Fig S11. Heatmap of curated marker genes and peaks determining cell clustering and annotation. Fig S12. Gene Set Enrichment Analysis. Fig S13. Heatmap of curated marker proteins and genes determining cell clustering and annotation of scMDCF on the ‘GSE128639’ dataset. Fig S14. Treg cell differentiation analysis by scMDCF. Fig S15. Heatmap of curated marker proteins and genes determining cell clustering and annotation. Fig S16. B cell differentiation analysis during SARS-CoV-2 vaccination. Fig S17. Expression levels of the ‘IGHG2’ gene across the B differentiation. Fig S18. Pathway enrichment analysis of B cell differentiation during SARS-CoV-2 vaccination. Fig S19. Analysis of key proteins differential expression in B cell differentiation during SARS-CoV-2 vaccination. Fig S20. Analysis of cell communication in B cell differentiation during SARS-CoV-2 vaccination. Fig S21. to Fig S27. The UMAP visualization of Seurat with various clustering resolutions. Fig S28. The cell-cell communication signaling pathway TRAIL. Fig S29. Heatmap showing chromatin accessibilityand gene expressionfor 468,896 peak-to-gene linkages. Fig S30. The density plot visualizes the distribution differences of regulon genes.
Additional file 2: Table S1. Details of each loss weight case. Table S2. The information of datasets used in the scMDCF.
Acknowledgements
Not applicable.
Abbreviations
- scMulti-omics
Single-cell multi-omics
- CRE
Cis-regulatory element
- CITE-seq
Cellular indexing of transcriptomes and epitopes by sequencing
- ADTs
Antibody-Derived Tags
- MNN
Mutual nearest neighbor
- KL
Kullback-Leibler
- SNARE-seq
Droplet-based single-nucleus chromatin accessibility and mRNA expression sequencing
- DLPFC
Dorsolateral prefrontal cortex
- MSE
Mean squared error
- NMI
Normalized Mutual Information
- ARI
Adjusted Rand Index
- AMI
Adjusted Mutual Information
- FMI
Fowlkes-Mallows Index
- ASW
Average Silhouette Width
- DB
Davies-Bouldin Index
- CH
Calinski-Harabasz Index
- iLISI
Inverse Simpson’s Index of Integration
- cLISI
Conditional Local Inverse Simpson’s Index
- PPI
Protein-Protein Interaction
- GO
Gene Ontology
- KEGG
Kyoto Encyclopedia of Genes and Genomes
- GSEA
Gene Set Enrichment Analysis
- GSVA
Gene Set Variation Analysis
- UMAP
Uniform Manifold Approximation and Projection
- TF
Transcription factor
- AD
Alzheimer’s disease
- GRN
Gene regulatory network
Authors’ contributions
X. L. conceived the study. Y.C. drafted the manuscript. Y.C., X.C., Y.S., and Y.F. collected and analyzed the single-cell multiomics data. Y.C. and X.L. implemented the algorithm of scMDCF. Y.C. and X.C. developed the package of scMDCF. Y.C., F.W., Y.Y., and K.W. provided important advice on cell-type annotation and cellular communication analysis. All authors read and approved the final manuscript.
Funding
The work described in this paper was substantially supported by the National Natural Science Foundation of China under Grant No. 62472195 (X.L.).
Data availability
We collected nine paired scRNA-seq and ATAC-seq datasets and four CITE-seq datasets. These datasets were meticulously gathered from a variety of platforms and cover multiple species, ensuring a rich and diverse data collection for comprehensive analysis. The ‘pbmc 10k’ dataset [79] and associated cell type labels are available from the Github repository at (https://github.com/gao-lab/GLUE/tree/master/data). The processed datasets ‘multi bmmc’ [80] and ‘hpap’ [81] are provided by the scMulti-omics benchmark [36] and can be downloaded through the Github repository at https://github.com/myylee/benchmark_sc_multiomic_integration/tree/main. The ‘pbmc 10x public’ data [82] and their cell type labels are available from the benchmarking article of scRNA-scATAC-seq analysis [36], with the Github repository located at https://github.com/myylee/benchmark_sc_multiomic_integration. Then, the ‘spleen lymph’ dataset [13] is accessible through the accession number GSE150599. The ‘inhouse’ dataset [7] is accessible through the accession number GSE148665. In other aspects, the paired scRNA-seq and ATAC-seq datasets enumerated below were obtained from the 10X Genomics website: ‘human brain 3k’ [83] (https://www.10xgenomics.com/resources/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-2-0-0), ‘human pbmc 3k’ [84] (https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0), ‘mouse brain 5k’ [85] (https://www.10xgenomics.com/resources/datasets/fresh-embryonic-e-18-mouse-brain-5-k-1-standard-2-0-0), ‘nextgem chromium’ [86] (https://www.10xgenomics.com/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-controller-1-standard-2-0-0). The ‘GSE128639’ [87] and ‘GSM4949911’ [88] datasets used in our study are accessible in the GEO database under the accession codes GSE128639 and GSM4949911, respectively. The vaccina SARS-CoV-2 CITE-seq dataset [63] and corresponding cell type labels used in our study are available in the GEO database under the accession code GSE171964. The Alzheimer’s Disease dataset [16] used in our study are available in the GEO database under the accession code GSE214979. Additional File 2: Table S2 provides a comprehensive overview of each dataset, including the type of omics data, the number of cells, species, and sample information. All the datasets can be downloaded from https://zenodo.org/records/11019640 [89]. scMDCF is released as a Python package at: https://pypi.org/project/scMDCF/. The source code [90] for the usage tutorial on GitHub: https://github.com/DARKpmm/scMDCF. The analysis code of Seurat on the GSE214979 dataset is available at: https://github.com/DARKpmm/scMDCF/blob/main/tutorial/Fig8_Seurat.R. The automatic search tool for the hyperparameters is available at: https://github.com/DARKpmm/scMDCF/blob/main/scMDCF/scMDCF_param.py.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Heumos L, Schaar AC, Lance C, Litinetskaya A, Drost F, Zappia L, et al. Best practices for single-cell analysis across modalities. Nat Rev Genet. 2023;24(8):550–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rautenstrauch P, Vlot AHC, Saran S, Ohler U. Intricacies of single-cell multi-omics data integration. Trends Genet. 2022;38(2):128–39. [DOI] [PubMed] [Google Scholar]
- 3.Baysoy A, Bai Z, Satija R, Fan R. The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol. 2023;24(10):695–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Badia-i Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet. 2023;24(11):739–54. [DOI] [PubMed] [Google Scholar]
- 5.Wang Q, Chen R, Cheng F, Wei Q, Ji Y, Yang H, et al. A Bayesian framework that integrates multi-omics data and gene networks predicts risk genes from schizophrenia GWAS data. Nat Neurosci. 2019;22(5):691–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kim HJ, Lin Y, Geddes TA, Yang JYH, Yang P. CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics. 2020;36(14):4137–43. [DOI] [PubMed] [Google Scholar]
- 7.Wang X, Sun Z, Zhang Y, Xu Z, Xin H, Huang H, et al. BREM-SC: a Bayesian random effects mixture model for joint clustering single cell multi-omics data. Nucleic Acids Res. 2020;48(11):5814–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cao ZJ, Gao G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat Biotechnol. 2022;40(10):1458–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Yang X, Mann KK, Wu H, Ding J. Sccross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in silico exploration. Genome Biol. 2024;25(1):198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gong B, Zhou Y, Purdom E. Cobolt: integrative analysis of multimodal single-cell sequencing data. Genome Biol. 2021;22(1):1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gayoso A, Steier Z, Lopez R, Regier J, Nazor KL, Streets A, et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods. 2021;18(3):272–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lin X, Tian T, Wei Z, Hakonarson H. Clustering of single-cell multi-omics data with a multimodal deep learning method. Nat Commun. 2022;13(1):7705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ashuach T, Gabitto MI, Koodli RV, Saldi GA, Jordan MI, Yosef N. MultiVI: deep generative model for the integration of multimodal data. Nat Methods. 2023;20(8):1222–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Anderson AG, Rogers BB, Loupe JM, Rodriguez-Nunez I, Roberts SC, White LM, et al. Single nucleus multiomics identifies ZEB1 and MAFB as candidate regulators of Alzheimer’s disease-specific cis-regulatory elements. Cell Genomics. 2023;3(3): 100263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Danese A, Richter ML, Chaichoompu K, Fischer DS, Theis FJ, Colomé-Tatché M. Episcanpy: integrated single-cell epigenomic analysis. Nat Commun. 2021;12(1):5228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 9729–38.
- 20.Li X, Wang K, Lyu Y, Pan H, Zhang J, Stambolian D, et al. Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis. Nat Commun. 2020;11(1):2338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Xie J, Girshick R, Farhadi A. Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. PMLR; 2016. pp. 478–87.
- 22.Kartha VK, Duarte FM, Hu Y, Ma S, Chew JG, Lareau CA, et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genom. 2022;2(9):100166. 10.1016/j.xgen.2022.100166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhang Z, Sun H, Mariappan R, Chen X, Chen X, Jain MS, et al. Scmomat jointly performs single cell mosaic integration and multi-modal bio-marker detection. Nat Commun. 2023;14(1):384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tian T, Wan J, Song Q, Wei Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat Mach Intell. 2019;1(4):191–8. [Google Scholar]
- 25.Chen L, Wang W, Zhai Y, Deng M. Deep soft K-means clustering with self-training for single-cell RNA sequence data. NAR Genom Bioinform. 2020;2(2):lqaa039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Xiong L, Xu K, Tian K, Shao Y, Tang L, Gao G, et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat Commun. 2019;10(1):4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16(12):1289–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 2020;21:1–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jin S, Guerrero-Juarez CF, Zhang L, Chang I, Ramos R, Kuan CH, et al. Inference and analysis of cell-cell communication using Cell Chat. Nat Commun. 2021;12(1):1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021;53(3):403–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Fornes O, Castro-Mondragon JA, Khan A, Van der Lee R, Zhang X, Richmond PA, et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48(D1):D87–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat Methods. 2021;18(11):1333–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yu G, Wang LG, Han Y, He QY. ClusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012;16(5):284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lee MY, Kaestner KH, Li M. Benchmarking algorithms for joint integration of unpaired and paired single-cell RNA-seq and ATAC-seq data. Genome Biol. 2023;24(1):244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Stephenson E, Reynolds G, Botting RA, Calero-Nieto FJ, Morgan MD, Tuong ZK, et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nat Med. 2021;27(5):904–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Luo Z, Xu C, Zhang Z, Jin W. A topology-preserving dimensionality reduction method for single-cell RNA-seq data using graph autoencoder. Sci Rep. 2021;11(1):20028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Lin Y, Gou Y, Liu Z, Li B, Lv J, Peng X. Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. pp. 11174–83.
- 40.Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat Biotechnol. 2019;37(12):1452–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Schmidt M, Hellwig B, Hammad S, Othman A, Lohr M, Chen Z, et al. A comprehensive analysis of human gene expression profiles identifies stromal immunoglobulin C as a compatible prognostic marker in human solid tumors. Clin Cancer Res. 2012;18(9):2695–703. [DOI] [PubMed]
- 42.Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, et al. Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019;47(D1):D721–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Katz SG, Edappallath S, Xu ML. IRF8 is a reliable monoblast marker for acute monocytic leukemias. Am J Surg Pathol. 2021;45(10):1391–8. [DOI] [PubMed] [Google Scholar]
- 44.Cheng G, Zhong M, Kawaguchi R, Kassai M, Al-Ubaidi M, Deng J, et al. Identification of PLXDC1 and PLXDC2 as the transmembrane receptors for the multifunctional factor PEDF. Elife. 2014;3:e05401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yasaka K, Yamazaki T, Sato H, Shirai T, Cho M, Ishida K, et al. Phospholipase D4 as a signature of toll-like receptor 7 or 9 signaling is expressed on blastic T-bet+ B cells in systemic lupus erythematosus. Arthritis Res Ther. 2023;25(1):200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ghanem MH, Shih AJ, Khalili H, Werth EG, Chakrabarty JK, Brown LM, et al. Proteomic and single-cell transcriptomic dissection of human plasmacytoid dendritic cell response to influenza virus. Front Immunol. 2022;13:814627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Capone M, Bryant JM, Sutkowski N, Haque A. Fc receptor-like proteins in pathophysiology of B-cell disorder. J Clin Cell Immunol. 2016;7(3):427. [DOI] [PMC free article] [PubMed]
- 48.Li C, Zhu B, Son YM, Wang Z, Jiang L, Xiang M, et al. The transcription factor Bhlhe40 programs mitochondrial regulation of resident CD8+ T cell fitness and functionality. Immunity. 2019;51(3):491–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wang Y, Chu J, Yi P, Dong W, Saultz J, Wang Y, et al. SMAD4 promotes TGF-independent NK cell homeostasis and maturation and antitumor immunity. J Clin Investig. 2019;128(11):5123–36. [DOI] [PMC free article] [PubMed]
- 50.Bueno V, Pestana JOM. The role of CD8+ T cells during allograft rejection. Braz J Med Biol Res. 2002;35:1247–58. [DOI] [PubMed] [Google Scholar]
- 51.Zafar A, Ng HP, Kim GD, Chan ER, Mahabeleshwar GH. BHLHE40 promotes macrophage pro-inflammatory gene expression and functions. FASEB J Off Publ Fed Am Soc Exp Biol. 2021;35(10):e21940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Mukherjee N, Ji N, Tan X, Chen CL, Noel OD, Rodriguez-Padron M, et al. KLRF1, a novel marker of CD56bright NK cells, predicts improved survival for patients with locally advanced bladder cancer. Cancer Med. 2023;12(7):8970–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Large-scale simultaneous measurement of epitopes and transcriptomes in single cells. Nat Methods. 2017;14(9):865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Derakhshani A, Safarpour H, Abdoli Shadbad M, Hemmat N, Leone P, Asadzadeh Z, et al. The role of hemoglobin subunit delta in the immunopathy of multiple sclerosis: Mitochondria matters. Front Immunol. 2021;12:709173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Wang K, Wei G, Liu D. CD19: a biomarker for B cell development, lymphoma diagnosis and therapy. Exp Hematol Oncol. 2012;1(1):1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Kutzner H, Kerl H, Pfaltz MC, Kempf W. CD123-positive plasmacytoid dendritic cells in primary cutaneous marginal zone B-cell lymphoma: diagnostic and pathogenetic implications. Am J Surg Pathol. 2009;33(9):1307–13. [DOI] [PubMed] [Google Scholar]
- 57.Chan JK, Sin V, Wong K, Ng C, Tsang WY, Chan C, et al. Nonnasal lymphoma expressing the natural killer cell marker CD56: a clinicopathologic study of 49 cases of an uncommon aggressive neoplasm. Blood. 1997;89(12):4501–13. [PubMed] [Google Scholar]
- 58.Verfaillie A, Imrichova H, Janky R, Aerts S. iRegulon and i-cisTarget: reconstructing regulatory networks using motif and track enrichment. Curr Protoc Bioinforma. 2015;52(1):2–16. [DOI] [PubMed] [Google Scholar]
- 59.Knochelmann HM, Dwyer CJ, Bailey SR, Amaya SM, Elston DM, Mazza-McCrann JM, et al. When worlds collide: Th17 and Treg cells in cancer and autoimmunity. Cell Mol Immunol. 2018;15(5):458–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Guo F, Iclozan C, Suh WK, Anasetti C, Yu XZ. CD28 controls differentiation of regulatory T cells from naive CD4 T cells. J Immunol. 2008;181(4):2285–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.McNally A, Hill GR, Sparwasser T, Thomas R, Steptoe RJ. CD4+ CD25+ regulatory T cells control CD8+ T-cell effector differentiation by modulating IL-2 homeostasis. Proc Natl Acad Sci U S A. 2011;108(18):7529–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Kimura MY, Pobezinsky LA, Guinter TI, Thomas J, Adams A, Park JH, et al. IL-7 signaling must be intermittent, not continuous, during CD8+ T cell homeostasis to promote cell survival instead of cell death. Nat Immunol. 2013;14(2):143–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zhang B, Upadhyay R, Hao Y, et al. Multimodal single-cell datasets characterize antigen-specific CD8+ T cells across SARS-CoV-2 vaccination and infection. Nature Immunology. 2023;24(10):1725–34. [DOI] [PMC free article] [PubMed]
- 64.Tangye SG, Good KL. Human IgM+ CD27+ B cells: memory B cells or “memory’’ B cells? J Immunol. 2007;179(1):13–9. [DOI] [PubMed] [Google Scholar]
- 65.Lin W, Zhang P, Chen H, Chen Y, Yang H, Zheng W, et al. Circulating plasmablasts/plasma cells: a potential biomarker for IgG4-related disease. Arthritis Res Ther. 2017;19(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Agematsu K. Invited Revie W Memory B cells and CD27. Histol Histopathol. 2000;15:573–6. [DOI] [PubMed] [Google Scholar]
- 67.Kotagiri P, Mescia F, Rae WM, Bergamaschi L, Tuong ZK, Turner L, et al. B cell receptor repertoire kinetics after SARS-CoV-2 infection and vaccination. Cell Rep. 2022;38(7):110393. 10.1016/j.celrep.2022.110393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.McHeyzer-Williams LJ, Milpied PJ, Okitsu SL, McHeyzer-Williams MG. Class-switched memory B cells remodel BCRs within secondary germinal centers. Nat Immunol. 2015;16(3):296–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Warnatz K, Denz A, Drager R, Braun M, Groth C, Wolff-Vorbeck G, et al. Severe deficiency of switched memory B cells (CD27+ IgM- IgD-) in subgroups of patients with common variable immunodeficiency: a new approach to classify a heterogeneous disease. Blood. 2002;99(5):1544–51. [DOI] [PubMed] [Google Scholar]
- 70.Zhang H, Liu Y, Liu D, Zeng Q, Li L, Zhou Q, et al. Time of day influences immune response to an inactivated vaccine against SARS-CoV-2. Cell Res. 2021;31(11):1215–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.He B, Liu S, Wang Y, Xu M, Cai W, Liu J, et al. Rapid isolation and immune profiling of SARS-CoV-2 specific memory B cell in convalescent COVID-19 patients via LIBRA-seq. Signal Transduction and Targeted Therapy. 2021;6(1):195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Stolovich-Rain M, Kumari S, Friedman A, Kirillov S, Socol Y, Billan M, et al. Intramuscular mRNA BNT162b2 vaccine against SARS-CoV-2 induces neutralizing salivary IgA. Front Immunol. 2023;13:933347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Takemoto Y, Tanimine N, Yoshinaka H, Tanaka Y, Takafuta T, Sugiyama A, et al. Multi-phasic gene profiling using candidate gene approach predict the capacity of specific antibody production and maintenance following COVID-19 vaccination in Japanese population. Front Immunol. 2023. 10.3389/fimmu.2023.1217206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Klasen C, Ohl K, Sternkopf M, Shachar I, Schmitz C, Heussen N, et al. MIF promotes B cell chemotaxis through the receptors CXCR4 and CD74 and ZAP-70 signaling. J Immunol. 2014;192(11):5273–84. [DOI] [PubMed] [Google Scholar]
- 75.Socodato R, Portugal CC, Canedo T, Rodrigues A, Almeida TO, Henriques JF, et al. Microglia dysfunction caused by the loss of rhoa disrupts neuronal physiology and leads to neurodegeneration. Cell Rep. 2020. 10.1016/j.celrep.2020.107796. [DOI] [PubMed] [Google Scholar]
- 76.Burgaletto C, Munafò A, Di Benedetto G, De Francisci C, Caraci F, Di Mauro R, et al. The immune system on the TRAIL of Alzheimer’s disease. J Neuroinflammation. 2020;17:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Jiang Y, Harigaya Y, Zhang Z, Zhang H, Zang C, Zhang NR. Nonparametric single-cell multiomic characterization of trio relationships between transcription factors, target genes, and cis-regulatory regions. Cell Syst. 2022;13(9):737–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Fleck JS, Jansen SMJ, Wollny D, Zenk F, Seimiya M, Jain A, et al. Inferring and perturbing cell fate regulomes in human brain organoids. Nature. 2023;621(7978):365–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.The pbmc 10k dataset. 10X Multiome Protocol. 2020. (Accessed 22 Nov 2025) https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k
- 80.Luecken MD, Burkhardt DB, Cannoodt R, Lance C, Agrawal A, Aliee H, et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In: 35th conference on neural information processing systems (NeurIPS 2021) track on datasets and benchmarks. 2021. Track on Datasets and Benchmarks.
- 81.Shapira SN, Naji A, Atkinson MA, Powers AC, Kaestner KH. Understanding islet dysfunction in type 2 diabetes through multidimensional pancreatic phenotyping: the Human Pancreas Analysis Program. Cell Metab. 2022;34(12):1906–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.The pbmc 10x public dataset. 10X Multiome Protocol. 2021. (Accessed 2 Dec 2025) https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0
- 83.The human brain 3k dataset. 10X Multiome Protocol. 2021. (Accessed 2 Dec 2025) https://www.10xgenomics.com/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-2-0-0
- 84.The human pbmc 3k dataset. 10X Multiome Protocol. 2021. (Accessed 2 Dec 2025) https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0
- 85.The mouse brain 5k dataset. 10X Multiome Protocol. 2021. (Accessed 2 Dec 2025) https://www.10xgenomics.com/datasets/fresh-embryonic-e-18-mouse-brain-5-k-1-standard-2-0-0
- 86.The nextgem chromium dataset. 10X Multiome Protocol. 2021. (Accessed 2 Dec 2025) https://www.10xgenomics.com/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-controller-1-standard-2-0-0
- 87.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Swanson E, Lord C, Reading J, Heubeck AT, Genge PC, Thomson Z, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife. 2021;10:e63632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Cheng, Y and Su, Y and Fan, Y and Yang, Y and Chen, X and Wang, F and Wong, K and Li, X . Aligned Cross-modal Integration and Regulatory Heterogeneity Characterization of Single-Cell Multiomic Data with Deep Contrastive Learning. Zenodo. 2025. (Accessed 2 Dec 2025) https://zenodo.org/records/11019640
- 90.Cheng, Y and Su, Y and Fan, Y and Yang, Y and Chen, X and Wang, F and Wong, K and Li, X . Aligned Cross-modal Integration and Regulatory Heterogeneity Characterization of Single-Cell Multiomic Data with Deep Contrastive Learning. Github. 2025. (Accessed 2 Dec 2025) https://github.com/DARKpmm/scMDCF
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Fig S1. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘pbmc 10k’ dataset. Fig S2. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘human brain 3k’ dataset. Fig S3. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘pbmc 10x public’ dataset. Fig S4. Visualization of the number of cell clusters for scMDC on the ‘pbmc 10k’ dataset. Fig S5. Benchmarking of scMDCF in terms of cell clustering, batch effect correction, and visualization quality on large-scale scMulti-omics dataset. Fig S6. Comparative analysis of computational efficiency. Fig S7. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘GSE128639’ dataset. Fig S8. Benchmarking the efficacy in eliminating batch effects and performance in runtime between scMDCF and competing methods. Fig S9. Ablation study about normalization study. Fig S10. UMAP visualizations of the cell embeddings for scMDCF and competing methods on the ‘human pbmc 3k’ dataset. Fig S11. Heatmap of curated marker genes and peaks determining cell clustering and annotation. Fig S12. Gene Set Enrichment Analysis. Fig S13. Heatmap of curated marker proteins and genes determining cell clustering and annotation of scMDCF on the ‘GSE128639’ dataset. Fig S14. Treg cell differentiation analysis by scMDCF. Fig S15. Heatmap of curated marker proteins and genes determining cell clustering and annotation. Fig S16. B cell differentiation analysis during SARS-CoV-2 vaccination. Fig S17. Expression levels of the ‘IGHG2’ gene across the B differentiation. Fig S18. Pathway enrichment analysis of B cell differentiation during SARS-CoV-2 vaccination. Fig S19. Analysis of key proteins differential expression in B cell differentiation during SARS-CoV-2 vaccination. Fig S20. Analysis of cell communication in B cell differentiation during SARS-CoV-2 vaccination. Fig S21. to Fig S27. The UMAP visualization of Seurat with various clustering resolutions. Fig S28. The cell-cell communication signaling pathway TRAIL. Fig S29. Heatmap showing chromatin accessibilityand gene expressionfor 468,896 peak-to-gene linkages. Fig S30. The density plot visualizes the distribution differences of regulon genes.
Additional file 2: Table S1. Details of each loss weight case. Table S2. The information of datasets used in the scMDCF.
Data Availability Statement
We collected nine paired scRNA-seq and ATAC-seq datasets and four CITE-seq datasets. These datasets were meticulously gathered from a variety of platforms and cover multiple species, ensuring a rich and diverse data collection for comprehensive analysis. The ‘pbmc 10k’ dataset [79] and associated cell type labels are available from the Github repository at (https://github.com/gao-lab/GLUE/tree/master/data). The processed datasets ‘multi bmmc’ [80] and ‘hpap’ [81] are provided by the scMulti-omics benchmark [36] and can be downloaded through the Github repository at https://github.com/myylee/benchmark_sc_multiomic_integration/tree/main. The ‘pbmc 10x public’ data [82] and their cell type labels are available from the benchmarking article of scRNA-scATAC-seq analysis [36], with the Github repository located at https://github.com/myylee/benchmark_sc_multiomic_integration. Then, the ‘spleen lymph’ dataset [13] is accessible through the accession number GSE150599. The ‘inhouse’ dataset [7] is accessible through the accession number GSE148665. In other aspects, the paired scRNA-seq and ATAC-seq datasets enumerated below were obtained from the 10X Genomics website: ‘human brain 3k’ [83] (https://www.10xgenomics.com/resources/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-2-0-0), ‘human pbmc 3k’ [84] (https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0), ‘mouse brain 5k’ [85] (https://www.10xgenomics.com/resources/datasets/fresh-embryonic-e-18-mouse-brain-5-k-1-standard-2-0-0), ‘nextgem chromium’ [86] (https://www.10xgenomics.com/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-controller-1-standard-2-0-0). The ‘GSE128639’ [87] and ‘GSM4949911’ [88] datasets used in our study are accessible in the GEO database under the accession codes GSE128639 and GSM4949911, respectively. The vaccina SARS-CoV-2 CITE-seq dataset [63] and corresponding cell type labels used in our study are available in the GEO database under the accession code GSE171964. The Alzheimer’s Disease dataset [16] used in our study are available in the GEO database under the accession code GSE214979. Additional File 2: Table S2 provides a comprehensive overview of each dataset, including the type of omics data, the number of cells, species, and sample information. All the datasets can be downloaded from https://zenodo.org/records/11019640 [89]. scMDCF is released as a Python package at: https://pypi.org/project/scMDCF/. The source code [90] for the usage tutorial on GitHub: https://github.com/DARKpmm/scMDCF. The analysis code of Seurat on the GSE214979 dataset is available at: https://github.com/DARKpmm/scMDCF/blob/main/tutorial/Fig8_Seurat.R. The automatic search tool for the hyperparameters is available at: https://github.com/DARKpmm/scMDCF/blob/main/scMDCF/scMDCF_param.py.
























