Abstract
The rapid advancement of single-cell omics technologies such as single-cell RNA sequencing and single-cell assay for transposase-accessible chromatin with high throughput sequencing has transformed our understanding of cellular heterogeneity and regulatory mechanisms. However, integrating these data types remains challenging due to distributional discrepancies and distinct feature spaces. To address this, we present a novel single-cell Contrastive INtegration framework (sCIN) that integrates different omics modalities into a shared low-dimensional latent space. sCIN uses modality-specific encoders and contrastive learning to generate latent representations for each modality, aligning cells across modalities and removing technology-specific biases. The framework was designed to rigorously prevent data leakage between training and testing, and was extensively evaluated on three real-world paired datasets namely simultaneous high-throughput ATAC and RNA expression with sequencing, 10X PBMC (10k version), and cellular indexing of transcriptomes and epitopes, and one unpaired dataset of gene expression and chromatin accessibility. Paired datasets refer to multi-omics data generated using technologies capable of capturing different omics features from the same cell population while unpaired datasets are measured from different cell populations from a tissue. Results on paired and unpaired datasets show that sCIN outperforms state-of-the-art models, including scGLUE, scBridge, sciCAN, Con-AAE, Harmony, and MOFA+, across multiple metrics: average silhouette width for clustering quality, Recall@k, cell type@k, cell type accuracy, and median rank for integration quality. Moreover, sCIN was evaluated on simulated unpaired datasets derived from paired data, demonstrating its ability to leverage available biological information for effective multimodal integration. In summary, sCIN reliably integrates omics modalities while preserving biological meaning in both paired and unpaired settings.
Keywords: single-cell multi-omics, SHARE-seq, CITE-seq, contrastive learning, neural networks, multimodal learning
Introduction
The advent of single-cell omics technologies has transformed the study of biological systems, offering insights into phenotypes at single-cell resolution across different omics layers and cell types. For instance, single-cell RNA sequencing (scRNA-seq) quantifies gene expression by measuring messenger ribonucleic acid (mRNA) abundance in large numbers of individual cells [1]. Moreover, single-cell assay for transposase-accessible chromatin with high throughput sequencing (scATAC-seq) identifies open chromatin regions, which are essential for gene regulation due to the increased activity of transcription factors and other regulatory elements in these regions [2]. Beyond scATAC-seq technologies, mapping chromatin accessibility [3–5] and DNA methylation patterns [6, 7] further illuminate the molecular mechanisms underlying gene regulation.
Gene expression involves several stages, including transcription, post-transcriptional regulation, translation, and post-translational modifications. To better understand these processes, there is increasing demand for sequencing technologies that can measure multiple molecular features within a single cell. These multi-omics approaches offer a more holistic view of cellular function and help mitigate batch effects from separate experiments. For instance, SHARE-seq [8] and SNARE-seq [9] simultaneously profile the transcriptome and chromatin accessibility of a cell population by integrating DNA fragmentation and mRNA reverse transcription. Cellular Indexing of Transcriptomes and Epitopes (CITE-seq) [10] combines antibody-based tagging with reverse transcription to capture both transcriptome and cell surface protein profiles within the same cells. Analyzing each data modality independently can result in fragmented insights and an incomplete understanding of the biological system. Therefore, computational approaches have been proposed to integrate multiple omics data types for more comprehensive analysis and inference. However, this integration task is challenging due to disparities in the distributions and feature spaces of the different modalities.
Computational methods used for unimodal omics datasets, specifically scRNA-seq, have been developed to better understand heterogeneity of cell types in the sparse single-cell data. For instance, scLDS2 was developed to find rare cell types by distinctions between cell types’ distributions using adversarial learning [11]. scGDC is a deep learning-based subspace clustering model that jointly learns denoised cell features, an affinity graph, and subspace projections from adversarial learning to identify cell types from noisy, high-dimensional scRNA-seq data [12]. In addition, scGMAAE uses distinct Gaussian distributions for each cell type in the scRNA-seq data, within a scalable adversarial Autoencoder—augmented by Bayesian variational inference—to produce interpretable low-dimensional embeddings that precisely clustered by cell types [13].
Most multi-omics computational approaches aim to learn a low-dimensional joint representation of multiple modalities using dimensionality reduction techniques such as principal component analysis (PCA) [14], canonical correlation analysis (CCA) [15–17], or deep neural networks [18–20]. For instance, Seurat [21] applies PCA to transform data linearly and uses mutual nearest neighbors and CCA to align embeddings in a shared latent space. However, this approach is limited in capturing nonlinear relationships between different omics features. Another approach, MOFA [22], is a probabilistic Bayesian framework that employs matrix factorization to decompose input data matrices from different modalities into two components: (i) a modality-specific weight matrix that captures factor loadings for each feature, and (ii) a shared factor matrix representing the factors loadings for each sample, which is common for all modalities. Harmony [23], on the other hand, uses fuzzy K-Nearest Neighbor (KNN) clustering to group cells based on cell types, while maximizing batch diversity within each cluster, offering a robust method for batch effect correction and modality alignment.
There are also network-based methods for the multimodal integration of the transcriptomics and the chromatin accessibility data. sLMIC models scRNA-seq and scATAC-seq data by constructing per-omic graphs and using low-rank self-representation to dissect shared and modality-specific features to precisely define and separate cell types across multi-omics data [24]. Moreover, network-based integrative clustering algorithm learns and fuses cell–cell similarity graphs from parallel single-cell transcriptomic and epigenomic profiles, then applies joint non-negative matrix factorization to extract shared features and accurately cluster cell types across multi-omics data [25].
Many multimodal deep learning frameworks have also been developed to map different modalities’ high-dimensional, nonlinear feature spaces into a unified, smaller subspace. For instance, scGLUE employs a graph variational autoencoder (VAE) to learn biological feature representations [20]. It enhances the cell embeddings learned by modality-specific VAEs using representations learned by the graph VAE. Additionally, adversarial training is applied to integrate technology effects, so an adversarial classifier cannot distinguish cell embeddings based on omics modalities. The universal framework for single-cell multi-omics data integration with graph convolutional networks (GCN-SC) leverages mutual nearest neighbors to connect cells across datasets and construct a mixed graph. It also adjusts query datasets using a graph convolutional network (GCN), followed by non-negative matrix factorization for dimension reduction and visualization [26]. MultiVI is a VAE-based framework that integrates scRNA-seq and scATAC-seq data by learning modality-specific latent representations using deep neural network encoders. It corrects batch effects and aligns these representations within a shared latent space [27]. Con-AAE uses adversarial Autoencoders to learn latent representations of paired single-cell transcriptome and chromatin accessibility data. By applying contrastive learning and a cycle consistency loss, it effectively integrates different data types and accounts for technology effects. Specifically, the contrastive learning in this model relies on positive and negative mining in which the furthest same-label and closest different label cells are positive pair and negative pair, respectively [28].
Here, we introduce sCIN, a contrastive learning-based neural network framework designed to integrate single-cell multi-omics datasets into a shared low-dimensional latent space. sCIN preserves cell type heterogeneity, eliminates technology effects, and applies to both paired and unpaired datasets. In particular, we adopt the CLIP model [29], which was originally designed to learn joint representations of images and text, to single-cell data. sCIN consists of two modality-specific encoders that learn reduced-dimensional representations for each omics type. The model then minimizes the distance between cells of the same type in latent space, while maximizing the distance between cells of different types through a contrastive loss function. We evaluated sCIN on four real-world paired and unpaired datasets: Ma-2020 (SHARE-seq) [9], 10x Genomics PBMC (10k version) [30], Luecken-2022 (CITE-seq) [31], and Muto-2021 (scRNA-seq and scATAC-seq of human kidney) [32]. We also simulated realistic, leakage-free unpaired datasets derived from paired datasets to assess sCIN’s performance. The results from both settings show that sCIN outperforms six state-of-the-art and one baseline methods using various integration strategies across different evaluation metrics.
Results
Overview of sCIN
sCIN is a contrastive learning-based framework, inspired by CLIP [29], designed for integrating paired and unpaired single-cell multi-omics datasets. It uses two neural network encoders to map each data modality into a shared lower-dimensional space. For paired data, sCIN aligns embeddings by treating measurements from the same cell as positive pairs. For unpaired data, cells of the same type across modalities are considered positive pairs. Additionally, the model separates embeddings of cells with different cell types (negative pairs), ensuring that clustering reflects biological variability rather than technical biases from single-cell omics technologies. Integration quality is evaluated using metrics such as Recall@k, ASW, cell type accuracy, median rank, and cell type@k (Methods). Recall@k and median rank assess the alignment of the same cells across modalities, while cell type accuracy, cell type@k, and ASW evaluate cell type consistency among nearest neighbors and clustering.
Model evaluations on paired datasets
We benchmarked sCIN against six state-of-the-art methods: scGLUE [20], scBridge [33], sciCAN [34], MOFA+ [35], Con-AAE [28], and Harmony [23], as well as a baseline Autoencoder neural network trained solely with a reconstruction loss on three different paired datasets. All evaluations used 10 train-test splits, resulting in ten hold-out test datasets. For metrics computed on separate embeddings, such as Recall@k, median rank, cell type accuracy, and cell type@k, we evaluated the metrics using embeddings from the hold-out datasets in each of the ten replications, resulting in 10 values per metric for each model. We summarized these distributions either by plotting the mean with error bars (for Recall@k and cell type@k) or by using box plots to illustrate the variation across replications for median rank and cell type accuracy.
For the average silhouette width (ASW), we concatenated the embeddings from the hold-out test datasets along the second dimension (i.e. column-wise for matched cells) and the first dimension (i.e. row-wise for unmatched cells) to compute the ASW on the resulting joint embeddings. As with the other metrics, this was done for each of the 10 replications, yielding 10 ASW values per model, which we visualized using box plots.
Evaluation of sCIN on the SHARE-seq dataset
We evaluated sCIN on the SHARE-seq (simultaneous high-throughput ATAC and RNA expression with sequencing) dataset, which profiles 32,231 mouse skin cells spanning 22 cell types. SHARE-seq enables the simultaneous measurement of gene expression and chromatin accessibility in single cells, offering a scalable and cost-effective approach [8]. The datasets contain 21 478 gene expression features and 340 341 chromatin accessibility features. The datasets were downloaded as h5ad files provided by Cao et al. [20]
We evaluated the models using the Recall@k metric, which measures the proportion of embeddings in the first modality whose corresponding counterparts from the second modality are among their k-nearest neighbors (Methods). On the SHARE-seq dataset, sCIN significantly outperformed other methods and achieved the highest Recall@k values across various k settings, with performance improving as k increased. Con-AAE also showed strong performance compared to other benchmarks, highlighting the potential of neural network-based frameworks for integration tasks while preserving biological information (Fig. 2a).
Figure 2.
(a) Comparison of Recall@k metrics between different models for k = 10, 20, 30, 40, and 50 for the SHARE-seq dataset. (b) Normalized Median Rank values for different models. (c) Cell type accuracy comparison between models. (d) Comparison of ASW score across models. (e) t-SNE visualization of the original data (scRNA-seq and scATAC-seq). (f) t-SNE visualization of the sCIN’s integrated embeddings.
We used the Median Rank metric to assess data integration quality. For each cell, the distance between its embedding in the first modality and all embeddings in the second modality was calculated. The rank of the true pair's distance among these values was obtained. The overall Median Rank was determined as the median of these ranks across all cells, with normalized values reported (Methods). On the SHARE-seq dataset, sCIN exhibited the closest pairwise distances in its embedding space compared to other models (Fig. 2b). Consistent with the Recall@k results, the second and third best-performing models were Con-AAE and MOFA+, respectively.
We evaluated whether sCIN preserves cell type information post-integration, an aspect of biological interpretability in multi-omics integration. Cell type accuracy, assessed by the alignment of cell types among nearest neighbors in the embedding space (Fig. 2c and d), showed that sCIN achieved the highest accuracy, nearly 0.7, outperforming all other models. Clustering quality, measured using ASW (Methods), ranked sCIN as the second-best model, with a normalized ASW exceeding 0.6, trailing Con-AAE by about 0.07. Supplementary table 3 summarizes the average values of metrics for each model.
To demonstrate that sCIN’s embedding space captures more relevant information than the original data, we visualized t-SNE representations of both the original data of scRNA-seq and scATAC-seq, and the joint sCIN’s embeddings. The t-SNE of the joint embeddings (Fig. 2f), colored by cell types, shows better preservation of cell type information. On the SHARE-seq dataset, sCIN effectively preserved cell type information post-integration, while the original data’s t-SNE visualization captures less cell type information, likely due to noise and inherent sparsity in single-cell multi-omics data. The process of plotting the t-SNE representations of the original data (both modalities) and sCIN’s integrated embeddings are the same for all paired datasets.
Evaluation of sCIN on the PBMC dataset
We conducted similar evaluations on the PBMC dataset (10k version), which includes paired profiling of gene expression and chromatin accessibility for peripheral blood mononuclear cells. This dataset was generated using the 10x Genomics sequencing platform and is widely used for benchmarking multi-omics integration methods [30]. The dataset comprised 9631 cells and 19 cell types, with 29 095 genes and 107 194 chromatin regions. The datasets were obtained as h5ad files provided by Cao et al. [20].
sCIN consistently achieved the highest Recall@k values across all k settings (Fig. 3a). It demonstrated a near-zero normalized Median Rank, highlighting the closeness of embeddings of the two modalities in the embedding space (Fig. 3b). The patterns for cell type accuracy and ASW were consistent with those observed in the SHARE-seq dataset (Fig. 3c and d). Average values of metrics for each model are available in Supplementary table 4. Moreover, sCIN effectively retained most cell type information for 19 distinct cell types in the integrated embedding space (Fig. 3e and f).
Figure 3.
(a) The comparison of Recall@k metrics between different models for k = 10, 20, 30, 40, and 50 for the 10x PBMC Multiome dataset (gene expression and chromatin accessibility). (b) Normalized Median Rank values for different models. (c) Cell type accuracy comparison between models. (d) Comparison of ASW score across models. (e) t-SNE embeddings of the original dataset (scRNA-seq and scATAC-seq). (f) t-SNE visualization of the sCIN’s integrated embeddings.
Evaluation of sCIN on the CITE-seq dataset
To show sCIN’s ability to handle diverse omics modalities, we evaluated its performance on the CITE-seq dataset [31], which includes gene expression and cell surface protein features. Single-cell CITE-seq contains 90 261 matched scRNA-seq and Antibody-Derived Tags (ADT) profiling [10] of bone marrow mononuclear cells of 12 healthy donors across different sites. The data were generated using the 10x 3' Single-Cell Gene Expression kit with Feature Barcoding and the BioLegend TotalSeq B Universal Human Panel v1.0. It includes measurements of 13 953 genes and 134 cell surface proteins. The datasets were downloaded from the GEO accession number GSE194122 as h5ad files.
On this dataset, sCIN outperformed all other models in integrating quality and preserving biological information. Specifically, sCIN consistently increased Recall@k values as k increased, while other frameworks plateaued and showed no significant improvements (Fig. 4a). A similar trend was observed in the Median Rank metric (Fig. 4b), where sCIN achieved the closest alignment of cell pairs across modalities. Additionally, sCIN achieved the highest performance in cell type accuracy and ASW metrics, outperforming other methods (Fig. 4c and d). Supplementary table 5 summarizes the average values of metrics for each model. The superiority of sCIN in the integration of the CITE-seq modalities can be explained by its emphasis on the cell type supervision, which guides the model when the information is scarce, specifically in the cell surface protein modality. In contrast, models such as scGLUE or MOFA+ do not explicitly use cell type information. Although Con-AAE does use cell type labels in its training, the high complexity of the model and hard positive and negative pairs mining might prevent it from learning enhanced integrated representations.
Figure 4.
(a) The comparison of Recall@k metric between different models for k = 10, 20, 30, 40, and 50 for the CITE-seq dataset (gene expression and cell surface proteins). (b) Normalized Median Rank metric for different models. (c) Cell type accuracies between models that show the percentage of closest embeddings from different modalities having the same cell types. (d) Comparison of ASW score across models shows the clusters’ quality.
Improved cell type clustering with sCIN’s joint embeddings
To compare sCIN’s latent space with the original and PCA-transformed spaces, we computed ASW for each modality of the preprocessed original data, PCA integrated embeddings, and sCIN’s integrated embeddings. PCA dimensionality matched sCIN’s (256 for all datasets). Results show sCIN’s embeddings achieve superior cell type clustering over PCA integration and original data for each modality. This experiment was performed on unseen test datasets and replicated 10 times (Supplementary Fig. 1).
Model evaluations on simulated unpaired datasets
We assessed the performance of sCIN on unpaired datasets, where omics data from different modalities do not originate from the same set of cells. First, we simulate unpaired datasets from the paired datasets: SHARE-seq, PBMC, and CITE-seq. Specifically, a subset of cells was selected, and only one modality (ATAC for SHARE-seq and PBMC, and ADT for CITE-seq) from this subset was retained. For the remaining cells, only the data for the second modality were used, ensuring no overlap between the cells of the two modalities. To examine the impact of unbalanced cell proportions across modalities, we generated datasets with varying levels of proportions (1%, 5%, 10%, 20%, and 50%) and compared the resulting embeddings on different metrics: Recall@k, ASW, and cell type accuracy. We also included the paired datasets as the optimal case and a random baseline, where cell types were randomly permuted and assigned to cells (Fig. 5). All evaluations were performed on the hold-out test datasets and replicated 10 times.
Figure 5.
Evaluation of sCIN on unpaired datasets. (a) Schematic showing the simulation of unpaired data from paired data. (b) Recall@k performance on the simulated unpaired datasets. (c) Cell type accuracy performance on the simulated unpaired datasets. (d) ASW based on joint embeddings from the simulated unpaired datasets. (e) Comparison of sCIN and the state-of-the-art models’ performances according to cell type@k metric on Muto et al. dataset. (f) Comparison of sCIN and the state-of-the-art models’ performances according to ASW metric on Muto et al. dataset. (g) t-SNE representations of the embeddings from the hold-out simulated unpaired PBMC dataset colored by cell types.
On the PBMC dataset, Recall@k values significantly improved as the proportion of cells for each cell type increased. Although the results remained lower than the paired case, they were significantly higher than the random baseline (Fig. 5b). In the cell type accuracy analysis, as fewer cells were provided for the second modality, the cell type accuracy decreased. The integrated embeddings from the paired dataset achieved the highest accuracy (0.848 on average). Across proportions from 1% to 50%, accuracy showed an increasing trend, consistently outperforming the random baseline (Fig. 5c).
Additionally, we assessed the contribution of each modality to the ASW score of the embeddings, demonstrating the preservation of cell type information (Fig. 5d). To further assess robustness, we calculated Recall@k and cell type accuracy from gene expression embeddings to chromatin accessibility embeddings (Supplementary Fig. 4ab); previous results showed similar results when comparing chromatin accessibility to gene expression. We also visualized t-SNE representations of the hold-out PBMC dataset in the embedding space of sCIN trained on 50% unpaired data (Fig. 5g), which clearly shows the clustering of integrated embeddings based on cell types. Similar results were seen from the datasets simulated from the SHARE-seq. For instance, based on Recall@k metric, the perfect sCIN’s performance was on the fully paired dataset (0.172 in average) while the performance enhanced from 0.026 to 0.067 as the percentage of cells increased in the second morality from 1% to 50% (supplementary Fig. 2a and Supplementary Fig. 4bc). The similar trend was demonstrated in experiments on the CITE-seq dataset (Supplementary Fig. 3 and Supplementary Fig. 4ef).
Model evaluations on Muto-2021 dataset
We also evaluated sCIN’s performance on real-world, unmatched single nucleus RNA sequencing (snRNA-seq) and single nucleus ATAC sequencing (snATAC-seq) data from [32]. The datasets were downloaded as h5ad files provided by Cao et al. [20]. The gene expression dataset consists of 19 985 cells and 27 146 genes while the chromatin accessibility dataset has 24 205 cells and 99 019 open chromatin regions (i.e. peaks). These datasets obtained from nontumor kidney cortex samples from patients undergoing partial or radical nephrectomy for renal mass. snRNA-seq and snATAC-seq libraries were obtained using 10x Genomics Chromium Single Cell 5′ v2 and 10X Genomics Chromium Single Cell ATAC v1, respectively.
Since there is no one-to-one pairing between cells in unpaired datasets, we use metrics emphasizing on cell type information retrieval in the embedding space. Specifically, we defined cell type@k to measure cell type consistency in a cell’s neighborhood from a different modality (Methods). According to this metric, sCIN outperformed methods handling unpaired datasets, namely scGLUE, sciCAN, and MOFA+. sCIN achieved 0.938 in average across different values of k (k = 10, 20, 30, 40, 50) and 10 replications (Fig. 5e). sciCAN and scGLUE are the next best-performing models, achieving (in average) 0.79 and 0.564, respectively. Also, the same superiority of sCIN was seen in ASW evaluation, with 0.761, on average, compared to other models (scGLUE: 0.521, sciCAN: 0.489, MOFA+: 0.21) (Fig. 5f). Similar to the paired experiments, sCIN outperformed models that handle unpaired datasets, namely scGLUE, sciCAN, and MOFA+.
Methods
sCIN framework
sCIN is a contrastive learning-based framework designed to learn a shared low-dimensional latent embedding across modalities of single-cell data. sCIN is developed based on the CLIP architecture developed by OpenAI, which aligns image and text embeddings in a shared latent space. sCIN extends this approach to integrate single-cell modalities such as scRNA, scATAC, and single-cell ADT profiles. It supports both paired and unpaired datasets: in the paired case, the input consists of simultaneous measurements of the same single cells across two modalities, whereas in the unpaired case, the two modalities originate from different single cells.
We used two neural network encoders to map the distinct feature spaces of each modality into a shared latent space. The main objective is to align the latent embeddings of each cell across modalities, irrespective of the technology used. sCIN can also handle unpaired datasets, where each cell is measured in only one modality. In this case, sCIN aims to bring embeddings of cells from the same cell type closer together while separating those from different cell types.
Let
for
denote the feature values for the first and second modalities, where N is the number of cells, and
and
. Here, p and q represent the number of features for the first and second modalities, respectively. We used two encoders,
and
, for the first and second modalities to generate latent embeddings for both modalities, each with the same dimension size d. The embeddings are normalized such that
and
.
In a contrastive learning framework, positive and negative pairs are defined to guide the algorithm in learning embeddings where positive pairs are closer and negative pairs are further apart. In the paired case, the pair
is considered a positive pair, while
, where cells i and j are from different types, is considered a negative pair. Note that
where i and j are distinct cells
and
of the same type, are neither a positive nor negative pair. In the unpaired case,
, where cells i and j share the same type, are treated as positive, while all others are treated as negative pairs. In fact, contrastive learning encourages the similarity between positive pairs (matched cells in paired setting and similar cell types in unpaired setting across modalities) while penalizing the dissimilarity between negative pairs which are cells that do not meet the criteria of positive pairs.
sCIN constructs
matrices for mini-batches, consisting of similarity between embedding of the first and second modalities, where M represents the number of cell types (as we displayed in Fig. 1). The diagonal entries represent positive pairs, while off-diagonal entries represent negative pairs. sCIN learns a shared embedding space by jointly training the encoders to maximize the similarity of diagonal entries while minimizing the off-diagonal pairs. The loss function for a given anchor cell i is given as follows,
Figure 1.
sCIN workflow. The framework uses modality-specific encoders to learn latent representations for each modality, using a contrastive training approach to align these representations. In the unpaired setting, cells of the same type are treated as positive pairs and are positioned closer together in the latent space. In contrast, cells of different types are treated as negative pairs and are separated further apart. In the paired setting, true cell pairs across modalities are defined as positive pairs. Performance is assessed using hold-out integrated embeddings, evaluated based on integration quality metrics (Recall@k) and the preservation of biological variability (ASW and cell type accuracy). Some parts of this figure were created using BioRender.com.
The total loss is summed up across all anchor cells in the mini-batch. In the unpaired case, the loss function is similar; however, the similarity matrix is constructed using a set of M cells from different types in the first modality and a corresponding set of M cells from the second modality, whereas, in the paired cases, the same set of cells is used for both modalities.
![]() |
sCIN was trained in 150 epochs for all datasets, with a custom early-stopping strategy implemented to enhance training efficiency. The early stopping criteria used a patience of 10 and a minimum delta of
. The encoders share the same architecture, consisting of three layers. The first linear layer projects the input data into a hidden dimension (256), followed by a batch normalization layer. A ReLU activation function is applied to the last linear layer to introduce nonlinearity relationships between features. The final linear layer outputs latent embeddings of 128 dimensions.
Benchmark models
To ensure rigorous evaluation and avoid any train-test data leakage, we implemented a unified modular training and evaluation framework for all models. This was particularly important as we identified instances of such leakage in some published models. Our framework strictly restricted the training function to access only the training data, excluding any information from the test set. The trained model was subsequently used to generate embeddings for the hold-out dataset, enabling fair and consistent evaluations across all models (see more details at https://github.com/AmirTEbi/sCIN). The embeddings generated by all models were evaluated using five metrics: ASW, Recall@K, and median rank, cell type accuracy, and cell type@k. Additionally, for each model, a joint embedding was constructed by concatenating embeddings from all modalities. sCIN was implemented and trained using the PyTorch framework in Python on the Nvidia GeForce RTX 4090 Graphical Processing Unit (GPU). Each benchmark experiment was replicated 10 times.
Con-AAE
It uses two encoders to map each modality into low-dimensional manifolds, employing an adversarial loss to ensure that an adversarial classifier cannot distinguish between the latent embeddings based on their modalities [28]. To enhance stability, the model incorporates a cycle-consistency loss. Additionally, a contrastive learning loss is applied to the embeddings to increase similarity between cells of the same type. Finally, a classifier predicts cell type probabilities based on the joint embedding space. The model was trained using default parameters, and the original Con-AAE implementation is available at https://github.com/kakarotcq/Con-AAE/tree/main.
Mofa+
Multi-Omics Factor Analysis v2 (MOFA+) [35] is an unsupervised, probabilistic matrix factorization framework designed for the integration of single-cell multi-omics data. It extends the original MOFA model [22] by introducing a more scalable variational-inference engine, flexible sparsity constraints, and explicit support for modeling multiple sample groups (“multi-group” functionality) while handling missing values across assays. Using the mofapy2 Python package (https://github.com/bioFAM/mofapy2), we trained MOFA+ on the entire dataset, as it lacks modality-specific embeddings and train/test splitting. To evaluate MOFA+’s paired test embeddings based on Recall@k, cell type accuracy, and median rank, we imputed one modality from another by augmenting the dataset with test samples initialized as zero matrices. For the unpaired test datasets, we extracted the learned weights, precisions, and intercepts for each view, and computed the closed-form posterior mean of the factors by centering and precision-weighting the test data matrix, and finally multiplied by the precomputed weight–covariance product. For efficiency, chromatin accessibility was reduced to 100 PCs before training. MOFA+ was trained with 1000 iterations, 30 factors, and “fast” convergence mode.
Harmony
Harmony refines PCA embeddings through two steps: (i) diversity-maximizing clustering and (ii) mixture model-based linear correction [23]. Although originally developed for batch correction, we adapted Harmony for multimodal integration. Datasets were reduced to 100 PCs and then projected into shared space using CCA. Preprocessing followed harmonypy standards as in https://github.com/slowkow/harmonypy. Since harmonypy does not natively support train/test splits, we trained Harmony on the training set (10 epochs). For evaluation, we cloned the model with identical parameters to generate embeddings for the hold-out data in a single epoch.
scGLUE
scGLUE consists of different modules [20]. It has modality-specific VAEs to generate cell embeddings. Moreover, it encodes the relationships between biological features (e.g. genes and chromatin regions) in a bi-directional graph with self-loops. A graph VAE produces embeddings for each node (i.e feature) in that graph. Since the latent dimensions of the cells and features’ embeddings are the same, scGLUE reconstructs the original data using the inner product of cells and features’ embeddings. It also employs an adversarial classifier to predict the omics modality using cells’ embeddings. scGLUE was run using the “scglue” python package based on the default settings and guidelines (https://scglue.readthedocs.io/en/latest/tutorials.html).
scBridge
scBridge is a semi-supervised method for integration of paired scRNA-seq and scATAC-seq data [33]. scBridge uses annotated scRNA-seq data to find representative cell-type features (prototypes). Then, it accepts the scATAC-seq data as gene activities and identifies scATAC-seq cells showing high correlation between chromatin accessibility and gene expression of the prototypes as reliable candidates for integration. scBridge was run using the default settings as stated in https://github.com/XLearning-SCU/scBridge/tree/main.
sciCAN
sciCAN (single-cell chromatin accessibility and gene expression data integration via Cycle-consistent Adversarial Network) is an unsupervised deep learning model that employs a shared encoder to learn integrated representations from both modalities, utilizing noise contrastive estimation to enhance discriminative features [34]. To align the modalities, sciCAN incorporates a cycle-consistent adversarial network, enabling the translation of data between modalities and ensuring consistency through cycle-consistency loss. Its architecture allows sciCAN to integrate data without requiring paired cells or prior annotations. sCIN was run using the default settings according to https://github.com/rpmccordlab/sciCAN.
Autoencoder
The baseline Autoencoder uses separate encoders for each data modality, mapping inputs to a 64-dimensional embedding via a 128-dimensional hidden layer with batch normalization and ReLU activation. These embeddings are concatenated and fed into a decoder to reconstruct the inputs. For the SHARE-seq and PBMC datasets, the hidden and latent dimensions were set to 256 and 128, respectively, with 2000, 500, and 100 PCs selected for SHARE-seq, PBMC, and CITE-seq. Training was conducted for 150 epochs with a learning rate of 0.01, and a batch size of 64 for all datasets.
Evaluation metrics
ASW
Silhouette width evaluates clustering quality by comparing a cell's within-cluster distance to its distance from the nearest neighboring cluster. The ASW, calculated as the mean silhouette width across all cells, ranges from −1 to 1, with higher values indicating better cluster separation. To assess integration results, we used ASW based on cell type labels. Euclidean distance was employed as the distance metric. Specifically, embeddings generated by each method for different modalities were concatenated in feature space, and ASW was calculated on the integrated embeddings. We normalized ASW to a range between 0 and 1 as follows:
![]() |
The higher value for this metric indicates the model’s ability to preserve cell-type heterogeneity in the embedding space effectively.
Recall@k
Recall@k metric assesses the closeness of different embeddings for the same cells in latent space, obtained using different modalities. To achieve this, the embedding of cell i from the first modality (e.g. RNA) is compared to all cell embeddings in the second modality (e.g. ATAC), and the
-nearest neighbors of cell i are then computed. We denote
as the cell indices of the k-nearest neighbors in the second modality. Irrespective of training on paired or simulated unpaired data, since the hold-out data is paired, we can assess whether the matched embedding of the same cell is among the k-nearest neighbors. The fraction of such cells is called Recall@k and is computed as:
![]() |
Where
is an indicator function that equals 1 if cell i is among the k-nearest neighbors in the second modality, and 0 otherwise and N is the number of cells. Recall@k ranges from 0 to 1, with higher values indicating better matching of paired cell embeddings across modalities.
Cell type@k
This metric computes, for each cell in modality 1, the fraction of its 𝑘 nearest neighbors (in the learned embedding of modality 2) that share the same cell-type label, and then averages this fraction across all cells.
![]() |
Where
is the number of cells in the modality 1’s test dataset,
is a set of k nearest neighbors to cell i in modality 2, and y is the cell type label.
Cell type accuracy
The procedure for this metric is similar to Recall@k but instead checks whether the cell type of the query cell matches that of its one-nearest neighbor in the second modality.
Median rank
For each cell embedding in the first modality, we calculated its Euclidean distances to all embeddings in the second modality and identified the rank of its matched embedding. The median rank across all cells serves as the metric, with lower values indicating better performance in accurately matching different data modalities for the same cells.
Data preparation
SHARE-seq, PBMC, and unpaired snRNA-seq and snATAC-seq data (Muto-2021) were downloaded from https://scglue.readthedocs.io/en/latest/data.html [20]. Since these datasets contain raw Unique Molecular Identifier counts, the preprocessing steps follows the suggested pipeline by Scanpy [36]. Specifically, in the raw count matrix of each modality, each cell is normalized by total counts over all genes. Then, the normalized counts were log-transformed. After that, the normalized counts were scaled to have zero mean and unit variance. The CITE-seq data was downloaded from the Gene Expression Omnibus database with the accession number GSE194122. This dataset was preprocessed by the provider. For sCIN, PCA was performed for each modality in all datasets to reduce the initial dimension of the data.
Conclusion
Although there are various powerful methods in multi-omics data integration such as network-based methods [24, 25] and linear methods [21–23], we have opted for Contrastive Learning, which has emerged as a significant representation learning method in machine learning, specifically in image analysis and natural language processing [37, 38]. It has also been applied in single-cell omics research for data analysis and integration [39, 40]. The core idea of contrastive learning is to align semantically similar samples (positive pairs) closely in the embedding space while pushing dissimilar samples (negative pairs) further apart. Positive and negative pairs can be from different modalities such as natural language and images.
We introduced sCIN, a novel framework designed to effectively integrate paired and unpaired single-cell omics data modalities, including scRNA-seq, scATAC-seq, and scADT. sCIN leverages different omics from the same cells in paired datasets and from cells of the same cell types in unpaired datasets. Our method outperforms alternative approaches, such as k-nearest neighbor clustering, matrix factorization, and neural networks as well as baseline models (Autoencoder and PCA) in integrating multi-omics datasets while preserving essential biological characteristics, including cell type identity. Although some existing methods such as Con-AAE incorporates Contrastive Learning in their training, the main difference between sCIN and Con-AAE lies in how contrastive pairs are selected. Con-AAE performs hard mining by selecting the furthest same-label cells as positive pairs and the closest different-label cells as negative pairs. While this approach can enhance contrast, it may lead to overfitting, especially in noisy or heterogeneous datasets.
On the other hand, sCIN uses the true cross-modal correspondences in paired data. It uses all matched cells as positive pairs and treats all other cells from different cell types in the batch as negative pairs, without relying on hard mining. Similarly, for unpaired data, cells annotated with the same type across modalities are used as positive pairs, also avoiding the need for hard mining. This provides a more stable and balanced training signal, which enhances robustness and generalization. Our evaluations show that this design leads to better performance on unseen data. Notably, sCIN achieved higher ASW scores than other methods, reflecting superior clustering quality based on cell types. Additionally, the comparison of sCIN’s ASW scores with those of the original datasets and the PCA baseline underscores its ability to extract meaningful biological insights from sparse single-cell data.
sCIN not only integrates paired multi-omics datasets but also effectively learns representations from unpaired omics data with different cell counts. Across various unpairing strategies, sCIN consistently captured biological information, with acceptable performance with limited sample sizes. Moreover, sCIN was designed to prevent data leakage between training and evaluation. The implementation is publicly available (Code Availability), accompanied by a tutorial for benchmarking other methods.
sCIN's reliance on cell-type labels during training presents several limitations. First, cell type annotation in single-cell data is inherently challenging due to noise and sparsity in omics measurements. Second, in the unpaired setting, sCIN assumes shared cell types across modalities, which may not always hold due to unique cell types in one modality or differing data quality. Exploring less cell-type-dependent approaches—such as unsupervised or self-supervised strategies—could enhance robustness. Additionally, the current framework is limited to integrating two omics modalities. sCIN is also designed for discrete cell types, making it less suitable for continuous processes like development. Extending sCIN to address these scenarios represents an important direction for future work.
Key Points
sCIN is a novel contrastive learning-based framework for integrating paired and unpaired single-cell multi-omics datasets.
sCIN consists of two modality-specific encoders, trained using a contrastive loss function tailored for single-cell integration tasks.
sCIN outperformed alternative models based on matrix factorization, Fuzzy KNN clustering with linear batch correction, and neural network models in a rigorous assessment.
Supplementary Material
Contributor Information
Amir Ebrahimi, Department of Biotechnology, College of Science, University of Tehran, Ghods 37, Tehran, 1417763135, Iran.
Alireza Fotuhi Siahpirani, Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Ghods 37, Tehran, 1417763135, Iran.
Hesam Montazeri, Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Ghods 37, Tehran, 1417763135, Iran.
Author contributions
A.E. and H.M. conceived the study and designed the experiments. A.E. and H.M. developed the method. A.E. collected and preprocessed the data, implemented the methods in Python, conducted all the experiments, analyzed the results, and wrote the manuscript. A.F.S. and H.M. supervised the project and verified the findings. A.E., H.M., and A.F.S. discussed the results and contributed to the final version of the manuscript. All authors agreed to the final version of the manuscript.
Conflict of interest
The authors declare no conflict of interest.
Funding
None declared.
Data availability
SHARE-seq, 10x Genomics PBMC (10k version), and Muto-2021 datasets are provided by the scGLUE method [20] at https://scglue.readthedocs.io/en/latest/data.html as h5ad files. The preprocessed CITE-seq dataset are publicly available at Gene Expression Omnibus (GEO) database with GSE194122 accession number as a h5ad file.
Code availability
Code for sCIN implementation and all analyses in this paper is publicly available at https://github.com/AmirTEbi/sCIN.
References
- 1. Rosenberg AB, Roco CM, Muscat RA. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 2018;360:176–82. 10.1126/science.aam8999 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cusanovich DA, Hill AJ, Aghamirzaie D. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 2018;174:1309–1324.e18. 10.1016/j.cell.2018.06.052 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ramanathan M, Porter DF, Khavari PA. Methods to study RNA-protein interactions. Nat Methods 2019;16:225–34. 10.1038/s41592-019-0330-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Lake BB, Chen S, Sos BC. et al. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat Biotechnol 2018;36:70–80. 10.1038/nbt.4038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ma A, McDermaid A, Xu J. et al. Integrative methods and practical challenges for single-cell multi-omics. Trends Biotechnol 2020;38:1007–22. 10.1016/j.tibtech.2020.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Luo C, Keown CL, Kurihara L. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 2017;357:600–4. 10.1126/science.aan3351 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Mulqueen RM, Pokholok D, Norberg SJ. et al. Highly scalable generation of DNA methylation profiles in single cells. Nat Biotechnol 2018;36:428–31. 10.1038/nbt.4112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ma S, Zhang B, LaFave LM. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 2020;183:1103–1116.e20. 10.1016/j.cell.2020.09.056 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat Biotechnol 2019;37:1452–7. 10.1038/s41587-019-0290-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stoeckius M, Hafemeister C, Stephenson W. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods 2017;14:865–8. 10.1038/nmeth.4380 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wang H, Ma X. Learning discriminative and structural samples for rare cell types with deep generative model. Brief Bioinform 2022;23:23. 10.1093/bib/bbac317 [DOI] [PubMed] [Google Scholar]
- 12. Wang H, Ma X. Learning deep features and topological structure of cells for clustering of scRNA-sequencing data. Brief Bioinform 2022;23:23. 10.1093/bib/bbac068 [DOI] [PubMed] [Google Scholar]
- 13. Wang H-Y, Zhao J-P, Zheng C-H. et al. scGMAAE: Gaussian mixture adversarial autoencoders for diversification analysis of scRNA-seq data. Brief Bioinform 2023;24:24. 10.1093/bib/bbac585 [DOI] [PubMed] [Google Scholar]
- 14. Bersanelli M, Mosca E, Remondini D. et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics 2016;17:15. 10.1186/s12859-015-0857-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Baysoy A, Bai Z, Satija R. et al. The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol 2023;24:695–713. 10.1038/s41580-023-00615-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Andrew G, Arora R, Bilmes J. et al. In: ICML. Dasgupta S, McAllester D (eds.), Deep Canonical Correlation Analysis, Vol. 28, pp. 1247–55. Atlanta, Georgia, USA: PMLR; 2013.
- 17. Hotelling H. Relations between two sets of variates. Biometrika 1936;28:321–77. 10.1093/biomet/28.3-4.321 [DOI] [Google Scholar]
- 18. Gayoso A, Steier Z, Lopez R. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat Methods 2021;18:272–82. 10.1038/s41592-020-01050-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Zhang Z, Yang C, Zhang X. scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously. Genome Biol 2022;23:139. 10.1186/s13059-022-02706-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Cao Z-J, Gao G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat Biotechnol 2022;40:1458–66. 10.1038/s41587-022-01284-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Stuart T, Butler A, Hoffman P. et al. Comprehensive integration of single-cell data. Cell 2019;177:1888–1902.e21. 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Argelaguet R, Velten B, Arnol D. et al. Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol 2018;14:e8124. 10.15252/msb.20178124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Korsunsky I, Millard N, Fan J. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods 2019;16:1289–96. 10.1038/s41592-019-0619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Wang H, Liu Z, Ma X. Learning consistency and specificity of cells from single-cell multi-omic data. IEEE J Biomed Health Inform 2024;28:3134–45. 10.1109/JBHI.2024.3370868 [DOI] [PubMed] [Google Scholar]
- 25. Wu W, Zhang W, Ma X. Network-based integrative analysis of single-cell transcriptomic and epigenomic data for cell types. Brief Bioinform 2022;23:bbab546. 10.1093/bib/bbab546 [DOI] [PubMed] [Google Scholar]
- 26. Gao H, Zhang B, Liu L. et al. A universal framework for single-cell multi-omics data integration with graph convolutional networks. Brief Bioinform 2023;24:bbad081. 10.1093/bib/bbad081 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Ashuach T, Gabitto MI, Koodli RV. et al. MultiVI: deep generative model for the integration of multimodal data. Nat Methods 2023;20:1222–31. 10.1038/s41592-023-01909-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wang X, Hu Z, Yu T. et al. Con-AAE: contrastive cycle adversarial autoencoders for single-cell multi-omics alignment and integration. Bioinformatics 2023;39:btad162. 10.1093/bioinformatics/btad162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Radford A, Kim JW, Hallacy C. et al. Learning transferable visual models from natural language supervision In: Meila M, Zhang T (eds.), Proceedings of the 38th International Conference on Machine Learning, Vol. 139, pp. 8748–63. Proceedings of Machine Learning Research; 2021. https://proceedings.mlr.press/v139/radford21a.html
- 30. [dataset]* 2020, PBMC from a healthy donor, single cell multiome ATAC gene expression demonstration data by Cell Ranger ARC 1.0.0, 10X Genomics, https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k.
- 31. Luecken MD, Burkhardt DB, Cannoodt R. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In: Vanschoren J, Yeung S (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021). San Diego, CA: Neural Information Processing Systems Foundation, Inc.; 2021.
- 32. Muto Y, Wilson PC, Ledru N. et al. Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney. Nat Commun 2021;12:2190. 10.1038/s41467-021-22368-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Li Y, Zhang D, Yang M. et al. scBridge embraces cell heterogeneity in single-cell RNA-seq and ATAC-seq data integration. Nat Commun 2023;14:6045. 10.1038/s41467-023-41795-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Xu Y, Begoli E, McCord RP. sciCAN: single-cell chromatin accessibility and gene expression data integration via cycle-consistent adversarial network. NPJ Syst Biol Appl 2022;8:33. 10.1038/s41540-022-00245-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Argelaguet R, Arnol D, Bredikhin D. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol 2020;21:111. 10.1186/s13059-020-02015-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Wolf FA, Angerer P, Theis FJ. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. 10.1186/s13059-017-1382-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Yang J, Ding R, Deng W. et al. RegionPLC: regional point-language contrastive learning for open-world 3D scene understanding. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE/CVF; 2024. pp. 19823–32. 10.1109/CVPR52733.2024.01874. https://ieeexplore.ieee.org/document/10656537 [DOI]
- 38. Li X, Wu Y, Jiang X. et al. Enhancing visual document understanding with contrastive learning in large visual-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024). Seattle, WA, USA: IEEE/CVF; 2024. pp. 15546–55. 10.1109/CVPR52733.2024.01472. https://ieeexplore.ieee.org/document/10656994 [DOI]
- 39. Kosuru VV, Ramaraju SV, Anatatmula SH. et al. Contrastive representation learning for multimodal single-cell RNA-seq data integration. In: Sivaraju SS, Lorenz P (eds.), 2024 8th International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 627–33. Coimbatore, Tamil Nadu, India: Institute of Electrical and Electronics Engineers (IEEE), 2024.
- 40. Andrekson L, Mercado R. Contrastive learning for robust cell annotation and representation from single-cell transcriptomics. bioRxiv. 2024. 10.1101/2024.06.20.599868 [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
SHARE-seq, 10x Genomics PBMC (10k version), and Muto-2021 datasets are provided by the scGLUE method [20] at https://scglue.readthedocs.io/en/latest/data.html as h5ad files. The preprocessed CITE-seq dataset are publicly available at Gene Expression Omnibus (GEO) database with GSE194122 accession number as a h5ad file.









