Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 May 1.
Published in final edited form as: IEEE Trans Comput Biol Bioinform. 2025 May-Jun;22(3):1086–1094. doi: 10.1109/TCBBIO.2025.3553068

scDILT: a Model-based and Constrained Deep learning Framework for Single-cell Data Integration, Label Transferring, and Clustering

Xiang Lin 1,*, Jianlan Ren 1,*, Le Gao 1, Junwen Wang 2,3,+, Zhi Wei 1,+
PMCID: PMC12421913  NIHMSID: NIHMS2088171  PMID: 40811359

Abstract

The scRNA-seq technology enables high-resolution profiling and analysis of individual cells. The increasing availability of datasets and advancements in technology have prompted researchers to integrate existing annotated datasets with newly sequenced datasets for a more comprehensive analysis. It is important to ensure that the integration of new datasets does not alter the cell clusters defined in the old/reference datasets. Although several methods have been developed for scRNA-seq data integration, there is currently a lack of tools that can simultaneously achieve the aforementioned objectives. Therefore, in this study, we have introduced a novel tool called scDILT, which leverages a conditional autoencoder and deep embedding clustering to effectively remove batch effects among different datasets. Moreover, scDILT utilizes homogeneous constraints to preserve the cell-type/clustering patterns observed in the reference datasets, while employing heterogeneous constraints to map cells in the new datasets to the annotated cell clusters in the reference datasets. We have conducted extensive experiments to demonstrate that scDILT outperforms other methods in terms of data integration, as confirmed by evaluations on simulated and real datasets. Furthermore, we have shown that scDILT can be successfully applied to integrate multi-omics single-cell datasets. Based on these findings, we conclude that scDILT holds great promise as a tool for integrating single-cell datasets derived from different batches, experiments, times, or interventions.

I. INTRODUCTION

The utilization of single-cell RNA sequencing (scRNA-seq) technology allows for the comprehensive profiling of various biological activities at the cellular level, including gene expression, protein expression, and chromatin accessibility. With the significant proliferation of available datasets, researchers are increasingly interested in conducting integrated analyses of datasets originating from diverse batches, tissues, and samples. However, the presence of batch effects poses a big challenge as it hinders the comparability and compatibility of different datasets[1]. Hicks et al. indicated that batch effects in scRNA-seq experiments occur when cells are cultured, captured, and sequenced separately[2] and two totally identical experimental designs are generally impossible to be achieved[3] [4]. Thus, before analyzing different batches of datasets, it is of great importance to correct the batch effect induced by these technical factors[2].

Many methods have been developed for batch effect removal in scRNA-seq data[5] [6] [7] [8] [9] [10]. Most of them correct the batch effect in a low-dimensional representation of original data. The integrated latent space can be used for many downstream analyses, such as clustering analyses and trajectory analyses. Seurat 3.0[9] corrects batch effect by finding anchor pairs between two batches of data. It relies on the mutual nearest neighbor (MNN) approach[11] which can only integrate two batches of data at a time. The performance is influenced by the order in which the input data is processed, and Seurat becomes computationally expensive in terms of both time and space when dealing with a large number of cells. Furthermore, when combining a pre-clustered and annotated dataset with one or more new datasets, Seurat operates in an unsupervised manner, leading to potential perturbations in the pre-defined cell types. This outcome is less desirable for biologists, as the annotation process is labor-intensive and exhaustive. Biologists aim to maintain the integrity of the original clusters after integration and incorporate new cells into the existing clusters without disrupting the established annotations.

Since the annotation process is very exhaustive, many studies have been designed to transfer labels from a reference dataset to the query datasets. Some models employ supervised approach to predict the cell type of the query datasets[12]. However, when the reference and query datasets are under different conditions (such as pre- and post-treatment), supervised methods can only assign new cells to the existent cell types and cannot discover new cell types in the query datasets. On the other hand, unsupervised clustering methods, such as Harmony[6], scGCN[13], CarDEC[14] and DESC[7], are also unsatisfactory because the prior information (annotated cell types in the reference dataset) cannot be used to guide the analyses of the integrated data, such as the representation learning and clustering process. Seurat V3 also works in an unsupervised manner. However, the labels are directly transferred between the anchors in reference and query datasets.

To tackle the aforementioned challenges, we have developed a semi-supervised model known as Single-Cell Deep Data Integration and Label Transferring (scDILT). This model aims to integrate diverse batches of data while simultaneously transferring labels from a reference dataset to one or more query datasets. The scDILT framework leverages a conditional autoencoder (CAE) to accomplish this task. The CAE receives the concatenated count matrix of multiple datasets, along with a vector indicating the batch IDs. By doing so, the CAE learns a latent representation of the integrated datasets while effectively mitigating the impact of batch effects on the data. Additionally, we have incorporated deep embedding clustering (DEC) to the integrated latent space generated by CAE, aiming to further enhance the removal of batch effects. To use the true label of reference data to guide the latent representation learning and clustering of query data, we build cell-to-cell constraints and implement them on the latent space[15].

In the clustering process, scDILT incorporates two types of constraints to ensure effective integration. Firstly, it utilizes homogenous constraints, which preserve the pre-defined clustering or cell-typing patterns observed in the reference datasets. This ensures that the integrated results maintain the original cluster structure and cell type annotations from the reference datasets. Secondly, scDILT employs heterogeneous constraints to accurately map the cells from the new datasets to the pre-defined cell types or clusters in the reference datasets. By leveraging these constraints, scDILT ensures that the integration process accurately assigns the cells from the new datasets to the appropriate cell types or clusters based on the information provided by the reference datasets or datasets that are previously annotated.

Upon completion, scDILT generates an integrated latent space representing the input datasets along with predicted labels for all cells. Through extensive experimentation, we have demonstrated that scDILT surpasses existing methods in terms of data integration, label transferring, and the discovery of new cell types. Furthermore, we have successfully applied scDILT to integrate scRNA-seq and scATAC-seq data in two Single-cell Multiome ATAC and Gene Expression datasets. Based on the clustering results obtained from scDILT, we conducted differential expression analyses. The outcomes of these analyses illustrate the superior performance of scDILT in integrating multi-omics data. Consequently, we conclude that scDILT holds substantial promise as a tool for jointly analyzing multiple scRNA-seq datasets.

II. METHODS AND MATERIALS

A. Conditional denoising autoencoder

The denoising autoencoder is a neural network for learning a nonlinear representation of data [16]. It receives corrupted data with artificial noises and reconstructs the original data with an encoder and a decoder [17]. It is generally used to learn a robust latent representation for noisy data. Here we employ the denoised autoencoder to the scRNA-seq data which is highly noised. Let’s denote the log normalized counts data as X and the corrupted data as Xc, formally:

Xc=X+σs (1)

where s is the artificial noise in standard Gaussian distribution (with mean=0 and variance=1), and σ controls the weights of s. We set σ as 2.5.

Next, we use an autoencoder to reduce the dimension of count data. Encoders (E) and decoders (D) are both multi-layered fully connected neural networks. Denoting the latent space as Z and the reconstructed count matrix as X, the encoder is Z=EwXc and the decoder is X=DwZ, where w and w are the learnable weights for encoder and decoder, respectively. Conditional autoencoder (CAE) and conditional variational autoencoder have been used to integrate the data from different batches[18]. Based on the traditional autoencoder model, we add a matrix B on the input of encoder and decoder. B is the one-hot coding from a batch vector b of each cell. If there are M batches in b, the dimension of B would be N×M. So, the encoder becomes Z=EwXcB and the decoder becomes X=DwZB where means the concatenation of two matrices. Note, Xc and Z are only used for model training. We perform the downstream analyses based on the latent space Z0 which is obtained from the well-trained model without adding artificial noise. The ELu activation function[19] is used for all the hidden layers in the encoder and decoder except the bottleneck layer.

B. ZINB loss

We employ a zero-inflated negative binomial (ZINB) model in the reconstruction loss function to characterize the zero-inflated and over-dispersed count data [20]. The ZINB-based reconstruction loss of the autoencoder is defined as LZINB, and further details can be found in the Supplementary Note 1. The sizes of layers are set to (256, 128, 64, 32) for the encoder and (32, 64, 128, 256) for the decoder. The overall architecture of the scDILT model is shown in Fig. 1.

Fig 1.

Fig 1.

The model architecture of scDILT. This model employs a conditional autoencoder structure to integrate the data from different batches. A batch vector is one-hot encoded and concatenated to the input of encoder and decoder. If one or more datasets are used as the reference (with annotated cell types), the cell-to-cell constraints will be built based on the labels of these data and implemented on the bottle-neck layer Z of the autoencoder. Then the deep embedding clustering will be performed on Z. scDILT will output the integrated embeddings of the multiple batches of data and the predicted labels of the cells in all the batches.

C. Deep embedded clustering

Our model has two learning stages, a pretraining stage and a clustering stage. In the pretraining stage, we train the autoencoder without considering the clustering loss and the constraint loss. In the clustering stage, we simultaneously optimize the autoencoder and the clustering results under the guidance of constraints. We perform unsupervised clustering on the latent space of the autoencoder[21]. Our autoencoder transfers the input matrix to a low dimensional space Z. The clustering loss is defined as the Kullback-Leibler (KL) divergence between the soft label distribution Q’ and the derived target distribution P’. Formally, Q’ is defined as:

qik=1+ziμk21k1+ziμk21 (2)

Where qik measures the similarity between the cell is latent representation ziziZ and cluster center μk by Student’s t-kernel[22], and the cluster center μk is initialized by applying a K-means on the latent space Z from the pretraining stage, and then updated per batch in the clustering stage. The target distribution P which emphasizes the more certain assignments is derived from Q’. Formally pikP is defined as:

pik=qik2iqikkqik2iqik (3)

During the training process, Q and clustering loss are calculated per batch and P is updated per epoch. Then, the clustering loss is calculated as:

LClustering=KLPQ=ikpiklogpikqik (4)

D. Autoencoder with pairwise constraints

1). Homogeneous constraints

When one or more of the input datasets have annotated label/cell types, we use the information of these labels to guide the clustering of the unlabeled cells. Based on the autoencoder architecture, we add pairwise constraints of cells[15] on the latent space according to the labels of cells in the reference datasets. We employ the must-link (ML) constraints to push two cells to have similar soft labels if they are in the same cell types. On the other hand, we employ the cannot-link (CL) constraints to pull two cells to have distinct soft labels if they are in the different cell types. The constraints loss of must-link is defined as:

Lmustlink=i,jMLlogkqik×qjk (5)

where qik is the soft label of cell i for cluster k as described in the clustering section above. The constraints loss of cannot-link is defined as:

Lcannotlink=i,jCLlog1kqik×qjk (6)

The number of constraints can be set according to the cell numbers. In this study, we set 3000 must-link constraints and 9000 cannot-link constraints. By adding the constraints in the autoencoder model, the clustering results of the reference dataset(s) can be retained in the new clustering analysis and be transferred it to the new datasets. Besides, although the constraints are only built between the cells in the reference dataset(s), they can help to improve the clustering performance of the unlabeled datasets by locating the centroids of each cluster.

2). Heterogenous constraints

To connect the cells in different datasets, we employ the algorithm in Seurat V3 to calculate the anchor cells in two different datasets. Briefly, Seurat uses canonical correlation vectors (CCV) to project the two datasets into a correlated low-dimensional space. Then mutual nearest neighbors in the two datasets are identify based on the L2-normalized CCV. We implement these algorithms by using Seurat (V4.1.0). Firstly, the raw count data of each dataset are normalized, and the top 2000 highly variable genes are selection for each. Then, the functions “SelectIntegrationFeatures” is used to find the anchors between the two datasets. These anchors were used as the must-links in scDILT(Supplementary S5S10).

Combining the pairwise constraint loss, reconstruction loss, and clustering loss, the total loss of the scDILT is:

L=LZINB+γLClustering+βLmustlink+δLcannotlink (7)

Where γ, β and δ are the coefficients for the clustering loss, must-link loss (includes the loss from homogeneous and heterogeneous constraints), and cannot-link loss respectively. In the experiments of this study, we set γ, β, and δ as 1, 0.5 and 1.5.

This model is implemented in python 3 using PyTorch [23]. The number of homogeneous constraints and heterogeneous constraints are listed in Supplementary Note 2 and, Supplementary Table 1.

E. Competing methods and Evalation metrics

We compared our method with four state-of-art methods (Supplementary Note 3). All the methods employ the same data preprocessing, normalization and feature selection approaches.

In a well-integrated data, the cells in the same cell type from different datasets can be assigned into the same cluster. Thus we evaluate the performance of data integration by measuring the clustering performance using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) (Supplementary Note 4). The real datasets used in this study are listed in Supplementary Table 2.

III. RESULTS

We implemented the same preprocessing steps to all simulation and real datasets (Supplementary Note 5). For the generation of simulated datasets, we utilized the R package Splatter (v1.18.2). The simulation parameters were estimated from a real dataset (Supplementary Note 6). Additionally, we collected four real datasets from different species and tissues (Supplementary Note 7). These datasets are first integrated using Seurat V3 with top 4,000 features, and then the clustering results of scDILT is added to the Seurat object as meta data to display the clustering results in umap and tSNE plot. To understand the pathway enrichment for the clustered cell group, differential expression analysis was implemented (Supplementary Note 8).

To compare the clustering performance of scDILT with other state-of-the-art emthods, including Seurat V3, Harmony, CarDEC, and DESC, we conducted a comparative analysis on the simulation data. It is evident that scDILT outperforms all cometing methods (Supplementary Note 9).

A. Integration of data in the same sample but different batches

In the first real data experiment, we integrate two batches of scRNA-seq data from the spleen of mice SLN111(Fig. 2) and SLN208 (Fig. 3). We assume similar cell type composition and proportion between the two batches. The first batch is used as the reference dataset with true cell type labels. We leverage these labels to guide clustering for all cells, ensuring accurate and consistent results across both batches. In Fig. 2a, the t-SNE plots display the latent representations generated by scDILT, Harmony, CarDEC, DESC, and Seurat V3 (from top to bottom) for the SLN111 dataset. The corresponding clustering performance of each method is depicted in Fig. 2b. Upon analysis, it is observed that scDILT achieves the highest ARI score (0.618) and the second-highest NMI score (0.625) among the competing methods, while the highest NMI score is 0.631. These results indicate that scDILT outperforms the other methods in terms of clustering accuracy and demonstrates its efficacy in integrating and analyzing the SLN111 dataset. In the t-SNE plot generated by DESC, cells from different batches appear to be completely separated. This observation suggests that the differences between the two batches overshadow the variations among different cell types. Conversely, in the t-SNE plots of Seurat V3, CarDEC, Harmony and scDILT, cells from different batches are mixed together. Each cell type contains cells from both batches, indicating that the variations among cell types play a more prominent role in the clustering process. However, it is worth noting that while CarDEC, Harmony and Seurat V3 can effectively remove the batch effect between the two datasets, they struggle to separate different cell types in the integrated latent representation (Fig. 2a). Multiple clusters within different cell types appear to be connected to each other, making it challenging to discern distinct cell types using these clustering algorithms. In contrast, scDILT demonstrates superior performance by simultaneously separating different cell types in the latent space while effectively removing the batch effect. This results demonstrate the enhanced clustering accuracy of scDILT compared to the competing methods.

Fig 2.

Fig 2.

a. Latent representation of SLN111 dataset from scDILT and the competing methods. From top to bottom, t-SNE plots are listed from the latent representation from scDILT, CarDEC, DESC, Harmony and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters, celltypes and gene CD14 expression. b. Clustering performance of all methods with ARI on the top and NMI at the bottom. c. Volcano plot from the DE between cluster 14 (CD4 T cells) and the other clusters. d. GSEA plot of top 5 enriched pathways in geneset CP REACTOME with p value and p-adjusted value listed on the right side.

Fig 3.

Fig 3.

a. Latent representation of SLN208 dataset from scDILT and the competing methods. From top to bottom, t-SNE plots are from scDILT, CarDEC, DESC, Harmony and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters, celltypes and gene CD79a expression. b. Clustering performance of all methods with ARI on the top and NMI at the bottom. c. Volcano plot from the DE between cluster 11 (mature B cells) and the other clusters. d. GSEA plot of top five enriched pathways in geneset C8 with p value and p-adjusted value listed on the right side.

To validate the clustering results from scDILT, we use differential expression (DE) analysis and gene set enrichment analysis (GSEA). We focus on cluster 14, comparing its gene expression with all other clusters using the Wilcoxon test. Given the clear observation in Fig. 2a that most cells in cluster 14 correspond to CD4 T cells, we can confidently assign cells of the same cell type from different batches into the same cluster post-integration. Consequently, we select CD4 as a marker gene and generate a plot illustrating its expression pattern within the t-SNE plots. This visualization helps to reinforce the accurate clustering and integration of CD4 T cells achieved by scDILT. In the t-SNE plots of scDILT, Harmony, CarDEC, and Seurat V3, CD4 is mainly expressed in the cluster of CD4 T cells. However, in the t-SNE plot of DESC, CD4 expressed in multiple clusters. Besides CD4, multiple T cell marker genes, including CD27[24], CD28[25], LCK[26], TCF7[27], CD3D[28] show high log2 fold change (>2) and small P-value (<0.05) (Fig. 2c) in the result of scDILT. We then perform a GSEA by using the DEGs in the CD4 T cells. The gene set CP REACTOME from MSigDB Collections (cell type signatures) is used. In the top 10 significant pathways (NES > 2, pval < 0.05), 4 of them are T Cells related pathways (Fig. 2d), including “Generation of second messenger”, “Membrane trafficking”, “Signaling by GPCR”, and “TCR signaling”. These findings reinforce and consolidate the clustering results of scDILT.

Fig. 3 illustrates the results obtained for dataset SLN208. In this dataset, scDILT demonstrates superior performance compared to the competing methods in terms of removing batch effects, separating cell types, and overall clustering performance (Fig. 3ab). Cluster 11 is selected for further downstream analyses based on the obtained results. The cells in this clusters are mature B cells. Here we illustrate the expression of EBF1[30] which is a marker gene of multiple cell types including the B doublets, transitional B cells, lfit3-high B cells and so on. EBF1 expressed in multiple clusters in the t-SNE plots of scDILT and DESC but concentrated in a large cluster in the t-SNE plots of CarDEC and Seurat V3. This result indicates that scDILT and DESC can separate the B cell sub-types after integration in the latent representation. Other marker genes BCL11A, CD79B, CD74, CD19, CD55[30] and MS4A1 are also highly expressed in the cluster 11 from the results of scDILT (Fig. 3c). Besides, in the results of GSEA of the cluster 11, 4 out of 10 top enriched pathways are B cell type related, including “MHC class II antigen presentation”, “Antigen activates B cell receptor BCR leading to generation of second messengers”, “Signaling by the B cell receptor BCR”, and “Interferon gamma signaling”(Fig. 3d). The “ MHC class II antigen presentation” pathway describes the presentation of antigens by major histocompatibility complex class II molecules, essential for B cell activation and antigen recognition. The “Antigen activates B cell receptor BCR leading to generation of second messengers “ pathway highlights the activation of B cell receptors (BCRs) by antigens, leading to the generation of second messengers such as cAMP, which regulate various cellular processes. The “Signaling by the B cell receptor BCR “ pathway details the intracellular signaling cascades initiated upon BCR activation, triggering B cell proliferation, differentiation, and antibody production. The “Interferon gamma signaling “ pathway involves the signaling of interferon-gamma, a cytokine that influences B cell activation, antibody class switching, and immune responses. These pathways collectively contribute to the orchestration of B cell-related immune processes. To sum up, the results in this part demonstrate that scDILT has a superior performance in data integration and label transferring for the datasets in the same tissue but sequenced in different batches.

B. Integration of scRNA-seq and scATAC-seq data

In the second experiment, we integrate the Single-cell Multimode ATAC Gene Expression (SMAGE-seq) dataset of human PBMC obtained from the 10X Genomics website. To integrate the multimodal data, we perform separate integration for the mRNA and ATAC modalities. The results of mRNA integration are presented in Fig. 4, while the results of ATAC integration are shown in Fig. 5. Notably, scDILT outperforms the other methods with the highest ARI and NMI scores for both scRNA-seq and scATAC-seq data (Fig. 4b and Fig. 5b). Although all methods effectively remove the batch effect (Fig. 4a and Fig. 5a), certain challenges arise in the t-SNE plots. CarDEC fails to distinguish between CD4 memory cells and CD8 naive cells in the mRNA data, DESC separates CD14+ monocytes and CD4 memory cells into multiple clusters, and Seurat struggles to separate pre-B cells and B cell progenitors. However, scDILT successfully separates pre-B cells and B cell progenitors, albeit with slight overlap between CD4 memory cells and CD8 naive cells. In the t-SNE plot of the ATAC data, DESC fails to integrate the two batches, Seurat and CarDEC remove batch effects but struggle to separate different cell types such as B cell progenitors and platelet cells in the latent representation. Only scDILT accurately separates these cell types while effectively removing batch effects. Furthermore, we assess the expression of the CD14 marker gene in CD14 monocyte cells using t-SNE plots, and confirm its exclusive expression in the CD14 monocyte cluster. Other marker genes for CD14 monocytes, including SAT1, ZEB2, LYZ, CD36, and CD74, also exhibit high expression in cluster 11, as identified by scDILT (Fig. 4c). In the pathway analysis, 3 out of top 10 enriched pathways are monocytes related, including “lung classical monocyte cell”, “lung nonclassical monocyte cell” and “lung ORL1 classical monocyte cell”(Fig. 4d, Table S4). In summary, this experiment demonstrates that scDILT has a superior performance in batch effect removal and cell type separation in not only scRNAseq data, but also scATAC-seq data, providing a chance to integrate multi-omics datasets.

Fig 4.

Fig 4.

a. Latent representation of 10X granulocyte scRNAseq dataset from scDILT and the competing methods. From top to bottom, t-SNE plots are from scDILT, CarDEC, DESC, Harmony and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters, celltypes and gene CD14 expression. b. Clustering performance of all methods with ARI on the top and NMI at the bottom. c. Volcano plot from the DE between cluster 11 (CD4 monocytes) and the other clusters. d. GSEA plot of top 5 enriched pathways in C8 geneset with p value and p-adjusted value listed on the right side.

Fig 5.

Fig 5.

a. Latent representation of 10X granulocyte scATACseq dataset from scDILT and the competing methods. From top to bottom, t-SNE plots are from scDILT, CarDEC, DESC, Harmony and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters and celltypes. b. Clustering performance of all methods with ARI on the top and NMI at the bottom.

C. Integration of data in the sample with different treatments

In the third experiment, we integrate the scRNA-seq of human PBMC with different treatments. We assume that there are some differences of cell type composition and/or proportion among different datasets. We use four datasets in this experiment including: 1) PBMC from a healthy human; 2) PBMC from a patient with Drug-induced hypersensitivity syndrome/drug reaction with eosinophilia and systemic symptoms (DiHS/DRESS); 3) PBMC from a DiHS/DRESS patient with sulfamethoxazole (SMX) and trimethoprim (TMP) treatment; 4) PBMC from a DiHS/DRESS patient with SMX, TMP, and tofacitinib (TOFA) treatment. Since only the healthy PBMC dataset has true labels, we this dataset as the reference to guide the clustering of the other datasets. Fig. 6 displays the t-SNE plots of the latent representations obtained from scDILT and Seurat V3 on the PBMC datasets. Seurat V3 effectively removes the batch effect, but it fails to clearly distinguish between various cell types in the latent space. In contrast, scDILT demonstrates a satisfactory performance in both batch effect removal and cell type separation. Notably, the well-separated cell types in the latent space of scDILT allow for the observation of variations in cell type composition across different batches and the identification of distinct batch proportions among different cell types. In Fig. 6, the cells from the healthy PBMC sample are colored based on their true labels/cell types, while the remaining cells are represented in blue. Interestingly, two new clusters, namely cluster 1 and 2, are identified in the patient samples, while absent in the healthy PBMC data. Notably, cluster 1 and 2 exclusively contain cells from patients, with cluster 2 lacking cells from the TOFA treatment group. Conversely, cluster 12 exclusively comprises cells without any treatments and represents the CD14+ monocyte cell type. These findings suggest that treatments such as SMX, TMP, and/or TOFA have the potential to eliminate CD14+ monocytes in the peripheral blood of patients. In summary, this experiment demonstrates that scDILT has a superior performance in batch effect removal and cell type separation, providing a chance to find new cell types in the samples with different interventions.

Fig 6.

Fig 6.

Latent representation of PBMC dataset with different treatments from scDILT and the competing methods. From top to bottom, t-SNE plots are from scDILT and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters and celltypes. Legends of cell types and batch are attached to the right. Batch information: 1) PBMC from a healthy human; 2) PBMC from a patient with Drug-induced hypersensitivity syndrome/drug reaction with eosinophilia and systemic symptoms (DiHS/DRESS); 3) PBMC from a DiHS/DRESS patient with sulfamethoxazole (SMX) and trimethoprim (TMP) treatment; 4) PBMC from a DiHS/DRESS patient with SMX, TMP, and tofacitinib (TOFA) treatment.

D. Integration of data in the same tissue but different batches

In this experiment, we performed data integration on scRNA-seq data obtained from different batches of human lung samples. The first batch, denoted as GSE131907, consisted of 208,506 cells derived from 58 lung tissues of 44 patients, while the second batch, denoted as GSE123904, comprised 40,505 cells from 17 lung tissues of 15 patients. Due to the large size of these datasets, we randomly selected 500 cells from each cell type per batch for analysis. Although these cells originated from the same tissue, their cell type compositions were not identical. Out of the 32 identified cell types, 11 were shared between the two datasets, 15 were unique to the first dataset, and 6 were unique to the second dataset. The first dataset served as the reference, containing the ground truth labels. The results of this experiment are depicted in Fig. 7, where scDILT exhibited the highest ARI and NMI among all methods (Fig. 7b). Conversely, in the t-SNE plots of DESC, cells were segregated into numerous small clusters, with multiple cell types emerging in each cluster. In contrast, the t-SNE plot of scDILT, Harmony, CarDEC, and Seurat V3 showed well-separated clusters of cells with varying sizes. Notably, scDILT successfully preserved the unique clusters for individual datasets, aligning with the ground truth. On the other hand, CarDEC, Harmony and Seurat V3 failed to accurately separate certain cell types in the latent representation, resulting in t-SNE plots containing a large ‘cluster’ comprising several smaller clusters and multiple cell types. Only scDILT demonstrated the ability to effectively separate all clusters in the t-SNE plot while maintaining distinct clusters for each dataset. For example, cluster 35 is mesothelial cells unique in the second dataset; and cluster 25 is tS1 cells unique in the first dataset. We select cluster 5 (mast cell) for the downstream analyses. We show the expression of marker gene KIT[29] in Fig. 7a and find that scDILT cluster most cells from different batches into a single cluster. The other two marker genes of mast cells, GATA2 and MS4A2[29], are also highly expressed in cluster 5 as shown in the volcano plot (Fig. 7c). In the results of GSEA, 4 out of 10 enriched pathways are mast cells related, including “travaglini lung basophil mast 2 cell”, “travaglini lung basophil mast 1 cell”, “cui developing heart C7 mast cell”, and “durante adult olfactory neuroepithelium mast cells” (Fig. 7d, Table S3). These mast cells play important roles in various physiological processes, including immune responses, tissue homeostasis, and sensory perception. The specific pathways involved in these cell types may include processes related to cell activation, degranulation, immune signaling, cytokine production, and tissue-specific functions. These findings provide compelling evidence of the superior performance of scDILT in effectively integrating datasets derived from the same tissue but originating from different patients. The ability of scDILT to successfully merge and harmonize these datasets highlights its robustness and effectiveness in overcoming inter-patient variability and capturing the underlying biological heterogeneity. By accurately integrating and aligning the data, scDILT enables a comprehensive analysis of the shared and unique features across patients, thereby facilitating the identification of key biological insights and potential biomarkers associated with the specific tissue of interest.

Fig 7.

Fig 7.

a. Latent representation of Lung cancer dataset from scDILT and the competing methods. From top to bottom, t-SNE plots are from scDILT, CarDEC, DESC, Harmony and Seurat V3 respectively. From left to right, cells are colored by the batches, clusters, celltypes and gene KIT expression. b. Clustering performance of all methods with ARI on the top and NMI at the bottom. c. Volcano plot from the DE between cluster 5 (mast cells) and the other clusters. Genes are listed within the box. d. GSEA plot of top 5 enriched pathways in C8 geneset with p value and p-adjusted value listed on the right side.

E. Model tests

We performed simulations using various datasets with an increasing number of cells to investigate the relationship between cell numbers and the running times of scDILT. In Fig. 8a, we demonstrate that scDILT’s running time exhibits a linear increase with the number of cells. Specifically, for a given dataset X with N cells, we combined two instances of X to create a larger dataset with 2N cells, simulating the integration process. Notably, when integrating a large dataset with 2×100,000 cells, scDILT completes all tasks, including integration and clustering, in approximately 50 minutes. It is worth mentioning that scDILT is trained using batches of samples, with a batch size of 256, leading to a low space complexity even when working with large datasets.

Fig 8.

Fig 8.

a. Running time of scDILT. We integrate two datasets with cell numbers from 2,000 to 10,000 with 1,000 increment. Running time of scDILT increases linearly with the increase of cell number. b. Relationship of γ (weight of clustering loss) and clustering performance. We set different γ value of 0, 0.01, 0.1, 1, 10, and 100. c. Relationship of total link number and clustering performance. We set different number of total links at 0, 1000, 5000, 12000, 24000, and 48000. d. Relationship of heterogenous link number and clustering performance. We set different number of heterogenous links at 0, 100, 500, 1000, and 3000. For panel b, c, and d, all clustering performance is evaluated using ARI (red) and NMI (blue) on two datasets SLN111 (solid line) and SLN208 (dashed line).

All model tests for scDILT were conducted using two spleen lymph node datasets (SLN111 and SLN208). The clustering loss weight, gamma, is an important parameter in our model. To evaluate the impact of the clustering loss on scDILT, we varied gamma values (0, 0.01, 0.1, 1, 10, and 100), while keeping other parameters fixed (Fig. 8b). We observed that when gamma is less than 1, the clustering loss enhances the clustering performance of scDILT. However, excessively high gamma values (>10) negatively affect scDILT’s performance. Additionally, we examined the contribution of homogeneous links to scDILT’s clustering performance. By increasing the total number of links (0, 1000, 5000, 12000, 24000, and 48000), while maintaining the ratio of 3-fold cross-links to ML links constant (Fig. 8c), we found that an appropriate number of homogeneous links (approximately equal to the number of cells) significantly improves clustering performance, while excessive links lead to inferior results. This may be due to excessively tight linking, which hinders the clustering process. Finally, we evaluated the impact of heterogeneous links on scDILT’s clustering performance. By increasing the number of heterogeneous links (0, 100, 500, 1000, 3000), while keeping the sum of heterogeneous and homogeneous ML links constant (3000), with the number of homogeneous CL links fixed at 9000 (Fig. 8d), we observed that an appropriate number of heterogeneous links (<1000) enhances the clustering performance, while too many heterogeneous links have a negative effect.

IV. CONCLUSION

This article presents scDILT, a novel tool for integrating scRNA-seq data from different batches or experiments. scDILT utilizes a conditional autoencoder and deep embedding clustering to effectively remove batch effects and ensure accurate clustering patterns across datasets. The inclusion of intra- and inter-dataset constraints enables scDILT to preserve clustering patterns in the reference datasets and accurately map cells in query datasets to pre-defined and annotated clusters. The extensive experiments demonstrate the superior performance of scDILT in data integration, making it a valuable tool for integrating single-cell datasets across different conditions, interventions, and time points. Additionally, scDILT shows potential for integrating multi-omics data, opening new possibilities for comprehensive analyses in the era of expanding single-cell sequencing technologies and available datasets.

Supplementary Material

supp2-3553068
supp1-3553068

Key points.

  • scDILT removes batch effects, using homogeneous constraints for reference patterns and heterogeneous constraints for new dataset cell mapping.

  • scDILT successfully integrates multi-omics single-cell datasets and outperforms other methods interms of data integration.

  • scDILT holds promise for integrating single-cell datasets from various sources.

Footnotes

Declaration of interest:

Declarations of interest: none

V. DATA AND CODE AVAILABILITY

The code, source data, and running scripts of scDIT are available on GitHub: https://github.com/Jianlan0816/scDILT.

VI. Reference

  • [1].Lähnemann D et al. , “Eleven grand challenges in single-cell data science,” (in eng), Genome Biol, vol. 21, no. 1, p. 31, February 07 2020, doi: 10.1186/s13059-020-1926-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Hicks SC, Townes FW, Teng M, and Irizarry RA, “Missing data and technical variability in single-cell RNA-sequencing experiments,” Biostatistics, vol. 19, no. 4, pp. 562–578, Oct 1 2018, doi: 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bacher R and Kendziorski C, “Design and computational analysis of single-cell RNA-sequencing experiments,” Genome Biol, vol. 17, p. 63, Apr 7 2016, doi: 10.1186/s13059-016-0927-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Stegle O, Teichmann SA, and Marioni JC, “Computational and analytical challenges in single-cell transcriptomics,” Nature Reviews Genetics, vol. 16, no. 3, pp. 133–145, 2015. [DOI] [PubMed] [Google Scholar]
  • [5].Barkas N et al. , “Joint analysis of heterogeneous single-cell RNA-seq dataset collections,” Nat Methods, vol. 16, no. 8, pp. 695–698, Aug 2019, doi: 10.1038/s41592-019-0466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Korsunsky I et al. , “Fast, sensitive and accurate integration of single-cell data with Harmony,” Nat Methods, vol. 16, no. 12, pp. 1289–1296, Dec 2019, doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Li X et al. , “Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis,” Nat Commun, vol. 11, no. 1, p. 2338, May 11 2020, doi: 10.1038/s41467-020-15851-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Polanski K, Young MD, Miao Z, Meyer KB, Teichmann SA, and Park JE, “BBKNN: fast batch alignment of single cell transcriptomes,” Bioinformatics, vol. 36, no. 3, pp. 964–965, Feb 1 2020, doi: 10.1093/bioinformatics/btz625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Stuart T et al. , “Comprehensive Integration of Single-Cell Data,” Cell, vol. 177, no. 7, pp. 1888–1902 e21, Jun 13 2019, doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Welch JD, Kozareva V, Ferreira A, Vanderburg C, Martin C, and Macosko EZ, “Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity,” Cell, vol. 177, no. 7, pp. 1873–1887 e17, Jun 13 2019, doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Haghverdi L, Lun AT, Morgan MD, and Marioni JC, “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors,” Nature biotechnology, vol. 36, no. 5, pp. 421–427, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Pasquini G, Rojo Arias JE, Schafer P, and Busskamp V, “Automated methods for cell type annotation on scRNA-seq data,” Comput Struct Biotechnol J, vol. 19, pp. 961–969, 2021, doi: 10.1016/j.csbj.2021.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Song Q, Su J, and Zhang W, “scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics,” Nature Communications, vol. 12, no. 1, pp. 1–11, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Lakkis J et al. , “A joint deep learning model enables simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics,” Genome Res, vol. 31, no. 10, pp. 1753–1766, Oct 2021, doi: 10.1101/gr.271874.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Tian T, Zhang J, Lin X, Wei Z, and Hakonarson H, “Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data,” Nature communications, vol. 12, no. 1, pp. 1–12, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Hinton GE and Salakhutdinov RR, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–7, Jul 28 2006, doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
  • [17].Vincent P, Larochelle H, Bengio Y, and Manzagol P-A, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103. [Google Scholar]
  • [18].Gayoso A et al. , “Joint probabilistic modeling of single-cell multi-omic data with totalVI,” Nat Methods, vol. 18, no. 3, pp. 272–282, Mar 2021, doi: 10.1038/s41592-020-01050-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Clevert D-A, Unterthiner T, and Hochreiter S, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015. [Google Scholar]
  • [20].Tian T, Wan J, Song Q, and Wei Z, “Clustering single-cell RNA-seq data with a model-based deep learning approach,” Nature Machine Intelligence, vol. 1, no. 4, pp. 191–198, 2019. [Google Scholar]
  • [21].Xie J, Girshick R, and Farhadi A, “Unsupervised deep embedding for clustering analysis,” in International conference on machine learning, 2016: PMLR, pp. 478–487. [Google Scholar]
  • [22].Van der Maaten L and Hinton G, “Visualizing data using t-SNE,” Journal of machine learning research, vol. 9, no. 11, 2008. [Google Scholar]
  • [23].Paszke A et al. , “Automatic differentiation in pytorch,” 2017. [Google Scholar]
  • [24].Latorre I et al. , “Study of CD27 and CCR4 markers on specific CD4+ T-cells as immune tools for active and latent tuberculosis management,” Frontiers in immunology, p. 3094, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].June CH, Ledbetter JA, Linsley PS, and Thompson CB, “Role of the CD28 receptor in T-cell activation,” Immunology today, vol. 11, pp. 211–216, 1990. [DOI] [PubMed] [Google Scholar]
  • [26].Palacios EH and Weiss A, “Function of the Src-family kinases, Lck and Fyn, in T-cell development and activation,” Oncogene, vol. 23, no. 48, pp. 7990–8000, 2004. [DOI] [PubMed] [Google Scholar]
  • [27].Utzschneider DT et al. , “T cell factor 1-expressing memory-like CD8+ T cells sustain the immune response to chronic viral infections,” Immunity, vol. 45, no. 2, pp. 415–427, 2016. [DOI] [PubMed] [Google Scholar]
  • [28].Noutsias M et al. , “Expression of functional T‐cell markers and T‐ cell receptor Vbeta repertoire in endomyocardial biopsies from patients presenting with acute myocarditis and dilated cardiomyopathy,” European journal of heart failure, vol. 13, no. 6, pp. 611–618, 2011. [DOI] [PubMed] [Google Scholar]
  • [29].Grimbaldeston MA, Chen C-C, Piliponsky AM, Tsai M, Tam S-Y, and Galli SJ, “Mast cell-deficient W-sash c-kit mutant KitW-sh/W-sh mice as a model for investigating mast cell biology in vivo,” The American journal of pathology, vol. 167, no. 3, pp. 835–848, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Browne P, Petrosyan K, Hernandez A, and Chan JA, “The B-cell transcription factors BSAP, Oct-2, and BOB. 1 and the pan–B-cell markers CD20, CD22, and CD79a are useful in the differential diagnosis of classic hodgkin lymphoma,” American journal of clinical pathology, vol. 120, no. 5, pp. 767–777, 2003. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp2-3553068
supp1-3553068

Data Availability Statement

The code, source data, and running scripts of scDIT are available on GitHub: https://github.com/Jianlan0816/scDILT.

RESOURCES