Abstract
Single-cell RNA sequencing has significantly advanced our understanding of cell heterogeneity and gene regulation. Batch-effect correction is essential for achieving robust data integration. Multiple methods have been developed to address this issue, particularly procedural approaches involving components such as anchoring or deep learning, which have achieved notable successes. However, order preservation, as an important feature, has been largely overlooked in procedural methods. Based on a monotonic deep learning network, we developed a correction method with order-preserving feature. By comparing with existing methods, we demonstrated that our method effectively improved clustering performance, better retained original inter-gene correlation and differential expression information.
Keywords: scRNA-sequencing, batch effect, order-preserving, monotonic deep learning network, inter-gene correlation, differential expression consistency
Introduction
The rapid advancement of single-cell RNA sequencing (scRNA-seq) technologies has significantly enhanced our understanding of cellular diversity and gene regulation in complex biological systems [1, 2]. By enabling the profiling of thousands of individual cells, scRNA-seq has revolutionized the study of cellular heterogeneity. However, integrating scRNA-seq datasets from different sources is often hindered by batch effects—systematic discrepancies arising from variations in experimental conditions, such as sample preparation, sequencing protocols, and platform differences [3, 4]. These batch effects can obscure true biological signals and distort downstream analyses, making their correction essential for robust cross-study comparisons [5].
In the context of batch-effect correction, the order-preserving feature of gene expression levels refers to the property of maintaining the relative rankings or relationships of gene expression levels (considering sequencing depth) within each batch after correcting batch effects. This feature ensures that the intrinsic order of gene expression levels is not disrupted during the correction process [6]. Maintaining the original order of gene expression levels helps to retain biologically meaningful patterns, such as relative expression levels between genes or cells, which are crucial for downstream analyses like differential expression or pathway enrichment studies [5]. Additionally, order-preserving feature enhances the robustness of batch effect correction methods, ensuring reliable data integration from diverse sources.
Several methods have been developed to correct batch effects, which can be broadly categorized into non-procedural methods and procedural methods, each employing distinct strategies. Non-procedural methods rely on direct statistical modeling to adjust batch effects without iterative feature alignment or sample matching. Examples include ComBat [7] and Limma [8], which were originally developed for bulk RNA-seq and later adapted for scRNA-seq. These methods adjust additive or multiplicative batch biases effectively, but their performance may be hindered in scRNA-seq due to its inherent sparsity and “dropout” effects, resulting from stochastic gene expression and RNA capture limitations.
To address this issue, procedural methods have been developed, involving multi-step computational workflows that align features or samples across batches. For example, Seurat v3 [9] uses canonical correlation analysis to identify shared subspaces and mutual nearest neighbors (MNNs) to anchor cells between batches. Similarly, Harmony [10] iteratively adjusts embeddings to align batches while preserving biological variation, and MMD-ResNet [11] uses deep learning to minimize distribution discrepancies. Moreover, Liger [12] and scVI [13] address this issue through factor decomposition and variational autoencoders [14], respectively, allowing them to correct batch effects while retaining complex biological signals.
Despite these advancements, several limitations remain. Firstly, deep learning-based methods, while powerful for modeling complex data structures, often suffer from interpretability issues, complicating biological analysis. Secondly, many approaches separate batch-effect correction from cell clustering, which can lead to the loss of rare cell type (CT) information [15]. Integrating batch-effect correction and clustering would better preserve biological signals [16]. More importantly, most current procedural methods neglect the order-preserving feature, which may result in the loss of valuable intra-batch information and misinterpretation of differential expression patterns. Although methods based on direct statistical modeling, such as Combat, possess order-preserving feature, the presence of a large number of zero values in scRNA-seq data often makes them ineffective in correcting batch effects in certain scenarios.
Therefore, we developed an order-preserving procedural method to correct batch effects. Our method performed initial clustering and utilized nearest neighbor (NN) information within and between batches to construct similarities between clusters. These similarities were then used to design a loss function [weighted maximum mean discrepancy (MMD)] for batch effect correction. And we employed a monotonic deep learning network to ensure intra-gene order-preserving feature. Compared to MMD-ResNet [11], we addressed potential class imbalances between different batches through weighted design and obtained a complete gene expression matrix. By comparing with existing methods, we demonstrated that our method not only improved clustering accuracy but also preserved inter-gene correlation. Furthermore, the order-preserving feature allowed us to retain differential expression information within each batch after correction, providing a more biologically interpretable framework for the integration of scRNA-sequencing data.
Results
Overview and evaluation
Our method was designed to align multiple batches of scRNA-seq data while preserving the intra-genic order of expression levels and inter-gene correlation during the correction process. The overall workflow was illustrated in Fig. 1. After preprocessing the scRNA-seq data, we initialized clustering using optional clustering algorithms and estimated the probability of each cell belonging to each cluster. Then, we utilized intra-batch and inter-batch NN information to evaluate the similarity among the obtained clusters, thereby completing intra-batch merging and inter-batch matching of similar clusters. To achieve batch-effect correction, we calculated the distribution distance between the reference batch and query batch using weighted maximum mean divergence. We finally minimized the loss through a global or partial monotonic deep learning network to obtain a corrected gene expression matrix. Our approach was thus divided into two modes: a global model and a partial model, with the partial model incorporating the
matrix as an additional input to the network.
Figure 1.
Procedure of the order-preserving batch-effect correction method based on a monotonic deep learning networkwork (Step 1: Preprocessing raw count data to obtain normalized expression matrix; Step 2: Initializing clusters based on normalized expression matrix and estimating the probability that each cell belongs to each cluster, i.e.
matrix; Step 3: Merging clusters intra batches and matching clusters inter batches based on NNs information; Step 4: Minimizing the weighted MMC between the paired sets of clusters by monotonic deep learning network; The
matrix can be used as an additional input to the network and all details of method can be found in Methods and Supplementary Materials).
We evaluated our method using multiple scRNA-seq data from different sources. Performance was compared to five established batch-effect correction methods, including ComBat [7], Harmony [10], Seurat v3 [9], MNN Correct [17], and ResPAN [18], as well as uncorrected raw data. To visualize the effectiveness of our method, we used
-SNE and UMAP, both widely adopted techniques for dimensionality reduction and high-dimensional data visualization. (𝑡 -SNE is t-distributed Stochastic Neighbor Embedding and UMAP is Uniform Manifold Approximation and Projection.) Unlike principal component analysis (PCA),
-SNE and UMAP are particularly suited for scRNA-seq [6] data because they better preserve local and global structures, while being scalable to large datasets.
We analyzed inter-gene correlation and differential expression consistency to highlight the advantages of order-preserving feature during correction process. And to assess clustering performance after batch-effect correction, we focused on two main criteria: batch mixing and CT purity. Specifically, we employed three clustering-related metrics: Adjusted Rand Index (ARI) [19] for clustering accuracy, Average Silhouette Width (ASW) [20] for cluster compactness, and Local Inverse Simpson Index (LISI) [10] for neighborhood diversity. Detailed definitions of these metrics can be found in the Methods section.
Compared to the benchmark methods, our method demonstrated superior performance, particularly in maintaining inter-gene correlation, improving CT clustering accuracy, and preserving original differential expression information within batches.
Order-preserving feature
Most current procedural methods neglect the order-preserving feature within gene during batch-effect correction. To evaluate how well different methods preserve the original ranking of gene expression levels. For each dataset, we selected two CTs with the largest and smallest sample sizes. For each of these CTs, as well as for the whole samples, we plotted boxplots of Spearman correlation coefficients before and after correction by different methods. Due to the large number of zeros in scRNA-seq datasets, which can result in many tied rankings, we only considered cells with non-zero raw counts for each gene in this analysis.
In this section, we excluded the method Harmony from this evaluation. According to the literature [10], the input of Harmony is PCA dimensionality reduction embedding of a gene expression matrix, and its output is an embedded feature space of the same dimensionality. The output is mainly used for subsequent clustering and visualization analyses. Due to the fact that this output no longer retains the original data dimension, it is not feasible to directly calculate the Spearman correlation coefficient at the level of gene expression after/before correction, and therefore it has not been included in the evaluation.
Among all listed methods, only the non-procedural method ComBat and our global monotonic model were able to preserve the order of gene expression levels (non-zero) before versus after batch-effect correction. The partial monotonic model could only ensure order-preserving feature based on the same
matrix (Fig. 2 and Supplementary Fig. S12).
Figure 2.
Boxplots of Spearman correlation coefficients between original data and batch-effect corrected data using different methods in Dataset 1 (only considering non-zero expression): (a) all samples in batch 1, (b) CT luminal_mature in batch 1, (c) CT luminal_progenitor in batch 1, (d) all samples in batch 2, (e): CT luminal_mature in batch 2, and (f) CT luminal_progenitor in batch 2.
Inter-gene correlation
Analyzing gene-gene interactions is essential for uncovering the intricate dynamics underlying biological processes and disease mechanisms. By identifying functionally related gene clusters, researchers gain insights into how groups of genes co-regulate cellular functions or contribute to disease progression. Constructing gene regulatory networks reveals not only direct gene interactions but also the complex layers of transcriptional regulation [21]. Therefore, maintaining inter-gene correlation during batch-effect correction is essential to preserve the biological integrity of scRNA-seq data.
Most existing batch-effect correction methods primarily focus on aligning cells across batches, often neglecting the preservation of inter-gene correlation structures within CTs. However, our method employed a distribution distance (weighted MMC) as the objective function and utilized a monotonic network to incorporate the order-preserving feature, enabling better preservation of inter-gene correlation during the batch-effect correction process. This approach prevented the disruption of important gene regulatory relationships, thereby maintaining the biological relevance of the data after batch-effect correction.
To quantify the ability to maintain inter-gene correlation, we designed the following procedure: Considering the robustness of the results, we focused only on CTs with more than 30 cells in the reference and query batches. For each CT, we selected significantly correlated gene pairs within that CT. To avoid the influence of low-expression genes, we focused only on genes whose average expression level exceeded the average expression level across all cells. Subsequently, a one-sided correlation test (implemented in R) was performed for each gene pair across different batches. We controlled the false discovery rate (FDR) by requiring each significantly correlated gene pair to exhibit the same correlation direction in both batches and to have Benjamini–Hochberg adjusted
-values below .05.
Finally, we calculated the Pearson correlation of these gene pairs before and after batch-effect correction, and we evaluated the performance of different methods in preserving inter-gene correlation using multiple metrics, including root mean square error (RMSE), Pearson correlation, and Kendall correlation. In all experimental single-cell RNA sequencing datasets, compared to the other methods except ComBat, our partial monotonic model and global monotonic model showed smaller mean square error roots, higher Pearson correlation and Kendall correlation coefficients in the vast majority of CTs (Fig. 3 and Supplementary Tables S5, S8, S9).
Figure 3.
The correlation coefficients (
-axis) of significantly correlated gene pairs (
-axis) in the luminal mature CT (Dataset 1) before versus after batch-effect correction: (a–f) Correlation coefficients computed from different methods and RMSE represents the root mean squared error between the uncorrected query batch and the corrected query batch.
Furthermore, we have included two approaches of statistical tests in difference evaluation. First, for each given correction method, we performed a paired Wilcoxon test on the two sets of Pearson correlation coefficients obtained before and after correction (under the assumption of independence or approximate independence). The vast majority of results showed that, the differences between the two sets of Pearson correlation coefficients before and after correction obtained by our method were not statistically significant (
>.05). For the remaining results (where
<.05), the
-values obtained by our method were still larger than those obtained by the other methods. Second, we conducted a statistical test to assess the differences among methods for their capability in preserving inter-gene correlations. Specifically, for each method, we calculated the differences between the two sets of Pearson correlation coefficients before versus after batch effect correction. As differences close to zero would be preferred regardless of positive or negative signs, we took their absolute values (notice that RMSEs were also equivalently calculated based on these absolute values). We then conducted a one-sided paired Wilcoxon test on the absolute differences generated by our method versus another method: the null hypothesis was that the two methods had no difference in inter-gene correlation preservation capability (i.e. equal absolute difference medians); and the alternative hypothesis was that the absolute difference median of our method was smaller. The results showed that, for the vast majority of CTs, the absolute differences medians obtained by our method were smaller than the other methods (
<.05, excluding the linear method ComBat; Supplementary Table S11). Therefore, our method demonstrated an improved capability in preserving inter-gene correlation (excluding the linear method ComBat).
Although ComBat demonstrated good performance in preserving inter-gene correlation, it failed to effectively correct batch effects across multiple datasets in our subsequent evaluations. Even after ComBat correction, the same CTs from different batches were still unable to cluster together. The numerous zero values in scRNA-seq data interfered with ComBat’s Bayesian modeling, thereby affecting its ability to correct batch effects. In our subsequent results, we demonstrated that our method not only effectively preserved the original inter-gene correlation but also successfully corrected batch effects and improved clustering accuracy. This indicated that our method achieved a balance between preserving complex gene regulatory networks and correcting batch effects.
Clustering performance
We applied our partial and global monotonic models to nine experimental scRNA-seq datasets and one simulated dataset to evaluate their clustering performance after batch-effect correction and compared them with other listed methods. The first dataset, derived from three independent studies of mammary epithelial cells, contains 9288 cells across three batches and three CTs (basal, luminal mature, and luminal progenitor) [22–24]. All methods corrected batch effects to varying degrees and our approach showed superior CT separation and batch mixing, resulting in well-defined clusters (Supplementary Fig. S2). The partial model achieved the highest ARI F1 and the second-highest LISI F1, The global model achieved the highest ASW F1 (Table 1). The second dataset, consisting of lung adenocarcinoma cells collected from three cell lines (HCC827, H1975, and H2228) across three platforms [25], all listed methods successfully mixed the batches (Supplementary Fig. S3). The partial model achieved the highest ARI F1, ASW F1 and LISI F1, and the global model ranked second (Table 1).
Table 1.
Comparison of the methods based on the clustering metrics computed on the different datasets
| Dataset 1 | Dataset 2 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ARI | ASW | LISI | ARI | ASW | LISI | |||||||||||||
| Method | CT | 1-B | F1 | CT | 1-B | F1 | CT | B | F1 | CT | 1-B | F1 | CT | 1-B | F1 | CT | B | F1 |
| Partial | 0.99 | 0.95 | 0.97 | 0.80 | 1.03 | 0.90 | 1.00 | 2.28 | 1.38 | 0.99 | 1.00 | 0.99 | 0.84 | 1.01 | 0.91 | 1.00 | 2.07 | 1.34 |
| Global | 0.98 | 0.94 | 0.96 | 0.86 | 1.01 | 0.93 | 1.01 | 2.18 | 1.36 | 0.97 | 0.99 | 0.98 | 0.71 | 1.02 | 0.85 | 1.01 | 2.05 | 1.33 |
| ComBat | 0.97 | 0.94 | 0.96 | 0.76 | 0.96 | 0.85 | 1.00 | 1.33 | 1.13 | 0.97 | 0.99 | 0.98 | 0.71 | 1.01 | 0.83 | 1.01 | 2.03 | 1.32 |
| Harmony | 0.98 | 0.94 | 0.96 | 0.82 | 1.01 | 0.91 | 1.00 | 2.27 | 1.38 | 0.97 | 1.00 | 0.98 | 0.71 | 1.01 | 0.84 | 1.01 | 2.01 | 1.32 |
| Seurat | 0.99 | 0.95 | 0.97 | 0.83 | 1.02 | 0.91 | 1.00 | 2.38 | 1.40 | 0.97 | 0.99 | 0.98 | 0.73 | 1.02 | 0.85 | 1.00 | 2.04 | 1.33 |
| MNN | 0.97 | 0.94 | 0.96 | 0.76 | 0.96 | 0.85 | 1.00 | 1.33 | 1.13 | 0.97 | 0.99 | 0.97 | 0.72 | 1.02 | 0.85 | 1.01 | 2.06 | 1.33 |
| ResPAN | 0.97 | 0.94 | 0.96 | 0.82 | 1.01 | 0.91 | 1.01 | 2.24 | 1.37 | 0.97 | 0.99 | 0.98 | 0.72 | 1.01 | 0.84 | 1.01 | 1.53 | 1.19 |
| Dataset 3 | Dataset 4 | |||||||||||||||||
| Partial | 0.99 | 0.99 | 0.99 | 0.80 | 1.04 | 0.90 | 1.00 | 1.74 | 1.27 | 0.98 | 1.00 | 0.99 | 0.71 | 0.99 | 0.83 | 1.01 | 1.75 | 1.26 |
| Global | 0.99 | 0.99 | 0.99 | 0.76 | 1.04 | 0.88 | 1.00 | 1.73 | 1.26 | 0.98 | 1.00 | 0.99 | 0.72 | 0.97 | 0.82 | 1.01 | 1.55 | 1.20 |
| ComBat | 0.99 | 0.99 | 0.99 | 0.73 | 1.04 | 0.86 | 1.00 | 1.40 | 1.16 | 0.88 | 0.90 | 0.89 | 0.50 | 0.81 | 0.62 | 1.00 | 1.12 | 1.06 |
| Harmony | 0.99 | 0.99 | 0.99 | 0.65 | 1.05 | 0.81 | 1.00 | 1.70 | 1.25 | 0.86 | 0.92 | 0.89 | 0.71 | 0.97 | 0.82 | 1.01 | 1.57 | 1.20 |
| Seurat | 0.99 | 0.99 | 0.99 | 0.81 | 0.97 | 0.88 | 1.00 | 1.73 | 1.26 | 0.97 | 1.00 | 0.98 | 0.72 | 0.99 | 0.84 | 1.01 | 1.81 | 1.28 |
| MNN | 0.99 | 0.99 | 0.99 | 0.73 | 1.03 | 0.86 | 1.00 | 1.73 | 1.26 | 0.98 | 1.00 | 0.99 | 0.62 | 0.97 | 0.76 | 1.02 | 1.54 | 1.19 |
| ResPAN | 0.98 | 0.99 | 0.99 | 0.67 | 1.03 | 0.82 | 1.00 | 1.02 | 1.05 | 0.55 | 0.76 | 0.64 | 0.57 | 0.86 | 0.68 | 1.02 | 1.12 | 1.04 |
| Dataset 5 | Dataset 6 | |||||||||||||||||
| Partial | 0.60 | 0.99 | 0.75 | 0.28 | 0.98 | 0.44 | 1.44 | 1.78 | 0.84 | 0.85 | 1.00 | 0.92 | 0.64 | 1.00 | 0.78 | 1.11 | 1.94 | 1.21 |
| Global | 0.61 | 0.99 | 0.75 | 0.28 | 0.97 | 0.44 | 1.44 | 1.78 | 0.84 | 0.83 | 1.00 | 0.91 | 0.63 | 1.00 | 0.78 | 1.13 | 1.95 | 1.19 |
| ComBat | 0.51 | 0.99 | 0.67 | 0.28 | 0.97 | 0.44 | 1.41 | 1.69 | 0.86 | 0.44 | 0.90 | 0.59 | 0.41 | 0.98 | 0.58 | 1.30 | 1.36 | 0.91 |
| Harmony | 0.53 | 0.98 | 0.69 | 0.27 | 0.98 | 0.43 | 1.39 | 1.83 | 0.91 | 0.83 | 1.00 | 0.91 | 0.58 | 1.00 | 0.73 | 1.18 | 1.94 | 1.15 |
| Seurat | 0.54 | 0.99 | 0.70 | 0.26 | 0.98 | 0.41 | 1.46 | 1.85 | 0.82 | 0.83 | 1.00 | 0.91 | 0.63 | 1.00 | 0.77 | 1.14 | 1.92 | 1.18 |
| MNN | 0.55 | 0.94 | 0.69 | 0.25 | 0.96 | 0.39 | 1.50 | 1.64 | 0.76 | 0.61 | 0.96 | 0.75 | 0.42 | 0.97 | 0.59 | 1.28 | 1.61 | 0.99 |
| ResPAN | 0.48 | 0.94 | 0.65 | 0.28 | 0.97 | 0.43 | 1.52 | 1.75 | 0.74 | 0.84 | 1.00 | 0.91 | 0.60 | 0.97 | 0.74 | 1.16 | 1.31 | 1.02 |
| Dataset 7 | Dataset 8 | |||||||||||||||||
| Partial | 0.99 | 0.58 | 0.73 | 0.76 | 0.81 | 0.78 | 1.00 | 1.75 | 1.27 | 0.95 | 0.95 | 0.97 | 0.48 | 0.98 | 0.64 | 1.07 | 1.70 | 1.20 |
| Global | 0.99 | 0.58 | 0.73 | 0.80 | 0.79 | 0.79 | 1.00 | 1.74 | 1.27 | 0.95 | 0.99 | 0.97 | 0.50 | 0.99 | 0.66 | 1.06 | 1.54 | 1.16 |
| ComBat | 0.53 | 0.16 | 0.25 | 0.61 | 0.46 | 0.52 | 1.00 | 1.00 | 1.00 | 0.93 | 0.99 | 0.96 | 0.47 | 0.96 | 0.63 | 1.05 | 1.24 | 1.07 |
| Harmony | 0.98 | 0.58 | 0.73 | 0.76 | 0.81 | 0.78 | 1.00 | 1.78 | 1.28 | 0.93 | 0.99 | 0.96 | 0.50 | 0.99 | 0.66 | 1.04 | 1.59 | 1.19 |
| Seurat | 0.78 | 0.43 | 0.55 | 0.65 | 0.72 | 0.68 | 1.00 | 1.43 | 1.17 | 0.94 | 0.99 | 0.96 | 0.53 | 0.98 | 0.69 | 1.05 | 1.70 | 1.21 |
| MNN | 0.75 | 0.43 | 0.55 | 0.64 | 0.46 | 0.52 | 1.00 | 1.35 | 1.14 | 0.94 | 0.99 | 0.96 | 0.57 | 0.98 | 0.72 | 1.05 | 1.68 | 1.21 |
| ResPAN | 0.97 | 0.58 | 0.73 | 0.83 | 0.78 | 0.80 | 1.00 | 1.78 | 1.28 | 0.93 | 0.99 | 0.96 | 0.47 | 0.98 | 0.63 | 1.06 | 1.59 | 1.18 |
| Dataset 9 | Dataset 10 | |||||||||||||||||
| Partial | 0.97 | 1.00 | 0.98 | 0.59 | 1.00 | 0.74 | 1.03 | 1.49 | 1.17 | 1.00 | 0.99 | 0.99 | 0.87 | 1.00 | 0.93 | 1.00 | 1.80 | 1.28 |
| Global | 0.97 | 1.00 | 0.98 | 0.43 | 1.01 | 0.61 | 1.03 | 1.47 | 1.16 | 1.00 | 0.99 | 0.99 | 0.88 | 0.99 | 0.93 | 1.00 | 1.82 | 1.29 |
| ComBat | 0.64 | 0.81 | 0.72 | 0.14 | 0.86 | 0.24 | 1.03 | 1.01 | 0.99 | 1.00 | 0.99 | 0.99 | 0.86 | 0.99 | 0.92 | 1.00 | 1.79 | 1.28 |
| Harmony | 0.84 | 0.92 | 0.88 | 0.41 | 0.93 | 0.57 | 1.02 | 1.08 | 1.02 | 1.00 | 0.99 | 0.99 | 0.86 | 0.99 | 0.92 | 1.00 | 1.77 | 1.27 |
| Seurat | 0.96 | 1.00 | 0.98 | 0.52 | 1.00 | 0.69 | 1.03 | 1.65 | 1.22 | 1.00 | 0.99 | 0.99 | 0.90 | 0.99 | 0.94 | 1.00 | 1.77 | 1.27 |
| MNN | 0.96 | 1.00 | 0.98 | 0.42 | 0.98 | 0.59 | 1.03 | 1.34 | 1.12 | 1.00 | 0.99 | 0.99 | 0.85 | 0.99 | 0.92 | 1.00 | 1.79 | 1.28 |
| ResPAN | 0.96 | 0.99 | 0.97 | 0.37 | 0.99 | 0.54 | 1.03 | 1.31 | 1.11 | 1.00 | 0.99 | 0.99 | 0.89 | 1.00 | 0.94 | 1.00 | 1.80 | 1.28 |
The results were evaluated on UMAP embedding of corrected data. Our method utilized Louvain as the initial clustering algorithm and determined the resolution parameter through ASW. Each metric was computed for the CT purity, the batch-mixing (B), and combining both criteria (F1). Top 1 performing methods were highlighted in bold.
To further test the flexibility of our method, we assessed their ability to detect new CTs by artificially removing CT H1975 from two batches while retaining it in the third batch (Dataset 3, a subset of Dataset 2). All listed methods successfully clustered the same CTs from different batches and distinguished the unique H1975 cells in the third batch (Supplementary Fig. S4). In a more challenging scenario, Dataset 7 comprised three batches: batch 1 contained only 293T cells (2885 cells), batch 2 contained only Jurkat cells (3258 cells), and batch 3 contained a 50/50 mix of both CTs (3388 cells). Harmony, ResPAN, and our method successfully integrated the batches, resulting in two distinct CTs after batch-effect correction. Other methods struggled in this scenario. Seurat and MNN Correct failed to effectively mix the 293T cells from different batches, and ComBat failed in both CTs (Supplementary Fig. S7).
Given the requirement of deep learning methods for substantial data, we also evaluated the performance of our method on a smaller dataset (Dataset 4), consisting of 704 mouse embryonic stem cells sequenced under three culture conditions [26]. Despite the limited data, our method effectively corrected batch effects and maintained high performance across all three clustering-related metrics (Supplementary Fig. S5, Table 1).
Furthermore, we applied our method to more scRNA-seq datasets and compared their performance with the listed batch-effect correction methods. Our method effectively corrected batch effects across different datasets, demonstrating their broad applicability (Figs 4 and 5, Supplementary Figs S6, S8 and S10). Our method accurately clustered the same CTs from different batches, exhibiting outstanding performance, particularly in clustering accuracy as measured by ARI (Table 1).
Figure 4.
Batch-effect correction for human blood dendritic cells’ (DCs) scRNA-seq data (Dataset 6) composed of two batches: (a–h)
-SNE embedding computed from compared methods, in which the points were colored by CT; (i–p)
-SNE embedding computed from compared methods, in which the points were colored by batch.
Figure 5.
Batch-effect correction for a failing human heart dataset (Dataset 9) composed of three batches: (a–h)
-SNE embedding computed from compared methods, in which the points were colored by CT; (i–p)
-SNE embedding computed from compared methods, in which the points were colored by batch.
Differential expression consistency
Analyzing differentially expressed genes (DEGs) is crucial for uncovering molecular mechanisms underlying various biological conditions and diseases [27]. DEG analysis allows researchers to identify genes associated with specific diseases, potentially leading to the discovery of therapeutic targets and the development of personalized treatments. Moreover, DEGs serve as biomarkers for disease diagnosis, monitoring treatment responses, and understanding gene regulatory networks, shedding light on gene interactions and their influence on phenotypic traits [28]. Integrating multiple scRNA-seq datasets while preserving the original differential expression information within batches can help us achieve more reliable integration analysis results.
We observed that the mixing performance (FISI) of the global model was significantly lower than that of the partial model on certain datasets (such as Dataset 1, 4, and 8 in Table 1). Moreover, across all datasets, the partial model generally outperformed the global model in terms of mixing performance. Although this difference could be attributed to the relationship between the two models, where the global monotonic model was a subset of the partial monotonic model, we found that the difference in their mixing performance was more likely caused by the inconsistency in differential expression across batches.
A gene should exhibit consistent expression patterns across different batches. For example, if a gene showed an upregulated trend between two CTs in one batch, it should demonstrated a similar trend in another batch. Although many batch-effect correction methods had been developed for scRNA-seq data, the consistency of differential expression across batches was often overlooked in correction evaluations. Previous studies [29, 30] had emphasized the importance of assessing differential expression consistency to avoid drawing misleading biological conclusions.
In our study, we found that the performance of the global model was influenced by the degree of consistency in differential expression across batches. For Dataset 1 as an example, we performed a one-sided Wilcoxon rank-sum test for normalized gene expression between each pair of CTs (three pairs in total) across each batch (two query batches and one reference batch). Based on the
-values (𝑝) obtained, we computed the corresponding
-scores (𝑧) using the following transformation:
![]() |
where
is the inverse function of the standard normal cumulative distribution function (c.d.f.).
The same procedure was also conducted on Dataset 2, then, we created a scatter plot where the
-score of the reference batch is taken as the
-axis, and the
-scores of other batches are taken as the
-axis. As shown in Fig. 6, for Dataset 2, the scatter plots are concentrated in the first and third quadrants, indicating consistent differential expression direction for most genes between the reference and query batches. In contrast, for Dataset 1, approximately half of the points fall in the second and fourth quadrants, indicating that many genes exhibit opposite differential expression patterns across batches, suggesting significant inconsistency.
Figure 6.
The
-scores between different pairs of CTs across batches in Dataset 1 and Dataset 2: (a–c)
-scores obtained in Dataset 1; (d–f):
-scores obtained in Dataset 2 (in all the graphs, the
-axis represents the
-scores obtained from the reference batch, and the
-axis represents the
-scores obtained from query batches).
Due to the monotonicity constraint of the network, the inconsistency in differential expression across batches imposed limitations on the optimization of the global model. When the gene expression trends among CTs were inconsistent across different batches, the global model struggled to effectively integrate these trends, resulting in poor mixing performance. In contrast, the partial model alleviated this issue by using the
matrix as an additional input. Therefore, in this scenario, the FISI score of the global model was lower than that of the partial model (Table 1). Similar phenomena were also observed in Dataset 4 and Dataset 6 (Supplementary Fig. S11 and Table 1).
This result indicated that the direction of differential expression between CTs might vary across different batches. Over-reliance on the differential expression analysis results from a single batch, or an excessive pursuit of mixing performance, could lead to the loss of differential expression information within other batches, ultimately resulting in misleading conclusions. Therefore, preserving the original differential expression information within different batches during the correction process was crucial to ensure reliable downstream integration analysis.
To evaluate the ability of different batch-effect correction methods to preserve batch-specific differential expression information, we used Dataset 2 as an procedure example. We selected the two CTs with the most cells (H1975 and H2228), then we conducted a one-sided Wilcoxon rank-sum test between these two CTs in the query batch before and after batch-effect correction, and converted the
values into
scores. By plotting the
scores before correction (
-axis) against those after correction (
-axis), we observed the changes in differential expression between the two CTs before versus after correction. We finally compared the global model with the widely adopted method Seurat and a deep learning-based method ResPAN. To ensure fairness, all tests were performed on the same set of two thousand highly variable genes HVGs. We used a significance threshold of
value.05, corresponding to a
score of
. Points where the absolute values of
scores before and after correction exceeded
and their signs reversed were identified as outliers. Based on the order-preserving feature of the global model, it exhibited the fewest outliers (Fig. 7 and Table 2), indicating that it more effectively retained the original differential expression information within the batch.
Figure 7.

The
-scores between CTs in a query batch from Dataset 2 before versus after batch effect correction (in all the graphs, the
-axis represents the
-scores obtained from the uncorrected batch, and the
-axis represents the
-scores obtained from the corrected batch; Outliers represent genes whose differential expression direction changed significantly before versus after batch-effect correction): (a–c) different batch-effect correction methods.
Table 2.
Comparison of different batch-effect correction methods in preserving original differential expression information
| Dataset 1 | Dataset 2 | Dataset 3 | Dataset 4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method |
-value |
Benjamini–Hochberg | bonferroni |
-value |
Benjamini–Hochberg | bonferroni |
-value |
Benjamini–Hochberg | bonferroni |
-value |
Benjamini–Hochberg | bonferroni |
| Seurat | 86 | 48 | 15 | 69 | 38 | 1 | 154 | 100 | 4 | 19 | 0 | 0 |
| ResPAN | 167 | 103 | 38 | 501 | 385 | 161 | 204 | 178 | 96 | 333 | 125 | 71 |
| Global | 83 | 43 | 15 | 52 | 32 | 2 | 48 | 25 | 1 | 2 | 0 | 0 |
| Dataset 5 | Dataset 6 | Dataset 8 | Dataset 9 | |||||||||
| Method |
-value
|
Benjamini–Hochberg | bonferroni |
-value
|
Benjamini–Hochberg | bonferroni |
-value
|
Benjamini–Hochberg | bonferroni |
-value
|
Benjamini–Hochberg | bonferroni |
| Seurat | 19 | 0 | 0 | 10 | 0 | 0 | 40 | 19 | 5 | 6 | 0 | 0 |
| ResPAN | 15 | 0 | 0 | 59 | 11 | 4 | 70 | 19 | 3 | 4 | 0 | 0 |
| Global | 11 | 0 | 0 | 0 | 0 | 0 | 17 | 8 | 3 | 2 | 1 | 0 |
The numbers in the table represented the count of genes that potentially exhibited aberrant differential expression before versus after batch-effect correction, with lower counts indicating better performance. The best-performing method was highlighted in bold. More details can be found in Results/Differential expression consistency.
To further evaluate the preservation of original differential expression information across different datasets, we repeated this procedure on all experimental scRNA-seq datasets excluding Dataset 7, where only one CT was present in the query batch. We also applied Bonferroni and Benjamini–Hochberg corrections to control the FDR and enhance the reliability of identified DEGs. Across all datasets, the global model consistently demonstrated the best performance in preserving original differential expression information, yielding the fewest outliers due to its order-preserving feature (Table 2).
Different initial clustering
Our method defaulted to using the Louvain clustering algorithm for initial clustering. However, in practical applications, various clustering algorithms could be flexibly adopted.
In this study, we also used the Leiden clustering algorithm [31] and Gaussian Mixture Model (GMM) [32] for initial clustering. The Leiden algorithm, an improvement over Louvain [33], ensured higher partition quality and robustness. It optimized the community detection process by introducing a local moving phase, which achieved more accurate community assignments while maintaining efficiency. GMM provided a flexible way to model complex data distributions. One key advantage of GMM was “soft clustering,” where each data point was assigned a probability of belonging to multiple clusters rather than being strictly assigned to one. This approach was advantageous for handling overlapping clusters and capturing uncertainty in data assignments.
Despite using different clustering algorithms, our strategy of merging and matching clusters (as detailed in Step 2 of the Methods section) remained applicable. Based on three different clustering methods (Louvain, Leiden, and GMM), our method effectively corrected batch effects across all datasets (Fig. 8, Supplementary Fig. S9, and Supplementary Table S6). The clustering performance after correction may vary depending on different initial clustering methods. For instance, when using GMM clustering, the partial monotonic model achieved an ARI of 1 in Dataset 4 (Supplementary Table S6). When different initial clustering methods were considered, the subsequent analyses following the batch-effect correction by our method showed minor differences (Supplementary Table S6).
Figure 8.
Batch-effect correction for a failing human heart dataset (Dataset 9) composed of two batches: (a–f)
-SNE embedding of corrected expression matrix based on different initial clustering methods, in which the points were colored by CT; (g–l)
-SNE embedding of corrected expression matrix based on different initial clustering methods, in which the points were colored by batch.
Discussion
Most current procedural methods involve components such as anchoring, MNN, and deep learning often overlooked the order-preserving feature during batch-effect correction. Although the non-procedural ComBat method can preserve the original order of gene expression levels within genes, it struggles to effectively handle the abundance of zero values in scRNA-seqscRNA-seq data, which may lead to suboptimal performance in batch-effect correction tasks. Therefore, we developed a procedural method with order-preserving feature to correct batch effect.
Our method used initial clustering and NN information within and across batches to construct similarities between clusters. We integrated a deep learning network with monotonicity property and weighted MMC to perform batch-effect correction. Our method not only preserved the inherent biological signals but also maintained the original order of gene expression levels during correction, thereby better retaining batch-specific information and enhancing biological interpretability.
We tested our method on multiple experimental scRNA-seq datasets and simulated datasets. In benchmark comparisons with other batch-effect correction methods, such as Seurat, Harmony, and ResPAN, our method demonstrated superior performance. It not only helped maintain inter-gene correlation and preserved the original differential expression information within batches, but also achieved higher clustering accuracy by integrating initial clustering with batch-effect correction.
Current batch-effect correction methods and their evaluations often overlooked the inconsistency in differential expression across batches. Over-reliance on a single batch could lead to misleading results in downstream analyses. More importantly, based on the order-preserving feature, our global method better retained the original differential expression information within batches during the batch-effect correction process, thereby improving the reliability of integration analysis results.
There are also following limitations for our method. First, in scenarios involving rare or imbalanced CTs, it is possible that there is no overlap or a low overlap across batches in certain CTs (e.g. Dataset 7). Because our method leverages MNN information to identify potentially shared CTs across batches, then unmatched CTs could be inaccurately aligned, thereby impairing the effectiveness of batch-effect correction. We introduced a minimum threshold to filter out unreliable matches. Second, for datasets with complex tissues, the monotonicity constraint imposed on the neural network may limit its capacity to model subtle batch effects. We designed a triple structure and hidden layers with expansion nodes to enhance its modeling capability. Despite these potential limitations, the advantages of our method have still been demonstrated. Our analyses across multiple experimental and simulated datasets, along with the comprehensive comparisons with the existing approaches, demonstrated the adaptability and robustness of our method in correcting batch effects.
To provide guidelines in practice, we recommend the following for selecting between the global and partial models based on methodological design and empirical performance. The global model is designed to enforce monotonicity across all samples, making it particularly suitable for scenarios where preserving global structural consistency and expression trends is critical. In contrast, the partial model applies monotonicity constraints only within subsets of samples that share the same initial clustering label. Furthermore, evaluation results highlight the complementary strengths of the two models. The partial model excels in batch mixing performance, demonstrating superior capability in mitigating batch effects across datasets. The global model shows better preservation of original differential gene expression. In terms of clustering performance and inter-gene correlation preservation, both models perform comparably. In practice, we recommend the partial model for applications where batch mixing performance is the primary concern, particularly in multi-batch single-cell datasets. Otherwise, when maintaining original differential gene expression is the focus, we recommend the global model.
With the continuous advancement of sequencing technology and multi-omics data, the order-preserving property in the batch-effect correction process deserves attention. Extending the order-preserving feature to other omics data and developing faster, more stable, and more effective batch-effect correction methods would be beneficial for better preserving batch-specific information.
Conclusion
In summary, we developed a comprehensive procedural method with order-preserving feature to correct batch effects, which involved initial clustering, NNs, and a monotonic deep network. By applying our method to multiple scRNA-seq data, we demonstrated that it not only effectively corrected batch effects but also preserved inter-gene correlation. Furthermore, leveraging the order-preserving feature, our method retained differential expression information within batches after correction.
Methods
Our method’s workflow (Fig. 1) mainly involved following steps: preprocessing, initializing clusters, identifying KNN (k-NN) pairs within batches and MNN pairs across batches, constructing a similarity matrix between clusters, merging clusters within batches, matching clusters across batches, and ultimately correcting batch effects based on a monotonic deep learning network. More details were provided in the Supplementary Materials.
Step 1: preprocessing
There are four important tasks to be completed in step of preprocessing: filtering low-quality cells and genes, cell normalization, log normalization and detecting HVGs. All the above steps were implemented in the python module scanpy.
Let
represent the raw count of gene
in cell
. In the filtering step, low-quality cells with nGene
were removed, and genes expressed in fewer than three cells (nCells
) were excluded. For normalization, the counts for each cell were divided by the total counts across all genes, multiplied by a constant of 10 000, and a log transformation was applied to obtain the normalized expression value
. Finally, we selected 2000 HVGs by using
function. We chose 2000 HVGs based on the following considerations: This setting is recommended by the widely used single-cell analysis tool Seurat [9] and the Python toolkit Scanpy. Both of which recommend selecting 2000 HVGs as input features for downstream analyses.
Step 2: initializing clusters
Let
be the
matrix of normalized expression from Step 1, including only the
HVGs. We applied the
function to the normalized data to obtain a low-dimensional embedding space (defaulting to
-dimensional PCA) and applied the
function to construct neighbor information (defaulting to
neighbors based on Euclidean distance).
We offered several methods for initializing clusters. Our method defaulted to using the Louvain [33] method, a graph-based clustering method that has demonstrated strong performance. The Louvain method is a popular approach for community detection in large networks, optimizing modularity to identify densely connected groups. It operates in two phases: first, it assigns nodes to communities, and then it merges communities to maximize modularity. This procedure can be implemented by the function
in
package, higher resolution means finding more and smaller clusters. Our method used the ASW to select the appropriate resolution parameter (Algorithm 1 in the Supplementary Material). Our method can also use a default resolution 1, to help find a moderate number of CTs.
In the manuscript, we only presented the results obtained by determining the resolution through ASW. The results based on the default resolution were provided in the Supplementary Materials (Supplementary Table S7). Alternative methods include a Gaussian Mixture Model [32] (GMM) and Leiden [31] algorithm. For GMM, we selected the number of clusters based on the Bayesian Information Criterion (BIC) [34]; For Leiden, an improvement over Louvain, can be implemented using the function
in the
package.
We then calculated the probability of each cell belonging to each cluster, denoted as
. For GMM, the posterior estimates of latent variables represented the cluster probabilities. In the Louvain/Leiden clustering algorithms, cluster probabilities were binary (0 or 1).
Step 3: merging and matching clusters
There are three important tasks to be completed in step of merging and matching clusters: finding KNN and MNN pairs, calculating similarity matrix among clusters and merging/matching clusters obtained from Step 2. We aimed to find potentially identical CTs within the same batch based on KNN information and find potentially identical CTs across batches based on MNN information. MNN-based methods, such as MNN [17] and BBKNN [35], have been shown to effectively reduce batch effects in scRNA-seq data.
Finding KNN and MNN pairs
Consider batches
(reference) and
(query) as an example. Let
be a
matrix of scRNA-seq data in PCA embedding space (n_components = 100), where
is
submatrix of cells in the reference batch
.
Let
be the vector of cell
from batch
in PCA embedding space. Denote
as the set of KNN pairs within batch
, cell
and cell
form a KNN pair if and only if:
![]() |
(1) |
where the tuples
and
are both KNN pairs, and
represents the set of cells in batch
that are nearest to cell
. We used a default of 10 neighbors and cosine distance for KNN calculations.
To correspond with the definition of KNN pairs intra batch, we let
be the set of MNN pairs between batch
and
, cell
and cell
form an MNN pair if and only if:
![]() |
(2) |
where the tuples
and
are both MNN pairs, and MNN
represents the set of cells in batch
which are nearest to cell
in batch
, and MNN
represents the set of cells in batch
which are nearest to cell
in batch
. We used a default of 25 neighbors and cosine distance for MNN calculations.
In our study, within each batch, we set the number of neighbors to 10 in order to better distinguish rare CTs [15]; across different batches, we set the number of neighbors to 25 in order to enhance the capability in matching potentially shared CTs [35].
Similarity matrix
After finding all KNN and MNN pairs, we constructed similarities of all obtained clusters among different batches. Let
represents the number of clusters obtained in batch
, and
the number of clusters obtained in batch
. The similarity matrix
can be defined as:
![]() |
(3) |
where
![]() |
(4) |
and
![]() |
(5) |
with
is the
th row and
th column of the
, representing the probability of cell
belonging to the cluster
obtained in Step 2. Larger clusters naturally lead to more KNN and MNN pairs, so we considered cluster size when computing similarity.
Merging and matching rule
To find potentially identical CTs within the same batch based on KNN information, and find potentially identical CTs across batches based on MNN information. We designed a merging/matching rule to merge and match clusters obtained in Step 2.
Starting from an initial cluster in the query batch, we identified all corresponding clusters in the similarity matrix that exceed the similarity threshold. This process continued iteratively until no new clusters were found. The discovered clusters were divided into paired sets based on their batch origin (query/reference), denoted as
. Both
and
may consist of multiple initial clusters. Then, we selected a new cluster within the query batch (previously undiscovered) and repeated the previous process until all initial clusters have been traversed. It is also worth noting that not all clusters have corresponding paired clusters. We then gradually increased the similarity threshold and repeated the above procedure until the number of clusters that can be matched at the new threshold was less than that at the previous threshold. This step helped us differentiate as many distinct CTs as possible. The detailed rules for merging/matching clusters and adjusting threshold (Algorithm 2 and Algorithm 3) were provided in the Supplementary Material.
Step 4: partial/global monotonic deep networks
Monotonic neural networks offer significant benefits in terms of consistency and interpretability, particularly in medical applications. However, the architecture and activation functions of these networks must be carefully designed to maintain monotonicity, which can constrain the network’s ability to capture complex, non-linear relationships and potentially reduce overall accuracy in certain contexts. This area is actively evolving, with ongoing research aimed at enhancing these networks to be more adaptable across complex scenarios [36, 37]. Here, we introduced a three-layer feedforward neural network with weight constraints [36]. Let
denote the weight connecting input
to hidden unit
(with
hidden units in total) and
the weight connecting hidden unit
to the output. Given an input
(of dimension
), the output function
for a network with one hidden layer is given by:
![]() |
To maintain monotonicity, the network must satisfy:
![]() |
where
indicates a partial ordering on
, defined by
, for
. In this condition, the pair
is called
.
To achieve monotonicity, partial derivatives need to satisfy:
![]() |
(6) |
Given that
, Equation (1) holds if and only if:
![]() |
This is equivalent to the constraint [38]:
![]() |
However, the comparability of network outputs under these constraints is guaranteed only when the input vectors are comparable. For incomparable input vectors, we cannot directly infer the size relationship of their corresponding outputs.
In batch-effect correction tasks, cells often have high-dimensional gene expressions that are incomparable. To ensure the order-preserving feature, the following {global monotonicity (increasing)} was defined:
![]() |
where
represent different cells,
represents gene, and
is the batch-effect corrected expression matrix of
.
To address this issue, we proposed a deep learning network illustrated in Fig. 1, where each gene is an input unit. We demonstrated that independence across units in feedforward networks is necessary for ensuring global monotonicity (i.e. each hidden layer node can only correspond to one input node and one output node, corresponding proof details were available in the Supplementary Materials).
We provided two options (global/partial) for the network. In the global option, the final batch-effect corrected expression matrix can satisfy the global monotonic property. The input of network is a normalized gene expression matrix, and the output is a batch-corrected matrix. The network consists of three sub-networks connected in series, incorporating a residual structure, with each sub-network employing constraint (1) to ensure monotonicity. Solid lines represent weights that are always active, while dashed lines indicate connections that are activated only when the corresponding input is zero. The dashed lines are asymmetrical, based on zero expression genes, and use other gene levels for imputation.
In the partial option, the probability estimation matrix
is introduced as an additional input to the network. This part of the input is fully connected to the middle layers, while the remaining structure is the same as in the global option.
Weighted maximum mean discrepancy
We employed a weighted MMC as the loss function to align the distributions of identical CTs across different batches. MMC [39, 40] measures the distance between two probability distributions
, defined for a function class
by:
![]() |
If
is a reproducing kernel Hilbert space with kernel
, the MMD can be written as the distance between the mean embeddings of
and
:
![]() |
(7) |
where
Equation (2) can be written as
![]() |
(8) |
where
and
are independent, as are
and
. For a universal kernel
, then MMD
iff
.
In practice, the distributions
and
are unknown, we can approximate MMD using observed values. In our work, we let
represents the reference batch A, and
represents the (corrected) query batch B. We designed the weighted MMC based on the information obtained in above steps to account for potential class imbalances:
![]() |
where
is a Gaussian kernel and:
![]() |
Evaluation Metrics
To evaluate the effectiveness of various methods in correcting batch effects, we employed three evaluation metrics: ARI, ASW, and Local Inverse Simpson’s Index (LISI). These metrics quantify clustering quality, considering both batch mixing and CT purity.
ARI measures the agreement between the clustering result and a reference classification, adjusting for chance. It quantifies how well cells of the same type are grouped together after batch correction. ARI ranges from -1 (no agreement) to 1 (perfect agreement). The ARI formula is:
![]() |
where:
is the number of cells in both cluster
and reference group
,
is the number of cells in cluster
,
is the number of cells in reference group
,
is the total number of cells.
ASW evaluates the separation and compactness of clusters by measuring how similar each cell is to its assigned cluster compared to other clusters. For CT purity, ASW assesses how well cells of the same type cluster together, while for batch mixing, it evaluates the extent of mixing between batches within clusters. The ASW for cell
is defined as:
![]() |
where:
is the average distance of cell
to all other cells in its cluster,
is the lowest average distance of cell
to cells in any other cluster.
For CT purity,
and
are calculated based on the same and different CTs, respectively. For batch mixing, they are calculated based on cells from the same and different batches.
Local Inverse Simpson’s Index (LISI) LISI measures the degree of batch mixing and CT homogeneity at the local level, providing insights into how well cells are integrated across batches while maintaining CT distinctions. Two variations are used: Batch LISI for batch mixing and CT LISI for CT purity. LISI is defined as:
![]() |
where:
is the number of cell groups (either batches or CTs),
is the proportion of the neighborhood for cell
that belongs to group
.
Batch LISI assesses the extent of batch mixing within local neighborhoods, with lower values indicating better mixing. CT LISI measures how pure local neighborhoods are with respect to CTs, with higher values indicating better CT separation.
Clustering evaluation
To quantify the clustering performance of different methods, we designed the following evaluation procedure: First, a dimension reduction step was performed using UMAP for every datasets before and after batch-effect correction based on different methods.
For the ARI, we performed Louvain clustering on the UMAP embeddings. both the original and corrected data. To select an appropriate resolution parameter for Louvain, we used the UMAP embedding of the original data and evaluated resolutions from the set
. The selection criterion was to maximize the grouping of cells of the same type from the same batch into the same cluster while ensuring that cells of the same type from different batches were assigned to different clusters, thereby reflecting the batch effect.
Key Points
We developed a batch-effect correction method with order-preserving feature.
Our method excels in multiple biological tasks, demonstrating superior performance in clustering and batch effect correction compared to existing methods.
Based on the order-preserving feature, our method can better retain original inter-gene correlation and differential expression information.
Supplementary Material
Acknowledgements
During the preparation of this work the authors used ChatGPT in order to improve language and readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Contributor Information
Mingxuan Zhang, School of Mathematical Sciences, University of Science and Technology of China, Hefei, 230026 Anhui, China.
Yinglei Lai, School of Mathematical Sciences, University of Science and Technology of China, Hefei, 230026 Anhui, China; Department of Statistics, The George Washington University, Washington, DC 20052, United States.
Author contributions
M.Z. and Y.L. designed the research; M.Z. and Y.L. developed the methods; M.Z. and Y.L. contributed to the acquisition, analysis, and interpretation of the data; M.Z. drafted the manuscript; M.Z. and Y.L. revised the manuscript; and All authors read and approved the final manuscript.
Conflict of interest
The authors declare that the sponsors have no competing financial interests.
Funding
This work was partially supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA0460300/XDA0460303) and the National Natural Science Foundation of China (T2350710230). YL was also partially supported by a start-up fund from the University of Science and Technology of China.
Data availability
We analyzed nine published scRNA-seq datasets and one simulated datasets, which are available through the accession numbers reported in the original articles.
(1) Dataset 1 consists of the mammary epithelial cell dataset from three independent studies [22–24].
(2) Dataset 2 consists of the human lung cell dataset [25] collected from three lung adenocarcinoma cell lines HCC827, H1975, and H2228 on three different platforms with CELseq2, 10x Chromium, and Drop-seq protocols, respectively, which can be downloaded from https://github.com/LuyiTian/sc_mixology.
(3) Dataset 3 is a subset of Dataset 2, with the data for cell type H1975 removed in two batches. In the third batch, data for this cell type is retained.
(4) Dataset 4 consists of the mouse embryonic stem cell dataset [26]. The transcriptome of 704 mouse embryonic stem cells was sequenced across three culture conditions (lif, 2i, and a2i), using the Fluidigm C1 microfuidics cell capture platform followed by illumina sequencing, which can be downloaded from http://www.ebi.ac.uk/teichmann-srv/espresso.
(5) Dataset 5 consists of the mouse mammary gland datasets [41] processed on the Microwell-seq platform from the Mouse Cell Atlas project, which can be downloaded from https://fgshare.com/articles/dataset/MCA_DGE_Data/5435866.
(6) Dataset 6 from GEO accession GSE80171 consists of human blood dendritic cells’ (DCs) scRNA-seq data [42], generated by the same technology and coming from the same tissue. It is composed of two batches containing four different cell types, which can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820.
(7) Dataset 7 is composed of three batches, where batch 1 contains only 293T cells, batch 2 contains only Jurkat cells, and batch 3 consists of a 50/50 mixture of Jurkat and 293T cells [43, 44], which can be downloaded from http://scanorama.csail.mit.edu/data.tar.gz.
(8) Dataset 8 was constructed using human pancreatic data from two sources [45, 46]. The resulting dataset consists of celseq2 batch (accession GSE85241) and smartseq2 batch (accession E-MTAB-5061) with 15 different cell types, which can be downloaded from https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/.
(9) Dataset 9 is a failing human heart dataset [47]. Based on different technologies, we selected a total of 39 682 cells from female subjects under the condition of DCM, which include 14 different cell types.
(10) Dataset 10 is a simulated count data using the Splatter package [48]. The set contains two batches with unbalanced numbers of cells.
The details of datasets used can be found in Supplementary Materials Table S3. The normailzed datasets in this paper are avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.
Code availability
Our method is implemented in python based on the PyTorch framework and avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
References
- 1. Nguyen Q, Pervolarakis N, Blake K. et al. Profiling human breast epithelial cells using single cell RNA sequencing identifies cell diversity. Nat Commun 2018;9:2028. 10.1038/s41467-018-04334-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Matsumoto H, Kiryu H, Furusawa C. et al. Scode: An efficient regulatory network inference algorithm from single-cell rna-seq during differentiation. Bioinformatics 2017;33:2314–21. 10.1093/bioinformatics/btx194 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Hicks SC, Townes FW, Teng M. et al. Missing data and technical variability in single-cell rna-sequencing experiments. Biostatistics 2018;19:562–78. 10.1093/biostatistics/kxx053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Lähnemann D, Köster J, Szczurek E. et al. Eleven grand challenges in single-cell data science. Genome Biol 2020;21:1–35. 10.1186/s13059-020-1926-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Leek JT, Scharpf RB, Bravo HC. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 2010;11:733–9. 10.1038/nrg2825 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Zhang X, Ye Z, Chen J. et al. Amdbnorm: an approach based on distribution adjustment to eliminate batch effects of gene expression data. Brief Bioinform 2022;23:528. 10.1093/bib/bbab528 [DOI] [PubMed] [Google Scholar]
- 7. Smyth GK, Speed T. Normalization of cdna microarray data. Methods 2003;31:265–73. 10.1016/S1046-2023(03)00155-5 [DOI] [PubMed] [Google Scholar]
- 8. Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007;8:118–27. 10.1093/biostatistics/kxj037 [DOI] [PubMed] [Google Scholar]
- 9. Stuart T, Butler A, Hoffman P. et al. Comprehensive integration of single-cell data. Cell 2019;177:1888–1902.e21. 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Korsunsky I, Millard N, Fan J. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods 2019;16:1289–96. 10.1038/s41592-019-0619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Shaham U, Stanton KP, Zhao J. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 2017;33:2539–46. 10.1093/bioinformatics/btx196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Welch JD, Kozareva V, Ferreira A. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 2019;177:1873–1887.e17. 10.1016/j.cell.2019.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lopez R, Regier J, Cole MB. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 2018;15:1053–8. 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kingma DP, Welling M. Auto-encoding variational Bayes. In: Bengio Y, LeCun Y (eds.), Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014). arXiv:1312.6114.
- 15. Yu X, Xu X, Zhang J. et al. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun 2023;14:960. 10.1038/s41467-023-36635-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Li X, Wang K, Lyu Y. et al. Deep learning enables accurate clustering with batch effect removal in single-cell rna-seq analysis. Nat Commun 2020;11:2338. 10.1038/s41467-020-15851-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Haghverdi L, Lun AT, Morgan MD. et al. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 2018;36:421–7. 10.1038/nbt.4091 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Wang Y, Liu T, Zhao H. Respan: a powerful batch correction model for scRNA-seq data through residual adversarial networks. Bioinformatics 2022;38:3942–9. 10.1093/bioinformatics/btac427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Hubert L, Arabie P. Comparing partitions. JClassif 1985;2:193–218. 10.1007/BF01908075 [DOI] [Google Scholar]
- 20. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. JComputApplMath 1987;20:53–65. 10.1016/0377-0427(87)90125-7 [DOI] [Google Scholar]
- 21. Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 2005;4:i–43. 10.2202/1544-6115.1128 [DOI] [PubMed] [Google Scholar]
- 22. Bach K, Pensa S, Grzelak M. et al. Differentiation dynamics of mammary epithelial cells revealed by single-cell rna sequencing. Nat Commun 2017;8:1–11. 10.1038/s41467-017-02001-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Pal B, Chen Y, Vaillant F. et al. Construction of developmental lineage relationships in the mouse mammary gland by single-cell rna profiling. Nat Commun 2017;8:1627. 10.1038/s41467-017-01560-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Giraddi RR, Chung C-Y, Heinz RE. et al. Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep 2018;24:1653–1666.e7. 10.1016/j.celrep.2018.07.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Tian L, Dong X, Freytag S. et al. Benchmarking single cell rna-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019;16:479–87. 10.1038/s41592-019-0425-8 [DOI] [PubMed] [Google Scholar]
- 26. Kolodziejczyk AA, Kim JK, Tsang JC. et al. Single cell rna-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell 2015;17:471–85. 10.1016/j.stem.2015.09.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hawkins RD, Hon GC, Ren B. Next-generation genomics: an integrative approach. Nat Rev Genet 2010;11:476–86. 10.1038/nrg2795 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell 2013;152:1237–51. 10.1016/j.cell.2013.02.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Lai Y, Eckenrode SE, She J-X. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinf 2009;10:1–11. 10.1186/1471-2105-10-S1-S23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Lai Y, Zhang F, Nayak TK. et al. An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets. Bioinformatics 2017;33:3852–60. 10.1093/bioinformatics/btx061 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: Guaranteeing well-connected communities. Sci Rep 2019;9:1–12. 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Reynolds DA. Gaussian mixture models. In: Li SZ, Jain A (eds.), Encyclopedia of Biometrics. New York: Springer, 2009, 659–63. 10.1007/978-0-387-73003-5_196 [DOI] [Google Scholar]
- 33. Blondel VD, Guillaume J-L, Lambiotte R. et al. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008;2008:10008. [Google Scholar]
- 34. Schwarz G. Estimating the dimension of a model. The annals of statistics 1978;6:461–4. 10.1214/aos/1176344136 [DOI] [Google Scholar]
- 35. Polański K, Young MD, Miao Z. et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics 2020;36:964–5. 10.1093/bioinformatics/btz625 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Daniels H, Velikova M. Monotone and partially monotone neural networks. IEEE Trans Neural Netw 2010;21:906–17. 10.1109/TNN.2010.2044803 [DOI] [PubMed] [Google Scholar]
- 37. You S, Ding D, Canini K. et al. Deep lattice networks and partial monotonic functions. In: Guyon I, von Luxburg U, Bengio S et al. (eds.), Advances in Neural Information Processing Systems 30 (NIPS 2017). NeurIPS Foundation, 2017, 2981–89. [Google Scholar]
- 38. Kay H, Ungar LH. Estimating monotonic functions and their bounds. AIChE Journal 2000;46:2426–34. 10.1002/aic.690461211 [DOI] [Google Scholar]
- 39. Gretton A, Borgwardt K, Rasch M. et al. A kernel method for the two-sample problem. In: Schölkopf B, Platt JC, Hofmann T (eds.), Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2006, 673–80. 10.7551/mitpress/7503.003.0069 [DOI] [Google Scholar]
- 40. Gretton A, Borgwardt KM, Rasch MJ. et al. A kernel two-sample test. JMachLearnRes 2012;13:723–73. [Google Scholar]
- 41. Han X, Wang R, Zhou Y. et al. Mapping the mouse cell atlas by microwell-seq. Cell 2018;172:1091–1107.e17. 10.1016/j.cell.2018.02.001 [DOI] [PubMed] [Google Scholar]
- 42. Villani A-C, Satija R, Reynolds G. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 2017;356:4573. 10.1126/science.aah4573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Zheng GX, Terry JM, Belgrader P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8:14049. 10.1038/ncomms14049 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat Biotechnol 2019;37:685–91. 10.1038/s41587-019-0113-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Muraro M, Dharmadhikari G, Grün D. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst 2016;3:385–394.e3 e3. 10.1016/j.cels.2016.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Segerstolpe Å, Palasantza A, Eliasson P. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab 2016;24:593–607. 10.1016/j.cmet.2016.08.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Koenig AL, Shchukina I, Amrute J. et al. Single-cell transcriptomics reveals cell-type-specific diversification in human heart failure. Nat Cardiovasc Res 2022;1:263–80. 10.1038/s44161-022-00028-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell rna sequencing data. Genome Biol 2017;18:174. 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We analyzed nine published scRNA-seq datasets and one simulated datasets, which are available through the accession numbers reported in the original articles.
(1) Dataset 1 consists of the mammary epithelial cell dataset from three independent studies [22–24].
(2) Dataset 2 consists of the human lung cell dataset [25] collected from three lung adenocarcinoma cell lines HCC827, H1975, and H2228 on three different platforms with CELseq2, 10x Chromium, and Drop-seq protocols, respectively, which can be downloaded from https://github.com/LuyiTian/sc_mixology.
(3) Dataset 3 is a subset of Dataset 2, with the data for cell type H1975 removed in two batches. In the third batch, data for this cell type is retained.
(4) Dataset 4 consists of the mouse embryonic stem cell dataset [26]. The transcriptome of 704 mouse embryonic stem cells was sequenced across three culture conditions (lif, 2i, and a2i), using the Fluidigm C1 microfuidics cell capture platform followed by illumina sequencing, which can be downloaded from http://www.ebi.ac.uk/teichmann-srv/espresso.
(5) Dataset 5 consists of the mouse mammary gland datasets [41] processed on the Microwell-seq platform from the Mouse Cell Atlas project, which can be downloaded from https://fgshare.com/articles/dataset/MCA_DGE_Data/5435866.
(6) Dataset 6 from GEO accession GSE80171 consists of human blood dendritic cells’ (DCs) scRNA-seq data [42], generated by the same technology and coming from the same tissue. It is composed of two batches containing four different cell types, which can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820.
(7) Dataset 7 is composed of three batches, where batch 1 contains only 293T cells, batch 2 contains only Jurkat cells, and batch 3 consists of a 50/50 mixture of Jurkat and 293T cells [43, 44], which can be downloaded from http://scanorama.csail.mit.edu/data.tar.gz.
(8) Dataset 8 was constructed using human pancreatic data from two sources [45, 46]. The resulting dataset consists of celseq2 batch (accession GSE85241) and smartseq2 batch (accession E-MTAB-5061) with 15 different cell types, which can be downloaded from https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/.
(9) Dataset 9 is a failing human heart dataset [47]. Based on different technologies, we selected a total of 39 682 cells from female subjects under the condition of DCM, which include 14 different cell types.
(10) Dataset 10 is a simulated count data using the Splatter package [48]. The set contains two batches with unbalanced numbers of cells.
The details of datasets used can be found in Supplementary Materials Table S3. The normailzed datasets in this paper are avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.



























