Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jun 30;26(3):bbaf247. doi: 10.1093/bib/bbaf247

An order-preserving batch-effect correction method based on a monotonic deep learning framework

Mingxuan Zhang 1, Yinglei Lai 2,3,
PMCID: PMC12207412  PMID: 40586320

Abstract

Single-cell RNA sequencing has significantly advanced our understanding of cell heterogeneity and gene regulation. Batch-effect correction is essential for achieving robust data integration. Multiple methods have been developed to address this issue, particularly procedural approaches involving components such as anchoring or deep learning, which have achieved notable successes. However, order preservation, as an important feature, has been largely overlooked in procedural methods. Based on a monotonic deep learning network, we developed a correction method with order-preserving feature. By comparing with existing methods, we demonstrated that our method effectively improved clustering performance, better retained original inter-gene correlation and differential expression information.

Keywords: scRNA-sequencing, batch effect, order-preserving, monotonic deep learning network, inter-gene correlation, differential expression consistency

Introduction

The rapid advancement of single-cell RNA sequencing (scRNA-seq) technologies has significantly enhanced our understanding of cellular diversity and gene regulation in complex biological systems [1, 2]. By enabling the profiling of thousands of individual cells, scRNA-seq has revolutionized the study of cellular heterogeneity. However, integrating scRNA-seq datasets from different sources is often hindered by batch effects—systematic discrepancies arising from variations in experimental conditions, such as sample preparation, sequencing protocols, and platform differences [3, 4]. These batch effects can obscure true biological signals and distort downstream analyses, making their correction essential for robust cross-study comparisons [5].

In the context of batch-effect correction, the order-preserving feature of gene expression levels refers to the property of maintaining the relative rankings or relationships of gene expression levels (considering sequencing depth) within each batch after correcting batch effects. This feature ensures that the intrinsic order of gene expression levels is not disrupted during the correction process [6]. Maintaining the original order of gene expression levels helps to retain biologically meaningful patterns, such as relative expression levels between genes or cells, which are crucial for downstream analyses like differential expression or pathway enrichment studies [5]. Additionally, order-preserving feature enhances the robustness of batch effect correction methods, ensuring reliable data integration from diverse sources.

Several methods have been developed to correct batch effects, which can be broadly categorized into non-procedural methods and procedural methods, each employing distinct strategies. Non-procedural methods rely on direct statistical modeling to adjust batch effects without iterative feature alignment or sample matching. Examples include ComBat [7] and Limma [8], which were originally developed for bulk RNA-seq and later adapted for scRNA-seq. These methods adjust additive or multiplicative batch biases effectively, but their performance may be hindered in scRNA-seq due to its inherent sparsity and “dropout” effects, resulting from stochastic gene expression and RNA capture limitations.

To address this issue, procedural methods have been developed, involving multi-step computational workflows that align features or samples across batches. For example, Seurat v3 [9] uses canonical correlation analysis to identify shared subspaces and mutual nearest neighbors (MNNs) to anchor cells between batches. Similarly, Harmony [10] iteratively adjusts embeddings to align batches while preserving biological variation, and MMD-ResNet [11] uses deep learning to minimize distribution discrepancies. Moreover, Liger [12] and scVI [13] address this issue through factor decomposition and variational autoencoders [14], respectively, allowing them to correct batch effects while retaining complex biological signals.

Despite these advancements, several limitations remain. Firstly, deep learning-based methods, while powerful for modeling complex data structures, often suffer from interpretability issues, complicating biological analysis. Secondly, many approaches separate batch-effect correction from cell clustering, which can lead to the loss of rare cell type (CT) information [15]. Integrating batch-effect correction and clustering would better preserve biological signals [16]. More importantly, most current procedural methods neglect the order-preserving feature, which may result in the loss of valuable intra-batch information and misinterpretation of differential expression patterns. Although methods based on direct statistical modeling, such as Combat, possess order-preserving feature, the presence of a large number of zero values in scRNA-seq data often makes them ineffective in correcting batch effects in certain scenarios.

Therefore, we developed an order-preserving procedural method to correct batch effects. Our method performed initial clustering and utilized nearest neighbor (NN) information within and between batches to construct similarities between clusters. These similarities were then used to design a loss function [weighted maximum mean discrepancy (MMD)] for batch effect correction. And we employed a monotonic deep learning network to ensure intra-gene order-preserving feature. Compared to MMD-ResNet [11], we addressed potential class imbalances between different batches through weighted design and obtained a complete gene expression matrix. By comparing with existing methods, we demonstrated that our method not only improved clustering accuracy but also preserved inter-gene correlation. Furthermore, the order-preserving feature allowed us to retain differential expression information within each batch after correction, providing a more biologically interpretable framework for the integration of scRNA-sequencing data.

Results

Overview and evaluation

Our method was designed to align multiple batches of scRNA-seq data while preserving the intra-genic order of expression levels and inter-gene correlation during the correction process. The overall workflow was illustrated in Fig. 1. After preprocessing the scRNA-seq data, we initialized clustering using optional clustering algorithms and estimated the probability of each cell belonging to each cluster. Then, we utilized intra-batch and inter-batch NN information to evaluate the similarity among the obtained clusters, thereby completing intra-batch merging and inter-batch matching of similar clusters. To achieve batch-effect correction, we calculated the distribution distance between the reference batch and query batch using weighted maximum mean divergence. We finally minimized the loss through a global or partial monotonic deep learning network to obtain a corrected gene expression matrix. Our approach was thus divided into two modes: a global model and a partial model, with the partial model incorporating the Inline graphic matrix as an additional input to the network.

Figure 1.

Figure 1

Procedure of the order-preserving batch-effect correction method based on a monotonic deep learning networkwork (Step 1: Preprocessing raw count data to obtain normalized expression matrix; Step 2: Initializing clusters based on normalized expression matrix and estimating the probability that each cell belongs to each cluster, i.e. Inline graphic matrix; Step 3: Merging clusters intra batches and matching clusters inter batches based on NNs information; Step 4: Minimizing the weighted MMC between the paired sets of clusters by monotonic deep learning network; The Inline graphic matrix can be used as an additional input to the network and all details of method can be found in Methods and Supplementary Materials).

We evaluated our method using multiple scRNA-seq data from different sources. Performance was compared to five established batch-effect correction methods, including ComBat [7], Harmony [10], Seurat v3 [9], MNN Correct [17], and ResPAN [18], as well as uncorrected raw data. To visualize the effectiveness of our method, we used Inline graphic-SNE and UMAP, both widely adopted techniques for dimensionality reduction and high-dimensional data visualization. (𝑡 -SNE is t-distributed Stochastic Neighbor Embedding and UMAP is Uniform Manifold Approximation and Projection.) Unlike principal component analysis (PCA), Inline graphic-SNE and UMAP are particularly suited for scRNA-seq [6] data because they better preserve local and global structures, while being scalable to large datasets.

We analyzed inter-gene correlation and differential expression consistency to highlight the advantages of order-preserving feature during correction process. And to assess clustering performance after batch-effect correction, we focused on two main criteria: batch mixing and CT purity. Specifically, we employed three clustering-related metrics: Adjusted Rand Index (ARI) [19] for clustering accuracy, Average Silhouette Width (ASW) [20] for cluster compactness, and Local Inverse Simpson Index (LISI) [10] for neighborhood diversity. Detailed definitions of these metrics can be found in the Methods section.

Compared to the benchmark methods, our method demonstrated superior performance, particularly in maintaining inter-gene correlation, improving CT clustering accuracy, and preserving original differential expression information within batches.

Order-preserving feature

Most current procedural methods neglect the order-preserving feature within gene during batch-effect correction. To evaluate how well different methods preserve the original ranking of gene expression levels. For each dataset, we selected two CTs with the largest and smallest sample sizes. For each of these CTs, as well as for the whole samples, we plotted boxplots of Spearman correlation coefficients before and after correction by different methods. Due to the large number of zeros in scRNA-seq datasets, which can result in many tied rankings, we only considered cells with non-zero raw counts for each gene in this analysis.

In this section, we excluded the method Harmony from this evaluation. According to the literature [10], the input of Harmony is PCA dimensionality reduction embedding of a gene expression matrix, and its output is an embedded feature space of the same dimensionality. The output is mainly used for subsequent clustering and visualization analyses. Due to the fact that this output no longer retains the original data dimension, it is not feasible to directly calculate the Spearman correlation coefficient at the level of gene expression after/before correction, and therefore it has not been included in the evaluation.

Among all listed methods, only the non-procedural method ComBat and our global monotonic model were able to preserve the order of gene expression levels (non-zero) before versus after batch-effect correction. The partial monotonic model could only ensure order-preserving feature based on the same Inline graphic matrix (Fig. 2 and Supplementary Fig. S12).

Figure 2.

Figure 2

Boxplots of Spearman correlation coefficients between original data and batch-effect corrected data using different methods in Dataset 1 (only considering non-zero expression): (a) all samples in batch 1, (b) CT luminal_mature in batch 1, (c) CT luminal_progenitor in batch 1, (d) all samples in batch 2, (e): CT luminal_mature in batch 2, and (f) CT luminal_progenitor in batch 2.

Inter-gene correlation

Analyzing gene-gene interactions is essential for uncovering the intricate dynamics underlying biological processes and disease mechanisms. By identifying functionally related gene clusters, researchers gain insights into how groups of genes co-regulate cellular functions or contribute to disease progression. Constructing gene regulatory networks reveals not only direct gene interactions but also the complex layers of transcriptional regulation [21]. Therefore, maintaining inter-gene correlation during batch-effect correction is essential to preserve the biological integrity of scRNA-seq data.

Most existing batch-effect correction methods primarily focus on aligning cells across batches, often neglecting the preservation of inter-gene correlation structures within CTs. However, our method employed a distribution distance (weighted MMC) as the objective function and utilized a monotonic network to incorporate the order-preserving feature, enabling better preservation of inter-gene correlation during the batch-effect correction process. This approach prevented the disruption of important gene regulatory relationships, thereby maintaining the biological relevance of the data after batch-effect correction.

To quantify the ability to maintain inter-gene correlation, we designed the following procedure: Considering the robustness of the results, we focused only on CTs with more than 30 cells in the reference and query batches. For each CT, we selected significantly correlated gene pairs within that CT. To avoid the influence of low-expression genes, we focused only on genes whose average expression level exceeded the average expression level across all cells. Subsequently, a one-sided correlation test (implemented in R) was performed for each gene pair across different batches. We controlled the false discovery rate (FDR) by requiring each significantly correlated gene pair to exhibit the same correlation direction in both batches and to have Benjamini–Hochberg adjusted Inline graphic-values below .05.

Finally, we calculated the Pearson correlation of these gene pairs before and after batch-effect correction, and we evaluated the performance of different methods in preserving inter-gene correlation using multiple metrics, including root mean square error (RMSE), Pearson correlation, and Kendall correlation. In all experimental single-cell RNA sequencing datasets, compared to the other methods except ComBat, our partial monotonic model and global monotonic model showed smaller mean square error roots, higher Pearson correlation and Kendall correlation coefficients in the vast majority of CTs (Fig. 3 and Supplementary Tables S5, S8, S9).

Figure 3.

Figure 3

The correlation coefficients (Inline graphic-axis) of significantly correlated gene pairs (Inline graphic-axis) in the luminal mature CT (Dataset 1) before versus after batch-effect correction: (a–f) Correlation coefficients computed from different methods and RMSE represents the root mean squared error between the uncorrected query batch and the corrected query batch.

Furthermore, we have included two approaches of statistical tests in difference evaluation. First, for each given correction method, we performed a paired Wilcoxon test on the two sets of Pearson correlation coefficients obtained before and after correction (under the assumption of independence or approximate independence). The vast majority of results showed that, the differences between the two sets of Pearson correlation coefficients before and after correction obtained by our method were not statistically significant (Inline graphic >.05). For the remaining results (where Inline graphic <.05), the Inline graphic-values obtained by our method were still larger than those obtained by the other methods. Second, we conducted a statistical test to assess the differences among methods for their capability in preserving inter-gene correlations. Specifically, for each method, we calculated the differences between the two sets of Pearson correlation coefficients before versus after batch effect correction. As differences close to zero would be preferred regardless of positive or negative signs, we took their absolute values (notice that RMSEs were also equivalently calculated based on these absolute values). We then conducted a one-sided paired Wilcoxon test on the absolute differences generated by our method versus another method: the null hypothesis was that the two methods had no difference in inter-gene correlation preservation capability (i.e. equal absolute difference medians); and the alternative hypothesis was that the absolute difference median of our method was smaller. The results showed that, for the vast majority of CTs, the absolute differences medians obtained by our method were smaller than the other methods (Inline graphic <.05, excluding the linear method ComBat; Supplementary Table S11). Therefore, our method demonstrated an improved capability in preserving inter-gene correlation (excluding the linear method ComBat).

Although ComBat demonstrated good performance in preserving inter-gene correlation, it failed to effectively correct batch effects across multiple datasets in our subsequent evaluations. Even after ComBat correction, the same CTs from different batches were still unable to cluster together. The numerous zero values in scRNA-seq data interfered with ComBat’s Bayesian modeling, thereby affecting its ability to correct batch effects. In our subsequent results, we demonstrated that our method not only effectively preserved the original inter-gene correlation but also successfully corrected batch effects and improved clustering accuracy. This indicated that our method achieved a balance between preserving complex gene regulatory networks and correcting batch effects.

Clustering performance

We applied our partial and global monotonic models to nine experimental scRNA-seq datasets and one simulated dataset to evaluate their clustering performance after batch-effect correction and compared them with other listed methods. The first dataset, derived from three independent studies of mammary epithelial cells, contains 9288 cells across three batches and three CTs (basal, luminal mature, and luminal progenitor) [22–24]. All methods corrected batch effects to varying degrees and our approach showed superior CT separation and batch mixing, resulting in well-defined clusters (Supplementary Fig. S2). The partial model achieved the highest ARI F1 and the second-highest LISI F1, The global model achieved the highest ASW F1 (Table 1). The second dataset, consisting of lung adenocarcinoma cells collected from three cell lines (HCC827, H1975, and H2228) across three platforms [25], all listed methods successfully mixed the batches (Supplementary Fig. S3). The partial model achieved the highest ARI F1, ASW F1 and LISI F1, and the global model ranked second (Table 1).

Table 1.

Comparison of the methods based on the clustering metrics computed on the different datasets

Dataset 1 Dataset 2
ARI ASW LISI ARI ASW LISI
Method CT 1-B F1 CT 1-B F1 CT B F1 CT 1-B F1 CT 1-B F1 CT B F1
Partial 0.99 0.95 0.97 0.80 1.03 0.90 1.00 2.28 1.38 0.99 1.00 0.99 0.84 1.01 0.91 1.00 2.07 1.34
Global 0.98 0.94 0.96 0.86 1.01 0.93 1.01 2.18 1.36 0.97 0.99 0.98 0.71 1.02 0.85 1.01 2.05 1.33
ComBat 0.97 0.94 0.96 0.76 0.96 0.85 1.00 1.33 1.13 0.97 0.99 0.98 0.71 1.01 0.83 1.01 2.03 1.32
Harmony 0.98 0.94 0.96 0.82 1.01 0.91 1.00 2.27 1.38 0.97 1.00 0.98 0.71 1.01 0.84 1.01 2.01 1.32
Seurat 0.99 0.95 0.97 0.83 1.02 0.91 1.00 2.38 1.40 0.97 0.99 0.98 0.73 1.02 0.85 1.00 2.04 1.33
MNN 0.97 0.94 0.96 0.76 0.96 0.85 1.00 1.33 1.13 0.97 0.99 0.97 0.72 1.02 0.85 1.01 2.06 1.33
ResPAN 0.97 0.94 0.96 0.82 1.01 0.91 1.01 2.24 1.37 0.97 0.99 0.98 0.72 1.01 0.84 1.01 1.53 1.19
Dataset 3 Dataset 4
Partial 0.99 0.99 0.99 0.80 1.04 0.90 1.00 1.74 1.27 0.98 1.00 0.99 0.71 0.99 0.83 1.01 1.75 1.26
Global 0.99 0.99 0.99 0.76 1.04 0.88 1.00 1.73 1.26 0.98 1.00 0.99 0.72 0.97 0.82 1.01 1.55 1.20
ComBat 0.99 0.99 0.99 0.73 1.04 0.86 1.00 1.40 1.16 0.88 0.90 0.89 0.50 0.81 0.62 1.00 1.12 1.06
Harmony 0.99 0.99 0.99 0.65 1.05 0.81 1.00 1.70 1.25 0.86 0.92 0.89 0.71 0.97 0.82 1.01 1.57 1.20
Seurat 0.99 0.99 0.99 0.81 0.97 0.88 1.00 1.73 1.26 0.97 1.00 0.98 0.72 0.99 0.84 1.01 1.81 1.28
MNN 0.99 0.99 0.99 0.73 1.03 0.86 1.00 1.73 1.26 0.98 1.00 0.99 0.62 0.97 0.76 1.02 1.54 1.19
ResPAN 0.98 0.99 0.99 0.67 1.03 0.82 1.00 1.02 1.05 0.55 0.76 0.64 0.57 0.86 0.68 1.02 1.12 1.04
Dataset 5 Dataset 6
Partial 0.60 0.99 0.75 0.28 0.98 0.44 1.44 1.78 0.84 0.85 1.00 0.92 0.64 1.00 0.78 1.11 1.94 1.21
Global 0.61 0.99 0.75 0.28 0.97 0.44 1.44 1.78 0.84 0.83 1.00 0.91 0.63 1.00 0.78 1.13 1.95 1.19
ComBat 0.51 0.99 0.67 0.28 0.97 0.44 1.41 1.69 0.86 0.44 0.90 0.59 0.41 0.98 0.58 1.30 1.36 0.91
Harmony 0.53 0.98 0.69 0.27 0.98 0.43 1.39 1.83 0.91 0.83 1.00 0.91 0.58 1.00 0.73 1.18 1.94 1.15
Seurat 0.54 0.99 0.70 0.26 0.98 0.41 1.46 1.85 0.82 0.83 1.00 0.91 0.63 1.00 0.77 1.14 1.92 1.18
MNN 0.55 0.94 0.69 0.25 0.96 0.39 1.50 1.64 0.76 0.61 0.96 0.75 0.42 0.97 0.59 1.28 1.61 0.99
ResPAN 0.48 0.94 0.65 0.28 0.97 0.43 1.52 1.75 0.74 0.84 1.00 0.91 0.60 0.97 0.74 1.16 1.31 1.02
Dataset 7 Dataset 8
Partial 0.99 0.58 0.73 0.76 0.81 0.78 1.00 1.75 1.27 0.95 0.95 0.97 0.48 0.98 0.64 1.07 1.70 1.20
Global 0.99 0.58 0.73 0.80 0.79 0.79 1.00 1.74 1.27 0.95 0.99 0.97 0.50 0.99 0.66 1.06 1.54 1.16
ComBat 0.53 0.16 0.25 0.61 0.46 0.52 1.00 1.00 1.00 0.93 0.99 0.96 0.47 0.96 0.63 1.05 1.24 1.07
Harmony 0.98 0.58 0.73 0.76 0.81 0.78 1.00 1.78 1.28 0.93 0.99 0.96 0.50 0.99 0.66 1.04 1.59 1.19
Seurat 0.78 0.43 0.55 0.65 0.72 0.68 1.00 1.43 1.17 0.94 0.99 0.96 0.53 0.98 0.69 1.05 1.70 1.21
MNN 0.75 0.43 0.55 0.64 0.46 0.52 1.00 1.35 1.14 0.94 0.99 0.96 0.57 0.98 0.72 1.05 1.68 1.21
ResPAN 0.97 0.58 0.73 0.83 0.78 0.80 1.00 1.78 1.28 0.93 0.99 0.96 0.47 0.98 0.63 1.06 1.59 1.18
Dataset 9 Dataset 10
Partial 0.97 1.00 0.98 0.59 1.00 0.74 1.03 1.49 1.17 1.00 0.99 0.99 0.87 1.00 0.93 1.00 1.80 1.28
Global 0.97 1.00 0.98 0.43 1.01 0.61 1.03 1.47 1.16 1.00 0.99 0.99 0.88 0.99 0.93 1.00 1.82 1.29
ComBat 0.64 0.81 0.72 0.14 0.86 0.24 1.03 1.01 0.99 1.00 0.99 0.99 0.86 0.99 0.92 1.00 1.79 1.28
Harmony 0.84 0.92 0.88 0.41 0.93 0.57 1.02 1.08 1.02 1.00 0.99 0.99 0.86 0.99 0.92 1.00 1.77 1.27
Seurat 0.96 1.00 0.98 0.52 1.00 0.69 1.03 1.65 1.22 1.00 0.99 0.99 0.90 0.99 0.94 1.00 1.77 1.27
MNN 0.96 1.00 0.98 0.42 0.98 0.59 1.03 1.34 1.12 1.00 0.99 0.99 0.85 0.99 0.92 1.00 1.79 1.28
ResPAN 0.96 0.99 0.97 0.37 0.99 0.54 1.03 1.31 1.11 1.00 0.99 0.99 0.89 1.00 0.94 1.00 1.80 1.28

The results were evaluated on UMAP embedding of corrected data. Our method utilized Louvain as the initial clustering algorithm and determined the resolution parameter through ASW. Each metric was computed for the CT purity, the batch-mixing (B), and combining both criteria (F1). Top 1 performing methods were highlighted in bold.

To further test the flexibility of our method, we assessed their ability to detect new CTs by artificially removing CT H1975 from two batches while retaining it in the third batch (Dataset 3, a subset of Dataset 2). All listed methods successfully clustered the same CTs from different batches and distinguished the unique H1975 cells in the third batch (Supplementary Fig. S4). In a more challenging scenario, Dataset 7 comprised three batches: batch 1 contained only 293T cells (2885 cells), batch 2 contained only Jurkat cells (3258 cells), and batch 3 contained a 50/50 mix of both CTs (3388 cells). Harmony, ResPAN, and our method successfully integrated the batches, resulting in two distinct CTs after batch-effect correction. Other methods struggled in this scenario. Seurat and MNN Correct failed to effectively mix the 293T cells from different batches, and ComBat failed in both CTs (Supplementary Fig. S7).

Given the requirement of deep learning methods for substantial data, we also evaluated the performance of our method on a smaller dataset (Dataset 4), consisting of 704 mouse embryonic stem cells sequenced under three culture conditions [26]. Despite the limited data, our method effectively corrected batch effects and maintained high performance across all three clustering-related metrics (Supplementary Fig. S5, Table 1).

Furthermore, we applied our method to more scRNA-seq datasets and compared their performance with the listed batch-effect correction methods. Our method effectively corrected batch effects across different datasets, demonstrating their broad applicability (Figs 4 and 5, Supplementary Figs S6, S8 and S10). Our method accurately clustered the same CTs from different batches, exhibiting outstanding performance, particularly in clustering accuracy as measured by ARI (Table 1).

Figure 4.

Figure 4

Batch-effect correction for human blood dendritic cells’ (DCs) scRNA-seq data (Dataset 6) composed of two batches: (a–h) Inline graphic-SNE embedding computed from compared methods, in which the points were colored by CT; (i–p) Inline graphic-SNE embedding computed from compared methods, in which the points were colored by batch.

Figure 5.

Figure 5

Batch-effect correction for a failing human heart dataset (Dataset 9) composed of three batches: (a–h) Inline graphic-SNE embedding computed from compared methods, in which the points were colored by CT; (i–p) Inline graphic-SNE embedding computed from compared methods, in which the points were colored by batch.

Differential expression consistency

Analyzing differentially expressed genes (DEGs) is crucial for uncovering molecular mechanisms underlying various biological conditions and diseases [27]. DEG analysis allows researchers to identify genes associated with specific diseases, potentially leading to the discovery of therapeutic targets and the development of personalized treatments. Moreover, DEGs serve as biomarkers for disease diagnosis, monitoring treatment responses, and understanding gene regulatory networks, shedding light on gene interactions and their influence on phenotypic traits [28]. Integrating multiple scRNA-seq datasets while preserving the original differential expression information within batches can help us achieve more reliable integration analysis results.

We observed that the mixing performance (FISI) of the global model was significantly lower than that of the partial model on certain datasets (such as Dataset 1, 4, and 8 in Table 1). Moreover, across all datasets, the partial model generally outperformed the global model in terms of mixing performance. Although this difference could be attributed to the relationship between the two models, where the global monotonic model was a subset of the partial monotonic model, we found that the difference in their mixing performance was more likely caused by the inconsistency in differential expression across batches.

A gene should exhibit consistent expression patterns across different batches. For example, if a gene showed an upregulated trend between two CTs in one batch, it should demonstrated a similar trend in another batch. Although many batch-effect correction methods had been developed for scRNA-seq data, the consistency of differential expression across batches was often overlooked in correction evaluations. Previous studies [29, 30] had emphasized the importance of assessing differential expression consistency to avoid drawing misleading biological conclusions.

In our study, we found that the performance of the global model was influenced by the degree of consistency in differential expression across batches. For Dataset 1 as an example, we performed a one-sided Wilcoxon rank-sum test for normalized gene expression between each pair of CTs (three pairs in total) across each batch (two query batches and one reference batch). Based on the Inline graphic-values (𝑝) obtained, we computed the corresponding Inline graphic-scores (𝑧) using the following transformation:

graphic file with name DmEquation1.gif

where Inline graphic is the inverse function of the standard normal cumulative distribution function (c.d.f.).

The same procedure was also conducted on Dataset 2, then, we created a scatter plot where the Inline graphic-score of the reference batch is taken as the Inline graphic-axis, and the Inline graphic-scores of other batches are taken as the Inline graphic-axis. As shown in Fig. 6, for Dataset 2, the scatter plots are concentrated in the first and third quadrants, indicating consistent differential expression direction for most genes between the reference and query batches. In contrast, for Dataset 1, approximately half of the points fall in the second and fourth quadrants, indicating that many genes exhibit opposite differential expression patterns across batches, suggesting significant inconsistency.

Figure 6.

Figure 6

The Inline graphic-scores between different pairs of CTs across batches in Dataset 1 and Dataset 2: (a–c) Inline graphic-scores obtained in Dataset 1; (d–f): Inline graphic-scores obtained in Dataset 2 (in all the graphs, the Inline graphic-axis represents the Inline graphic-scores obtained from the reference batch, and the Inline graphic-axis represents the Inline graphic-scores obtained from query batches).

Due to the monotonicity constraint of the network, the inconsistency in differential expression across batches imposed limitations on the optimization of the global model. When the gene expression trends among CTs were inconsistent across different batches, the global model struggled to effectively integrate these trends, resulting in poor mixing performance. In contrast, the partial model alleviated this issue by using the Inline graphic matrix as an additional input. Therefore, in this scenario, the FISI score of the global model was lower than that of the partial model (Table 1). Similar phenomena were also observed in Dataset 4 and Dataset 6 (Supplementary Fig. S11 and Table 1).

This result indicated that the direction of differential expression between CTs might vary across different batches. Over-reliance on the differential expression analysis results from a single batch, or an excessive pursuit of mixing performance, could lead to the loss of differential expression information within other batches, ultimately resulting in misleading conclusions. Therefore, preserving the original differential expression information within different batches during the correction process was crucial to ensure reliable downstream integration analysis.

To evaluate the ability of different batch-effect correction methods to preserve batch-specific differential expression information, we used Dataset 2 as an procedure example. We selected the two CTs with the most cells (H1975 and H2228), then we conducted a one-sided Wilcoxon rank-sum test between these two CTs in the query batch before and after batch-effect correction, and converted the Inline graphic values into Inline graphic scores. By plotting the Inline graphic scores before correction (Inline graphic-axis) against those after correction (Inline graphic-axis), we observed the changes in differential expression between the two CTs before versus after correction. We finally compared the global model with the widely adopted method Seurat and a deep learning-based method ResPAN. To ensure fairness, all tests were performed on the same set of two thousand highly variable genes HVGs. We used a significance threshold of Inline graphic value.05, corresponding to a Inline graphic score of Inline graphic. Points where the absolute values of Inline graphic scores before and after correction exceeded Inline graphic and their signs reversed were identified as outliers. Based on the order-preserving feature of the global model, it exhibited the fewest outliers (Fig. 7 and Table 2), indicating that it more effectively retained the original differential expression information within the batch.

Figure 7.

Figure 7

The Inline graphic-scores between CTs in a query batch from Dataset 2 before versus after batch effect correction (in all the graphs, the Inline graphic-axis represents the Inline graphic-scores obtained from the uncorrected batch, and the Inline graphic-axis represents the Inline graphic-scores obtained from the corrected batch; Outliers represent genes whose differential expression direction changed significantly before versus after batch-effect correction): (a–c) different batch-effect correction methods.

Table 2.

Comparison of different batch-effect correction methods in preserving original differential expression information

Dataset 1 Dataset 2 Dataset 3 Dataset 4
Method Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni
Seurat 86 48 15 69 38 1 154 100 4 19 0 0
ResPAN 167 103 38 501 385 161 204 178 96 333 125 71
Global 83 43 15 52 32 2 48 25 1 2 0 0
Dataset 5 Dataset 6 Dataset 8 Dataset 9
Method Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni Inline graphic -value Benjamini–Hochberg bonferroni
Seurat 19 0 0 10 0 0 40 19 5 6 0 0
ResPAN 15 0 0 59 11 4 70 19 3 4 0 0
Global 11 0 0 0 0 0 17 8 3 2 1 0

The numbers in the table represented the count of genes that potentially exhibited aberrant differential expression before versus after batch-effect correction, with lower counts indicating better performance. The best-performing method was highlighted in bold. More details can be found in Results/Differential expression consistency.

To further evaluate the preservation of original differential expression information across different datasets, we repeated this procedure on all experimental scRNA-seq datasets excluding Dataset 7, where only one CT was present in the query batch. We also applied Bonferroni and Benjamini–Hochberg corrections to control the FDR and enhance the reliability of identified DEGs. Across all datasets, the global model consistently demonstrated the best performance in preserving original differential expression information, yielding the fewest outliers due to its order-preserving feature (Table 2).

Different initial clustering

Our method defaulted to using the Louvain clustering algorithm for initial clustering. However, in practical applications, various clustering algorithms could be flexibly adopted.

In this study, we also used the Leiden clustering algorithm [31] and Gaussian Mixture Model (GMM) [32] for initial clustering. The Leiden algorithm, an improvement over Louvain [33], ensured higher partition quality and robustness. It optimized the community detection process by introducing a local moving phase, which achieved more accurate community assignments while maintaining efficiency. GMM provided a flexible way to model complex data distributions. One key advantage of GMM was “soft clustering,” where each data point was assigned a probability of belonging to multiple clusters rather than being strictly assigned to one. This approach was advantageous for handling overlapping clusters and capturing uncertainty in data assignments.

Despite using different clustering algorithms, our strategy of merging and matching clusters (as detailed in Step 2 of the Methods section) remained applicable. Based on three different clustering methods (Louvain, Leiden, and GMM), our method effectively corrected batch effects across all datasets (Fig. 8, Supplementary Fig. S9, and Supplementary Table S6). The clustering performance after correction may vary depending on different initial clustering methods. For instance, when using GMM clustering, the partial monotonic model achieved an ARI of 1 in Dataset 4 (Supplementary Table S6). When different initial clustering methods were considered, the subsequent analyses following the batch-effect correction by our method showed minor differences (Supplementary Table S6).

Figure 8.

Figure 8

Batch-effect correction for a failing human heart dataset (Dataset 9) composed of two batches: (a–f) Inline graphic-SNE embedding of corrected expression matrix based on different initial clustering methods, in which the points were colored by CT; (g–l) Inline graphic-SNE embedding of corrected expression matrix based on different initial clustering methods, in which the points were colored by batch.

Discussion

Most current procedural methods involve components such as anchoring, MNN, and deep learning often overlooked the order-preserving feature during batch-effect correction. Although the non-procedural ComBat method can preserve the original order of gene expression levels within genes, it struggles to effectively handle the abundance of zero values in scRNA-seqscRNA-seq data, which may lead to suboptimal performance in batch-effect correction tasks. Therefore, we developed a procedural method with order-preserving feature to correct batch effect.

Our method used initial clustering and NN information within and across batches to construct similarities between clusters. We integrated a deep learning network with monotonicity property and weighted MMC to perform batch-effect correction. Our method not only preserved the inherent biological signals but also maintained the original order of gene expression levels during correction, thereby better retaining batch-specific information and enhancing biological interpretability.

We tested our method on multiple experimental scRNA-seq datasets and simulated datasets. In benchmark comparisons with other batch-effect correction methods, such as Seurat, Harmony, and ResPAN, our method demonstrated superior performance. It not only helped maintain inter-gene correlation and preserved the original differential expression information within batches, but also achieved higher clustering accuracy by integrating initial clustering with batch-effect correction.

Current batch-effect correction methods and their evaluations often overlooked the inconsistency in differential expression across batches. Over-reliance on a single batch could lead to misleading results in downstream analyses. More importantly, based on the order-preserving feature, our global method better retained the original differential expression information within batches during the batch-effect correction process, thereby improving the reliability of integration analysis results.

There are also following limitations for our method. First, in scenarios involving rare or imbalanced CTs, it is possible that there is no overlap or a low overlap across batches in certain CTs (e.g. Dataset 7). Because our method leverages MNN information to identify potentially shared CTs across batches, then unmatched CTs could be inaccurately aligned, thereby impairing the effectiveness of batch-effect correction. We introduced a minimum threshold to filter out unreliable matches. Second, for datasets with complex tissues, the monotonicity constraint imposed on the neural network may limit its capacity to model subtle batch effects. We designed a triple structure and hidden layers with expansion nodes to enhance its modeling capability. Despite these potential limitations, the advantages of our method have still been demonstrated. Our analyses across multiple experimental and simulated datasets, along with the comprehensive comparisons with the existing approaches, demonstrated the adaptability and robustness of our method in correcting batch effects.

To provide guidelines in practice, we recommend the following for selecting between the global and partial models based on methodological design and empirical performance. The global model is designed to enforce monotonicity across all samples, making it particularly suitable for scenarios where preserving global structural consistency and expression trends is critical. In contrast, the partial model applies monotonicity constraints only within subsets of samples that share the same initial clustering label. Furthermore, evaluation results highlight the complementary strengths of the two models. The partial model excels in batch mixing performance, demonstrating superior capability in mitigating batch effects across datasets. The global model shows better preservation of original differential gene expression. In terms of clustering performance and inter-gene correlation preservation, both models perform comparably. In practice, we recommend the partial model for applications where batch mixing performance is the primary concern, particularly in multi-batch single-cell datasets. Otherwise, when maintaining original differential gene expression is the focus, we recommend the global model.

With the continuous advancement of sequencing technology and multi-omics data, the order-preserving property in the batch-effect correction process deserves attention. Extending the order-preserving feature to other omics data and developing faster, more stable, and more effective batch-effect correction methods would be beneficial for better preserving batch-specific information.

Conclusion

In summary, we developed a comprehensive procedural method with order-preserving feature to correct batch effects, which involved initial clustering, NNs, and a monotonic deep network. By applying our method to multiple scRNA-seq data, we demonstrated that it not only effectively corrected batch effects but also preserved inter-gene correlation. Furthermore, leveraging the order-preserving feature, our method retained differential expression information within batches after correction.

Methods

Our method’s workflow (Fig. 1) mainly involved following steps: preprocessing, initializing clusters, identifying KNN (k-NN) pairs within batches and MNN pairs across batches, constructing a similarity matrix between clusters, merging clusters within batches, matching clusters across batches, and ultimately correcting batch effects based on a monotonic deep learning network. More details were provided in the Supplementary Materials.

Step 1: preprocessing

There are four important tasks to be completed in step of preprocessing: filtering low-quality cells and genes, cell normalization, log normalization and detecting HVGs. All the above steps were implemented in the python module scanpy.

Let Inline graphic represent the raw count of gene Inline graphic in cell Inline graphic. In the filtering step, low-quality cells with nGene Inline graphic were removed, and genes expressed in fewer than three cells (nCells Inline graphic) were excluded. For normalization, the counts for each cell were divided by the total counts across all genes, multiplied by a constant of 10 000, and a log transformation was applied to obtain the normalized expression value Inline graphic. Finally, we selected 2000 HVGs by using Inline graphic function. We chose 2000 HVGs based on the following considerations: This setting is recommended by the widely used single-cell analysis tool Seurat [9] and the Python toolkit Scanpy. Both of which recommend selecting 2000 HVGs as input features for downstream analyses.

Step 2: initializing clusters

Let Inline graphic be the Inline graphic matrix of normalized expression from Step 1, including only the Inline graphic HVGs. We applied the Inline graphic function to the normalized data to obtain a low-dimensional embedding space (defaulting to Inline graphic-dimensional PCA) and applied the Inline graphic function to construct neighbor information (defaulting to Inline graphic neighbors based on Euclidean distance).

We offered several methods for initializing clusters. Our method defaulted to using the Louvain [33] method, a graph-based clustering method that has demonstrated strong performance. The Louvain method is a popular approach for community detection in large networks, optimizing modularity to identify densely connected groups. It operates in two phases: first, it assigns nodes to communities, and then it merges communities to maximize modularity. This procedure can be implemented by the function Inline graphic in Inline graphic package, higher resolution means finding more and smaller clusters. Our method used the ASW to select the appropriate resolution parameter (Algorithm 1 in the Supplementary Material). Our method can also use a default resolution 1, to help find a moderate number of CTs.

In the manuscript, we only presented the results obtained by determining the resolution through ASW. The results based on the default resolution were provided in the Supplementary Materials (Supplementary Table S7). Alternative methods include a Gaussian Mixture Model [32] (GMM) and Leiden [31] algorithm. For GMM, we selected the number of clusters based on the Bayesian Information Criterion (BIC) [34]; For Leiden, an improvement over Louvain, can be implemented using the function Inline graphic in the Inline graphic package.

We then calculated the probability of each cell belonging to each cluster, denoted as Inline graphic. For GMM, the posterior estimates of latent variables represented the cluster probabilities. In the Louvain/Leiden clustering algorithms, cluster probabilities were binary (0 or 1).

Step 3: merging and matching clusters

There are three important tasks to be completed in step of merging and matching clusters: finding KNN and MNN pairs, calculating similarity matrix among clusters and merging/matching clusters obtained from Step 2. We aimed to find potentially identical CTs within the same batch based on KNN information and find potentially identical CTs across batches based on MNN information. MNN-based methods, such as MNN [17] and BBKNN [35], have been shown to effectively reduce batch effects in scRNA-seq data.

Finding KNN and MNN pairs

Consider batches Inline graphic (reference) and Inline graphic (query) as an example. Let Inline graphic be a Inline graphic matrix of scRNA-seq data in PCA embedding space (n_components = 100), where Inline graphic is Inline graphic submatrix of cells in the reference batch Inline graphic.

Let Inline graphic be the vector of cell Inline graphic from batch Inline graphic in PCA embedding space. Denote Inline graphic as the set of KNN pairs within batch Inline graphic, cell Inline graphic and cell Inline graphic form a KNN pair if and only if:

graphic file with name DmEquation2.gif (1)

where the tuples Inline graphic and Inline graphic are both KNN pairs, and Inline graphic represents the set of cells in batch Inline graphic that are nearest to cell Inline graphic. We used a default of 10 neighbors and cosine distance for KNN calculations.

To correspond with the definition of KNN pairs intra batch, we let Inline graphic be the set of MNN pairs between batch Inline graphic and Inline graphic, cell Inline graphic and cell Inline graphic form an MNN pair if and only if:

graphic file with name DmEquation3.gif (2)

where the tuples Inline graphic and Inline graphic are both MNN pairs, and MNNInline graphic represents the set of cells in batch Inline graphic which are nearest to cell Inline graphic in batch Inline graphic, and MNNInline graphic represents the set of cells in batch Inline graphic which are nearest to cell Inline graphic in batch Inline graphic. We used a default of 25 neighbors and cosine distance for MNN calculations.

In our study, within each batch, we set the number of neighbors to 10 in order to better distinguish rare CTs [15]; across different batches, we set the number of neighbors to 25 in order to enhance the capability in matching potentially shared CTs [35].

Similarity matrix

After finding all KNN and MNN pairs, we constructed similarities of all obtained clusters among different batches. Let Inline graphic represents the number of clusters obtained in batch Inline graphic, and Inline graphic the number of clusters obtained in batch Inline graphic. The similarity matrix Inline graphic can be defined as:

graphic file with name DmEquation4.gif (3)

where

graphic file with name DmEquation5.gif (4)

and

graphic file with name DmEquation6.gif (5)

with Inline graphic is the Inline graphicth row and Inline graphicth column of the Inline graphic, representing the probability of cell Inline graphic belonging to the cluster Inline graphic obtained in Step 2. Larger clusters naturally lead to more KNN and MNN pairs, so we considered cluster size when computing similarity.

Merging and matching rule

To find potentially identical CTs within the same batch based on KNN information, and find potentially identical CTs across batches based on MNN information. We designed a merging/matching rule to merge and match clusters obtained in Step 2.

Starting from an initial cluster in the query batch, we identified all corresponding clusters in the similarity matrix that exceed the similarity threshold. This process continued iteratively until no new clusters were found. The discovered clusters were divided into paired sets based on their batch origin (query/reference), denoted as Inline graphic. Both Inline graphic and Inline graphic may consist of multiple initial clusters. Then, we selected a new cluster within the query batch (previously undiscovered) and repeated the previous process until all initial clusters have been traversed. It is also worth noting that not all clusters have corresponding paired clusters. We then gradually increased the similarity threshold and repeated the above procedure until the number of clusters that can be matched at the new threshold was less than that at the previous threshold. This step helped us differentiate as many distinct CTs as possible. The detailed rules for merging/matching clusters and adjusting threshold (Algorithm 2 and Algorithm 3) were provided in the Supplementary Material.

Step 4: partial/global monotonic deep networks

Monotonic neural networks offer significant benefits in terms of consistency and interpretability, particularly in medical applications. However, the architecture and activation functions of these networks must be carefully designed to maintain monotonicity, which can constrain the network’s ability to capture complex, non-linear relationships and potentially reduce overall accuracy in certain contexts. This area is actively evolving, with ongoing research aimed at enhancing these networks to be more adaptable across complex scenarios [36, 37]. Here, we introduced a three-layer feedforward neural network with weight constraints [36]. Let Inline graphic denote the weight connecting input Inline graphic to hidden unit Inline graphic (with Inline graphic hidden units in total) and Inline graphic the weight connecting hidden unit Inline graphic to the output. Given an input Inline graphic (of dimension Inline graphic), the output function Inline graphic for a network with one hidden layer is given by:

graphic file with name DmEquation7.gif

To maintain monotonicity, the network must satisfy:

graphic file with name DmEquation8.gif

where Inline graphic indicates a partial ordering on Inline graphic, defined by Inline graphic, for Inline graphic. In this condition, the pair Inline graphic is called Inline graphic.

To achieve monotonicity, partial derivatives need to satisfy:

graphic file with name DmEquation9.gif (6)

Given that Inline graphic, Equation (1) holds if and only if:

graphic file with name DmEquation10.gif

This is equivalent to the constraint [38]:

graphic file with name DmEquation11.gif

However, the comparability of network outputs under these constraints is guaranteed only when the input vectors are comparable. For incomparable input vectors, we cannot directly infer the size relationship of their corresponding outputs.

In batch-effect correction tasks, cells often have high-dimensional gene expressions that are incomparable. To ensure the order-preserving feature, the following {global monotonicity (increasing)} was defined:

graphic file with name DmEquation12.gif

where Inline graphic represent different cells, Inline graphic represents gene, and Inline graphic is the batch-effect corrected expression matrix of Inline graphic.

To address this issue, we proposed a deep learning network illustrated in Fig. 1, where each gene is an input unit. We demonstrated that independence across units in feedforward networks is necessary for ensuring global monotonicity (i.e. each hidden layer node can only correspond to one input node and one output node, corresponding proof details were available in the Supplementary Materials).

We provided two options (global/partial) for the network. In the global option, the final batch-effect corrected expression matrix can satisfy the global monotonic property. The input of network is a normalized gene expression matrix, and the output is a batch-corrected matrix. The network consists of three sub-networks connected in series, incorporating a residual structure, with each sub-network employing constraint (1) to ensure monotonicity. Solid lines represent weights that are always active, while dashed lines indicate connections that are activated only when the corresponding input is zero. The dashed lines are asymmetrical, based on zero expression genes, and use other gene levels for imputation.

In the partial option, the probability estimation matrix Inline graphic is introduced as an additional input to the network. This part of the input is fully connected to the middle layers, while the remaining structure is the same as in the global option.

Weighted maximum mean discrepancy

We employed a weighted MMC as the loss function to align the distributions of identical CTs across different batches. MMC [39, 40] measures the distance between two probability distributions Inline graphic, defined for a function class Inline graphic by:

graphic file with name DmEquation13.gif

If Inline graphic is a reproducing kernel Hilbert space with kernel Inline graphic, the MMD can be written as the distance between the mean embeddings of Inline graphic and Inline graphic:

graphic file with name DmEquation14.gif (7)

where Inline graphic Equation (2) can be written as

graphic file with name DmEquation15.gif (8)

where Inline graphic and Inline graphic are independent, as are Inline graphic and Inline graphic. For a universal kernel Inline graphic, then MMDInline graphic iff Inline graphic.

In practice, the distributions Inline graphic and Inline graphic are unknown, we can approximate MMD using observed values. In our work, we let Inline graphic represents the reference batch A, and Inline graphic represents the (corrected) query batch B. We designed the weighted MMC based on the information obtained in above steps to account for potential class imbalances:

graphic file with name DmEquation16.gif

where Inline graphic is a Gaussian kernel and:

graphic file with name DmEquation17.gif

Evaluation Metrics

To evaluate the effectiveness of various methods in correcting batch effects, we employed three evaluation metrics: ARI, ASW, and Local Inverse Simpson’s Index (LISI). These metrics quantify clustering quality, considering both batch mixing and CT purity.

ARI measures the agreement between the clustering result and a reference classification, adjusting for chance. It quantifies how well cells of the same type are grouped together after batch correction. ARI ranges from -1 (no agreement) to 1 (perfect agreement). The ARI formula is:

graphic file with name DmEquation18.gif

where:

  • Inline graphic is the number of cells in both cluster Inline graphic and reference group Inline graphic,

  • Inline graphic is the number of cells in cluster Inline graphic,

  • Inline graphic is the number of cells in reference group Inline graphic,

  • Inline graphic is the total number of cells.

ASW evaluates the separation and compactness of clusters by measuring how similar each cell is to its assigned cluster compared to other clusters. For CT purity, ASW assesses how well cells of the same type cluster together, while for batch mixing, it evaluates the extent of mixing between batches within clusters. The ASW for cell Inline graphic is defined as:

graphic file with name DmEquation19.gif

where:

  • Inline graphic is the average distance of cell Inline graphic to all other cells in its cluster,

  • Inline graphic is the lowest average distance of cell Inline graphic to cells in any other cluster.

For CT purity, Inline graphic and Inline graphic are calculated based on the same and different CTs, respectively. For batch mixing, they are calculated based on cells from the same and different batches.

Local Inverse Simpson’s Index (LISI) LISI measures the degree of batch mixing and CT homogeneity at the local level, providing insights into how well cells are integrated across batches while maintaining CT distinctions. Two variations are used: Batch LISI for batch mixing and CT LISI for CT purity. LISI is defined as:

graphic file with name DmEquation20.gif

where:

  • Inline graphic is the number of cell groups (either batches or CTs),

  • Inline graphic is the proportion of the neighborhood for cell Inline graphic that belongs to group Inline graphic.

Batch LISI assesses the extent of batch mixing within local neighborhoods, with lower values indicating better mixing. CT LISI measures how pure local neighborhoods are with respect to CTs, with higher values indicating better CT separation.

Clustering evaluation

To quantify the clustering performance of different methods, we designed the following evaluation procedure: First, a dimension reduction step was performed using UMAP for every datasets before and after batch-effect correction based on different methods.

For the ARI, we performed Louvain clustering on the UMAP embeddings. both the original and corrected data. To select an appropriate resolution parameter for Louvain, we used the UMAP embedding of the original data and evaluated resolutions from the set Inline graphic. The selection criterion was to maximize the grouping of cells of the same type from the same batch into the same cluster while ensuring that cells of the same type from different batches were assigned to different clusters, thereby reflecting the batch effect.

Key Points

  • We developed a batch-effect correction method with order-preserving feature.

  • Our method excels in multiple biological tasks, demonstrating superior performance in clustering and batch effect correction compared to existing methods.

  • Based on the order-preserving feature, our method can better retain original inter-gene correlation and differential expression information.

Supplementary Material

supplementary_materials5_10_bbaf247

Acknowledgements

During the preparation of this work the authors used ChatGPT in order to improve language and readability. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Contributor Information

Mingxuan Zhang, School of Mathematical Sciences, University of Science and Technology of China, Hefei, 230026 Anhui, China.

Yinglei Lai, School of Mathematical Sciences, University of Science and Technology of China, Hefei, 230026 Anhui, China; Department of Statistics, The George Washington University, Washington, DC 20052, United States.

Author contributions

M.Z. and Y.L. designed the research; M.Z. and Y.L. developed the methods; M.Z. and Y.L. contributed to the acquisition, analysis, and interpretation of the data; M.Z. drafted the manuscript; M.Z. and Y.L. revised the manuscript; and All authors read and approved the final manuscript.

Conflict of interest

The authors declare that the sponsors have no competing financial interests.

Funding

This work was partially supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA0460300/XDA0460303) and the National Natural Science Foundation of China (T2350710230). YL was also partially supported by a start-up fund from the University of Science and Technology of China.

Data availability

We analyzed nine published scRNA-seq datasets and one simulated datasets, which are available through the accession numbers reported in the original articles.

  • (1) Dataset 1 consists of the mammary epithelial cell dataset from three independent studies [22–24].

  • (2) Dataset 2 consists of the human lung cell dataset [25] collected from three lung adenocarcinoma cell lines HCC827, H1975, and H2228 on three different platforms with CELseq2, 10x Chromium, and Drop-seq protocols, respectively, which can be downloaded from https://github.com/LuyiTian/sc_mixology.

  • (3) Dataset 3 is a subset of Dataset 2, with the data for cell type H1975 removed in two batches. In the third batch, data for this cell type is retained.

  • (4) Dataset 4 consists of the mouse embryonic stem cell dataset [26]. The transcriptome of 704 mouse embryonic stem cells was sequenced across three culture conditions (lif, 2i, and a2i), using the Fluidigm C1 microfuidics cell capture platform followed by illumina sequencing, which can be downloaded from http://www.ebi.ac.uk/teichmann-srv/espresso.

  • (5) Dataset 5 consists of the mouse mammary gland datasets [41] processed on the Microwell-seq platform from the Mouse Cell Atlas project, which can be downloaded from https://fgshare.com/articles/dataset/MCA_DGE_Data/5435866.

  • (6) Dataset 6 from GEO accession GSE80171 consists of human blood dendritic cells’ (DCs) scRNA-seq data [42], generated by the same technology and coming from the same tissue. It is composed of two batches containing four different cell types, which can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820.

  • (7) Dataset 7 is composed of three batches, where batch 1 contains only 293T cells, batch 2 contains only Jurkat cells, and batch 3 consists of a 50/50 mixture of Jurkat and 293T cells [43, 44], which can be downloaded from http://scanorama.csail.mit.edu/data.tar.gz.

  • (8) Dataset 8 was constructed using human pancreatic data from two sources [45, 46]. The resulting dataset consists of celseq2 batch (accession GSE85241) and smartseq2 batch (accession E-MTAB-5061) with 15 different cell types, which can be downloaded from https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/.

  • (9) Dataset 9 is a failing human heart dataset [47]. Based on different technologies, we selected a total of 39 682 cells from female subjects under the condition of DCM, which include 14 different cell types.

  • (10) Dataset 10 is a simulated count data using the Splatter package [48]. The set contains two batches with unbalanced numbers of cells.

The details of datasets used can be found in Supplementary Materials Table S3. The normailzed datasets in this paper are avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.

Code availability

Our method is implemented in python based on the PyTorch framework and avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

References

  • 1. Nguyen  Q, Pervolarakis  N, Blake  K. et al.  Profiling human breast epithelial cells using single cell RNA sequencing identifies cell diversity. Nat Commun  2018;9:2028. 10.1038/s41467-018-04334-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Matsumoto  H, Kiryu  H, Furusawa  C. et al.  Scode: An efficient regulatory network inference algorithm from single-cell rna-seq during differentiation. Bioinformatics  2017;33:2314–21. 10.1093/bioinformatics/btx194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hicks  SC, Townes  FW, Teng  M. et al.  Missing data and technical variability in single-cell rna-sequencing experiments. Biostatistics  2018;19:562–78. 10.1093/biostatistics/kxx053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Lähnemann  D, Köster  J, Szczurek  E. et al.  Eleven grand challenges in single-cell data science. Genome Biol  2020;21:1–35. 10.1186/s13059-020-1926-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Leek  JT, Scharpf  RB, Bravo  HC. et al.  Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet  2010;11:733–9. 10.1038/nrg2825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Zhang  X, Ye  Z, Chen  J. et al.  Amdbnorm: an approach based on distribution adjustment to eliminate batch effects of gene expression data. Brief Bioinform  2022;23:528. 10.1093/bib/bbab528 [DOI] [PubMed] [Google Scholar]
  • 7. Smyth  GK, Speed  T. Normalization of cdna microarray data. Methods  2003;31:265–73. 10.1016/S1046-2023(03)00155-5 [DOI] [PubMed] [Google Scholar]
  • 8. Johnson  WE, Li  C, Rabinovic  A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics  2007;8:118–27. 10.1093/biostatistics/kxj037 [DOI] [PubMed] [Google Scholar]
  • 9. Stuart  T, Butler  A, Hoffman  P. et al.  Comprehensive integration of single-cell data. Cell  2019;177:1888–1902.e21. 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Korsunsky  I, Millard  N, Fan  J. et al.  Fast, sensitive and accurate integration of single-cell data with harmony. Nat Methods  2019;16:1289–96. 10.1038/s41592-019-0619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Shaham  U, Stanton  KP, Zhao  J. et al.  Removal of batch effects using distribution-matching residual networks. Bioinformatics  2017;33:2539–46. 10.1093/bioinformatics/btx196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Welch  JD, Kozareva  V, Ferreira  A. et al.  Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell  2019;177:1873–1887.e17. 10.1016/j.cell.2019.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Lopez  R, Regier  J, Cole  MB. et al.  Deep generative modeling for single-cell transcriptomics. Nat Methods  2018;15:1053–8. 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Kingma  DP, Welling  M. Auto-encoding variational Bayes. In: Bengio Y, LeCun Y (eds.), Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014). arXiv:1312.6114.
  • 15. Yu  X, Xu  X, Zhang  J. et al.  Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun  2023;14:960. 10.1038/s41467-023-36635-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Li  X, Wang  K, Lyu  Y. et al.  Deep learning enables accurate clustering with batch effect removal in single-cell rna-seq analysis. Nat Commun  2020;11:2338. 10.1038/s41467-020-15851-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Haghverdi  L, Lun  AT, Morgan  MD. et al.  Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol  2018;36:421–7. 10.1038/nbt.4091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Wang  Y, Liu  T, Zhao  H. Respan: a powerful batch correction model for scRNA-seq data through residual adversarial networks. Bioinformatics  2022;38:3942–9. 10.1093/bioinformatics/btac427 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Hubert  L, Arabie  P. Comparing partitions. JClassif  1985;2:193–218. 10.1007/BF01908075 [DOI] [Google Scholar]
  • 20. Rousseeuw  PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. JComputApplMath  1987;20:53–65. 10.1016/0377-0427(87)90125-7 [DOI] [Google Scholar]
  • 21. Zhang  B, Horvath  S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 2005;4:i–43. 10.2202/1544-6115.1128 [DOI] [PubMed] [Google Scholar]
  • 22. Bach  K, Pensa  S, Grzelak  M. et al.  Differentiation dynamics of mammary epithelial cells revealed by single-cell rna sequencing. Nat Commun  2017;8:1–11. 10.1038/s41467-017-02001-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Pal  B, Chen  Y, Vaillant  F. et al.  Construction of developmental lineage relationships in the mouse mammary gland by single-cell rna profiling. Nat Commun  2017;8:1627. 10.1038/s41467-017-01560-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Giraddi  RR, Chung  C-Y, Heinz  RE. et al.  Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep  2018;24:1653–1666.e7. 10.1016/j.celrep.2018.07.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Tian  L, Dong  X, Freytag  S. et al.  Benchmarking single cell rna-sequencing analysis pipelines using mixture control experiments. Nat Methods  2019;16:479–87. 10.1038/s41592-019-0425-8 [DOI] [PubMed] [Google Scholar]
  • 26. Kolodziejczyk  AA, Kim  JK, Tsang  JC. et al.  Single cell rna-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell  2015;17:471–85. 10.1016/j.stem.2015.09.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Hawkins  RD, Hon  GC, Ren  B. Next-generation genomics: an integrative approach. Nat Rev Genet  2010;11:476–86. 10.1038/nrg2795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Lee  TI, Young  RA. Transcriptional regulation and its misregulation in disease. Cell  2013;152:1237–51. 10.1016/j.cell.2013.02.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Lai  Y, Eckenrode  SE, She  J-X. A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinf  2009;10:1–11. 10.1186/1471-2105-10-S1-S23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Lai  Y, Zhang  F, Nayak  TK. et al.  An efficient concordant integrative analysis of multiple large-scale two-sample expression data sets. Bioinformatics  2017;33:3852–60. 10.1093/bioinformatics/btx061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Traag  VA, Waltman  L, Van Eck  NJ. From Louvain to Leiden: Guaranteeing well-connected communities. Sci Rep  2019;9:1–12. 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Reynolds  DA. Gaussian mixture models. In: Li SZ, Jain A (eds.), Encyclopedia of Biometrics. New York: Springer, 2009, 659–63. 10.1007/978-0-387-73003-5_196 [DOI] [Google Scholar]
  • 33. Blondel  VD, Guillaume  J-L, Lambiotte  R. et al.  Fast unfolding of communities in large networks. J Stat Mech: Theory Exp  2008;2008:10008. [Google Scholar]
  • 34. Schwarz  G. Estimating the dimension of a model. The annals of statistics  1978;6:461–4. 10.1214/aos/1176344136 [DOI] [Google Scholar]
  • 35. Polański  K, Young  MD, Miao  Z. et al.  BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics  2020;36:964–5. 10.1093/bioinformatics/btz625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Daniels  H, Velikova  M. Monotone and partially monotone neural networks. IEEE Trans Neural Netw  2010;21:906–17. 10.1109/TNN.2010.2044803 [DOI] [PubMed] [Google Scholar]
  • 37. You  S, Ding  D, Canini  K. et al.  Deep lattice networks and partial monotonic functions. In: Guyon I, von Luxburg U, Bengio S et al. (eds.), Advances in Neural Information Processing Systems 30 (NIPS 2017). NeurIPS Foundation, 2017, 2981–89. [Google Scholar]
  • 38. Kay  H, Ungar  LH. Estimating monotonic functions and their bounds. AIChE Journal  2000;46:2426–34. 10.1002/aic.690461211 [DOI] [Google Scholar]
  • 39. Gretton  A, Borgwardt  K, Rasch  M. et al.  A kernel method for the two-sample problem. In: Schölkopf B, Platt JC, Hofmann T (eds.), Advances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press, 2006, 673–80. 10.7551/mitpress/7503.003.0069 [DOI] [Google Scholar]
  • 40. Gretton  A, Borgwardt  KM, Rasch  MJ. et al.  A kernel two-sample test. JMachLearnRes  2012;13:723–73. [Google Scholar]
  • 41. Han  X, Wang  R, Zhou  Y. et al.  Mapping the mouse cell atlas by microwell-seq. Cell  2018;172:1091–1107.e17. 10.1016/j.cell.2018.02.001 [DOI] [PubMed] [Google Scholar]
  • 42. Villani  A-C, Satija  R, Reynolds  G. et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science  2017;356:4573. 10.1126/science.aah4573 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zheng  GX, Terry  JM, Belgrader  P. et al.  Massively parallel digital transcriptional profiling of single cells. Nat Commun  2017;8:14049. 10.1038/ncomms14049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Hie  B, Bryson  B, Berger  B. Efficient integration of heterogeneous single-cell transcriptomes using scanorama. Nat Biotechnol  2019;37:685–91. 10.1038/s41587-019-0113-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Muraro  M, Dharmadhikari  G, Grün  D. et al.  A single-cell transcriptome atlas of the human pancreas. Cell Syst  2016;3:385–394.e3  e3. 10.1016/j.cels.2016.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Segerstolpe  Å, Palasantza  A, Eliasson  P. et al.  Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab  2016;24:593–607. 10.1016/j.cmet.2016.08.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Koenig  AL, Shchukina  I, Amrute  J. et al.  Single-cell transcriptomics reveals cell-type-specific diversification in human heart failure. Nat Cardiovasc Res  2022;1:263–80. 10.1038/s44161-022-00028-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Zappia  L, Phipson  B, Oshlack  A. Splatter: simulation of single-cell rna sequencing data. Genome Biol  2017;18:174. 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary_materials5_10_bbaf247

Data Availability Statement

We analyzed nine published scRNA-seq datasets and one simulated datasets, which are available through the accession numbers reported in the original articles.

  • (1) Dataset 1 consists of the mammary epithelial cell dataset from three independent studies [22–24].

  • (2) Dataset 2 consists of the human lung cell dataset [25] collected from three lung adenocarcinoma cell lines HCC827, H1975, and H2228 on three different platforms with CELseq2, 10x Chromium, and Drop-seq protocols, respectively, which can be downloaded from https://github.com/LuyiTian/sc_mixology.

  • (3) Dataset 3 is a subset of Dataset 2, with the data for cell type H1975 removed in two batches. In the third batch, data for this cell type is retained.

  • (4) Dataset 4 consists of the mouse embryonic stem cell dataset [26]. The transcriptome of 704 mouse embryonic stem cells was sequenced across three culture conditions (lif, 2i, and a2i), using the Fluidigm C1 microfuidics cell capture platform followed by illumina sequencing, which can be downloaded from http://www.ebi.ac.uk/teichmann-srv/espresso.

  • (5) Dataset 5 consists of the mouse mammary gland datasets [41] processed on the Microwell-seq platform from the Mouse Cell Atlas project, which can be downloaded from https://fgshare.com/articles/dataset/MCA_DGE_Data/5435866.

  • (6) Dataset 6 from GEO accession GSE80171 consists of human blood dendritic cells’ (DCs) scRNA-seq data [42], generated by the same technology and coming from the same tissue. It is composed of two batches containing four different cell types, which can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE94820.

  • (7) Dataset 7 is composed of three batches, where batch 1 contains only 293T cells, batch 2 contains only Jurkat cells, and batch 3 consists of a 50/50 mixture of Jurkat and 293T cells [43, 44], which can be downloaded from http://scanorama.csail.mit.edu/data.tar.gz.

  • (8) Dataset 8 was constructed using human pancreatic data from two sources [45, 46]. The resulting dataset consists of celseq2 batch (accession GSE85241) and smartseq2 batch (accession E-MTAB-5061) with 15 different cell types, which can be downloaded from https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/.

  • (9) Dataset 9 is a failing human heart dataset [47]. Based on different technologies, we selected a total of 39 682 cells from female subjects under the condition of DCM, which include 14 different cell types.

  • (10) Dataset 10 is a simulated count data using the Splatter package [48]. The set contains two batches with unbalanced numbers of cells.

The details of datasets used can be found in Supplementary Materials Table S3. The normailzed datasets in this paper are avaliable via https://github.com/MingxuanZhangUSTC/Order-preserving-correction.git.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES