Abstract
In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior-posterior axis in visceral endoderm.
ScRNA-seq has become an essential technology for resolving gene expression heterogeneity in single cells1–4 and has been widely used in many biological domains5–9. It remains challenging to integrate scRNA-seq datasets with inter-sample heterogeneity—for example, different cell subpopulation compositions among datasets or gene expression difference in the same cell types across datasets. In the past few years, many strategies have been developed10–21, including those based on canonical correlation analysis (CCA) to project cells from all datasets into a CCA space and others using shared cell types to learn an integrated space that removes batch differences. The CCA algorithm assumes that shared cell subpopulations have very similar gene expression across datasets (referred to herefter as inter-sample ‘homogeneous data’—for example, replicated experiments). Under this scenario, the methods can integrate homogeneous data by maximizing cross covariance. Alternative approaches, such as Scanorama12, that explicitly leverage shared cell types have become more popular due to better handling datasets with some degrees of inter-sample heterogeneity. All these methods perform reasonably well in integrating a pair of datasets, but their performances vary22.
The lack of a consistent global reference space for projecting all cells is likely a major limitation when these methods are applied to integrate heterogeneous data from multiple experiments or batches. For example, the CCA-based methods integrate datasets sequentially, and the hyperplane in each round of integration reduces the cell dissimilarity signals by seeking for the maximal cross-covariance between datasets23. The use of information contained in shared cell types partially mitigates this limitation, but the integration methods starting from shared cell types, while institutive, also have their own disadvantages: inappropriate shared cells might be chosen for different datasets, and especially inter-sample heterogeneity (such as that existing in drug-treated samples with changes in cell states or even types from controls) might be distorted from over-integration (Fig. 1). To overcome this problem, we developed a novel algorithm, called RPCI. Different from the classical principal component regression algorithm24 (see ‘Integration algorithms’ in Methods), RPCI introduces a new effective formula to calibrate cell similarity by a global reference gene eigenvector. It is an assumption-free method as it makes no presumption on inter-sample similarity and does not rely on shared cell types. Benchmarking it with 11 other integration approaches using simulated and real scRNA-seq datasets, we showed the outperformance of RPCI under various scenarios, especially in correctly detecting cell groups affected specifically by a gene knockout or identifying correct cell type relationship during ontogenesis. To make RPCI easily accessible, we implemented it in an R package (called ‘RISC’, for ‘robust integration of scRNA-seq data’) with other essential functions for scRNA-seq analysis.
Fig. 1 |. Schematic illustration of data integration.
a, Three heterogeneous datasets. Each contains three groups (a, b/b′, c/c′ and d) of cells, with only batch difference in the group a among the three datasets, but, in addition to batch effects, relatively small or large gene expression differences exist in b/b′ or c/c′, respectively, marked by different shapes. b, CCA strategy. It first integrates X and Y by projecting their cells into a maximal covariance hyperplane (‘CCA space 1’) to yield a new integrated data matrix N by XTY and then integrates the third dataset (Z) to form a second hyperplane (‘CCA space 2’) through cross-covariance of Z and N. c, Shared cell-type-based strategy. The integration uses shared cell types to define and correct batch effects and then calibrate the relationships from the non-shared cell types to the shared cell types with adjustment of batch correction among datasets, often resulting in non-linear integration. d,e, RPCI strategy. It uses the gene eigenvectors (Ur, for example, Ux or its top-ranked eigenvectors) of a reference data to decompose individual datasets independently. Red arrows point to the cell groups (b/b′ and c/c′) with true gene expression difference that is maintained better in RPCI than other strategies.
Results
RPCI, a new framework for scRNA-seq data integration.
We illustrate the general strategy in scRNA-seq data integration with cartoons (Fig. 1) and simulated data (Fig. 2) containing three batches (or sets) of heterogeneous datasets from cell groups with distinct transcriptomes25, represented as X, Y and Z (Fig. 1a). For strategies like CCA10, all cells in X and Y are integrated into a hyperplane (N) defined by maximal cross covariance, and then the hyperplane will be further integrated into a second hyperplane from maximizing cross covariance of N and Z (Fig. 1b and Methods). As such, Z is not directly compared to X or Y, and, moreover, distortions can accumulate from each of the stepwise integrations. This limitation was reported previously12,21, and most current software has opted to use cell types shared across datasets to guide batch corrections11–21. Typically, these methods first identify common cell type(s) (for example, cell group a; note that we use cell groups because one cell type might have multiple transcriptomic states like b/b′ and c/c′) across datasets, use them to calculate batch correction factors (Bi, the i-th integration) and then apply Bi to merge all datasets (Fig. 1c). This has two caveats: (i) some software might select shared cell groups inappropriately (for example, choose either a or b but not both for X and Y), leading to erroneous batch corrections, and (ii) the method may not be able to correctly adjust the batch effects for non-shared cell groups. The calibration of inter-sample difference in related but not shared cell types (that is, dissimilarities between b and b′ or c and c′ :, and ) will be determined by two factors: batch corrections (B) and weighted distances (W) to the shared cells. For instance, is based on Ba,XZ and , whereas is based on Bab,XY and . Because the combined terms can lead to non-linear transformation, the difference in b/b′ and c/c′ is not guaranteed to be quantitatively comparable for integration. This lack of a global standard might lead to increased distortions when three or more datasets are integrated (see Methods and Results below).
Fig. 2 |. Performance of RPCI in integrating simulated heterogeneous datasets.
a, Three sets of scRNA-seq data simulated by the ‘Symsim’ software. The left and right panels show the relationship of the cell groups as a dendrogram tree and relatively shifting positions, respectively. b, Gene expression variance of cell groups. The left panel plots the standard deviations (y axis) and average values (x axis) of gene expression of one cell group. The right panel shows the cell–cell correlations between cell groups based on gene expression variance. c, Pairwise data integration by RPCI. The top and bottom plots color cells by sets and groups, respectively. d, Top three PCs of the RPCI-integrated data retaining the expression distinction among cell groups. e, UMAP based on the integrated cell eigenvectors demonstrates that the RPCI integration correctly merged a across the three sets while preserving the difference in b/b′ and c/c′. The inserted dashed plot shows the correct merging of batches.
To address these problems, we introduce RPCI, which exploits the idea that gene eigenvectors, derived from gene expression variance, are expected to match or be the same in one cell type across datasets. As such, when decomposing two or more gene cell matrices by reference gene eigenvectors (that is, the left-singular vectors), common cell types will have matched cell eigenvectors (that is, the right-singular vectors), whereas distinct cell types will not. Thus, RPCI can use a global reference gene eigenvector to decompose all the datasets (Fig. 1d) and project all cells into this RPCI space (Fig. 1e). More importantly, , and capture the information of gene expression difference among UX, UY or UZ, and the linear transformation also ensures that the dissimilarities in cell groups (for example, b/b′ or c/c′) are quantitatively maintained in the integrated data (Fig. 1e, red arrows). In Methods (see ‘RPCI and data integration’), we provide the computational framework to support the validity of the RPCI strategy. Below, we demonstrate the performance of RPCI and benchmark it against top-performing tools in the field using both simulated and real scRNA-seq datasets.
Benchmark of RPCI and other strategies using simulated data.
We first tested on homogeneous datasets from simulation26 and compared RPCI with Anchor, fastMNN, Scanorama, Harmony and other integration approaches10–21. By visualizing the integration results using uniform manifold approximation and projection (UMAP) or quantifying the integrations using a set of four metrics—kBET, LISI, ARIand SW13,22,27–29—we demonstrated that RPCI had the same or better performance, leading to correct removal of batch effects (Supplementary Fig. 1).
To show the key difference between RPCI and other tools, we turned to heterogeneous datasets (Supplementary Fig 2a,b) generated by the ‘Symsim’ software25, which controlled the degrees of gene expression difference in two of the four cell groups (Fig. 2a): <1% differentially expressed (DE) genes in group a across sets, ~17% DE genes between group b and b′ and ~26% DE genes between groups c and c′ (b/b′ and c/c′, hereafter referred to as ‘heterogeneous groups’). Note that d is a rare cell group with 20 cells. A good method not only needs to combine the homogeneous cell groups (a) and separate the rare cell group d but also should successfully retain more subtle separation of heterogeneous groups, b/b′ and c/c′, due to the gene expression difference.
To illustrate the fundamental principle of RPCI, we first looked at cell group correlations computed from gene expression variance and raw counts of the uncorrected data (Fig. 2b and Supplementary Fig. 2c). Both correlations captured the original cell group relationships in Supplementary Fig. 2a, with batch effects visible, indicating that gene expression variance contains the biological signals distinguishing cell groups. Using the gene expression variance of data Set 1 to build reference gene eigenvectors (the first ten eigenvectors), we decomposed individual datasets and projected all cells into the RPCI space. As shown in Fig. 2c–e, when combining Set 1 with Set 2 or Set 3 or all three sets together, the resulting cell group relationships were always correctly maintained, and the differences in b/b′ or c/c′ were preserved accurately, as visualized by principal component (PC) and UMAP plots.
Next, we integrated these three sets by 11 other tools and compared the performances by four benchmarking metrics (Supplementary Fig. 2d,e). The kBET scores, which mainly estimate the batch correction for shared cell groups29, indicated that RPCI was among the top performers, with scores greater than 0.85. The scores from LISI, ARI and SW measure both batch removal and correct cell type separation13,27,28. The high LISA, ARI and SW cell type scores of RPCI (>0.9) indicated that the RPCI-integrated data maintained the correct relationships of all the cell groups and supported that RPCI outperformed other tools. RPCI also received the top batch removal scores in these three metrics (>0.80). Examining the results of other tools, we found that the machine learning-based tools (such as scGen and BERMUDA) paid more attention to batch removal but overlooked cell group dissimilarity; by contrast, the tools designed for correcting eigenvectors (for example, Harmony and BEER) appeared to be less powerful in handling batch effects.
We analyzed closely how different tools used and modified the gene expression variance in integration. First, we changed the number of eigenvectors used for integration. The results suggested that RPCI could consistently produce correct integration with different numbers of top eigenvectors before being overwhelmed by noise, but other tools were less stable (Supplementary Fig. 3a), with the results supported by t-distributed stochastic neighbor embedding (t-SNE) visualization too (Supplementary Fig. 3b). One can potentially plot kBET or other integration metrics using different numbers of eigenvectors to study the stable performance of RPCI and choose the optimal number for integration (Supplementary Fig. 3c). Next, we compared the gene variance before and after integration. In contrast to other tools, RPCI was the best in maintaining the variance (Supplementary Fig. 4a–c); the correlation coefficient was 0.90, and the regression slope reached 0.99, indicating the least distortion. This held when different data normalization algorithms were used, as RPCI had the highest R2 values. Lastly, we examined how many preset DE genes remained significantly after integration and whether fold changes (FCs) in b/b′ or c/c′ comparisons were distorted. Based on conservative thresholds for detecting DE genes (Methods), the type I/type II errors (false positive/negative) from the RPCI-integrated data are presented in Supplementary Fig. 4d. The correlations of FCs in gene expression in b/b′ or c/c′ were 0.87 and 0.96 (Supplementary Fig. 4e), respectively, much higher than what were observed for most other tools (>10% better than BEER and >20% better than others; Supplementary Fig. 4f). Note that not all integration tools output batch-corrected gene expression values, which are needed for this DE analysis.
The above analyses provide strong support for RPCI’s top performance, but we needed to further evaluate RPCI integration with datasets of more biological complexity. We took scRNA-seq data from peripheral blood mononuclear cells (PBMCs) (GSE94820)30 to construct a series of testing datasets, mimicking various experimental conditions and perturbations (Fig. 3a), starting from a base dataset (Set 0) with four cell types (see simulated data in Methods). The resultant cell type relationships across these semi-simulated datasets were shown by the correlations of gene expression variance and raw counts (Supplementary Fig. 5a), with the variance correlation clearly marking batches and biological signals between cell types from individual datasets. When integrating all six datasets or in a pairwise setting, using gene eigenvectors from Set 0 as the reference, we found that RPCI robustly removed batch effects (in CD16+, CD14+ Mono, DoubleNeg and pDC cells), correctly identified the rare cell type in Set 3 and accurately preserved intra-sample heterogeneities in CD1C and CD141 cells, respectively (Fig. 3b,c and Supplementary Fig. 5b). The expression patterns of the cell-type-specific marker genes support that RPCI integration appropriately aligned gene expression for individual cell types (Supplementary Fig. 5c).
Fig. 3 |. RPCI integration of semi-simulated datasets with diverse cell types and multiple perturbations.
a, A scheme illustrates the relationship of six datasets derived from published scRNA-seq data. We took a PBMC scRNA-seq data (Set 0) and added batch effects to all or specific cell types or introduced gene expression difference to a selected cell type (CD141) to generate five ‘simulated’ datasets, along with a second PBMC dataset from other developmental stages (Set 3). b, c, RPCI integration results of the six datasets in various configurations, using the unmodified Set 0 as the reference. UMAP plots show the merged cells before (b) and after (c) RPCI integration. d, Four metrics summarize the performance comparisons of RPCI with other 11 published tools in their integrations of all six datasets, with additional results in Supplementary Figs. 5 and 6 and Supplementary Table 1. In all LISI, ARI and SW plots, the x axis is cell type score, and the y axis is batch score.
We repeated our integration test of the six datasets using other tools, in the same input order of datasets and configurations. Both UMAP visualization of the integrated data (Supplementary Fig. 6a,b) and the benchmark metrics (Fig. 3d) indicated that these tools had various levels of successes but, overall, performed not so robustly as RPCI. RPCI was ranked in the top two in correcting batches as determined by the kBET score and always showed the optimal performance in distinguishing all the cell types as evaluated (Fig. 3d). As shown in Supplementary Fig. 6b, some tools either did not obtain complete integration (pDC in Scanorama and BEER; CD16+ Mono in Anchor and Harmony) or mixed some cell types (MMD-ResNet and Liger) or were totally unable to integrate these datasets (BERMUDA and ZinbWave). In addition, for tools showing a good performance in batch correction, they did not sufficiently identify inter-sample heterogeneity, similarly to what was shown above (Supplementary Fig. 2d,e). Moreover, as we expected from the lack of a global reference, most tools displayed better performances in integrating pairwise datasets than in multiple datasets (Supplementary Table 1 and Fig. 3d). Taken together, these results demonstrate strongly the important advantages of RPCI in integrating multiple datasets from complicated conditions.
Performance of RPCI on real heterogeneous scRNA-seq data.
We next tested the performance of RPCI on published data with cross-sample difference introduced by genetic perturbation. We selected a dataset (GSE118545) from a single-nucleus RNA sequencing (snRNA-seq) platform that consisted of cells derived from wild-type (WT) and estrogen-related receptor α/γ knockout (referred to as ERR KO) mouse hearts31. The availability of biological triplicates (six samples and ~20,600 nuclei in total) also allowed us to address reproducibility. The WT and KO difference could be observed in the 4th PC of the cell eigenvectors from the data before integration, whereas the cell type information was in PCs 2–3 (Fig. 4a). The data qualities were evaluated and adjusted by preliminary data processing, and the sample relationship could be correctly established by gene expression similarity (Supplementary Fig. 7a,b). The cell types could be identified by the known marker genes provided by the authors, and the cell type relationship was confirmed by our correlation analysis of gene expression variance and raw counts (Supplementary Fig. 7c–e).
Fig. 4 |. Performance of RPCI in integrating WT and ERR KO snRNA-seq datasets.
a, 3D PC plots show the separation of cells in the raw data by genotypes (left, WT in blue and ERR KO in red) and cell types (right), indicating that batch effects are mainly reflected in PC 4, whereas cell type information is reflected in PCs 2–3 of the cell eigen matrix. The three replicates (WT-1/2/3 and KO-1/2/3) are not distinguished in these two plots. b, UMAP plots show the integrated data from eight different tools, with cells in the top and bottom panels colored by genotypes and cell types, respectively. For comparison and by subpopulation analysis, the same cell types (for example, dCM) with gene expression difference between WT and ERR KO are named and colored differently (for example, WT-dCM and KO-dCM). Here, the identification of cell types was based on the marker genes provided in the original study (expression patterns in Supplementary Fig. 7e). c, UMAP plots show the RPCI integration results from integration of one pair (WT-1 and KO-1) and two pairs of the three replicated datasets (WT-1, WT-2, KO-1 and KO-2).
We performed integration on the WT and ERR KO datasets by RPCI and other tools, with the same inputs and sample order. The results indicated that only RPCI integration showed a correct distinction between WT and ERR KO cardiomyocytes (CMs) (Fig. 4b), which is consistent with Hu et al.’s finding that developing and mature CMs (dCMs and mitoCMs) were the cell subpopulations most affected by ERR KO31. Note that this distinction can be visualized better with stereo three-dimensional (3D) UMAP32. The performance difference was shown in quantitative benchmark metrics: RPIC had the top scores in LISI, AIR and SW (Supplementary Fig. 7f). The expression patterns of cell type marker genes in the RPCI data further supported the correct integration (Supplementary Fig. 7e). We noticed some differences in data quality of the six replicates (Supplementary Fig. 7b), which provided an opportunity to study how data quality affected integration. We treated the WT and ERR KO datasets as pairs and integrated them sequentially (Fig. 4b,c and Supplementary Fig. 8a,b). Based on the metric scores in a series of pair integrations, the outperformance of RPCI was prominent and robust (Supplementary Figs. 7f and 8c). A large number of DE genes between the WT and KO CM subpopulations were also found in the RPCI data (Supplementary Fig. 8d,e), concordant with the original report31. Pathway enrichment analysis of the DE genes found pathways (for example, cardiac muscle contraction) known to be affected by ERR KO and additional new ones (Supplementary Fig. 8f). The DE analysis, thus, further supported the difference in WT and ERR KO CMs.
Application of RPCI to developmental single-cell data.
Another common task in scRNA-seq data integration is to analyze datasets from multiple developmental stages, often leading to trajectory analysis. This kind of integration is quite challenging because the shared cell types among datasets could be very limited. To address this, we applied RPCI to integrate scRNA-seq datasets studying the transcriptomic dynamics in the mouse embryonic endoderm with 17 datasets from embryonic day (E) 3.5 to E7.5 (GSE123046)33. Our application of RPCI showed that it successfully integrated these datasets from multiple time points and replicated library samples (Fig. 5a and Supplementary Fig. 9a,b) and distinguished cell groups related to major embryonic development: 1) visceral endoderm (VE), 2) epiblast (EPI) and 3) extra-embryonic ectoderm (ExE) (Fig. 5a). More importantly, examining the integrated VE cells using the time point (E4.5-E7.5) and cell type annotation (VE, visceral endoderm; exVE, extra-embryonic VE; emVE, embryonic VE) from the original publication, we found that RPCI aligned cells along the developmental time points correctly (Fig. 5b) while distinguishing the different cell types properly (Fig. 5c). We further investigated the EPI/ExE cells of the integrated data and identified 20 cell subpopulations (Supplementary Fig. 9c). Expression patterns of the cell type markers in these VE and EPI/ExE cells showed that they represented distinct subpopulations, further supporting our integration result (Supplementary Fig. 9d,e).
Fig. 5 |. Maintenance of correct cell type relationship in endoderm developmental trajectory in RPCI integration.
a, 3D UMAP plot of the mouse embryonic endoderm celLs In the RPCI-Integrated data (from E3.5 to E7.5), wlth annotatlons for EPI, ExE and VE. b, c, Cells In the VE developmental trajectory as revealed by RPCI integration. The VE cells in 3D UMAP were annotated with time points (b) or cell types (c). The b′ and c′ are 90° horizontal rotation view of the b and c, respectively. d, Putative anterior-posterior VE (AVE and PVE) at E6.5 and E7.5. The top UMAP shows VE cells in different times, with the colors highlighting the AVE/PVE cells. The bottom panel shows a rotation view of the AVE/PVE cells. e, Heat map shows some important transcription factors with expression trends along AVE/PVE development stages, from E4.5 to E7.5.
One of the many important findings in the original study by Nowotschin et al. is that the spatial patterning along the anterior–posterior axis of gut endoderm could actually be observed in the single-cell data, even at E7.5, with a small fraction of the definitive endoderm and visceral endoderm cells primed toward anterior and posterior fate, respectively33. Not only did RPCI successfully identify E7.5 anterior VE as a distinct cell group as the original study, but also the RPCI-integrated data recognized four previously unreported emVE subpopulations that express known markers of either anterior or posterior endoderm cells33–36 (Supplementary Fig. 9f). We, thus, referred to them as putative anterior (AVE) and posterior (PVE) cells (Fig. 5d). The result strongly suggests that the spatial axis was detectable in VE cells at E7.5 but also at E6.5. We further analyzed the DE genes between AVE and PVE in either E6.5 or E7.5. Among them, we found transcription factors for the different VE groups (Fig. 5e), which potentially can play roles in regulating early VE development that need to be further tested. These findings strongly support that RPCI can correctly integrate highly heterogeneous datasets, leading to novel insights in cell lineage evolution that would otherwise be missed by other software.
For comparison, we performed integration on these datasets using five top performance tools (Supplementary Fig. 10a). The benchmark metrics indicated that RPCI more correctly handled the integration than others, especially in successfully aligning the various cell subpopulations along developmental time points (Supplementary Fig. 10b). Interestingly, Scanorama ranked second in metric scores, probably because its strategy emphasizes the integration of multiple panoramas and, thus, is quite powerful in estimating cell type similarities across time points.
Extensive test of RPCI on datasets of various heterogeneity.
We further evaluated the robust performance of RPCI integration using a large number of scRNA-seq datasets published by other investigators, with the goal to account for various scenarios in integration of data with different extents of inter-dataset heterogeneity and data quality. These datasets were obtained from various experimental37,38 or genetic39 perturbations, developmental stages6, tumor samples from patients with or without immune therapy40–42, multiple platforms and protocols43–47 or cross species48 (details in Supplementary Notes and Supplementary Figs. 11–17). The results support that RPCI is able to distinguish batch effects from biological signals, without over-integration or under-integration, and, in some cases, also leading to novel findings.
Finally, we performed an integration study to demonstrate that the RPCI-integrated data could be directly used for subsequent analysis by other software. Specifically, we showed that the RPCI-integrated data could be used by the trajectory analysis software PAGA49 to study the mouse mammary gland development (GSE111113 and E16 to 12-week adult stage50; Supplementary Fig. 18a,b). Even with the data at the middle time point (‘P4’) removed, RPCI data could still successfully be used to infer a trajectory consistent with the full datasets (Supplementary Fig. 18c,d).
Discussion
In this study, we developed and validated a new method, called RPCI, for integrating scRNA-seq datasets. We also conducted extensive evaluations and compared its performance with many tools that have been shown to be the top in recent benchmark studies22. Two key features of RPCI are that it does not depend on shared cell types and does not make any assumption of the similarity between datasets, although one probably would not use it to integrate non-biologically relevant datasets.
The RPCI approach uses reference gene eigenvectors—thus, choosing an optimal reference is an important step for RPCI. Although changing the reference would not affect the ability of RPCI to preserve inter-sample differences across heterogeneous datasets, we found some differences in the RPCI-integrated results by using different reference gene eigenvectors. To address this, our RISC software package includes a function to score the datasets to help choose the reference, by the analyses of cell clustering, standard deviation of eigenvectors and eigenvector distribution (Supplementary Fig. 19a and Methods). To illustrate its use, we used the PBMC datasets in Fig. 3 as an example. The benchmark metrics indicated that RPCI integrations with alternative references had overall concordant performances (Supplementary Fig. 19b–d). However, notably, the best metric score was observed when the set (Set 1) with the top gene eigenvector score was used as the reference data (Supplementary Fig. 19a). Set 5 had a very abnormal standard deviation score, probably due to how it was constructed computationally. Using it as the reference led to poor integration. It is worth pointing out that the results from other integration tools could also be affected by alternating use of ‘reference’ datasets. We alternated the integration order of cardiac snRNA-seq datasets in Fig. 4 and started from a KO sample. By visualization, we were able to see that the integration results from all tested tools were changed, but RPCI integration still showed the top performance (Supplementary Fig. 19e–g). In practice, users will likely try different reference data and compare the RPCI results. In our opinion, RPCI is especially suitable when the biological question is clear and the samples are integrable, such as in the cases of comparing treated and untreated samples or patients with controls. In addition to how to select the best reference, there is certainly room for additional improvement, such as to use meta cells instead of individual cells for increasing computing efficiency and to use data imputing for reducing dropout noises in the raw scRNA-seq data.
In addition to the 11 tools we tested here, there are many algorithms, such as Lasso and Elastic net, that have the ability to remove batch effects. However, without a global reference such as the gene eigenvectors in RPCI, the process for removing batch effects could distort biological signals between datasets, because it is difficult to form linear regression for intra-sample heterogeneity. On the other hand, we also need to consider computational efficiency and applicability of data qualities. Some complicated algorithms cannot be applied to poor-quality data, and others are extremely time consuming, such as ZinbWave16. In fact, we tested RPCI performance in integration of cells with normal and poor data quality (Supplementary Fig. 20) and demonstrated that RPCI can successfully and robustly handle cells with very low numbers of scRNA-seq reads. Furthermore, because we applied RPCI to more than 70 real datasets from 17 studies, with a large range of unique molecular identifiers (UMIs), nGenes and gene expression variance (Supplementary Fig. 20g,h), we think that RPCI is a robust scRNA-seq data integration approach that can handle data of various quality in complex experimental designs. Lastly, RPCI is similar to bulk RNA sequencing tools in terms of computational efficiency but much faster than many other scRNA-seq tools (Supplementary Fig. 21) and scalable to multiple and large scRNA-seq datasets.
Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41587–021-00859-x.
Methods
Our analyses used both simulated and real scRNA-seq data and implemented several algorithms. A full description of the data source is in Supplementary Table 2, and pre-processing of individual data is discussed in the Supplementary Notes. Here we describe the computational methods for general data filtering, data visualization, integration algorithm and the analytic metrics for integration performance. RPCI was released in the RISC R package (https://github.com/bioinfoDZ/RISC), including additional functions mentioned below—for example, clustering cells, identifying cluster marker genes and detecting DE genes between experimental conditions.
Data filtering and visualization.
UMAP, t-SNE and heat map.
To visualize the scRNA-seq data, we applied both t-SNE and UMAP using the R packages ‘Rtsne’ (v0.15)51 and ‘umap’ (v0.2.3.1)52, respectively. We decomposed the scRNA-seq gene cell matrix by singular value decomposition (SVD)53, used cell eigenvectors (that is, the right-singular vectors) as the input to calculate t-SNE or UMAP and then chose the first two components of t-SNE or UMAP for two-dimensional (2D) plotting and the first three for 3D plotting. To draw the heat map, we used the R package ‘pheatmap’54 and scaled gene expression values by rows, but the scale was set to ‘none’ when displaying cell—cell correlations.
Filtering and pre-processing scRNA-seq data.
For simulated data, we skipped data filtering, as ‘splatter’ and ‘Symsim’ tools already control the data quality. For real scRNA-seq data, we removed the genes expressed in fewer than five cells of individual datasets and discarded cells expressing fewer than 200 genes. We also filtered out cells with extremely low or high UMIs (a potential doublet). To remove the effect of sequencing depth, we normalized scRNA-seq raw data (that is, counts). For homogeneous simulated data, we applied the ‘log(count + 1)’ method55 for normalization, for each counti,j, i for gene and j for cell (n cells in total):
Here, presumptive sequencing depth was set to 1 × 106. For heterogeneous simulated data, we applied five methods for normalization: 1) the ‘CPM’ method56 from the function ‘calculateCPM’ of the R package ‘scater’ (v1.16.2); 2) the ‘Scater’ method57 from the function ‘normalize’ of the R package ‘scater’; 3) computeSumFactors and normalize for the ‘Scran’ method58 using the R package ‘scran’ (v1.16.0); 4) the SCnorm function for the ‘scNorm’ method59 using the R package ‘SCnorm’ (v1.10.0); and 5) the ‘log(count + 1)’ method (‘log1p’)55. We used these different normalization methods to address whether the performances of scRNA-seq data integration methods are dependent on normalization algorithms. For real scRNA-seq data, we used the ‘log(count + 1)’ method for normalization to simplify the process, because we showed that normalization did not significantly affect our results. When analyzing snRNA-seq data, we excluded the counts from the mitochondrial genes in gene normalization, as these counts were not from nuclei.
After count normalization, we scaled the normalized counts using the ‘scale’ function in the R package ‘base’60. The scaled counts merely contained gene signal information for individual cells and yield column-wise zero empirical mean for each column (that is, cell), thus satisfying the requirement for principal component analysis (PCA) and SVD23,53. The scaled data from different datasets were concatenated to generate a matrix that we refer to here as raw, uncorrected or pre-integrated data matrix. When evaluating the performances of integration approaches, for both simulated and real scRNA-seq data, we used the same data files (either raw counts or normalized counts) as inputs to different tools for fair comparison.
Gene expression variance and highly variable genes.
For simulated data, although we still calculated gene expression variance (coefficient of variation (CV)) for each gene61, we decomposed gene cell matrix and performed data integration using all the genes (1,000 genes), without a selection of variable genes. By contrast, for real scRNA-seq data, we identified highly variable genes by the quasi-Poisson model62 and used them for gene cell matrix decomposition and data integration. For some datasets across multiple time points with few shared highly variable genes, we also tried to input all the genes to RPCI integration. An example of data integration using all the genes was applied to the mouse embryonic endoderm data. Because most scRNA-seq tools were designed for integration based on highly variable genes, we used their own approaches when evaluating their performances, unless specified otherwise. Moreover, to assure a fairness in comparison, we input the same highly variable genes for all methods.
The default method in the RISC package uses three criteria to identify highly variable genes. First, it calculates CV61 for each gene i, labeled as Ci and given by
where Si and μi denote standard variance and mean value, respectively. To control for the correlation between Si and μi, a quasi-Poisson regression is used to obtain genes with over-dispersion Ci. Specifically, genes are binned (for example, 20 bins) by their expression levels. For genes in a bin, quasi-Poisson regression62 is used to predict by μbi as
where θ is the quasi-Poisson over-dispersion parameter. The predicted is calculated by , and the corresponding ratio between the observed Ci and the predicted is given by for each gene. The genes with ri > 1 and Ci > 0.5 are considered as highly variable for subsequent analysis. We limited the number of the highly variable genes by ranking ri, with the z-score test.
Integration algorithms.
To illustrate the process of data integration, we first described the decomposition of single-cell gene cell matrices by SVD, because most scRNA-seq integration tools use it. For simplicity, we first consider two datasets (that is, two gene cell matrices) and then extend to multiple datasets. Notably, classic PCA has the same equation as CCA, as discussed below.
SVD and PCA.
Let one gene cell matrix Xn×p with n rows of genes and p columns of cells. When we perform decomposition to Xn×p by SVD53, we generate eigenvectors,
where Δn×p is singular values of Xn×p, and both Un×n and Vp×p are orthonormal eigenvectors and represent the left- and right-singular vectors of the matrix Xn×p, respectively. In PCA, the full PCs23 are decomposed from Xn×p given by
where W is the PC score and reflects the maximal variance of Xn×p. Because PCA as a dimensionality reduction algorithm emphasizes the total variances and effective signal of Xn×p in the first few PCs, PCA generally keeps the first l PCs l∈{1,2,…,p,n} in calculation. Therefore, the truncated Wl is given by23
Accordingly, we can further decompose Xn×p by eigenvectors of the first l components, and the decomposition would include the most biological signal of the original gene cell matrix.
Here, Un×l is for gene signal and referred to as the gene eigenvectors, whereas Vp×l is for cell signal and termed as cell eigenvectors (Fig. 1).
CCA and data integration.
The essential function of CCA is to seek the maximal correlation between cross-covariance matrices. Let another gene cell matrix Yn×q with n rows of genes and q columns of cells. The CCA procedure10,11,18 is given by
where α1 and α2 are coefficients for X and Y, respectively, and seeks to maximize the correlation between X and Y. However, we do not need to calculate α1 and α2 in practice, because multiplying X and Y can achieve the same goal10,18. The correlation coefficient ρ is simplified and given by
where XTX and YTY are equal to 1 orthonormal matrix, as X and Y are from a scaled matrix with variance equal to 1 (see ‘Filtering and pre-processing scRNA-seq data’), and ρp×q with p rows of cells for X and q columns of cells for Y. Interestingly, the classic PCR has the same equation 23,24 and is given by
where XTX is equal to 1 orthonormal matrix. When decomposing the matrices β or ρ by SVD,
When integrating homogeneous datasets with only batch differences, the gene eigenvectors UX is approximate to UY and, thus,. After decomposition of ρ, the left- and right-singular vectors of the ρ are almost equal to VX and Vy; therefore, CCA is expected to have a good performance.
However, a notable issue in CCA procedure is the lack of a global (or standard) reference; the ρ is completely based on the covariance between X and Y. For heterogeneous scRNA-seq datasets with true gene expression difference in cells (of the same types) among datasets, the Ux is different from the UY, but CCA will still adjust the values of both cell eigenvectors (VX and Vy) by . When integrating three or more heterogeneous datasets, this error will accumulate, and the adjustment can make the resultant cell eigen matrices not directly comparable anymore. Consequently, the real cell eigen difference across datasets can be distorted in the CCA results.
RPCI and data integration.
The core principle of RPCI is very different from existing methods, including those working on SVD and PCA spaces. Let two gene cell matrices be Xn×p and Yn×q. After decomposing the matrix by the first l eigenvectors using SVD, the gene cell matrices are also given by
However, we do not maximize co-variance between X and Y, like CCA, but decompose Y by into an RPCI space. This equation is given by
Importantly, we focus on γ and decompose it by SVD; the right-singular vector of γ generates adjusted cell eigenvector of Y. Logically, the adjustment from VY to is based mainly on . When we have three or more datasets, all adjustments of cell eigenvector matrices (V) depend on the (Fig. 1d). Therefore, UX serves as a global reference, and all the adjusted cell eigenvector V′ are comparable. Afterwards, all the adjusted V′ are projected into a reference space defined by . Interestingly, U represents the gene expression variance of the target gene cell matrix, so the adjustment of V directly reflects the gene signal difference between the reference gene eigenvector UX and the target one U. More specifically, to integrate gene cell matrices X, Y and Z, the adjusted Vx, VY and Vz are modified by , and , respectively, so the adjusted (equal to VX), and can directly capture the difference of gene expression signal among X, Y and Z.
Regularization procedure.
In most cases, the dimension reduction and data integration of scRNA-seq data are based on the highly variable genes. Because the cell numbers of scRNA-seq data are often large, such as ten thousands and more, the matrix Xn×p and Yn×q are most likely in the condition , where n here refers to the number of highly variable genes (in most cases <3,000). As such, unique SVD might not be obtained from decomposing Xn×p and Yn×q. In RPCI procedures, Xn×p is decomposed by eigenvectors of the first l components (see SVD and PCA), where l ≪ max(p,q); therefore, the Un×l and Vp×l will be unique—that is, have undergone a regularized solution63,64.
Integration based on shared cell types.
This strategy is algorithmically different from the CCA, but it can be incorporated into a CCA-based integration method, as implemented in the Anchor’ It starts with finding shared cell types between datasets, but different tools implement this step quite differently10–21. Let two gene cell matrices be X and Y with shared cells from groups a and b and not shared cells from group c for X and c′ for Y, as shown in Figs. 1c and 2a. To form an X/Y integration, this approach needs to successfully remove the batch effects in a/b by minimizing Bab (a batch correction matrix with groups a and b). When datasets are complex, this can become challenging because the tools need an appropriate threshold to determine what cell types are shared or matched across different datasets. Even if this step succeeds, an integration tool needs to work on correcting the batch effects in non-shared cell types (for example, c/c′) to achieve full integration. Approximately, and using Scanoroma as an example, the difference in c/c′ for X/Y integration is based on the batch correction Bab and the weighted matrix from c to Xab and c′ to Yab and equal to
The full adjustment between X and Y is AXY, which contains two terms, BXY and :
When adding a third dataset Z that shares only cells of a with X, a distortion or incomparability can arise from merging X/Y/Z, because AXY and AXZ are from different references (different shared cell groups and different cell dissimilarities: AXY is based on Bab from the shared cells of a and b with the dissimilarity in c/c′ from but AXZ calculated from Ba from the shared cells of a with the dissimilarity in b/b′from ). Consequently, these incomparable corrections will lead to a distorted integration. As shown in Supplementary Fig. 2d, the integrated data from some tools in this category did not preserve the differences in b/b′ and c/c′.
Alignment of gene expression values across datasets.
For the downstream scRNA-seq analysis after integration, it is necessary for an integration tool to output the adjusted gene expression values. The RISC package generates and outputs batch-adjusted expression values for all genes expressed across samples, not only the top variable genes used in the data integration, with three options, one of which directly uses the RPCI-adjusted cell eigenvector for batch correction. Primarily, according to the ordinary least square model65 and the zero-inflated model66, we generate the integrated gene expression values according to the reference values xi and batch factors, while keeping genes with zero expression as zero in the integrated data. The equation is given by
where δ is for coefficient based on yi and xi. This method adjusts individual genes using a fixed batch model (such as DESeq2) and shows good performance in most cases, and it is the default method in the RISC tool. A second method implemented in RISC uses the RPCI model to remove batches, aligning the target matrix values based on the reference matrix X directly63, with the equation given by
where denoting RPCI-adjusted cell eigenvector (see ‘RPCI and data integration’). For this option, the use of all genes for integration will produce the best corrected gene expression values. The last method is based on the kernel assumption that most genes are not differentially expressed across datasets. Thus, we adjust the expression ranges λi of each gene across datasets (due to batch factor). Specifically, we use the linear regression model with Gaussian65 to define empirical confidence intervals for gene expression ranges across datasets. Then, we predict the integrated gene value by and xi. The equations are given by
where μx is for gene average expression and ξ is for coefficient based on λi and μx. This method can optimally preserve the original variance structures of individual datasets.
Analytic metrics.
To evaluate the performances of integration tools, we used four benchmark metrics: kBET, LISI, ARI and SW13,22,27,29,67. Additionally, we employed cell—cell correlation to analyze the relationships of cell populations, the difference of gene expression variance before and after integration to estimate batch removal and biological signal preservation and the numbers of DE genes in related cell groups. Notably, the metric scores can be affected by the number of cells with correct or incorrect local distribution. In data integration, the same biological signal distortion with different numbers of cells might generate quite different scores in these benchmark metrics.
kBET.
The kBET metric29 was developed in recent years for estimating the batch correction by batch mixture from each data point to its nearest neighbors—that is, the mixture distribution of batch-labeled points. For our analysis, we inputted the cell eigenvectors of each cell population from the raw and integrated data into the kBET. We used the top 20 cell eigenvectors for simulated data and the top 30 for real data. Then, we computed the kBET scores from different sample sizes (0%, 5%, 10%, 15%, 20% and 25%), according to a recent benchmark study of scRNA-seq tools22. Higher kBET scores represent better performance in batch correction. For a fair comparison of different integration tools, we used the same number of cell eigenvectors and the same cell type annotation as the inputs to kBET. However, fastMNN and Scanorama have different cell eigenvectors in visualization, so we selected the full cell eigenvectors of fastMNN (50 eigenvectors) and Scanorama (100 eigenvectors) for this test.
LISI.
Different from the kBET, the LISI13 not only scores the batch correction performances but also assesses the appropriate aggregation of cell populations. Interestingly, LISI also uses the local distribution of batch-labeled points to nearest neighbors, but the local distribution is based on the distance with a perplexity. Therefore, we inputted the UMAP values from the raw and integrated data into the LISI to generate two types of scores. The cell type LISI indicates the correct and independent cell population aggregation in the data, with 1 indicating that the same cell type is fully embedded, whereas the integration LISI (referred to here as batch LISI) measures batch removal, with a score equal to the number of (preset) batches indicating the optimal performance. Because not all cell types were present in all batches in many of the heterogeneous datasets (leading to inconsistent LISI), we further modified and scaled the cell type and batch LISI to values between 0 and 1 to make them more comparable, similarly to a recent benchmark paper22 and as shown in our online codes for computing LISI scores.
ARI.
ARI evaluates batch removal and cell population purity27. In the evaluation of batch correction, ARI measures the mixture of the preset batch-labeled points according to cell eigenvectors of the raw and integrated data. To evaluate cell population purity, ARI compares the original cell type annotation to the cell clustering results from the cell eigenvectors via the adjustRandIndex function of the ‘mclust’ package27. We used the cell eigenvectors as the inputs of the ARI, and the number of cell eigenvectors was the same as what was inputted in the UMAP/t-SNE visualization. Higher scores indicate purer aggregation of cell population and more accurate batch correction. For fair comparison, we used the same number of cell eigenvectors and cell type annotation as the inputs of the ARI, but fastMNN and Scanorama had their own eigenvectors.
SW.
Our analyses also used silhouette coefficient67 to evaluate how the preset cell population information was kept in the integrated data, as described previously22. We first calculated the UMAP values of the integrated data and then detected the silhouette coefficient of the preset cell groups by the UMAP values, using the silhouette function of the R package cluster’68. The silhouette coefficient reflects the independence of the cell populations in the integrated data. We also used the same process to calculate the conjunction of batch-labeled points and scored the batch removal. We defined the criteria of SW for each cell population: a silhouette coefficient less than 0 for real scRNA-seq data meant that the points from different batches were completely mixed and batch effects were removed, but a silhouette coefficient greater than 0.1 between cell populations indicated that cell type purity was preserved in the integrated data. As such, we could determine what percentage of cell populations had correct silhouette coefficients. For simulated data, a silhouette coefficient less than 0.15 was for batch mixture, and a silhouette coefficient greater than 0.3 was for cell type independence, due to smaller numbers of cells. Additionally, we weighted the SW by the local UMAP distribution of the cells from the same cell type (W). For instance, in the simulated data in Fig. 2 and Supplementary Fig. 2e, the SW of the cell groups was weighted by the average of the local UMAP distribution of the cells within each group (a, b, b′, c, c′ or d). To sum up, the SW score can be expressed by
Cell-cell group correlation.
Two kinds of cell population correlations were calculated in this study: cell population (group) correlation from the raw counts and correlation based on gene expression variance. In the former, we first combined raw count gene cell matrices of individual datasets after pre-processing the data (see see ‘Filtering and pre-processing scRNA-seq data’). Next, we performed Pearson’s correlation coefficients (r) for gene cell matrix and generated a cell-cell correlation matrix for individual cells. Then, according to cell population (group) annotation, the cell correlation matrix was averaged to obtain cell population (group) level correlation. For the latter, we first calculated the gene expression variance (see ‘Gene expression variance and highly variable genes’) of each cell population (group) and built a gene variance population matrix, with rows for gene expression variance of individual populations (groups) and columns for cell populations. This matrix was used to compute r, Pearson correlation coefficient.
Gene dispersion.
The preservation of gene expression variance (see ‘Gene expression variance and highly variable genes’)61 between the raw and integrated data is another important indicator for the accuracy of data integration. We calculated gene expression variances for each gene and then used linear regression analysis to evaluate the agreement of all gene expression variances before and after data integration, resulting in R2 for scoring the performance of integration tools (Supplementary Fig. 4b). Some methods do not align gene expression but merely adjust cell eigenvectors, so they do not have scores for gene dispersion.
DE gene analysis.
We used four algorithms for detecting DE genes between the datasets for the same cell groups (or clusters): 1) negative binomial (NB) generalized linear model with the glm.nb function of the R package ‘MASS’ (v7.3–51.6)69; 2) the zlm function with generalized linear model (MAST) from the R package ‘MAST’ (v1.14.0)70; 3) the conjugate Dirichlet process mixture (scDD) from the R package ‘scDD’ (v1.12.0) with scDD function71; and 4) non-parametric earth mover’s distance (EMD) by executing the calculate_emd function of the R package ‘EMDomics’ (v2.18.0)72. For raw simulated data, we considered batches as a factor in the test model and identified the reference DE genes that appeared in four algorithms, with adjusted P value < 0.01 and log(FC) > 0.25 or < −0.25. Note that scDD and EMD cannot generate FC, so only adjusted P value was applied. For the integrated data, we performed the DE gene test without batch factor, but the DE genes were defined by the same criteria.
In this study, we also analyzed DE genes for real scRNA-seq data. There, we used the NB generalized linear model as the default method to perform DE gene analysis for the integrated data. We used the threshold of adjusted P value < 0.01 and the logFC > 0.5 or < −0.5 as the default. However, in the DE gene analysis of real scRNA-seq data, we filtered out the genes not expressed in at least ten single cells.
FC analysis.
In addition to DE gene analysis, we also assessed the consistency of individual gene’s FCs (including all the genes in the two cell groups, b/b′ and c/c′) in the raw and integrated data by correlation analysis (R2). The FCs are generated by either generalized linear regression or NB model.
Reference selection.
To select the optimal reference dataset, we propose to use three tests to rank individual datasets (Supplementary Fig. 19): 1) cluster score: estimate how many cell clusters are in individual datasets, with the idea that the dataset with more clusters is a better reference; 2) PC′ stv (standard variance) score: estimate how many principal components (PC′ = UΔ, U for gene eigenvectors and Δ for the singular values) can explain the major gene expression variance in each dataset, with the idea that a good reference should have higher PC′ stv score; 3) PC′ dis. (distribution) score: use the Kolmogorov-Smirnov test to detect the gene eigenvector distribution across individual datasets, with the idea that a good reference should not exhibit biased gene eigenvector distribution. From these scores, one can choose a reference by ranking the datasets using a weighted scheme of cluster score > PC stv score > PC dis. score. However, a dataset should not be used as the reference if its PC dis. score is too high, because it indicates a potential outlier sample. This method is implemented in the ‘InPlot’ function of the RISC package.
Other integration tools.
We generally ran those integration tools using the default parameters: Seurat/Anchor (v3.2.0)10,18, Scran/batchelor/fastMNN (v1.2.4)11, Harmony (v1.0)13, scMerge (v1.2.0)14, Liger (v0.5.0.9000)20, Scanorama (v1.6)12, BEER (v0.1.6)21, BERMUDA (v1.0.0)19, MMD_ResNet17, scGen (v1.1.5)15 and ZinbWave (v1.8.0)16. We inputted the raw counts into ZinbWave, following its standard protocols. For Seurat/Anchor, we followed its standard protocol for integrating scRNA-seq datasets. When integrating simulated data, the cell numbers were small, so we decreased the k.filter parameter of ‘FindIntegrationAnchors’ from the default 200 to 20. As the ‘FindIntegrationAnchors’ function can select a reference dataset, we set it to the same one used in RPCI. We used the batchelor11 to integrate both simulated and real scRNA-seq data with ‘multiBatchPCA’ and ‘fastMNN’ functions, according to their standard protocol. To align gene expression values, we used the ‘mnnCorrect’ function. When using Scanorama, we followed the standard protocol and used the ‘correct’ function to integrate data and correct batches with the default parameters. We also integrated data by BERMUDA, following its standard steps: clustering cells in each dataset by Seurat, detecting the similarity of cell clusters across datasets using the Spearman correlation and then merging datasets by BERMUDA. We performed data integration by BEER, following the standard protocol; as BEER chooses the reference and highly variable genes by itself, we could not control the inputs. We used ‘HarmonyMatrix’ to integrate datasets for Harmony and the ‘optimizeALS’ and ‘quantile_norm’ functions for Liger. Based on their standard protocols, both Harmony and Liger correct batch effects in cell eigenvectors but do not align gene expression values. We also ran scGen, scMerge and MMD_ResNet, following their standard vignettes. To be consistent, we set the same order of individual datasets for data integration with RPCI, Seurat, Scran, Scanorama and BERMUDA.
Computing efficiency test.
This was achieved by increasing the number of cells or datasets. For the sequential test, we integrated two simulated datasets, each containing the same number of cells, with 10,000 genes per cell. The number of cells was increased from 50 to 50,000 (totally, 100 ~ 100,000 cells in data integration), and the times consumed by each integration tool was recorded. In parallel, we determined computing times for integrating multiple datasets, with 1,000 cells per dataset and 10,000 genes per cell, and the number of datasets from two to 20. Notably, ZinbWave, Harmony and Liger would correct batch effects in cell eigenvectors across datasets but did not generate batch-corrected gene expression values. By contrast, fastMNN, Anchor, Scanorama, scMerge, BEER and RPCI correct both cell eigenvectors and gene expression values.
Reporting Summary.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All scRNA-seq datasets in this study were published previously, and their availabilities are described in Supplementary Table 2.
Code availability
The RISC is prepared as an R package and is available for free use via GitHub (https://github.com/bioinfoDZ/RISC). Codes for the analysis (and related source data) are provided in the Code Ocean (https://codeocean.com/capsule/9098032).
Supplementary Material
Acknowledgements
We thank all the research groups that generated and shared the scRNA-seq data used in this study. We thank the members of the Zheng lab for valuable discussions, software testing and comments on the manuscript. We also acknowledge funding support from the National Institutes of Health (grants HL133120 to D.Z. and B.Z., HL153920 to D.Z., HD092944 to D.Z. and B.Z., and HD070454 to D.Z.).
Footnotes
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41587-021-00859-x.
Peer review informationNature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Islam S et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014). [DOI] [PubMed] [Google Scholar]
- 2.Nawy T Single-cell sequencing. Nat. Methods 11, 18 (2014). [DOI] [PubMed] [Google Scholar]
- 3.Wang Y & Navin NE Advances and applications of single-cell sequencing technologies. Mol. Cell 58, 598–609 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zheng GX et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun 8, 14049 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Azizi E et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Rosenberg AB et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fan X et al. Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis. Cell Res. 28, 730–745 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang JX et al. Single-cell gene expression analysis reveals regulators of distinct cell subpopulations among developing human neurons. Genome Res. 27, 1783–1794 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Davie K et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Butler A, Hoffman P, Smibert P, Papalexi E & Satija R Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol 36, 411–420 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Haghverdi L, Lun ATL, Morgan MD & Marioni JC Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol 36, 421–427 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hie B, Bryson B & Berger B Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol 37, 685–691 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Korsunsky I et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lin Y et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lotfollahi M, Wolf FA & Theis FJ scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019). [DOI] [PubMed] [Google Scholar]
- 16.Risso D, Perraudeau F, Gribkova S, Dudoit S & Vert JP A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun 9, 284 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Shaham U et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Stuart T et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang T et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 20, 165 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Welch JD et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang F, Wu Y & Tian W A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tran HTN et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jolliffe IT & Cadima J Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci 374, 20150202 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jolliffe IT Principal Component Analysis (Springer, 2011). [Google Scholar]
- 25.Zhang X, Xu C & Yosef N Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun 10, 2611 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zappia L, Phipson B & Oshlack A Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Scrucca L, Fop M, Murphy TB & Raftery AE mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8, 289–317 (2016). [PMC free article] [PubMed] [Google Scholar]
- 28.Rousseeuw P J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 20, 53–65 (1987). [Google Scholar]
- 29.Buttner M, Miao Z, Wolf FA, Teichmann SA & Theis FJ A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019). [DOI] [PubMed] [Google Scholar]
- 30.Villani AC et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hu P et al. Single-nucleus transcriptomic survey of cell diversity and functional maturation in postnatal mammalian hearts. Genes Dev. 32, 1344–1357 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liu Y, Singh VK & Zheng D Stereo3D: using stereo images to enrich 3D visualization. Bioinformatics 36, 4189–4190 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Nowotschin S et al. The emergent landscape of the mouse gut endoderm at single-cell resolution. Nature 569, 361–367 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Arnold SJ & Robertson EJ Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nat. Rev. Mol. Cell Biol 10, 91–103 (2009). [DOI] [PubMed] [Google Scholar]
- 35.Nowotschin S, Hadjantonakis AK & Campbell K The endoderm: a divergent cell lineage with many commonalities. Development 146, dev150920 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Stuckey DW, Di Gregorio A, Clements M & Rodriguez TA Correct patterning of the primitive streak requires the anterior visceral endoderm. PLoS ONE 6, e17620 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kang HM et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol 36, 89–94 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pepe-Mooney BJ et al. Single-cell analysis of the liver epithelium reveals dynamic heterogeneity and an essential role for YAP in homeostasis and regeneration. Cell Stem Cell 25, 23–38 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hill MC et al. A cellular atlas of Pitx2-dependent cardiac development. Development 146, dev180398 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gordon SR et al. PD-1 expression by tumour-associated macrophages inhibits phagocytosis and tumour immunity. Nature 545, 495–499 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Savas P et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med 24, 986–993 (2018). [DOI] [PubMed] [Google Scholar]
- 42.Yost KE et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med 25, 1251–1259 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ding J et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol 38, 737–746 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Grun D et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Muraro MJ et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Segerstolpe A et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Wang YJ et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Baron M et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wolf FA et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Giraddi RR et al. Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep. 24, 1653–1666 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Maaten LVD Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res 15, 3221–3245 (2014). [Google Scholar]
- 52.Becht E et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol 37, 38 (2018). [DOI] [PubMed] [Google Scholar]
- 53.Alter O, Brown PO & Botstein D Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Kolde R pheatmap: Pretty Heatmaps https://rdrr.io/cran/pheatmap/ (2019).
- 55.Zwiener I, Frisch B & Binder H Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9, e85150 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Law CW, Chen Y, Shi W & Smyth GK voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.McCarthy DJ, Campbell KR, Lun AT & Wills QF Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lun AT, Bach K & Marioni JC Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Bacher R et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.R Core Team. R: A Language and Environment for Statistical Computing https://www.R-project.org/ (2019).
- 61.Koopmans LH, Owen DB & Rosenblatt JI Confidence intervals for the coefficient of variation for the normal and log normal distributions. Biometrika 51, 25–32 (1964). [Google Scholar]
- 62.Ver Hoef JM & Boveng P L. Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology 88, 2766–2772 (2007). [DOI] [PubMed] [Google Scholar]
- 63.Gonzalez I, Déjean S, Martin P & Baccini A CCA: an R package to extend canonical correlation analysis. J. Stat. Softw 23, 14 (2008). [Google Scholar]
- 64.Witten DM, Tibshirani R & Hastie T A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Wooldridge JM Introductory Econometrics: A Modern Approach (Cengage, 2018) [Google Scholar]
- 66.Lambert D Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14 (1992). [Google Scholar]
- 67.Rousseeuw P J. Silhouettes: a graphical aid to the interpretation and validation of cluster-analysis. J. Comput. Appl. Math 20, 53–65 (1987). [Google Scholar]
- 68.Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K cluster: Cluster Analysis Basics and Extensions https://cran.r-project.org/package=cluster (2019). [Google Scholar]
- 69.Venables WN, Ripley BD & Venables WN Modern Applied Statistics with S (Springer, 2002). [Google Scholar]
- 70.Finak G et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Korthauer KD et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Nabavi S, Schmolze D, Maitituoheti M, Malladi S & Beck AH EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 32, 533–541 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All scRNA-seq datasets in this study were published previously, and their availabilities are described in Supplementary Table 2.





