Abstract
Computational tools for integrative analyses of diverse single-cell experiments are facing formidable new challenges including dramatic increases in data scale, sample heterogeneity, and the need to informatively cross-reference new data with foundational datasets. Here, we present SCALEX, a deep-learning method that integrates single-cell data by projecting cells into a batch-invariant, common cell-embedding space in a truly online manner (i.e., without retraining the model). SCALEX substantially outperforms online iNMF and other state-of-the-art non-online integration methods on benchmark single-cell datasets of diverse modalities, (e.g., single-cell RNA sequencing, scRNA-seq, single-cell assay for transposase-accessible chromatin use sequencing, scATAC-seq), especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We showcase SCALEX’s advantages by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19 patients, each assembled from diverse data sources and growing with every new data. The online data integration capacity and superior performance makes SCALEX particularly appropriate for large-scale single-cell applications to build upon previous scientific insights.
Subject terms: Data integration, Computational models, Sequencing, Bioinformatics
Integrative analyses of single-cell datasets are facing new challenges as data size and complexity grow. Here the authors present SCALEX, which projects cells from different datasets into a common latent space, allowing accurate online integration as well as cross-referencing with atlas-scale data.
Introduction
Single-cell experiments enable the decomposition of samples into their constituent, diverse cell-types and cell states1–4. Many computational tools have been developed for integrative analysis of single-cell datasets, all seeking to separate biological variations from non-biological noise, such as batch effects of different donors, conditions, and/or analytical platforms5,6. The scope of the integration task is expanding rapidly with technical advances for single-cell studies, which continue to grow larger and larger in scale, now exceeding 1 million cells in some cases7,8. Moreover, the range of examined sample types is also increasing, and datasets now often include highly heterogenous cell subsets9,10. Most importantly, as single-cell studies become more routine, new studies should be informatively cross-referenced to foundational research stuides7,8,11–15. Thus, there is a growing need for integration tools that can manage single-cell data of large-scale and complex cell-type compositions while also supporting accurate alignment to and exploration within existing datasets.
Most current single-cell data integration methods (e.g., Seurat16–18, MNN19, Harmony20, Conos21, Scanorama22, BBKNN23, etc.) are based on the searching across batches for cell-correspondence, for instance similar individual cells or cell anchors/clusters. These methods suffer from three limitations. First, they are prone to mixing cell populations that only exist in some batches, which becomes a severe problem for the integration of complex datasets that contain non-overlapping cell populations in each batch (i.e., partially overlapping data)16,17. Second, they require computational resources that increase dramatically as the number of cells and of batches increase, making these methods increasingly unsuitable for today’s large-scale single-cell datasets7,8,11–15. Finally, these methods can only remove batch effects from the current dataset being assessed. Each time a new dataset is added, it requires an entirely new integration process that changes the existing integration results of previous studies. This requirement severely limits a tool’s ability to continuously integrate arriving new single-cell data without recalculating existing integrations from scratch, a capacity referred to as “online” data integration24.
Online data integration ability is becoming increasingly crucial with today’s single-cell experiments. The recently developed tool, online iNMF24, an online version of LIGER25, iteratively applies integrative non-negative matrix factorization (iNMF) to decouple the shared and dataset-specific factors related to cell identities, and thus is able to incorporate new data with existing datasets on-the-fly. Another recently developed package, scvi-tools26, combining scVI27 with scArches28, applies a conditional variational autoencoder (VAE)29 framework to model the inherent distribution of the input single-cell data for data integration. However, the conditional VAE design of scVI requires model augmentation and retraining when integrating new data, meaning that scVI is not an online method. We want to highlight that this online integration ability meets a rapidly growing need in the life sciences and in biomedicine: it enables the alignment of data coming from new single-cell analyses (from the lab and clinic) into the substantial corpus of existing knowledge, especially that from previous foundational single-cell research. Put another way, the online integration capacity obviates the need to augment and/or retrain models when analyzing additional datasets, which both preserves hard-won scientific insights and saves a huge amount of computational resource.
Here, we developed SCALEX as a method for online integration of heterogeneous single-cell data based on a VAE framework. The encoder of SCALEX is designed to be a data projection function that only preserves batch-invariant biological data components when projecting single-cells. Importantly, the projection function is a generalized one that requires no retraining on new data, thus allowing SCALEX to integrate single-cell data in an online manner. Working with an extensive collection of benchmark datasets, we demonstrate that SCALEX substantially outperforms online iNMF as well as non-online single-cell data integration tools, in terms of integration accuracy, scalability, and computationally efficiency. The advantages make SCALEX particularly appropriate for the integration and research utilization of today’s single-cell datasets, which continue to grow along with the ongoing explosion of single-cell studies in biology and medicine.
Results
SCALEX implements a generalized encoder that enables online integration of single-cell data
To enable online integration, the fundamental design concept underlying SCALEX is to implement a generalized projection function that disentangles the batch-related components away from the batch-invariant components of single-cell data and projects the batch-invariant components into a common cell-embedding space. We previously applied VAE and designed SCALE (Single-Cell ATAC-seq Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq data30. We found that the encoder of SCALE has the potential to disentangle cell-type-related and batch-related features in a low-dimensional embedding space.
Here, to obtain a generalized encoder for data projection without retraining, SCALEX includes three specific design elements (Fig. 1a, Supplementary Fig. 1, “Overview of the SCALEX model” in Methods). First, SCALEX implements a batch-free encoder that extracts only biological-related latent features () from input single-cell data () and a batch-specific decoder29 that reconstructs the original data from z by incorporating batch information back during data reconstruction. Supplying batch information only to the decoder focuses the encoder exclusively on learning the batch-invariant biological components, which is crucial for the encoder generalizability. In contrast, scVI includes a set of batch-conditioned parameters into its encoder, which restrains the encoder from the generalizability with new batches and thus precludes online data integration. Second, SCALEX includes a Domain-Specific Batch Normalization (DSBN)31 layer using multi-branch Batch Normalization32 in its decoder to support incorporation of batch-specific variations during single-cell data reconstruction. Third, the SCALEX encoder employs a mini-batch strategy that samples data from all batches (instead of a single batch), which more tightly follows the overall distribution of the input data. Note that each mini-batch is subjected to a Batch Normalization layer in the encoder to adjust the deviation of each mini-batch and to align it to the overall input distribution.
We conducted extensive analyses of SCALEX hyperparameters and also tested the specific contributions of each design element by implementing a set of SCALEX test-variants, each lacking an individual or a combination of the design elements, and evaluating their performance for single-cell data integration (“Ablation studies using test-variants of SCALEX” in Methods). We found that each design element is crucial for the integration performance of SCALEX. More importantly, the combination of these design elements renders the encoder of SCALEX a generalized function capable of accurate projection of single cell data from different batches into a batch-invariant cell-embedding space, making SCALEX a truly online data integration method.
SCALEX integration is substantially more accurate than state-of-the-art single-cell data integration methods
We extensively assessed the basic data integration performance of SCALEX, following the evaluative framework proposed in a recent comparative study33. We examined multiple well-curated scRNA-seq datasets, including human pancreas (eight batches of five studies)34–38, heart (two batches of one study)39 and liver (two studies)40,41; as well as human non-small-cell lung cancer (NSCLC, four studies)42–45 and peripheral blood mononuclear cells (PBMC; two batches assayed by two different protocols)16. Our comparison included online iNMF and other state-of-the-art non-online single-cell data integration methods, including Seurat v3, Harmony, MNN, Conos, BBKNN, Scanorama, LIGER (i.e., batch iNMF), and scVI. We evaluated the integration performance of these tools based on the benchmark datasets by Uniform Manifold Approximation and Projection (UMAP)46 embedding visualization as well as a series of scoring metrics19,20,47–49.
With UMAP embedding, we note that all of the raw datasets displayed strong batch effects, with cell-types that were common in different batches separately distributed. Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Supplementary Fig. 2). MNN, scVI, and Conos integrated many datasets but left some common cell-types not well-aligned. Online iNMF, LIGER, BBKNN, and Scanorama often had unmerged common cell-types, and sometimes incorrectly mixed distinct cell-types together. For example, considering the T cell populations between the two batches in the PMBC dataset (Fig. 1b), while SCALEX, Seurat v3, Harmony, MNN, scVI integrations were effective, online iNMF misaligned some of the CD4 naïve T cells with CD8 naïve T cells, and misaligned some NK cells with CD8 T cells.
SCALEX substantially outperformed all of the other methods for cell-type clustering, as assessed by the adjusted Rand Index (ARI)47 and the Normalized Mutual Information (NMI)48 (Fig. 1c, Supplementary Figs. 3, 4). To quantify cell-type separation and batch mixing we used two paired metrics: a pair comprising the Silhouette score49 and the batch entropy mixing score19, as well as a pair comprising the cell-type and integration local inverse Simpson’s Indexes (cLISI and iLISI)20. Overall, SCALEX achieved the highest scores for cell-type separation, and tied with Seurat v3 and Harmony as the best-performing methods on the batch mixing metrics (Supplementary Fig. 4a). Interestingly, we observed that both LIGER and online iNMF often scored the lowest for cell-type separation yet the highest for batch mixing. However, after careful investigation, we concluded that a higher batch mixing score does not necessarily indicate better data integration, but instead often indicates an issue of over-correction, which we consider in-depth in a dedicated subsection below. Finally, we followed the protocol in a recent large-scale study for benchmarking single-cell integration methods to compare SCALEX against ten state-of-the-art methods using multiple scores50 (“Single-cell integration benchmarking (scIB)” in Methods). We observed that SCALEX outperformed all other tools on the pancreas, liver, and NSCLC datasets in terms of the overall score, and ranked the third on the PBMC dataset and the fourth on the heart dataset (Supplementary Fig. 4b).
SCALEX is scalable to Atlas-level datasets and accommodates diverse data modalities
Single-cell datasets that contain a large number of cells and consist of heterogenous and complex samples from multiple tissues have been termed “Atlas-level” datasets in a recent comparative study33. These Atlas-level datasets are posing new challenges to data integration tools. We tested the scalability and computation efficiency of SCALEX by applying it to a typical Atlas-level dataset, the Human Fetal Atlas dataset, which contains 4,317,246 cells from two data batches, GSE156793 and GSE134355 (Supplementary Fig. 5a, b, “Preprocessing for scRNA-seq” in Methods)8,15. SCALEX accurately integrated these two batches, showing good alignment of the same cell-types (Fig. 1d). In addition to SCALEX, only BBKNN, Scanorama, and scVI can be used to integrate this Atlas-level dataset, however, their integrations does not separate and align the cell-types well, as indicated by the UMAP embeddings (Supplementary Fig. 5c) and the low cell-type separation and batch mixing scores (Supplementary Fig. 5d).
We compared the computational efficiency of different methods using down-sampled datasets (of 10 kilo (K), 50 K, 250 K, 1 million (M), and 4 M cells) from this Human Fetal Atlas dataset. Both SCALEX and online iNMF consumed very efficient runtime and memory that increased only linearly with data size. scVI also is scalable to 4 M cells with acceptable memory usage, whereas Seurat v3, Harmony, Conos, and LIGER consumed runtime and/or memory that increased exponentially, thus did not scale beyond 1 M cells on a workstation of 64 central processing unit (CPU) cores and 256 gigabytes (GB) memory (Fig. 1e). Notably, the deep learning framework of SCALEX enables it to run very efficiently on graphics processing unit (GPU) devices, requiring much reduced runtime (using about 20 minutes and 90 GB of memory on the 4 M dataset).
SCALEX can be used to integrate other modalities of single-cell data (e.g., scATAC-seq51,52, cellular indexing of transcriptomes and epitopes by sequencing, CITE-Seq53, etc.) and cross-modality data (e.g., simultaneous analysis of scRNA-seq and scATAC-seq). SCALEX substantially outperformed all other methods for integration of mouse brain scATAC-seq datasets (two batches assayed by single nucleus assay for transposase-accessible chromatin using sequencing, snATAC and 10X)54 (Fig. 1f, Supplementary Fig. 6a–c), and performed well for integration of additional single-cell data modalities including CITE-seq53 and spatial transcriptome MERFISH data55 (Supplementary Fig. 6d, e). We also used SCALEX to integrate a cross-modality dataset (scRNA-seq and scATAC-seq)56,57 and found that SCALEX correctly integrated the two modalities of data and distinguished rare cells that are specific to the scRNA-seq data, including pDC and platelet cells (Fig. 1g), doing so better than other methods including two additional methods scjoint58 and bindSC59, according to both UMAP embeddings and multiple analytical metrics (Supplementary Fig. 7).
SCALEX integrates partially overlapping datasets without over-correction
Many recent single-cell datasets, especially Atlas-level datasets, feature high sample heterogeneity and complex cell-type compositions9,10. These datasets often contain partially overlapping batches where each batch contains some non-overlapping cell populations. For example, the liver dataset is a partially overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Supplementary Fig. 8).
This partial overlap problem presents a major challenge for single-cell data integration and often leads to an issue of over-correction (i.e., mixing of distinct cell-types), especially for those local cell similarity-based methods16,17. For example, Seurat v3 mixed the hepatocyte-CXCL1, hepatocyte-CYP2A13, and hepatocyte-TAT-AS1 cells and Harmony mixed the hepatocyte-CYP2A13 and hepatocyte-TAT-AS1 cells (Fig. 2a). As a global integration method that projects cells into a common cell-embedding space, SCALEX is expected to be less sensitive to this problem. Indeed, we noticed that SCALEX correctly maintained the five hepatocyte subtypes apart (as did scVI. Fig. 2a). Unexpectedly, despite being a global method, online iNMF severely suffered from over-correction, mixing all five hepatocyte subtypes, and even mixing B cells and NK cells (Fig. 2a), presumably because its matrix factoring algorithm forced the alignment of distinct cell-types.
We defined an over-correction score, a metric to measure this over-correction problem based on the percent of cells with inconsistent cell-types in the neighborhood for each cell (“Over-correction score” in Methods). Formally, the over-correction score is a negative index, i.e., the higher the over-correction score, the more severe the extent of inaccurate mixing of cell-types. For the benchmark datasets, SCALEX had the lowest over-correction scores (Fig. 2b), whereas online iNMF yielded extremely high over-correction scores.
To systematically characterize the performance of different methods on partially overlapping datasets, we constructed test datasets with a range of common cell-types, that we generated based on down-sampling of the six major cell-types in the pancreas dataset (“Generation of partially overlapping datasets” in Methods). SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas Seurat v3, Harmony, and online iNMF frequently mixed distinct cell-types (Fig. 2c, d). Although scVI showed one of the lowest levels of over-correction when integrating partially overlapping datasets, it is prone to mistakenly splitting one cell-type into many small groups. We noted that the severity of over-correction and error-splitting is amplified as the overlapping number decreases (Supplementary Fig. 9). When there were no common cell-types, both Seurat v3 and Harmony collapsed the six cell-types into three, mixing alpha with gamma cells, beta with delta cells, and acinar with ductal cells to varying extents, whereas scVI split alpha cells into 6 groups. We repeated this down-sampling analysis from the 12 cell-types in the PBMC dataset and observed similar results of over-correction and error-splitting (Supplementary Fig. 10).
SCALEX increases the scope and resolution of an existing cell space by adding new data through online projection
The generalizability of SCALEX’s encoder to project cells from various sources into a common cell-embedding space without model retraining allows SCALEX to integrate new single-cell data with existing data in an online manner. We tested the online data integration performance of SCALEX for newly arriving data based on the pancreas dataset. Prior to projection, we first used SCALEX to integrate the pancreas dataset and this accurately removed the strong batch effect that was evident in the raw data (Fig. 3a, Supplementary Fig. 11a, b).
We subsequently projected three new batches of scRNA-seq data60–62 for pancreas tissues (Fig. 3b) into this “pancreas cell space” using the same SCALEX encoder trained on the original pancreas dataset. After projection, most of the cells in the new batches were accurately aligned to the correct cell-types in the pancreas cell space, enabling their accurate annotation by cell-type label transfer (Fig. 3c, “Cell-type annotation by label transfer” in Methods). We benchmarked projection accuracy by calculating the ARI, the NMI, and the F1 scores to evaluate cell-type annotation by label transfer with cell-type information in the original studies. We compared the results with online iNMF and scVI, the only two tools that are able to project cells into an existing cell-space (note that data projection of scVI needs model retraining through scArches). SCALEX achieved the highest projection accuracy in comparisons with online iNMF and scVI (Fig. 3c). scVI also achieved high accuracy, projecting most cells onto right locations, with only a few exceptions of alpha and ductal cells (Supplementary Fig. 11c). Online iNMF mixed distinct cell-types when incorporating new batches, e.g., projecting some alpha cells onto the locations of gamma and delta cells (Supplementary Fig. 11c), which in turn led to wrong annotations during label transfer (Fig. 3c).
The ability to project new single-cell data into an existing cell-embedding space allows SCALEX to readily enrich (i.e., to add biological resolution) this cell space with additional informative details. To verify this, we projected two additional melanoma data batches (SKCM_GSE72056, SKCM_GSE123139)10,63 onto the previously constructed PBMC space. Again, SCALEX correctly projected all common cell-types onto the same locations in the PBMC cell space (Fig. 3d), but online iNMF mixed tumor cells with plasma, monocyte and CD8 T cells, and scVI split the CD8T cells into several distinct groups (Supplementary Fig. 12). Importantly, we noticed that for the tumor and plasma cells only present in the melanoma data batches, SCALEX did not project these cells onto any existing cell populations in the PBMC space; rather, it projected them onto new locations close to similar cells, with the plasma cells projected to a location near B cells, and the tumor cells projected to a location near HSC cells (Fig. 3e). This indicates that SCALEX can enrich an existing cell space with new cell-types through data projection.
SCALEX projection also enables post hoc annotation of unknown cell-types in an existing cell space using new data. For instance, we noted a group of previously uncharacterized cells in the pancreas dataset (Fig. 3a). We found that these cells displayed high expression levels of known epithelial gene markers. We therefore assembled a collection of epithelial cells from the bronchial epithelium dataset64, and then projected these epithelial cells onto the pancreas cell space. We found that a group of antigen-presenting airway epithelial (SLC16A7 + epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f). These data, together with the observation that both cell populations showed similar marker gene expression (Fig. 3g), suggest that these uncharacterized cells are also SLC16A7 + epithelial cells. Note that online iNMF and scVI were not able to identify this small group of epithelial cells, because they were split into several smaller groups and/or were often mixed with other cell-types (Supplementary Fig. 2). SCALEX thus enables discovery science in cell biology by supporting exploratory analysis with large numbers of diverse datasets.
SCALEX integration constructs expandable single-cell atlases
The ability to combine heterogenous data into a common cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse datasets. We applied SCALEX integration to three large and complex datasets: the Mouse Atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq12,14), the Human Atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq15,65), and the Human Fetal Atlas dataset8,15 (Supplementary Fig. 13).
Despite the strong batch effects in the raw data, SCALEX accurately integrated the three batches of the Mouse Atlas data into a common cell-embedding space (Fig. 4a–c, Supplementary Fig. 14a). Common cell-types were well-aligned at the same position in the cell space, including B, T, and endothelial cells presented in all tissues, and proximal tubule, urothelial, and hepatocytic cells from particular tissues. Distinct cell-types were located separately, such as sperm, Leydig, and small intestine cells from the Microwell-seq data, keratinocyte stem cells and large intestine cells from the Smart-seq2 data, indicating that biological variations were well preserved (Supplementary Fig 14b, c). We compared SCALEX with all other methods and found that SCALEX performed the best for cell-type clustering, especially for avoiding over-correction (Fig. 4d, e, Supplementary Fig. 13b).
Importantly, atlases generated with SCALEX can be further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data. To illustrate this utility, we projected two additional data batches of aged mouse tissues from Tabula Muris Senis (Smart-seq2 and 10X)13 and two single tissue datasets (lung and kidney)66 onto the SCALEX Mouse Atlas cell space. We found that cells in the new data batches were correctly projected onto the locations of the same cell-types in the cell-embedding space of the initial atlas (Fig. 4f) as confirmed by the accurate cell-type annotations for the new data by label transfer (Fig. 4g).
Following the same strategy, we constructed a SCALEX Human Atlas by integration of multiple tissues from two studies (GSE134355, GSE159929) (Supplementary Fig. 15a, b). SCALEX effectively eliminated the batch effects in the original data and integrated the two datasets (Supplementary Fig. 15c, d). Again, we were able to correctly project two additional human skin datasets (GSE130973, GSE147424)67,68 onto the Human Atlas cell space (Supplementary Fig. 15e), and accurately annotated these projected skin cells (Supplementary Fig. 15f). In sum, these results illustrate that SCALEX enables: i) researchers to evaluate their project-specific single cell datasets by leveraging existing information in large-scale (and ostensibly well annotated) cell atlases; and ii) atlas creators to informatively integrate new datasets and derive new biological insights from new research programs.
An integrative SCALEX COVID-19 PBMC Atlas revealed distinct immune responses among COVID-19 patients
Many single-cell studies have been conducted to analyze COVID-19 patient immune responses69–76. However, these studies often suffer from small sample size and/or limited sampling of various disease states70,76. For a comprehensive study, we used SCALEX to generate a COVID-19 PBMC Atlas, integrating data from nine COVID-19 studies, involving a total of 860,746 single cells in 10 batches69–75 (Fig. 5a, Supplementary Dataset 1). We identified 22 cell-types, each of which has support from gene expression data for canonical markers (Fig. 5b, c, Supplementary Fig. 16a, “Cell-type annotation by clustering” in Methods). Cells across different studies were integrated accurately with the same cell-types aligned together, confirming the integration performance of SCALEX (Supplementary Fig. 16b), which was much better than the other methods (Supplementary Fig. 16c, d).
Interestingly, we found that some cell subpopulations were differentially associated with patient status (Fig. 5d). A subpopulation of CD14 monocytes (CD14-ISG15-Mono) was characterized by its high expression of Type I interferon-stimulated genes (ISGs) and genes enriched with immune-response-related gene ontology (GO) terms (Fig. 5e, f). The frequency of CD14-ISG15-Mono cells increased significantly from mild/moderate to severe patients (Fig. 5g, Supplementary Fig. 17a, “Analysis of changes in cell-type frequency across multiple conditions” in Methods). Within the COVID-19 patients, we observed a significant decrease in ISG gene expression in CD14-ISG15-Mono cells between the mild/moderate and severe cases, suggesting an immune exhaustion-like response in severe COVID-19 patients69 (Fig. 5e).
Additionally, a neutrophil subpopulation (NCF1-Immature_Neutrophil), characterized by decreased expression of the genes responsible for neutrophil activation but elevated expression of genes enriched with viral-process-related GO terms, was specifically enriched in severe verse mild/moderate patients (Supplementary Fig. 17b, c). A plasma cell subpopulation (MZB1-Plasma), characterized by decreased expression of genes related to antibody production and enriched for immune and inflammatory response-related GO terms, were also enriched in severe patients (Supplementary Fig. 17d, e). Thus, the SCALEX COVID-19 PBMC atlas, generated by integrating a highly diverse collection of single-cell data from individual studies, identified multiple immune cell-types that become progressively dysfunctional during COVID-19 disease progression74. Importantly, these cell trends were not and could not have been detected in the small-scale, individual studies that served as the basis for our SCALEX COVID-19 PBMC atlas.
Online integration of the SCALEX COVID-19 PBMC Atlas with the SC4 consortium study
Our analysis based on the SCALEX COVID-19 PBMC Atlas yielded findings consistent with two conclusions from the Single Cell Consortium for COVID-19 in China (SC4) study, a recent large-scale effort that generated a single-cell atlas of over 1 million cells from 171 COVID-19 patients and 25 healthy controls7 (Supplementary Fig. 18a). First, both studies observed the same set of immune cell subpopulations which displayed differential associations with COVID-19 severity. The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g). Second, based on calculating the same cytokine score and inflammatory score (defined in the SC4 study) for the cells in our SCALEX COVID-19 PBMC Atlas, we confirmed that the monocyte subpopulations are associated with cytokine storms triggered by SARS-Cov2 infection and are further elevated in severe patients77 (Fig. 5j, “Inflammatory and cytokine score analysis” in Methods, adjusted p-value < 0.01).
SCALEX’s online integration capacity enables us to project the SC4 consortium dataset into the cell space of the SCALEX COVID-19 PBMC Atlas. We found that the cell-types of two atlases were well-aligned (Fig. 5h, i, Supplementary Fig. 18b, c). Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC Atlas. First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19. The integration also supported more precise characterization of specific cell subpopulations. For example, the megakaryocyte population, not distinguished in either the SCALEX COVID-19 PBMC Atlas or the SC4 Atlas (Supplementary Fig. 16c), were divided into two subpopulations in the combined atlas after projection of SC4 (Fig. 5h). An exploratory functional analysis of the differentially expressed genes in these two newly delineated megakaryocyte subpopulations (TUBA8-Mega and IGKC-Mega, Supplementary Fig. 18d, e) revealed enrichment for the GO terms “humoral immune response” for IGKC-Mega cells, yet enrichment for “negative regulation of platelet activation” for TUBA8-Mega cells (Fig. 5k). These results illustrate how the continuously expandable single-cell atlases generated using SCALEX capitalize on existing large-scale data resources and also facilitate the discovery of new biological and biomedical insights.
Discussion
Single cell studies are becoming more and more prevalent, growing larger and larger in scale, and expanding in the scope of sample types, often with quite heterogenous cell subsets. Thus, there is a great need for data integration tools to accurately and efficiently handle these Atlas-level datasets33. Further, there is also a need for online integration capacity to continuously incorporate incoming new data with existing integrations without having to recalculate from scratch24. By design, SCALEX learns a generalized projection function to project heterogeneous single-cell data into a common cell-embedding space, enabling it to achieve bona fide online data integration. SCALEX is also computationally efficient, and preserves biological variations and avoids over-correction when integrating partially overlapping datasets.
These features make SCALEX particularly useful for Atlas-level datasets, allowing the integration of many single-cell studies to support ongoing, very large-scale research programs throughout the life sciences and biomedicine. We speculate that use of SCALEX to project single-cell datasets from highly diverse cancer types to construct a pan-cancer single-cell atlas may lead to the discovery of previously unknown cell-types that are common to divergent carcinomas and that function in pathogenesis, malignant progression, and/or metastasis.
Methods
Overview of the SCALEX model
SCALEX applies a variational autoencoder (VAE) to project the different batches of datasets into the same batch-invariant low-dimensional embeddings by learning a batch-free encoder and a batch-specific decoder simultaneously. Since the encoder and decoder are coupled to learn a batch-free encoder, a batch label (b) is only exposed to the decoder within the domain-specific batch normalization (DSBN), thus the decoder captures the batch information while the encoder learns the domain-invariant features. In the encoder, SCALEX takes the input expression profile () across all the batches as a whole mixture distribution without distinguishing their batch sources and extracts their mean () and variance () of the latent representations (z) in a 10-dimension embedding space to learn their global data structure. A standard multivariate Gaussian prior is used for z, while the approximated distribution of z is re-parameterized by , where is sampled from . In the decoder, SCALEX maps the latent representations with batch label (b) back to their original profile. To enable the decoder to capture the batch-specific variations, a DSBN layer is applied to learn a batch-specific normalization for each batch label (b), before transforming them back to their original profile with the new batch variations. To learn the global distribution to avoid overcorrection on partially overlapping datasets, within each mini-batch in the training process, SCALEX randomly samples data from all batches and trains on them together with Batch Normalization to smooth the batch-specific shifts and align to the global distribution. Once trained, the encoder of SCALEX is generalized to any batches and serves as a universal function for globally mapping different batches of datasets into the same batch-invariant space.
Training SCALEX is to maximize the log-likelihood of the observed single-cell sequencing data ():
1 |
2 |
3 |
Then the loss function is transformed into the evidence lower bound (ELBO). While the ELBO can be further decomposed into two terms:
4 |
The first term is the reconstruction term, which minimizes the distance between the generated output data () and the original input data (), calculated as the binary cross entropy between and . The second term is the regularization term, which minimizes the Kullback-Leibeler divergence between posterior distribution and prior distribution of latent representations (). To enable a more flexible alignment under the latent space, we adjusted the coefficient of the second term to 0.5 after hyper-parameter optimization via a grid search; thus, the final loss function is:
5 |
The overall network architecture of SCALEX consists of an encoder and a decoder. The encoder is a two-layer neural network (fully connected [1024]-BN-ReLU-fully connected [10]) for mean () and variance () of the 10-dimension latent representations () using a reparameterization to obtain latent representations (), and the decoder has only one layer (no hidden layer), directly connecting latent representations () to the output () (fully connected-DSBN-Sigmoid) with domain-specific batch normalization, where the latent representations () and batch label (b) are provided as input, and a Sigmoid activation function. We used the Adam78 optimizer with a 5e-4 weight decay and betas (0.9, 0.999, the exponential decay rate for the first and second moment parameters) to optimize the model under the learning rate 2e-4. We adopted mini-batch strategy to iteratively optimize the model, in each mini-batch, we randomly sampled data from all batches instead of from the same batch, and the mini-batch size for training input is 64. The maximum number of training iterations is 30,000 and an early stopping is triggered when there has been no improvement for 10 epochs. The hyper-parameters are chosen after a grid search. SCALEX is very robust with all of these hyper-parameters, all of the results in this manuscript are produced under the same parameters.
Domain-specific batch normalization (DSBN)
Batch normalization (BN)32 is a widely used training technique in deep neural networks to reduce internal covariate shifting. A BN layer whitens activations within a mini-batch of samples followed by scaling and shifting with learned affine parameters and . For a mini-batch of samples: ;
6 |
7 |
8 |
9 |
Where is the mini-batch mean, is the mini-batch variance, is the normalized output by and , is the BN output by scaling and shifting with parameters and , and is a constant added to the mini-batch variance for numerical stability.
Domain specific batch normalization (DSBN)31 is a combination of multiple sets of BN specific to each domain. DSBN learns domain-specific affine parameters and for each domain, d is the domain label; here, domain represents different batches. In the neural network, DSBN serves like multi-channel BN and switches to the corresponding BN given the domain label d. The DSBN layer can be written as:
10 |
where d is the batch label, and and are domain-specific affine parameters for domain d.
DSBN could capture the domain-specific information by estimating mini-batch statistics by learning affine parameters for each domain separately, thus enabling the network to learn the domain-invariant features.
Preprocessing for scRNA-seq
We downloaded gene expression matrices and preprocessed them using the following procedure: i). Cells with fewer than 600 genes and genes present in fewer than 3 cells were filtered out. ii). Total counts of each cell were normalized to 10,000. iii). Values of each gene were subjected to log transformation with an offset of 1. iv). The top 2000 highly variable genes were identified. v). Values of each gene were normalized to the range of 0-1 within each batch by the MaxAbsScaler function in the scikit-learn package in Python. The processed matrix was used as input for the SCALEX model for downstream differential gene expression analysis.
For the human fetal atlas dataset, we collected two batches (batch GSE156793, which contains 4,062,980 cells by sciRNA-seq3, and batch GSE134355, which contains 254,266 cells by Microwell-seq). We then selected the cells from the common tissues (1,369,619 cells) for integration and computational efficiency benchmarking (down-sampled from different data sizes including 10 K, 50 K, 250 K, 1 M, and 4 M).
Preprocessing for scATAC-seq
We downloaded open chromatin profile matrices (peaks or bins), merged them by peaks (or bins), and processed them using the following procedure: i). The combined matrix was binarized and filter bins with fewer than 3 cells. ii). The top 30,000 most variable peaks (or bins) were selected using the select_var_feature function in the EpiScanpy79 package. iii). Total counts of each cell were normalized to the median of the total counts of all cells by using the normalize_total function, with parameters target_sum=“None” in the Scanpy80 package. iv). Values of each peak (or bin) were normalized to the range of 0-1 within each batch by the MaxAbsScaler function in the scikit-learn package in Python. The processed matrix was used as input for the SCALEX model.
Preprocessing for cross-modality data (scRNA-seq and scATAC-seq)
We first created a gene activity matrix by the GeneActivity function in the Signac81 R package to quantify the activity of each gene from scATAC-seq data. We then combined gene activity score matrix with scRNA-seq data matrix as two individual “batches” for integration. The subsequent preprocessing followed the same preprocessing used for the scRNA-seq data (above).
Clustering
For Harmony, MNN, Conos, BBKNN, Scanorama, scVI, LIGER, and online iNMF, we used their latent features with method specific default dimensions for further clustering. For Seurat v3, we initially performed integration and obtained the 2000-dimensional latent feature vectors following the standard workflow, and then we used PCA for dimensionality reduction because 2000-dimensional latent feature vectors are too high to directly cluster. Finally, we used 50-dimensional PCA latent feature vectors for clustering. For Conos and BBKNN, since they do not provide latent feature vectors after integration (and we failed to extract the latent feature vectors from their constructed either neighborhood or joint graphs), we used UMAP features for downstream clustering.
To ensure a fair comparison, we used scanpy.tl.leiden and scanpy.tl.louvain functions for clustering with resolution=0.5. For BBKNN and Conos, since resolution=0.5 generates too many clusters, we also included clustering results of with resolution = 0.05, which were used in our benchmark comparison (more details in Supplementary Dataset 2).
Visualization
UMAP algorithm46 was used for visualization. We applied the neighbors function from the Python package Scanpy with the parameters n_neighbors=30 and metric=“Euclidean” for computing the neighbor graph, followed by umap function with min_dist=0.1 to visualize cells in a two-dimensional space. Tissue anatomy diagrams are generated by gganatogram (v2) R package82,83.
Adjusted Rand Index
The Rand Index (RI) computes a similarity score between two clustering assignments by considering matched and unmatched assignment pairs, independent of the number of clusters. The Adjusted Rand Index (ARI) score is calculated by “adjust for chance” with RI as follows:
11 |
If given the contingency table, then ARI can also be represented by:
12 |
The ARI score is 0 for random prediction and 1 for perfectly matching.
Normalized mutual information
13 |
Where P and T are categorical distributions for the predicted and real clustering, I is the mutual entropy, and H is the Shannon entropy.
Silhouette score
We used the silhouette score to assess the separation of biological populations with the function silhouette_score in the scikit-learn package in Python. The silhouette score was computed by combining the average intra-cluster distance (a) and the average nearest-cluster (b) for each cell.
14 |
Here, we took UMAP embeddings as input to calculate silhouette score.
Batch entropy mixing score
Batch entropy mixing score (adapted from “entropy of batch mixing”19) was used to access the regional mixing of cells from different batches, with a high score suggesting that cells from different batches are well mixed together.
The batch entropy mixing score was computed as follows:
Calculated the proportion Pi of cell numbers in each batch to the total cell numbers.
Randomly chose 30 cells from all batches.
Calculated the 30 nearest neighbors for each randomly chosen cell.
- The regional mixing entropies for each cell were defined as:
15
where pi is the proportion of cells from batch i in a given region, such that , pi’ is a correction item to eliminate the deviation caused by the different cell numbers in different batches. The total mixing entropy was then calculated as the sum of the regional mixing entropies.16 Repeated (2)-(4) for 10 iterations with different randomly chosen cells and calculated the average, E, as the final batch entropy mixing score.
Note that to mitigate the effect of misalignment of batch-specific cell-types, we calculated the batch entropy mixing score only based on cells from cell-types that are common in different batches.
Local inverse Simpsons Index (LISI)
The LISI metric was proposed by Korsunsky et al. 201920 to assess batch and cell-type mixing. We calculated integration LISI (iLISI) and cell-type LISI (cLISI) values using the compute_lisi function in the lisi R package. UMAP embeddings, batch labels, and cell-type labels were used as input in calculation. Briefly,
17 |
where is the i-th cell’s UMAP embeddings in the dataset of size N, and Y is the set of unique values with respect to the type of LISI we are computing (i.e., Y is the values of “batch label” for calculating iLISI and the value of “cell-type label” for calculating cLISI). The probability refers to the “relative abundance” of the covariate y within KNN (k-nearest neighborhood) of . A Gaussian kernel-based distribution of neighborhoods was used and the perplexity was fixed to 30.
Over-correction score
We defined an over-correction score to assess the level of over-correction problem, based on calculating the percentage of cells with inconsistent cell-types in each cell’s neighborhood. We calculated the over-correction score over all cells, and for each cell i we averaged the frequency of the k-nearest neighboring cells with distinct cell-types to the cell i (see the following equation).
18 |
where n is the total cell number, k represents the k-nearest neighbors of each cell, the cell-type of the cell i is , the cell-type of the neighboring cell j is , and is an indicative function defined as:
19 |
Formally, the over-correction score is a negative index, i.e., the higher the over-correction score, the more severe the extent of inaccurate mixing of cell-types.
F1 score
We calculate the F1 score by the function f1_score with average= “macro” in the scikit-learn package in Python.
Single-cell integration benchmarking (scIB)
The scores for all 12 examined metrics were calculated using the Python package scIB50 with default parameters. The batch_correction_mean, bio_conservation_mean, and overall scores (rectangles) were calculated as described in the work of Luecken et al.50 to assess the performances of different methods in terms of the batch removal, the conservation of biological variance, and the overall accuracy scores, respectively.
Comparison with other integration methods
We compared SCALEX to nine other batch effect removal methods (see below for specific details of each method). For each dataset as input for all methods, we performed the same filtration, followed by method-specific normalization, batch correction and visualization. Note that for visual comparison, we also included the embeddings of the raw input data, wherein we performed dimensionality reduction by Principal Component Analysis (PCA)84 followed by UMAP visualization to see the batch effects. No correction function was used. All parameters were kept as default values.
Scanorama (v1.6). We performed the preprocessing pipelines as stated above (as the same below), and used the Scanpy and scanorama Python packages for integration. For the highly_variable_genes function, we set flavor=“seurat”, batch_key=“batch”, and n_top_genes=2,000. After extracting highly variable genes, we divided the datasets according to the batch labels and formed a new list of datasets as the input for the correct_scanpy function. The integration matrix was kept for downstream analysis. All other parameters were kept their default values.
BBKNN (v1.3.12). We used Scanpy and bbknn Python packages and followed the suggested pipelines for integration. For the highly_variable_genes function, we set flavor=“seurat”, batch_key=“batch”, and n_top_genes=2,000. After selecting cell neighbors at the low-dimensional space from the PCA analysis, we performed the bbknn function with neighbors_within_batch=5, n_pcs=20, and trim=0. All other parameters were default.
scVI (scvi-tools, v0.11): We used the scvi Python package and followed the suggested pipelines. Batch information was added to the VAE model by setting n_batch.
Seurat v3 (v3.2.3): We used the Seurat R package and followed the standard integration workflow. We normalized different batches of a dataset separately. For the FindVariableFeatures function, we set selection.method=“vst” and nfeatures=2000 to select 2000 highly variable genes for each batch of a dataset. For the FindIntegrationAnchors function, we set k.filter=100. All other parameters were kept at default values. If the number of input cells in a dataset exceeded 50,000, we employed the reciprocal PCA and reference-based integration to improve computational efficiency.
Harmony (v1.0): We used the harmony R package. We created a Seurat object with all cells and performed the standard workflow. After PCA, we used the RunHarmony function for integration. All parameters were default.
Conos (v1.3.1): We used the Conos R package. For each batch of dataset, we used the basicSeuratProc and RunTSNE functions for preprocessing. After that, we built a joint graph using the buildGraph function with k=30 and k.self=5. All other parameters were default.
MNN (FastMNN, v0.3.0): We used the SeuratWrappers R package. We created a Seurat object with all cells and performed the standard workflow. Then we used the RunFastMNN function with default parameters for integration.
Online iNMF and LIGER (LIGER, v1.0.0): We used the rliger R package. For the online iNMF method, we used the online_iNMF function with k=20, miniBatch_size=5,000 and max.epochs=5. For the LIGER method, we used the optimizeALS function with k=20. All other parameters were the default values. Different from other methods, online iNMF only loads one mini-batch from the whole data in the HDF5 file format (converted from the original data format by the rhdf5 R package) for a memory-efficient implementation; accordingly, a file conversion issue with the down-sampled human fetal atlas dataset of 4 M data size prevented online iNMF from calculating computational efficiency with the 4 M.
scJoint: We used the scJoint Python package. We pre-processed the data into the standard input format for scJoint, and then modified the config.py file in the scJoint package and set the same training config parameters as used in the tutorial of “Analysis of PBMC data from 10x Genomics using scJoint” (https://github.com/sydneybiox/scJoint/blob/main/tutorial/Analysis%20of%2010xGenomics%20data%20using%20scJoint.ipynb).
bindSC (v1.0.0): We used the bindSC R package. Following the tutorial, we first performed dimension reductions for gene expression, for the gene activity scores, and for the chromatin accessibility profiles, using the dimReduce function with K=30. Subsequently, we ran the BiCCA function with lambda=0.5, alpha=0.5, and K=20. All other parameters were default.
Cell-type annotation by clustering
This type of annotation was used for de novo annotation of a single-cell dataset. We used a Leiden clustering85 method for cell clustering (specifically employing the leiden function from the Python package Scanpy with default parameters). Then for each cluster, we annotate its cell-type based on: i) cell-type annotations of each cell in the original study, if available, or ii) expression levels of canonical marker genes in each cell. A majority vote strategy was used when needed. Similar to Ren et al. 2021, we also employed a hierarchical annotation strategy, i.e., we first clustered all cells in a dataset into several major clusters, then for some big clusters, we further clustered them into minor clusters respectively.
Single cell projection
We defined single cell projection as the operation to convert high-dimensional single-cell data (e.g., gene expression profiles in scRNA-seq or open chromatin profiles in scATAC-seq) to low-dimensional representations in the common SCALEX cell-embedding space using the trained encoder.
Cell-type annotation by label transfer
This type of annotation was used for annotation of a new single-cell data batch using the annotations in a large single-cell dataset as a reference, or for post hoc annotations of unknown cell population(s) in a large dataset using new batches of data of known cell-types. Both scenarios require “single cell projection” (see details below).
The basic idea of cell-type annotation by label transfer is based-on that the same cell-types will occupy the same locations in the low-dimensional SCALEX cell-embedding space, thus cell-type annotation in one data batch can be transferred to another data batch, for the cells positioned at the same locations. Technically, we used the KNeighborsClassifier function from the scikit-learn package to train a prediction model, using the representations (in the low-dimensional cell-embedding space) of the single-cell data with known cell-type labels as input. We then used this model to make cell-type predictions for cells without annotations using their representations (in the low-dimensional cell-embedding space) as input. For comparison, label transfer for online iNMF follows the same procedures as SCALEX by predicting the cell-type based-on the projected locations.
Similarity matrix and confusion matrix
We used similarity matrix to evaluate the congruence of two different batches for the same cell-types in the common cell-embedding space. Technically, we merged all cells with the same cell-type label and calculated an average representation (in the low-dimensional cell-embedding space) for the cell-type. This was repeated for all cell-types. We then calculated the similarity matrix S = [Sij] for the cell-type similarities of the two batches, where Sij is the Pearson correlation coefficient between the average representation of cell-type i in data_batch_1 and the average representation of cell-type j in data_batch_2.
We used the confusion matrix to evaluate the accuracy of cell-type annotations (prediction) when a gold-standard annotation is available, which is typical for “cell-type annotation by label transfer” (see above). In cell-type annotation by label transfer, we predict the cell-types for a single-cell data_batch_1, using the annotations in another data_batch_2. When data_batch_1 was already annotated with cell-types, we can calculate the confusion matrix C=[Cij] to compare the cell-type predictions with the existing cell-type annotations, where Cij equals the percentage of cells known to be in cell-type i and predicted to be in cell-type j.
Generation of partially overlapping datasets
To simulate partially overlapping datasets from the pancreas dataset, we used the pancreas_celseq2 and pancreas_smartseq2 data batches, and worked with only six cell-types (alpha, beta, ductal, acinar, delta, gamma). For each simulated partially overlapping dataset, we randomly selected three to six cell-types from each batch, and counted the number of the common cell-types, which was used as the indicator for the overlapping level (whole integers, 0 to 6). We required the union of cell-types in the newly simulated partially overlapping dataset to cover all six cell-types.
For the PBMC dataset, we used both of the two data batches and worked with twelve cell-types (B, CD4 T, CD4 naive T, CD8 T, CD8 naive T, DC, HSC, Megakaryocyte, NK, monocyte-CD14, monocyte-FCGR3A, pDC). We used the same down-sampling strategy as for the pancreas dataset (above).
Analysis of changes in cell-type frequency across multiple conditions
To identify differences in cell-type frequency among the scRNA-seq data from the mild/moderate, severe, convalescent COVID-19 patients, as well as the healthy and influenza patient controls, we applied a Dirichlet-multinomial regression model. This model accounts for the constraint that the cell frequencies in a scRNA-seq data are not independent of each other. In detail, we normalized the regression coefficients to a standard normal distribution and calculated a z score, and then conducted significance testing based on the regression model generated by the DirichReg function in the R package DirichletReg (v0.7).
Differential gene expression analysis and Gene Ontology term enrichment analysis
Differential gene expression analysis was performed on all expressed genes using the rank_genes_groups function with method=“t-test” in the Scanpy package, for two certain cell-types in a COVID-19 single-cell atlas. A gene was considered differentially expressed when a log2-fold change was >1 in the two conditions in comparison, and the Benjamini-Hochberg adjusted p-value was < 0.01. The top 200 highly expressed genes sorted by scores (implemented in Scanpy) of each cell-type were used as the input for GO analysis, and enriched GO terms were acquired for each group of cells of the “GO_Biological_Process_2018” dataset using the Python package gseapy (v0.10.1).
Inflammatory and cytokine score analysis
We defined the inflammatory score and the cytokine score for each cell following Ren et al. 20217, based on the expression of a defined collection of cytokine genes and inflammatory-response-related genes (Supplementary Dataset 3). We then calculated the cytokine and inflammatory scores from the raw gene expression profile using the score_genes function implemented in the Scanpy.
Ablation studies using test-variants of SCALEX
To accomplish an accurate generalized encoder, the design of the full SCALEX included the following specific innovations:
an asymmetric autoencoder that inputs batch information only to the decoder (i.e., never to the encoder) (See diagram in Supplementary Fig. 1);
a DSBN layer in the decoder to release the encoder from the burden of capturing the batch-specific variations;
a mini-batching strategy that samples data from all batches simultaneously (rather than single batches iteratively) and thus more tightly follows the same overall distribution of the full input dataset; this strategy includes a Batch Normalization layer in the encoder that adjusts the deviation of each mini-batch and aligns them to the overall input distribution.
We conducted ablation studies to investigate the contributions of each design element of SCALEX. That is, we analyzed the performance (for integration and projection tasks) of the full SCALEX and four SCALEX “test-variants”, each with a distinct network architecture (Supplementary Figs. 19–21). These can be summarized as follows:
Full SCALEX (referred as “Baseline” in the following)
This Baseline model includes an encoder without batch labeling, sampling from all batches with Batch Normalization, decoder with DSBN, and beta=0.5. All other ablations are compared relative to this.
Encoder with batch label
A test-variant with Baseline including an encoder with batch label as input. We found that the integration performance of this variant is similar with the full SCALEX, showing only a slight reduction in the evaluation scores (Supplementary Fig. 19). However, the real issue is this: this addition of batch information at the beginning precludes online integration of newly arriving data. Put another way, this SCALEX test-variant is not capable of integrating single-cell data in a truly online manner.
Decoder without DSBN
A test-variant of Baseline removing the DSBN layer from decoder. The DSBN layer combines multiple batch-specific Batch Normalization layers to capture the batch-specific information; this approach has been demonstrated as effective for domain adaption, as it provides a weak alignment across different domains. We observed an obvious drop in the integration performance (in terms of all evaluation scores) and a slight drop in the projection performance of this SCALEX test-variant, based-on the UMAP embeddings (Supplementary Figs. 19, 20).
Sampling by batch without the Batch Normalization (BN) layer in the encoder
Removing the Batch Normalization layer from the encoder, and each mini-batch is sampled by batch instead of from all the batches. The integration performance of this test-variant was obviously worse than full SCALEX (Supplementary Fig. 19). The projection performance also dropped obviously, with clear deviations from the common cell-embedding space (Supplementary Fig. 20).
Regular autoencoder
A test-variant that uses a regular autoencoder instead of a VAE framework. This variant performed the worst among all the variants we tested, for both the integration and projection tasks (Supplementary Figs. 19, 20).
Note that we also explored altering the beta factor. Replacing the beta factor as 1 instead of 0.5. The integration performance of test variant of SCALEX for beta factor of 1 is worse than the SCALEX for beta factor of 1 (Supplementary Fig. 21).
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank Jianbin Wang, Jin Gu and Fuchou Tang for helpful comments and advice. This work is supported by the State Key Research Development Program of China (Grant No. 2019YFA0110002, Q.C.Z.), the National Natural Science Foundation of China (Grants No. 32125007 and 91940306, Q.C.Z.), the Beijing Advanced Innovation Center for Structural Biology, and the Tsinghua-Peking Joint Center for Life Sciences. We thank the Tsinghua University Branch of China National Center for Protein Sciences (Beijing) for computational facility support. This work is also supported by the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award No. FCC/1/1976-44-01, FCC/1/1976-45-01, URF/1/4352-01-01, and URF/1/4663-01-01 (X.G.).
Author contributions
Q.C.Z. conceived and supervised the project. L.X. designed and implemented the SCALEX model. L.X. and K.T. validated the SCALEX model. L.X., K.T., Y.L., W.N., and X.G. analyzed the results. L.X. and Q.C.Z. wrote the manuscript, with inputs from all the authors.
Peer review
Peer review information
Nature Communications thanks Huachao Huang, Yang Xu, and the other, anonymous, reviewer(s) for their contribution to to the peer review of this work. Peer reviewer reports are available.
Data availability
All data analyzed in this study are publicly available; the data sources are detailed in Supplementary Dataset 1. All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request.
Code availability
SCALEX86 is available at https://github.com/jsxlei/SCALEX. For reproducibility, the scripts for benchmarks and several case studies are also available in the above repository.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Lei Xiong, Kang Tian.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-022-33758-z.
References
- 1.Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of single-cell RNA sequencing. Mol. Cell. 2015;58:610–620. doi: 10.1016/j.molcel.2015.04.005. [DOI] [PubMed] [Google Scholar]
- 2.Papalexi E, Satija R. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat. Rev. Immunol. 2018;18:35–45. doi: 10.1038/nri.2017.76. [DOI] [PubMed] [Google Scholar]
- 3.Tanay A, Regev A. Scaling single-cell genomics from phenomenology to mechanism. Nature. 2017;541:331–338. doi: 10.1038/nature21350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Satpathy AT, et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 2019;37:925–936. doi: 10.1038/s41587-019-0206-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Leek JT, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010;11:733–739. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hicks SC, Townes FW, Teng M, Irizarry RA. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19:562–578. doi: 10.1093/biostatistics/kxx053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ren X, et al. COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell. 2021 doi: 10.1016/j.cell.2021.01.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cao, J. et al. A human cell atlas of fetal gene expression. Science370, 10.1126/science.aba7721 (2020). [DOI] [PMC free article] [PubMed]
- 9.Azizi E, et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell. 2018;174:1293–1308.e1236. doi: 10.1016/j.cell.2018.05.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li H, et al. Dysfunctional CD8 T cells form a proliferative, dynamically regulated compartment within human melanoma. Cell. 2019;176:775–789.e718. doi: 10.1016/j.cell.2018.11.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Regev, A. et al. The Human Cell Atlas. Elife6, 10.7554/eLife.27041 (2017). [DOI] [PMC free article] [PubMed]
- 12.Tabula Muris C, et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tabula Muris C. A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. Nature. 2020;583:590–595. doi: 10.1038/s41586-020-2496-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Han X, et al. Mapping the Mouse Cell Atlas by Microwell-Seq. Cell. 2018;172:1091–1107.e1017. doi: 10.1016/j.cell.2018.02.001. [DOI] [PubMed] [Google Scholar]
- 15.Han X, et al. Construction of a human cell landscape at single-cell level. Nature. 2020;581:303–309. doi: 10.1038/s41586-020-2157-4. [DOI] [PubMed] [Google Scholar]
- 16.Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018;36:411–420. doi: 10.1038/nbt.4096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stuart T, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177:1888–1902.e1821. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell, 10.1016/j.cell.2021.04.048 (2021). [DOI] [PMC free article] [PubMed]
- 19.Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods, 10.1038/s41592-019-0619-0 (2019). [DOI] [PMC free article] [PubMed]
- 21.Barkas N, et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods. 2019;16:695–698. doi: 10.1038/s41592-019-0466-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hie B, Bryson B, Berger B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 2019 doi: 10.1038/s41587-019-0113-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Polanski K, et al. BBKNN: fast batch alignment of single cell transcriptomes. Bioinformatics. 2020;36:964–965. doi: 10.1093/bioinformatics/btz625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gao, C. et al. Iterative Refinement of Cellular Identity from Single-Cell Data Using Online Learning. bioRxiv, 2020.2001.2016.909861, 10.1101/2020.01.16.909861 (2020).
- 25.Welch, J. D. et al. Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity. Cell177, 1873–1887.e1817 (2019). [DOI] [PMC free article] [PubMed]
- 26.Gayoso, A. et al. scvi-tools: a library for deep probabilistic analysis of single-cell omics data. bioRxiv, 2021.2004.2028.441833, 10.1101/2021.04.28.441833 (2021).
- 27.Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods15, 1053-1058 (2018). [DOI] [PMC free article] [PubMed]
- 28.Lotfollahi, M. et al. Query to reference single-cell integration with transfer learning. bioRxiv, 2020.2007.2016.205997, 10.1101/2020.07.16.205997 (2020).
- 29.Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv:1312.6114 (2013).
- 30.Xiong L, et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 2019;10:4576. doi: 10.1038/s41467-019-12630-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chang, W.-G., You, T., Seo, S., Kwak, S. & Han, B. Domain-specific batch normalization for unsupervised domain adaptation. arXiv:1906.03950 (2019).
- 32.Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015).
- 33.Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. bioRxiv, 2020.2005.2022.111161, 10.1101/2020.05.22.111161 (2020). [DOI] [PMC free article] [PubMed]
- 34.Lawlor N, et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 2017;27:208–222. doi: 10.1101/gr.212720.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Segerstolpe A, et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24:593–607. doi: 10.1016/j.cmet.2016.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Muraro MJ, et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3:385–394.e383. doi: 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Grun D, et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell. 2016;19:266–277. doi: 10.1016/j.stem.2016.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Baron M, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 2016;3:346–360.e344. doi: 10.1016/j.cels.2016.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Litvinukova M, et al. Cells of the adult human heart. Nature. 2020;588:466–472. doi: 10.1038/s41586-020-2797-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Aizarani N, et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature. 2019;572:199–204. doi: 10.1038/s41586-019-1373-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.MacParland SA, et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nat. Commun. 2018;9:4383. doi: 10.1038/s41467-018-06318-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lambrechts D, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat. Med. 2018;24:1277–1289. doi: 10.1038/s41591-018-0096-5. [DOI] [PubMed] [Google Scholar]
- 43.Song Q, et al. Dissecting intratumoral myeloid cell plasticity by single cell RNA-seq. Cancer Med. 2019;8:3072–3085. doi: 10.1002/cam4.2113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zilionis R, et al. Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species. Immunity. 2019;50:1317–1334.e1310. doi: 10.1016/j.immuni.2019.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kim N, et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nat. Commun. 2020;11:2285. doi: 10.1038/s41467-020-16164-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 (2018).
- 47.Hubert L, Arabie P. Comparing partitions. J. Classification. 1985;2:193–218. doi: 10.1007/BF01908075. [DOI] [Google Scholar]
- 48.Amelio, A. & Pizzuti, C. in Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 1584–1585 (Association for Computing Machinery, Paris, France, 2015).
- 49.Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Computational Appl. Math. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
- 50.Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 2022;19:41–50. doi: 10.1038/s41592-021-01336-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Cusanovich DA, et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science. 2015;348:910–914. doi: 10.1126/science.aab1601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Buenrostro JD, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–490. doi: 10.1038/nature14590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Stoeckius M, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods. 2017;14:865–868. doi: 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Fang R, et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 2021;12:1337. doi: 10.1038/s41467-021-21583-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348:aaa6090. doi: 10.1126/science.aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Genomics, X. 10k Peripheral blood mononuclear cells (PBMCs) from a healthy donor, Single Cell ATAC Dataset by Cell Ranger 1.0.1. (2018).
- 57.Genomics, X. 10k PBMCs from a Healthy Donor (v3 chemistry), Single Cell Gene Expression Dataset by Cell Ranger 3.0.0. (2018).
- 58.Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat Biotechnol, 10.1038/s41587-021-01161-6 (2022). [DOI] [PMC free article] [PubMed]
- 59.Dou, J. et al. Unbiased integration of single cell multi omics data. biorxiv, 10.1101/2020.12.11.422014 (2020).
- 60.Wang YJ, et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes. 2016;65:3028–3038. doi: 10.2337/db16-0405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Enge M, et al. Single-cell analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns. Cell. 2017;171:321–330.e314. doi: 10.1016/j.cell.2017.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Xin Y, et al. Pseudotime ordering of single human beta-cells reveals states of insulin production and unfolded protein response. Diabetes. 2018;67:1783–1794. doi: 10.2337/db18-0365. [DOI] [PubMed] [Google Scholar]
- 63.Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–196. doi: 10.1126/science.aad0501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Plasschaert LW, et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature. 2018;560:377–381. doi: 10.1038/s41586-018-0394-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.He S, et al. Single-cell transcriptome profiling of an adult human cell atlas of 15 major organs. Genome Biol. 2020;21:294. doi: 10.1186/s13059-020-02210-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kimmel JC, et al. Murine single-cell RNA-seq reveals cell-identity- and tissue-specific trajectories of aging. Genome Res. 2019;29:2088–2103. doi: 10.1101/gr.253880.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Sole-Boldo L, et al. Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun. Biol. 2020;3:188. doi: 10.1038/s42003-020-0922-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.He H, et al. Single-cell transcriptome analysis of human skin identifies novel fibroblast subpopulation and enrichment of immune subsets in atopic dermatitis. J. Allergy Clin. Immunol. 2020;145:1615–1628. doi: 10.1016/j.jaci.2020.01.042. [DOI] [PubMed] [Google Scholar]
- 69.Schulte-Schrepping J, et al. Severe COVID-19 Is Marked by a Dysregulated Myeloid Cell Compartment. Cell. 2020;182:1419–1440.e1423. doi: 10.1016/j.cell.2020.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Lee, J. S. et al. Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19. Sci. Immunol.5, 10.1126/sciimmunol.abd1554 (2020). [DOI] [PMC free article] [PubMed]
- 71.Wilk AJ, et al. A single-cell atlas of the peripheral immune response in patients with severe COVID-19. Nat. Med. 2020;26:1070–1076. doi: 10.1038/s41591-020-0944-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Guo C, et al. Single-cell analysis of two severe COVID-19 patients reveals a monocyte-associated and tocilizumab-responding cytokine storm. Nat. Commun. 2020;11:3924. doi: 10.1038/s41467-020-17834-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Yao C, et al. Cell-type-specific immune dysregulation in severely ill covid-19 patients. Cell Rep. 2021;34:108590. doi: 10.1016/j.celrep.2020.108590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Zhang JY, et al. Single-cell landscape of immunological responses in patients with COVID-19. Nat. Immunol. 2020;21:1107–1118. doi: 10.1038/s41590-020-0762-x. [DOI] [PubMed] [Google Scholar]
- 75.Ballestar, E. et al. Single cell profiling of COVID-19 patients: an international data resource from multiple tissues. medRxiv, 2020.2011.2020.20227355, 10.1101/2020.11.20.20227355 (2020).
- 76.Bernardes JP, et al. Longitudinal Multi-omics Analyses Identify Responses of Megakaryocytes, Erythroid Cells, and Plasmablasts as Hallmarks of Severe COVID-19. Immunity. 2020;53:1296–1314.e1299. doi: 10.1016/j.immuni.2020.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Chen G, et al. Clinical and immunological features of severe and moderate coronavirus disease 2019. J. Clin. Invest. 2020;130:2620–2629. doi: 10.1172/JCI137244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014).
- 79.Danese, A., Richter, M. L., Fischer, D. S., Theis, F. J. & Colomé-Tatché, M. EpiScanpy: integrated single-cell epigenomic analysis. bioRxiv, 10.1101/648097 (2019). [DOI] [PMC free article] [PubMed]
- 80.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Stuart, T., Srivastava, A., Lareau, C. & Satija, R. Multimodal single-cell chromatin analysis with Signac. bioRxiv, 10.1101/2020.11.09.373613 (2020).
- 82.Maag JLV. gganatogram: An R package for modular visualisation of anatograms and tissues based on ggplot2. F1000Res. 2018;7:1576. doi: 10.12688/f1000research.16409.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Petryszak R, et al. Expression Atlas update–an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Res. 2016;44:D746–D752. doi: 10.1093/nar/gkv1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Abdi H, Williams LJ. Principal component analysis. WIREs Comput. Stat. 2010;2:433–459. doi: 10.1002/wics.101. [DOI] [Google Scholar]
- 85.Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 2019;9:5233. doi: 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Xiong, L., Tian, K. & Li, Y. SCALEX (v1.0.0). Zenodo, 10.5281/zenodo.7032217 (2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data analyzed in this study are publicly available; the data sources are detailed in Supplementary Dataset 1. All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request.
SCALEX86 is available at https://github.com/jsxlei/SCALEX. For reproducibility, the scripts for benchmarks and several case studies are also available in the above repository.