Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

Shaoheng Liang; Jinzhuang Dou; Ramiz Iqbal; Ken Chen

doi:10.21203/rs.3.rs-3134332/v1

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Jul 26:rs.3.rs-3134332. [Version 1] doi: 10.21203/rs.3.rs-3134332/v1

Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

Shaoheng Liang ^1,^2,^3,^*, Jinzhuang Dou ¹, Ramiz Iqbal ¹, Ken Chen ¹

PMCID: PMC10402204 PMID: 37547002

Abstract

Clustering and visualization are essential parts of single-cell gene expression data analysis. The Euclidean distance used in most distance-based methods is not optimal. The batch effect, i.e., the variability among samples gathered from different times, tissues, and patients, introduces large between-group distance and obscures the true identities of cells. To solve this problem, we introduce Batch-Corrected Distance (BCD), a metric using temporal/spatial locality of the batch effect to control for such factors. We validate BCD on simulated data as well as applied it to a mouse retina development dataset and a lung dataset. We also found the utility of our approach in understanding the progression of the Coronavirus Disease 2019 (COVID-19). BCD achieves more accurate clusters and better visualizations than state-of-the-art batch correction methods on longitudinal datasets. BCD can be directly integrated with most clustering and visualization methods to enable more scientific findings.

1. Introduction

Gene expression reflects the identity of a cell. Single-cell RNA sequencing (scRNA-seq) technologies profile thousands of cells simultaneously [1], enabling trajectory inference to reveal the course of cell development and transformation [2]. Although a large amount of data have been gathered from different tissues among large cohorts of patients [3], technical variances among separately assayed samples often overshadow the similarity of cells, resulting in disconnected trajectories (Figure 1a) that hinders discovery of underlying biological processes. This phenomenon, often called the batch effect, complicates the single-cell sequencing data analysis. For samples collected from the same condition (i.e., same time and tissue but different participants), differences among samples are usually considered batch effects and get corrected, but over-correction is suspected in some cases [4]. For longitudinal data, where true biological changes and nuisance factors are entwined (Figure 1c), the distinction between biological and batch effects becomes more elusive.

Figure 1: — Illustration of how BCD handles batch effect differently.

As a practical example, researchers have collected data from the developing retina of 13 mice at different ages [5]. Ideally, the embedding of such data should reveal a fluent trajectory of how pluripotent stem cells differentiate/evolve into multipotent stem cells, and finally to specific cell types. However, as a mouse does not survive the tissue extraction, each sample in the dataset is from a unique mouse. The batch effect exists among samples collected from different mice. Because there is no bijective (i.e., injective and surjective) mapping for cells from different samples, existing methods addressing batch effect in longitudinal data assume that the features are measured on the same set of entities (cells) at different time points [6] are not applicable to single-cell data.

To address this issue, multiple methods have been published [7]. For example, Seurat utilizes mutual nearest neighbor (MNN) to identify similar clusters in different batches and integrates those batches by removing the differences [8]. Harmony also integrates proximal clusters, but in an iterative way through soft clustering [9]. Liger uses nonnegative matrix factorization (NMF) to separate common and sample-specific features [10]. A neural network approach, scVI, combines variational autoencoder and zero-inflated model to visualize and correct the data. Notably, Harmony generates integrated clusters and visualization without giving a corrected expression profile. It is not deemed a significant drawback, however, because corrected datasets are often hard to interpret. Researchers usually only cluster and visualize the data using such methods, and recur to statistical tests that control for the batch effect in the downstream analyses [11]. These methods have been utilized in a few large-cohort studies and satisfyingly removed the batch effect among samples [8, 9]. However, none of them are designed for longitudinal data.

In an effort to address this issue, we noticed the “locality” among the single-cell samples. Specifically, samples gathered at closer time points are expected to be more biologically alike. If we decompose the difference between two samples to batch effect and biological effect. The former is uniform between all samples, but the latter increases with the distance between samples. Thus, the relative strength of the effects decreases over distance. In other words, the observed difference between two adjacent time points is more likely to be the batch effect, than that of two distant time points.

Here, we define a distance metric, termed Batch-Corrected Distance (BCD) which exploits the locality to precisely remove the batch effect (Figure 1b) but keep biologically meaningful information that forms the trajectory. As a distance metric, it is naturally compatible with all state-of-the-art distance-based clustering and embedding methods. We compare BCD with Euclidean distance, Seurat integration, and Harmony on simulated data and mouse retina development data, and show applications of BCD on COVID-19 patient data and human fetal lung development data. The results clearly show the benefit of BCD to biology studies.

2. Results

2.1. The Batch-Corrected Distance

Across single-cell samples, genes (and their combinations) are differentially affected by the batch effect. Traditional clustering and trajectory inference methods based on metrics that assign equal weights to all genes, such as the Euclidean distance, fail to address this discrepancy in data. We introduce Batch-Corrected Distance (BCD), where a weight matrix (covariance matrix) for genes is used to offset the batch effect. The matrix is directly inferred from the gene expression data and time labels or spatial coordinates (Methods), based on the temporal/spatial locality intuition: the batch effect is uniform across samples, while the biological effect increases over the distance of samples.

To validate BCD, we simulated a dataset with seven samples. We assume that all cells are differentiated from an initial state, stem cells named S. S differentiates to two multipotent cell types A and B, which differentiate to terminal types A1, A2 and B1, B2, respectively (Figure 2a). We simulated 200 genes, and assigned a unique ideal gene expression profile for each cell type, denoted as $s, a, b, a_{1}, a_{2}, b_{1}$ and $b_{2}$ . Because the development of the cell types is gradual [12], we used weighted mean to represent the process in the simulated samples. For example, to simulate intermediate states between A and A1, we used $(θ a + (1 - θ) a_{1})$ as its profile, where $θ$ is set to be uniformly distributed in a range corresponds to the developing stage of a sample (Table 1). The ambiguous cell types ( $θ$ close to 0.5) are denoted as “A->A1”, while cells more similar to “A” or “A1” ( $θ$ close to 0 or 1) are labeled as them each.

Figure 2: — Results for the simulated data. Left panels are colored by samples. Right panels are colored by cell types. Each row represents a method.

Table 1:

Simulated dataset

		Composition

Day	Samples	S	S->A S->B	A B	A->A1 A->A2 B->B1 B->B2	A1 A2 B1 B2

1	1	.8	.2
2	2		.5	.5
3	2			.5	.5
4	2				.5	.5

Open in a new tab

We randomly selected a set of genes to be “susceptible to the batch effect”, and added different additive components on those genes in different samples. For every single cell, we further added random noise into the ideal profile and used a Poisson distribution to generate the gene expression. The number of samples at each time point, and compositions of samples are shown in Table 1. It can be seen that the cell types gradually evolve over time.

We used Seurat to process the data with BCD, Euclidean distance, Seurat integration, and Harmony. In the process, 150 highly variable genes are selected, and 30 PCs are used. The result of BCD are shown in Figure 2b and 2c. Given the prior knowledge that all cell types originate from S, it can be clearly seen that it branches to A and B, and further to A1, A2, and B1, B2, respectively. Small batch effect can still be seen within A1, A2, B1, and B2, but are largely mitigated. The samples are also well organized by their time. Day 1 is at the center, day 4 is separated to A1 and A2 at the top-left and B1 and B2 at the bottom-right corner, while day 2–3 corresponds to the trajectory of which the cells differentiate.

In contrast, Euclidean distance, the baseline, does not correct for any batch effect (Figure 2d and 2e). Consequently, the terminal types A1, A2, B1, and B2 each splits into two groups, corresponds to different samples. It also fails to illustrate the evolution of the cell types. Seurat integration shows high capacity of removing batch effects, as the cells from multiple days are mixed together (Figure 2f). However, it is an over-correction. For example, S, S->A and A are mixed together, and so do A->A1 and A1, and A->A2 and A2. The same applies to the branch of B. It also leads to incorrect trajectory inferences, as A is now closer to A than A->A1. Overall, it makes it harder to delineate the cell differentiation (Figure 2g). The result of Harmony is slightly better (Figure 2h and 2c). The two batches are fairly mixed for A1, A2, B1, and B2. The trajectory of S->A to A, then to A->A1 and A->A2, and finally to A1 and A2 can be seen, although the B->B2 and B2 are misplaced. Besides, the branch of A and branch of B are still in disconnected clusters, leaving their relationship with S undiscovered. Overall, BCD performs the best in the four methods. For time consumption, BCD uses less than 1 second, while Harmony uses 9 seconds and Seurat integration uses 20 seconds.

2.2. Mouse retina development dataset

We applied BCD to a dataset including 13 single-cell specimens collected from developing retina of one sample at E12 (day 12 embryo), two at E14, one at E16, two at E18, one at P0 (postnatal day 0), two at P2, one at P5, two at P8, and one at P14 [5]. A total of 110,359 cells are collected and process with Seurat, where 2,000 highly variable genes are selected, and 30 PCs are used.

The development of mouse retina is well-understood by the field. Briefly, retinal progenitor cells (RPCs, including early RPCs and late RPCs) differentiate into Neurogenic cells, which further differentiate into photoreceptor precursors, Amacrine cells, horizontal cells, and retinal ganglion cells [13]. The photoreceptor precursors differentiate into cone cells, rod cells, and bipolar cells [14]. These cell types are all neural cells. Late RPCs also differentiate to Müller glia [13]. These cells form a neural network where each terminal cell type has a unique function. Cone cells and rod cells forming the input layer are photoreceptors that work in light and dark environment, respectively. The signal from cone cells and rod cells propagate through the bipolar cells first and then retinal ganglion cells to go to the brain. Horizontal cells provides horizontal connections between photoreceptors and bipolar cells, while amacrine cells perform similar function between bipolar cells and retinal ganglion cells. These cells form two hidden layers. Müller glia is an auxiliary cell type that supports the aforementioned neural cells.

The result of BCD is shown in Figure 3b and 3c. There is a clear trajectory of the cells gradually evolving from E11 to P14. The aforementioned evolution trajectory of the cell types can also be seen along the trajectory. The Euclidean distance also performs relatively well on this dataset. However, clear batch effect can be seen in Figure 3d, where cells are clustered by samples. As a result, a cell type is divided into multiple clusters in Figure 3d, making it harder to delineate the differentiation trajectory of the cell types. Both Seurat integration and Harmony over-corrects the batch effect. Although the samples are well mixed (Figure 3f and 3h), many cell types are mixed. For example, early RPC, late RPC, and Müller glia are not distinguishable from the result, and so do horizontal cells, amacrine cells, bipolar cells, and retinal ganglion cells (Figure 3g and 3i). Even worse, the bipolar cells, which should be differentiated from photoreceptor precursors, are misplaced to directly under neurogenic cells. These mistakes are because these methods do not utilize the temporal locality of the samples. For this dataset, BCD uses less than 2 seconds, while Harmony uses 5 minutes. Seurat integration costs more than 3 hours.

2.3. COVID-19 immune compartment datasets

The global pandemic COVID-19 has reportedly infected 6.3 million people worldwide, with the death toll at 380 thousand. Understanding the immune response to the virus is essential to developing treatments. We applied BCD to a recently published dataset including immune compartment samples collected from 13 participants [15] including 4 health controls (HC1–4), 3 moderate (M1–3) cases, and 6 severe (S1–6) cases. Although these are all different participants, we consider HC, M, and S a time course to reflect the progression of the disease. We used Seurat and BCD to process the data. During the process, 2,000 highly variable genes are selected, and 30 PCs are used.

The result is shown in figure 4. The largest group is macrophages, which recognize and destroy virus-infected cells. The result shows from top to bottom a gradual change from those in health controls, to moderate cases, and to severe cases. In contrast, Euclidean distance yields disconnected groups of macrophages because of the batch effect, while Seurat integration and Harmony overcorrect the effect and confuse cells from moderate cases with those from health controls (Supplementary Material). The changes in gene expression that drive the differences can be attained by differential expression analysis. Similar trajectories can also be seen on T cells and plasma cells, which kill infected cells and produce antibodies, respectively. Using these pieces of information, researchers can identify the most effective form of immune cells and find ways to transform others into it to treat the disease.

2.4. Human Lung datasets

Besides studies of the immune compartment, knowledge of lung development may also help cure the disease and restore the functionality of the lung [16]. Miller et al. has produced a single-cell dataset of cells from [17]. Cells from fetal human lungs are collected at week 11.5 (W11.5), W15, W18, and W21, and are available for trachea, small airways in the lung, and the distal tip of the lung.

We explored the dataset using BCD. Because both temporal and spatial information of the samples is available, we use a vector [week, location] to label each sample, where week is the number of weeks mentioned above, and location is set to be 0, 2, and 4 for trachea, small airways, and distal lung, respectively. The result is shown in Figure 5. A branching trajectory can be seen from W11.5 to W18 for distal lung and small airways mesenchymal cells. The cells from the two locations are similar at the early stage (W11.5), but become more distinct when they are more developed (W15 and W18). In contrast, affected by the batch effect, Euclidean distance and Harmony show W15 and W18 small airway mesenchymal cells as isolated clusters, with no connection with W11.5, while Seurat mixes all mesenchymal cells across the two locations and all times points, blurring the trajectory (Supplementary Material). Similar trajectories also show for Epithelial cells, endothelial cells, and pericytes. Researchers may use these pieces of information to further study the changes in gene expression in the cell type developments and develop treatments.

Figure 5: — Results for the human fetal lung development dataset. Arrows are added for visual reference.

3. Discussion

In order to build a reliable trajectory of cell type development from a longitudinal dataset, the batch effect should be corrected for. This is a new problem as state-of-the-art batch effect correction methods cannot utilize the time/spacial information, and thus result in over-correction. Our method, BCD, utilizes such information to accurately identify and remove the batch effect, while preserving the correct structure of the data. The results based on the BCD clearly show the evolution of the cell types through time. The discovered trajectory helps researchers make hypotheses of the physiological changes during development or pathological changes as a result of disease infection. The changes can be ascertained by differential gene expression analysis.

BCD assumes that the batch effect is shared by all the samples. This is a reasonable assumption because many kinds of batch effects have a biological basis. For example, Cellular stress response stimulated by the sample preparation affects certain genes. In the case the set of genes do change, the local linear approximation may be used. Because the sums and products of distances are guaranteed to be valid distances, multiple BCDs each correcting for the batch effect in a specific set of samples can be combined as a consensus distance.

BCD is intrinsically a linear transformation, which may be insufficient for more complex batch effects including interactions of genes. Nevertheless, it is a proof of concept that the time/spatial locality should be considered in batch correction for longitudinal datasets, which are on the rise. Nonlinear methods may be invented based on the same concept. For example, kernelization can be a direct extension to BCD. Besides, BCD is very fast to run and can be used in the exploration of the dataset with negligible cost.

Although all the experiments are from a biology background, the scope of BCD is not confined to it. It can be applied to any longitudinal/spatial dataset affected by batch effects where the temporal/spatial locality holds. BCD also illustrates that batch the effect correction problem is related to the alternative clustering problems. Over the past few years, many advanced alternative clustering models have been introduced, and translate them to this context may result in better performance.

In summary, we defined a novel pairwise distance of the cells, namely Batch-Corrected Distance (BCD), where the effect of the unwanted clustering is controlled. Results show our method achieves more accurate clusters and better visualizations than state-of-the-art methods on longitudinal datasets. The BCD can be directly integrated into most clustering and visualization methods to enable more scientific findings.

4. Methods

4.1. The Batch-Corrected Distance

Trajectory inference, in general, aims to find a graph $G = (V, E)$ which reflects the hop-by-hop gradual change $(E)$ of cells $(V_{i}, V_{j}, \dots \in V)$ by optimizing

\min \sum_{(V_{i}, V_{j}) \in E} ‖ x_{i} - x_{j} ‖

(1)

subjecting to a set of constraints [2, 18]. Here, $x_{i}$ is the profile of cell $i$ , which can be the whole gene expression, or the first few principal components (PCs). Eulidean distance is the most widely used metric, while the Mahalanobis distance

{∥x_{i} - x_{j}∥}_{Σ^{- 1}} = \sqrt{{(x_{i} - x_{j})}^{⊤} Σ^{- 1} (x_{i} - x_{j})},

(2)

where $Σ$ is the covariance matrix calculated from all the samples, may be used to account for different (co)varainces among features. To account for the batch effect, we propose to redefine the $Σ$ as

\tilde{Σ} = \sum_{i = 1}^{n} \sum_{j : i \notin C_{j}} W_{i j} (x_{i} - m_{j}) {(x_{i} - m_{j})}^{⊤},

(3)

where $C_{j} = {i ∣ c e l l i i s i n b a t c h j}$ information and $m_{j}$ is the centroid of each batch. If weight $W_{i j} \equiv 1$ , it degrades to the metric defined by Qi and Davidson [19] for generating an alternative clustering different from the original $C$ . In essence, it removes the variances across the batches, while retain the variance within each batch. For a longitudinal dataset, each sample is (and all the cells in the sample are) collected from a time point, denoted as $t_{j}$ (and $τ_{i}$ ). To utilize the temporal/longitudinal locality, we set $W_{i j}$ to be

W_{i j} = \exp (- \frac{{∥ τ_{i} - t_{j} ∥}^{2}}{2 l^{2}}),

(4)

where $l$ (set to 1 in our experiments) is the length scale within which two samples are considered temporally/spatially close. The covariance of proximal time points are weighted more, and thus are suppressed in the refined distance. When the dataset contains both temporal and spatial labels, $τ_{i}$ and $t_{j}$ can be vectors that include both labels. Inversed Cholesky-decomposed $\tilde{Σ}$ can be used to transform the data. If first k PCs are used, the computational complexity is $O (n k^{2} + k^{3})$ .

4.2. Gene expression data processing

We use Seurat [20], an R package, to analyze the gene expression data. The package provides functionalities to normalize data, find highly variable features (i.e., genes) by variance stabilizing transformation, scale the features, perform principal component analysis (PCA), and visualize the result with UMAP (uniform manifold approximation and projection). This is the de facto standard single-cell data analysis protocol. The normalization step, in specific, normalizes the summation of gene expression in each cell to be one. The scaling step standardizes each gene so that the average expression over all cells is zero, and the standard deviation is one. UMAP is a nonlinear embedding method visualize data by their distance [21].

Also provided in Seurat is a data integration method that corrects batch effect [8]. It first projects samples into a common subspace using canonical correlation analysis (CCA), and then finds MNNs in the CCA subspace as “anchors” to correct the data. We refer to it as Seurat integration (not to be confused with the entire Seurat protocol). Harmony first projects the data into a lower-dimensional PCA space, and then iteratively removes batch effects. At each iteration, it clusters cells while maximizing the diversity of batches within each cluster and calculates a correction factor for each cell to remove the batch effect. Tran et al. [7] show by systematic assessments that both methods are state-of-the-art. Thus, we compare BCD with them. To ensure good comparability, we implemented BCD with an interface to Seurat. This choice also makes BCD easy to use for biology researchers familiar with Seurat.

7. Acknowledgements

This work was supported in part by grant number 2018-182735 to KC, Human Breast Cell Atlas Seed Network Grant (HCA3-0000000147) to KC from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation, grant RP180248 to KC from Cancer Prevention & Research Institute of Texas, The University of Texas MD Anderson Cancer Center Pre-Cancer Atlas Project to KC, and The University of Texas MD Anderson Cancer Center Colorectal Cancer Moonshot and P30 CA016672 (US National Institutes of Health/National Cancer Institute) to the University of Texas Anderson Cancer Center Core Support Grant.

Footnotes

⁹

Competing Interests

The authors declare no competing interests.

⁵

Code Availability

The software package is available at https://github.com/KChen-lab/bcd. The scripts to generate all results are included in the repository.

6. Data Availability

All datasets used in this work are public data. They are available in public repositories, including the mouse retina (GSE118614), COVID-19 (GSE145926), and human lung (E-MTAB-8221).

References

[1].Lim B., Lin Y., and Navin N., “Advancing cancer research and medicine with single-cell genomics,” Cancer Cell, vol. 37, no. 4, pp. 456–470, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Saelens W., Cannoodt R., Todorov H., and Saeys Y., “A comparison of single-cell trajectory inference methods,” Nature biotechnology, vol. 37, no. 5, pp. 547–554, 2019. [DOI] [PubMed] [Google Scholar]
[3].Regev A., Teichmann S. A., Lander E. S., Amit I., Benoist C., Birney E., Bodenmiller B., Campbell P., Carninci P., Clatworthy M., et al. , “Science forum: the human cell atlas,” Elife, vol. 6, p. e27041, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Nygaard V., Rødland E. A., and Hovig E., “Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses,” Biostatistics, vol. 17, no. 1, pp. 29–39, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Clark B. S., Stein-O’Brien G. L., Shiau F., Cannon G. H., Davis-Marcisak E., Sherman T., Santiago C. P., Hoang T. V., Rajaii F., James-Esposito R. E., et al. , “Single-cell rna-seq analysis of retinal development identifies nfi factors as regulating mitotic exit and late-born cell specification,” Neuron, vol. 102, no. 6, pp. 1111–1126, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Müller C., Schillert A., Röthemeier C., Trégouët D.-A., Proust C., Binder H., Pfeiffer N., Beutel M., Lackner K. J., Schnabel R. B., et al. , “Removing batch effects from longitudinal gene expression-quantile normalization plus combat as best approach for microarray transcriptome data,” PloS one, vol. 11, no. 6, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Tran H. T. N., Ang K. S., Chevrier M., Zhang X., Lee N. Y. S., Goh M., and Chen J., “A benchmark of batch-effect correction methods for single-cell rna sequencing data,” Genome Biology, vol. 21, no. 1, pp. 1–32, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck III W. M., Hao Y., Stoeckius M., Smibert P., and Satija R., “Comprehensive integration of single-cell data,” Cell, vol. 177, no. 7, pp. 1888–1902, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Korsunsky I., Millard N., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M., Loh P.-r., and Raychaudhuri S., “Fast, sensitive and accurate integration of single-cell data with harmony,” Nature methods, pp. 1–8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Welch J. D., Kozareva V., Ferreira A., Vanderburg C., Martin C., and Macosko E. Z., “Single-cell multi-omic integration compares and contrasts features of brain cell identity,” Cell, vol. 177, no. 7, pp. 1873–1887, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Finak G., McDavid A., Yajima M., Deng J., Gersuk V., Shalek A. K., Slichter C. K., Miller H. W., McElrath M. J., Prlic M., et al. , “Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data,” Genome biology, vol. 16, no. 1, p. 278, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Schiebinger G., Shu J., Tabaka M., Cleary B., Subramanian V., Solomon A., Gould J., Liu S., Lin S., Berube P., et al. , “Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming,” Cell, vol. 176, no. 4, pp. 928–943, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Stenkamp D. L., “Development of the vertebrate eye and retina,” in Progress in molecular biology and translational science, vol. 134, pp. 397–414, Elsevier, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Brzezinski J. A. and Reh T. A., “Photoreceptor cell fate specification in vertebrates,” Development, vol. 142, no. 19, pp. 3263–3273, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Liao M., Liu Y., Yuan J., Wen Y., Xu G., Zhao J., Cheng L., Li J., Wang X., Wang F., et al. , “Single-cell landscape of bronchoalveolar immune cells in patients with covid-19,” Nature Medicine, pp. 1–3, 2020. [DOI] [PubMed] [Google Scholar]
[16].Golchin A., Seyedjafari E., and Ardeshirylajimi A., “Mesenchymal stem cell therapy for covid-19: present or future,” Stem Cell Reviews and Reports, pp. 1–7, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Miller A. J., Yu Q., Czerwinski M., Tsai Y.-H., Conway R. F., Wu A., Holloway E. M., Walker T., Glass I. A., Treutlein B., et al. , “In vitro and in vivo development of the human airway at single-cell resolution,” Developmental Cell, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Qiu X., Mao Q., Tang Y., Wang L., Chawla R., Pliner H. A., and Trapnell C., “Reversed graph embedding resolves complex single-cell trajectories,” Nature methods, vol. 14, no. 10, p. 979, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Qi Z. and Davidson I., “A principled and flexible framework for finding alternative clusterings,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 717–726, 2009. [Google Scholar]
[20].Butler A., Hoffman P., Smibert P., Papalexi E., and Satija R., “Integrating single-cell transcriptomic data across different conditions, technologies, and species,” Nature biotechnology, vol. 36, no. 5, pp. 411–420, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].McInnes L., Healy J., and Melville J., “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All datasets used in this work are public data. They are available in public repositories, including the mouse retina (GSE118614), COVID-19 (GSE145926), and human lung (E-MTAB-8221).

[R1] [1].Lim B., Lin Y., and Navin N., “Advancing cancer research and medicine with single-cell genomics,” Cancer Cell, vol. 37, no. 4, pp. 456–470, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Saelens W., Cannoodt R., Todorov H., and Saeys Y., “A comparison of single-cell trajectory inference methods,” Nature biotechnology, vol. 37, no. 5, pp. 547–554, 2019. [DOI] [PubMed] [Google Scholar]

[R3] [3].Regev A., Teichmann S. A., Lander E. S., Amit I., Benoist C., Birney E., Bodenmiller B., Campbell P., Carninci P., Clatworthy M., et al. , “Science forum: the human cell atlas,” Elife, vol. 6, p. e27041, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Nygaard V., Rødland E. A., and Hovig E., “Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses,” Biostatistics, vol. 17, no. 1, pp. 29–39, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Clark B. S., Stein-O’Brien G. L., Shiau F., Cannon G. H., Davis-Marcisak E., Sherman T., Santiago C. P., Hoang T. V., Rajaii F., James-Esposito R. E., et al. , “Single-cell rna-seq analysis of retinal development identifies nfi factors as regulating mitotic exit and late-born cell specification,” Neuron, vol. 102, no. 6, pp. 1111–1126, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Müller C., Schillert A., Röthemeier C., Trégouët D.-A., Proust C., Binder H., Pfeiffer N., Beutel M., Lackner K. J., Schnabel R. B., et al. , “Removing batch effects from longitudinal gene expression-quantile normalization plus combat as best approach for microarray transcriptome data,” PloS one, vol. 11, no. 6, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Tran H. T. N., Ang K. S., Chevrier M., Zhang X., Lee N. Y. S., Goh M., and Chen J., “A benchmark of batch-effect correction methods for single-cell rna sequencing data,” Genome Biology, vol. 21, no. 1, pp. 1–32, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck III W. M., Hao Y., Stoeckius M., Smibert P., and Satija R., “Comprehensive integration of single-cell data,” Cell, vol. 177, no. 7, pp. 1888–1902, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Korsunsky I., Millard N., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M., Loh P.-r., and Raychaudhuri S., “Fast, sensitive and accurate integration of single-cell data with harmony,” Nature methods, pp. 1–8, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Welch J. D., Kozareva V., Ferreira A., Vanderburg C., Martin C., and Macosko E. Z., “Single-cell multi-omic integration compares and contrasts features of brain cell identity,” Cell, vol. 177, no. 7, pp. 1873–1887, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Finak G., McDavid A., Yajima M., Deng J., Gersuk V., Shalek A. K., Slichter C. K., Miller H. W., McElrath M. J., Prlic M., et al. , “Mast: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell rna sequencing data,” Genome biology, vol. 16, no. 1, p. 278, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Schiebinger G., Shu J., Tabaka M., Cleary B., Subramanian V., Solomon A., Gould J., Liu S., Lin S., Berube P., et al. , “Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming,” Cell, vol. 176, no. 4, pp. 928–943, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Stenkamp D. L., “Development of the vertebrate eye and retina,” in Progress in molecular biology and translational science, vol. 134, pp. 397–414, Elsevier, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Brzezinski J. A. and Reh T. A., “Photoreceptor cell fate specification in vertebrates,” Development, vol. 142, no. 19, pp. 3263–3273, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Liao M., Liu Y., Yuan J., Wen Y., Xu G., Zhao J., Cheng L., Li J., Wang X., Wang F., et al. , “Single-cell landscape of bronchoalveolar immune cells in patients with covid-19,” Nature Medicine, pp. 1–3, 2020. [DOI] [PubMed] [Google Scholar]

[R16] [16].Golchin A., Seyedjafari E., and Ardeshirylajimi A., “Mesenchymal stem cell therapy for covid-19: present or future,” Stem Cell Reviews and Reports, pp. 1–7, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Miller A. J., Yu Q., Czerwinski M., Tsai Y.-H., Conway R. F., Wu A., Holloway E. M., Walker T., Glass I. A., Treutlein B., et al. , “In vitro and in vivo development of the human airway at single-cell resolution,” Developmental Cell, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Qiu X., Mao Q., Tang Y., Wang L., Chawla R., Pliner H. A., and Trapnell C., “Reversed graph embedding resolves complex single-cell trajectories,” Nature methods, vol. 14, no. 10, p. 979, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Qi Z. and Davidson I., “A principled and flexible framework for finding alternative clusterings,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 717–726, 2009. [Google Scholar]

[R20] [20].Butler A., Hoffman P., Smibert P., Papalexi E., and Satija R., “Integrating single-cell transcriptomic data across different conditions, technologies, and species,” Nature biotechnology, vol. 36, no. 5, pp. 411–420, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].McInnes L., Healy J., and Melville J., “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]

PERMALINK

This is a preprint.

Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

Shaoheng Liang

Jinzhuang Dou

Ramiz Iqbal

Ken Chen

Abstract

1. Introduction

Figure 1:

2. Results

2.1. The Batch-Corrected Distance

Figure 2:

Table 1:

2.2. Mouse retina development dataset

Figure 3:

2.3. COVID-19 immune compartment datasets

Figure 4:

2.4. Human Lung datasets

Figure 5:

3. Discussion

4. Methods

4.1. The Batch-Corrected Distance

4.2. Gene expression data processing

7. Acknowledgements

Footnotes

6. Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

Batch-Corrected Distance Mitigates Temporal and Spatial Variability for Clustering and Visualization of Single-Cell Gene Expression Data

Shaoheng Liang

Jinzhuang Dou

Ramiz Iqbal

Ken Chen

Abstract

1. Introduction

Figure 1:

2. Results

2.1. The Batch-Corrected Distance

Figure 2:

Table 1:

2.2. Mouse retina development dataset

Figure 3:

2.3. COVID-19 immune compartment datasets

Figure 4:

2.4. Human Lung datasets

Figure 5:

3. Discussion

4. Methods

4.1. The Batch-Corrected Distance

4.2. Gene expression data processing

7. Acknowledgements

Footnotes

6. Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases