Abstract
Time course single–cell RNA sequencing (scRNA-seq) enables researchers to probe genome–wide expression dynamics at the the single cell scale. However, when gene expression is affected jointly by time and cellular identity, analyzing such data — including conducting cell type annotation and modeling cell type–dependent dynamics — becomes challenging. To address this problem, we propose SNOW (SiNgle cell flOW map), a deep learning algorithm to deconvolve single cell time series data into time–dependent and time–independent contributions. SNOW has a number of advantages. First, it enables cell type annotation based on the time–independent dimensions. Second, it yields a probabilistic model that can be used to discriminate between biological temporal variation and batch effects contaminating individual timepoints, and provides an approach to mitigate batch effects. Finally, it is capable of projecting cells forward and backward in time, yielding time series at the individual cell level. This enables gene expression dynamics to be studied without the need for clustering or pseudobulking, which can be error prone and result in information loss. We describe our probabilistic framework in detail and demonstrate SNOW using data from three distinct time course scRNA-seq studies. Our results show that SNOW is able to construct biologically meaningful latent spaces, remove batch effects, and generate realistic time–series at the single–cell level. By way of example, we illustrate how the latter may be used to enhance the detection of cell type–specific circadian gene expression rhythms, and may be readily extended to other time–series analyses.
1. Introduction
Gene expression is shaped by intrinsic cellular identities and extrinsic environmental conditions. Today, single-cell RNA sequencing (scRNA-seq) technologies enable us to probe how gene expression changes across cell types under various experimental conditions [1-5], with applications ranging from organ development [6, 7] to cancer progression [8, 9] and more recently to the circadian rhythm [10, 11]. To understand the dynamics of these processes, studies have started to directly observe how gene expression profiles change over time via time–coures scRNA-seq profiling [7, 12-14] and a number of methods have been developed to characterize and model scRNA-seq time-series data. For example, Waddington-OT [6] applies unbalanced optimal transport to compute the likelihood of cell state transitions. To gain mechanistic insights, PRESCIENT (Potential eneRgy undErlying Single Cell gradI-ENTs) [15] constructs a global potential function, , and uses to estimate how gene expression, , changes over time via the Euler scheme . However, this potential function is constructed on the PCA space, which may not represent the relevant geometry and cannot be mapped back to the original gene expression space after the dimensionality is reduced. To overcome this limitation, scNODE [16] uses a variational autoencoder [17] to construct a lower dimensional space with which to find governing equations that recapitulate the observed dynamics.
All the aforementioned methods are some variant of parameterizing a flow that satisfies the optimal transport constraint. This approach is useful in contexts were temporal variation affects all cells, such as in during development where cells move smoothly on a lower dimensional space along the same paths (Figure 1A, top). However, this may not be the best description for systems where cells can act in a highly cell type–specific manner over time. In these cases, the paths they take may not be immediately obvious (Figure 1A, bottom), and may even prevent us from correctly annotating cell types. As a result, in this latter situation, it is desirable to remove the effect of time to facilitate cell type annotation, which is usually achieved by integrating and batch–correcting the time points. Since the removal of temporal effects also removes biologically meaningful dynamics that one may wish to study, further analyses use non-integrated data to study the average expression for each cell type over time. While data integration remains an active field of research [18-21], conclusions drawn from such analyses will depend on the quality of the integration and cell type annotation from the first stage.
To address these problems, we sought to simultaneously decompose gene expression into time–dependent and time–independent components (Figure 1B). By doing this, we can conduct cell type annotation using the time–independent component, study dynamics without requiring cell type annotation, and project cells forward and backward in time to generate time–series for each individual cell (Figure 1C, top). When cell type labels are provided, one can combine time series generated from individual cells to mitigate the impact of batch effects (Figure 1C, bottom panel). Here, we describe SNOW (SiNgle cell flOW map), an unsupervised probabilistic approach for the annotation, normalization and interpolation of single cell time series data. Our approach parameterizes a zero-inflated negative binomial distribution using latent coordinates computed from the count data. To demonstrate its utility, we show that the latent space constructed by SNOW can capture biologically meaningful structure and map cells collected at one time point to past and future states. By constraining the second derivative of generated time series, SNOW also indirectly removes potential batch effects contaminating the time–series. To our knowledge, SNOW is the only method focusing on the analysis of time series of differentiated cells, in which the effects of time and cell state may be mixed in the data (Figure 1A, bottom).
2. SNOW algorithm
We aim to achieve a number of things with SNOW. First, we wish to construct a time-independent characterization of the cell state to facilitate cell type annotation. This is achieved by minimizing the Wasserstein distance between the prior, , and the latent distribution conditioned on sampling time, . Second, we wish to map cells forward and backward in time such that their model–generated gene expression time series matches that of the population average (Figure 1C, top). To increase the smoothness of the inter-polated trajectories, we incorporated in the loss function the second derivative of generated time series to penalize high curvature (see Methods for more detail). As a consequence of this second derivative loss, batch effects in the form of a sudden increase/decrease in expression will be simultaneously removed (Figure 1C, bottom). Third, we wish to infer the sample collection time for an untimed sample, which is an active field of research in chronobiology [22-24]. To do this, we incorporated two additional terms in the loss function: one related to predicting the actual sampling time of each cell, and another related to predicting the sampling time of a cell after being mapped to another time by the model.
To achieve this, SNOW models the observed count of gene from cell collected at time as a sample from a zero-inflated negative binomial (ZINB) distribution that is dependent on the observed library size of the cell (), time (), and cell state (). The cell state is a low–dimensional vector computed by an encoder network that represents the time–independent biological variation contributing to . To remove the effect of time, we constrain the variational posterior conditioned on time to be close to the prior . The resulting time–independent representation of the cell state can, if desired, be used to conduct cell type annotation (Figure 1B). In the process of computing the log likelihood, and are used to construct , which represents the expected percentage of all reads in cell that originate from gene at time . By changing as an input to the decoder, we can generate a gene expression profile of a cell collected at past or future times. In other words, we create an object similar to a flow map, in which the expected expression of the past/future state of a cell can be generated without time integration, which will be required if the system is parameterized by a system of ordinary differential equations.
Details of the algorithm are given below.
2.1. General probabilistic framework
We model the count matrix with a zero-inflated, negative binomial (ZINB) distribution [25, 26], where and are the number of cells and genes in the sample, respectively. Without zero-inflation, a given entry within , , is modeled as:
(1) |
where is the standard gamma function, the (time-independent) encoded state of sampled at , the gene– and cell–specific inverse dispersion, the library size of cell , and the count fraction of gene in cell such that . and are optimized using neural networks and respectively.
Zero-inflation is added with the following form:
(2) |
(3) |
where is parameterized with a neural network. Since elements of are conditionally independent of each other given and , we can compute the probability of observing the count profile of a particular cell as:
(4) |
Or equivalently:
(5) |
Our framework allows the generation of “virtual” cells by assuming a Gaussian prior, a commonly used prior for building variational auto-encoders, as follows:
where is the time-independent latent representation of a cell; is the normalized expression profile (or count fraction) enforced by using a softmax activation function in the last layer of ; is the count profile of the virtual cell; and is the observed the library size. The Gamma-Poisson process generates following a negative binomial distribution with mean , while is a binary vector that represents dropouts. , , and are neural networks that map the latent space and time back to the full gene space, .
2.2. SNOW loss function
A number of methods have used variational autoencoders (VAEs) [17] to model count data from single-cell RNA seq [16, 25, 27]. All have used loss functions reminiscent of the evidence lower bound (ELBO), which constrains the shape of the latent space indirectly via the KL-divergence term:
(6) |
(See derivation of ELBO in Supplement.) In the above expression, , the prior distribution of the representations , has been chosen for convenience to be and is the variational posterior distribution of constructed by the encoder network. The KL-divergence term provides the model some level of robustness, as it essentially requires points near in the latent space to be decoded into similar objects. However, as the dimensionality of the data grows, the log-likelihood term of the ELBO will dominate over the regularizing KL-divergence term. While this is unaccounted for in SCVI [25], both scNODE [16] and SCVIS [27] incorporate scaling factors to maintain the strength of the regularization of the latent space. By definition, maximizing ELBO can lead to the maximization of the marginal likelihood ,
(7) |
When , or equivalently , the ELBO will be equal to the marginal log likelihood of and . However, when the ELBO is not tight, its optimization can lead to an enlargement of the approximation error, . To account for this, SNOW regularizes the latent space directly by minimizing the distance between the latent distribution and the prior as measured by the Wasserstein distance. Briefly, in addition to the log likelihood term, the SNOW loss function begins with two main regularization terms, the former of which regularizes the latent space and the latter of which enables predictions of the sampling time:
(8) |
In the above expression, regularizes the latent space and enforces time–independence via:
(9) |
where denotes the Wasserstein-2 distance between distributions and . This regularization enableS the generation of a “virtual” cell when is sampled from . To ensure our model can generate proper “synthetic” cells sampled from different time points, we enforced two things. First, the time–independent components of the “synthetic” cells should follow the same distribution as that of the real cells (a Gaussian distribution). Second, the sampling time of the “synthetic” cells should remain predictable. To achieve this, we therefore impose:
(10) |
where is the sampling time of the “synthetic” cells. And finally, we constrain the second derivative of the generated time series to enforce smoothness:
(11) |
where is the number of genes and is the average of over all generated time points. In practice, we find that computing for a randomly selected gene, , in each training loop to be computationally cheaper and sufficient to generate smooth time series, giving the final form of our loss function:
(12) |
which preserves the latent space distribution, its time independence, and ensures the smoothness of the generated time series.
In practice, we simplify the calculation by replacing the Wasserstein-2 distance with a more computationally tractable form, the sliced Wasserstein distance [28], defined as:
(13) |
where the distributions and can be generated by first sampling from and directly before projecting them in a random direction, , sampled uniformly from the unit sphere . Given a set of data points with an unknown underlying distribution , the sliced-Wasserstein distance with respect to a known distribution, such as the standard normal, can be easily computed as:
(14) |
where and we assume the columns of and are sorted such that elements of both and are arranged in ascending/descending order.
2.3. Neural network optimization
By default, SNOW uses a 3–layer encoder neural network with 256 fully connected neurons per layer and ReLU activation to project count data onto a 32 dimensional latent space . Subsequently, and were used as input to individual neural networks (, and ) with the same structure as the encoder network to generate the count fraction, inverse dispersion and dropout probability. To ensure that generates probabilities, its last layer is activated by a sigmoid function so that its output ranges from 0 to 1. We further clamped the dropout probability between 0.01 and 0.99 to prevent the appearance of . As mentioned above, the last layer of is activated by a softmax function to enforce the sum of its output. During each training loop, we focus only on a randomly selected small subset of the data, by default 300. Everything within the loss function is computed from information contained within this subset of 300 cells, which enables our method to be applied to larger datasets in a memory efficient manner.
In all test cases, the optimization of the model parameters was done with the ADAM [29] optimizer as implemented by pytorch [30] with a learning rate of 0.0005, , , and a weight decay of 0.0001. No scheduler was used to change the learning rate during the training process.
3. Materials and Methods
3.1. Datasets
The circadian drosophila clock neuron dataset
The drosophila clock neuron dataset [10] (mean UMI/cell = 20060) was collected from Drosophila clock neurons every four hours with two replicates (12 time points in total) under both light-dark (LD) and dark-dark (DD) cycles. We focused our analysis on cells subject to the LD cycle, which contains 2325 cells. Count data was downloaded from the Gene Expression Omnibus under the accession code GSE157504 and the relevant metadata from https://github.com/rosbashlab/scRNA_seq_clock_neurons. Data integration was conducted using the IntegrateData function from Seurat [18] with ndim = 1:50, and k.weight=100. The resulting counts were used as input to the model.
The circadian mouse aorta dataset
The mouse aorta dataset [31] (mean UMI/cell = 14181) was collected every 6 hours (4 time points in total) under LD conditions, with a total of 21998 cells. H5ad files of the smooth muscle cells (SMC) and fibroblasts were downloaded from https://www.dropbox.com/sh/tl0ty163vyg265i/AAApt14eybExMMPK7VVDmfvga. Raw counts were used as input to the model.
The lung regeneration dataset
The lung regeneration dataset [32] (mean UMI/cell = 1585) was collected every day for two weeks (day 1 through day 14), and on day 21, 28 36 and 54. We used AT2 cells, cilliated cells and club cells because they are activated after bleomycin treatment, resulting in a total of 24383 cells. Gene expression data were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE141259 along with the associated metadata. Raw counts were used as input to the model.
3.2. Identifying batch effects
To identify genes potentially affected by batch effects, we looked for two types of patterns: spurious expression and spurious detection. We consider a gene to have spurious expression if its maximum normalized expression at one time point is five times higher than its average over all time points; and we consider a gene to have spurious detection if its maximum capture rate (number of cells that contain a said gene over the total number of cells collected at this time point) at one time point is five times greater than its mean capture rate over all time points. Here, we used the empirical as normalized expression. To exclude genes that are almost never detected, we only used those with an average normalized expression (across all time points) greater than 0.00001 and an average capture rate (across all time points) over 5%. In the clock neuron dataset, this analysis identifies 124 genes with unusual capture rates and 24 genes with unusual expression, with zero intersection.
3.3. Detecting circadian behavior on a single cell level
To conduct cycling detection for each individual cell, we generated de novo time series by concatenating the time independent representation of the cell state, , with time, . We generated time series comprising 24 time points spanning 48 hours. We then conducted harmonic regression on these time series, resulting in a value, phase estimate and amplitude estimate for each gene from each cell.
4. Results
4.1. SNOW constructs biologically meaningful latent spaces
Time could have a profound impact on single cell data when it contributes to gene expression together with cell state. To illustrate this, we used UMAP [33] to create lower dimensional embeddings of time–series sc-RNAseq data collected from the fly clock neurons [10] and the mouse aorta [31], both with existing cell type annotations (see Supplement for details regarding cell type annotation). We observed that the effect of time strongly drove clustering in the UMAP space (Figure 2, left column). As illustrated in the top row of Figure 2, while the UMAP space can separate the smooth muscle cells (SMCs) and fibroblasts, the SMC cluster contains subclusters, each corresponding to different sampling times. This effect is even stronger in the fly clock neurons, where the UMAP projection separates into small, disjoint clusters where each contains cells sampled at a particular point in time, and each such cluster contains cells of different cell types.
To construct a representation of the cell state that is independent of time for cell type annotation, we regularized the probability distribution of the latent coordinates conditioned on the sampling time, , by minimizing its sliced Wasserstein distance with respect to the prior (see details in Methods). This approach is in principle more efficient than minimizing the maximum mean discrepancy [34] described in previous work [25] because fewer computations are needed . With our approach, we were able to create latent representations of the cells that capture the original cell type annotation while remaining independent of their sampling times (Figure 2, right column).
Close examination of the SNOW latent space generated from the drosophila data (Figure 2, bottom row) revealed that we have retained variation attributable to cell type. Adding the original cell-type annotations to the UMAP plot of the SNOW–processed data (Figure 2 bottom right), we find dorsal neurons (‘DN’s) located on the top and right side, and lateral neurons (‘LN’s) on the left (Figure 2C). Interestingly, we observed that a group of dorsal neurons (6:DN1p, 18:DN1p, 19:DN2) and lateral neurons (9:LNd_NPF, 12:LNd) merged into two larger clusters in our latent space (Figure 2 bottom right and Figure S2). On the other hand, we also observed that cluster 14 breaks into at least two smaller clusters (Figure S2). To identify the origin of this discrepancy, we conducted data integration with Seurat [35] (see details in Methods) with the features used to train our model, and made similar observations (Figure S2, S3). This result suggests that the merging and breaking of clusters in our embedding can be attributed to the small feature set used in the original annotation of cell types. Additionally, we applied SNOW to a time series dataset charting the regeneration of mouse lungs subjected to bleomycin-mediated injury [32] and observed that cells significantly affected by bleomycin in the original gene expression space are now embedded closer to their untreated counterpart in the UMAP space generated from SNOW (Figure S4).
4.2. SNOW maps cell forward and backward in time
SNOW generates a latent space that is independent of time and contains a decoder that reconstructs the transcriptome when the latent state and time are both supplied. In principle, then, it is possible to provide the latent space and an unsampled time to generate an expression profile of a specific cell at another timepoint. To test whether we can produce expression dynamics for each cell that resembles the average of its cell type, we generated de novo time series by concatenating the latent representation of a cell, , with time, . Because the concatenated can be different from the sampling time of the cell, we refer to this as the “pseudo” sampling time. We generated time series using latent representations of the mouse aorta, which is sampled every six hours for one day, by using 100 equally spaced pseudo sampling times. One might then reasonably ask: if the generated data had in fact been observed data, would the encoder network have correctly identified the time that was used to generate the pseudo sample? By supplying the generated expression profile back to the encoder network, we observed that we are capable of re-inferring the pseudo sampling time of each cell accurately (Figure 3A), with a mean absolute error of 0.80 and 0.79 hours for the smooth muscle cells and the fibroblasts respectively (Figure 3B). Overlaying the mean absolute error on its UMAP projection identified no regions with particularly large errors (Figure 3C).
To further validate our approach, we averaged the generated time series for all cells from the same cell type and compared this population average to the experimental data (red lines in Figure 3D). Using the well–characterized circadian genes as examples, we observed that the population average of the generated time series exhibit clear oscillatory dynamics and match closely with empirical observation (Figure 3D). It is worth noting that no constraint was imposed during the training process to shape the generated population average. This observation suggests that the agreement between the observed and the generated dynamics is consequent of a successful deconstruction of the gene expression into time–dependent and time–independent components.
We next repeated this test on the clock neuron dataset, sampled every 4 hours for two days, and observed that our model remained competent at “predicting” pseudo sampling times (Figure S5A), with an mean absolute error ranging from 1.5 hours to less than 3 hours. Similar to before, we observed SNOW–generated oscillations in known circadian markers in concordance with experimental observation (Figure 3E, F). Despite the proximity of the 1:DN1p_CNMa cluster and the 2:s_LNv cluster in the UMAP space (Figure 2, bottom left), we observed the mean expression level of the generated expression time series of CNMa to differ by ten fold, suggesting our usage of a fixed latent space standard deviation did not prevent the model from learning the distinctiveness of each cell type.
Interestingly, we found that the quality of the generated time series ties closely with the size of the latent standard deviation (). In the clock neuron dataset, we observed that small leads to dampened oscillation in the long run (Figure S5B). However, this effect is not apparent in the mouse heart data (Figure S5B), potentially because of its larger sample size, simpler cell type composition, and fewer sampling times.
4.3. SNOW corrects batch effects
While batch effects can be difficult to identify and correct, the fact that samples are related in time provides us a potential route of correction by prohibiting abrupt changes of expression, formally achieved by constraining the second derivative of the generated time series. To test whether SNOW can reduce the impact of batch effects in time–course data, we first identified genes that have been potentially affected. We consider a gene to be severely impacted by a batch effect if it is mostly detected only at a single time point. For those that are consistently detected across time points, we assume they are affected if their expression level at a particular time point is much higher than that of the rest (see Methods). With these two criteria, we identified 148 genes within the 1:DN1p_CNMa cluster from the clock neuron data and observed that 117 of them are considered to be features by Seurat [35]. As Seurat identifies features by looking for outliers on a mean–variance plot, it is expected, and alarming, that genes satisfying our criteria will be considered as features. By constructing time series using all sampled 1:DN1p_CNMa cells to span the entirety of the experiment, we observed that the generated signal is unaffected by the outlier samples (Figure 4A).
Interestingly, we observed that a large proportion of the selected genes appear to be impacted by a batch effect at time ZT38. Direct visualization of the expression level of putative batch–affected genes on the UMAP space implies that these genes, which were not originally used as features, may contribute to the disagreement between the original cell type assignment and our latent space. For example, Figure 4B illustrates that cells annotated as 1:DN1p_CNMa neurons that had an elevated expression of batch–affected genes are located away from the main cluster. This suggests that what appears to be batch effect may simply be an artifact of bad cell type annotation. Since we can compute the likelihood of making an observation, if cells considered to be 1:DN1p_CNMa neurons at ZT38 were, in fact, of some other origin, cells collected at ZT38 would stand out from the rest of the time series, but the log–likelihood would not. To test this, we computed the log likelihoods of observing the experimental data and observed that gene-wise log likelihood also shows a sharp drop at the time when gene expression peaks (Figure S6), indicating that the observed expression level has a low probability of occurrence under our statistical model. Computing the log likelihood of observing the entire cell by summing up the probabilities of observing each gene, we noticed a drop at ZT38 for almost all cell types (Figure 4C, S6C), in agreement with our observation that a large fraction of the identified genes were impacted at ZT38. Additionally, this drop of log likelihood at ZT38 remained even when all cells were pooled together (Figure S6B), suggesting that the expression peaks we observed at ZT38 cannot solely be attributed to cell type assignment.
To summarize, we showed that SNOW can generate time series that are unaffected by outlier samples and that our underlying statistical framework is capable of detecting batch–affected genes.
4.4. SNOW allows unsupervised identification of circadian rhythms in gene expression
The discovery of tissue–specific circadian regulation [36] and advances in single cell technologies have led to studies that report cell-type specific circadian oscillation [10, 11]. While circadian time series conducted on the tissue level can be directly supplied to a number of readily available cycling detection algorithms [37, 38], single cell data requires some special considerations. First, proper cell type annotation requires the removal of all temporal effects. While this can be achieved via data integration, integrated data cannot be used for cycling detection, forcing users to conduct cell type annotation with integrated data but perform cycling detection with “raw” data. Moreover, one needs to choose whether to consider each cell as a replicate or to construct pseudobulk data for each time point. However, considering cells as replicates can be highly computationally inefficient, and it has been shown that constructing pseudobulk profiles can generate false positives, especially for genes with low expression [39].
With SNOW, we can generate a transcriptome-wide time series for all cells by projecting them forward and backward in time, thus enabling us to conduct cycling detection at the single cell level. For each cell, we are now able to obtain a value, phase estimate, and estimated amplitude for each gene using harmonic regression. To test if these values are biologically meaningful, we first took their average across all cells and observed that known circadian genes vri and tim had the lowest values among all genes. Next, we compared our results to the published list of per–cell–type cycling genes [10]. As demonstrated in Figure S7, genes that were reported to have rhythmic expression in multiple cell types had smaller average (across all cells) values and larger average (across all cells) amplitudes.
We then investigated the biological interpretation of the cell–level statistics. Overlaying harmonic regression values on the UMAP space showed that vri and tim are highly cyclic in all cells (Figure 5A). Additionally, known circadian genes such as Clk and per were also considered highly cyclic in most of the annotated clusters (Figure 5A). To ensure that SNOW can capture cell/cell type specific features, we also overlayed the estimated oscillation amplitude and phase for each cell (Figure 5A). Interestingly, we observed high oscillation amplitudes of vri and tim in all labeled clusters. By contrast, cluster 16, an unnamed cluster, stood out for having a much lower amplitude despite its close proximity to the high–amplitude dorsal neurons on the UMAP space. Additionally, despite the fact that the phases of vri, tim and Clk were reported to be largely identical across cell types [10], we observed that SNOW is capable of discerning fine phase differences between clusters on a single cell level(Figure 5A).
We observed that there are cases where the harmonic regression values from flat genes are low, which leads to disagreement between our analysis and the published cycling genes. By looking at the estimated amplitudes, we found that these disagreements can be resolved by using amplitude criteria that exclude cells/clusters with low oscillation amplitudes (Figure S8). We also observed that sky, which was reported to be cycling in the 2:s_LNv and 1:DN1p_CNMa clusters, also appeared to be cycling in two other DN1p clusters (Figure 5B, left panel). While the estimated amplitudes of sky from the two DN1p clusters are smaller than that of the 1:DN1p_CNMa cluster, they are similar to that of 2:s_LNv neurons (Figure 5B, right panel). A closer look at the time series generated from the two DN1p groups revealed expression dynamics distinct from that of 1:DN1p_CNMa but similar to that of 2:s_LNv, suggesting sky may be cycling in a larger population of dorsal neurons than previously believed. Another gene, Ddc, which was also reported to cycle in the 2:s_LNv neurons, showed high values and low amplitudes in our analysis (Figure S9A). Comparing SNOW–generated time series to the experimental observations (Figure S9B) suggests that this may have been a false positive in the original analysis. On the other hand, we observed that two dorsal neuron groups (7:DN1p, 20:DN3) in which Ddc was not reported to be cycling originally showed low values and high amplitudes in the SNOW generated data (Figure S9B), possibly suggesting a false negative (Figure S9B).
In summary, we showed that SNOW may be used to help the identification of rhythmic genes by first generating time series for each cell, and then conducting cycling detection on a single cell level. By doing this, cycling detection analysis does not depend on the accuracy of cell type annotation. This suggests that it can be used in combination with traditional analyses that first assign cell types prior to pseudobulking for cycling detection. For example, it can increase the confidence in the identification of cycling genes by confirming that they are rhythmic in the majority of individual cells; detect potential false negatives in the pseudobulk analysis (especially for rare cell types that may not be sampled at all time points); and avoid false-positives by removing potential batch effects. It can also potentially identify subsets of cells of a single type (or a single cluster) that are differentially cycling, an effect that may be missed in analysis where cells of the same type are treated as replicates or pseudobulked for cycling detection.
5. Discussion
We presented SNOW (SiNgle cell flOW map), a deep learning framework for the annotation, normalization, and generation of single–cell time scRNA-seq data. SNOW computes and maximizes the log likelihood of the experimental observations by taking the raw count data as input. The count data is modeled to follow a zero-inflated negative distribution, similar to previous works [25, 26]. SNOW then deconvolves the data internally into time-dependent and time-independent components by minimizing the sliced Wasserstein distance between relevant distributions. The time–independent component can be used readily for cell type annotation, and the time–dependent component can be used to generate artificial time series for individual cells. We demonstrated the utility of SNOW by applying it to multiple single cell datasets with vastly different cell numbers, sampling frequencies, and sequencing depths.
SNOW has a number of advantages. First, most methods for analyzing single cell time series data focus on developmental processes, in which the effect of time and cell type are associated. These methods largely rely on finding an optimal transport map between cells sampled at distinct time points [15, 16, 40]. While such methods are appropriate when cells are gradually transitioning from one state to another following the same general trajectories (as illustrated in Figure 1A, top), it is difficult to apply them to time series from mature cells where expression changes with time in a cell type–specific manner. SNOW addresses this issue by deconvolving the effect of time and cell–type.
Second, we demonstrate that SNOW can be used to identify and eliminate batch effects. By modeling count data with a zero inflated negative binomial distribution, we were able to identify samples from the clock neuron dataset that are likely to be batch-affected by using the estimated probability of observing their gene expression profiles. From these samples, we observed an interesting association between aberrant gene expression and the drop of log likelihood, empirically confirming the validity of our approach. As illustrated in Figure 4A, by generating time series on a single cell level and constraining their second derivatives, the effect of a batch-affected time point can be mitigated.
Third, SNOW is capable of generating time–series data for individual cells. We demonstrated how this capability can be used to enhance the analysis of circadian signals by conducting cycling detection on individual cells. Our approach does not rely on the correct identification of clusters or the construction of pseudobulk expression profiles, and is therefore less sensitive to outlier time points or the correct identification of cell types. By looking for genes exhibiting rhythmic patterns across many cells, our approach can increase the confidence of detected cycling genes and potentially identify false positive/negative cyclers from traditional analyses.
Several existing methods bear some similarities to SNOW, with important differences. SCVI [25], DCA [26], and many other methods [27, 41-43] are all built on variational autoencoders [17], which are typically trained via the optimization of the evidence lower bound (ELBO). However, as the ELBO only constrains the latent space via the KL divergence term(eq 6), it may generate correlated latent dimensions and fail to enforce the assumption that the prior distribution has an identity covariance, enlarging the difference between ELBO and the actual log likelihood, . In this situation, one would fail to generate “realistic” virtual gene expression profiles by passing samples drawn from the prior distribution through the decoder network. While having an “irregular” latent space that fails to match the prior distribution may not impact the performance of the model in other tasks such as clustering and identifying cell types, enforcing independence between the latent dimensions is known to improve model interpretability [44, 45]. To address this, scNODE [16] and SCVIS [27] introduced a scaling factor added to the KL divergence term (similar to -VAEs [44]) to enforce a stronger constraint on the latent space, thereby encouraging a more efficient representation of the data. More recently, various methods have been proposed [45-48] to directly enforce independence between latent dimensions via minimizing . In SNOW, we enforced independence between latent dimensions and alignment with respect to the prior distribution simultaneously by minimizing the sliced Wasserstein distance.
As mentioned in previous sections, we developed SNOW to solve the following problem: in the situation where both cell state and time affects gene expression, removing temporal effects to facilitate cell type annotation also removes biologically meaningful gene expression dynamics. This problem is related to what MrVI [41] attempts to solve by constructing a sample-unaware representation () and a sample-aware representation (), where is used to conduct cell type annotation and is used to model how sample related covariates (such as a batch or a time–point) affect gene expression. In some sense, SNOW and MrVI are designed to solve the same problem, except that SNOW specializes in continuous covariates (time) and MrVI in discrete covariates. Our explicit enforcement of statistical independence between the latent space and time, which is absent in both MrVI and SCVI, naturally defines cell state as a time–invariant quantity. By supplying the decoder with time and the time independent representation of cell type, SNOW can generate data “sampled” from intermediate time points, which cannot happen if time is simply treated as batch label, as it is in MrVI. SNOW also has the additional benefit of enforcing smoothness by constraining the second derivative with respect to time, which is not possible if time is treated as a categorical variable.
We made a few assumptions during the construction of our framework. First, we built the latent representation of each cell as a time–invariant object. For mature cells, we observed that this time invariant object corresponds to cell type. Biologically speaking, this assumption can hold as long as cells of the same cell type constantly express reliably detectable type–specific marker genes. In developmental systems, this time invariant object should in turn capture the lineage of each cell if lineage–specific markers are being expressed. However, when gene expression undergoes substantial changes and no lineage or cell type–specific markers are present, our first assumption will be violated. Second, we assumed that gene expression is predominantly affected by two components, namely cell type (or lineage) and time. This assumption implies that our framework is not applicable to developmental systems where bifurcations are present. For example, if a stem cell population differentiated into three distinct cell types at , all expressing the same lineage specific markers, the decoder cannot generate three distinct set of gene expression profiles when the input lineage (stem cell) and time is fixed. In the situation when gene expression of each cell changes along the same non-bifurcating trajectory but with different speed, given our second assumption, the input time for the model should be replaced by the estimated pseudotime of each cell in order to to correctly identify lineage. As SNOW is thus only applicable to a small subset of developmental processes, we recommend using SNOW to analyze mature systems.
Nonetheless, we expect our work to be of interest to those studying dynamic processes in complex tissues. Additional features can be easily added into our method to handle more complex datasets, and approaches employed in our work, such as data integration or the enforcement of statistical independence, can also be extracted and adopted for other analyses.
Supplementary Material
Acknowledgements
This work was supported by NSF grant DMS-1764421, Simons Foundation grant 597491, and NIH grant R01AG068579.
Footnotes
Code and Data Availability
Code for our analysis is available on bitbucket (https://bitbucket.org/biocomplexity/snow/src/main/).
References
- [1].Peidli Stefan, Green Tessa D., Shen Ciyue, Gross Torsten, Min Joseph, Garda Samuele, Yuan Bo, Schumacher Linus J., Taylor-King Jake P., Marks Debora S., Luna Augustin, Blüthgen Nils, and Sander Chris. scPerturb: harmonized single-cell perturbation data. Nature Methods, January 2024. [DOI] [PubMed] [Google Scholar]
- [2].Aissa Alexandre F., Islam Abul B. M. M. K., Ariss Majd M., Go Cammille C., Rader Alexandra E., Conrardy Ryan D., Gajda Alexa M., Rubio-Perez Carlota, Valyi-Nagy Klara, Pasquinelli Mary, Feldman Lawrence E., Green Stefan J., Lopez-Bigas Nuria, Frolov Maxim V., and Benevolenskaya Elizaveta V.. Single-cell transcriptional changes associated with drug tolerance and response to combination therapies in cancer. Nature Communications, 12(1):1628, March 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Chang Matthew T., Shanahan Frances, Nguyen Thi Thu Thao, Staben Steven T., Gazzard Lewis, Yamazoe Sayumi, Wertz Ingrid E., Piskol Robert, Yang Yeqing Angela, Modrusan Zora, Haley Benjamin, Evangelista Marie, Malek Shiva, Foster Scott A., and Ye Xin. Identifying transcriptional programs underlying cancer drug response with TraCe-seq. Nature Biotechnology, 40(1):86–93, January 2022. [DOI] [PubMed] [Google Scholar]
- [4].Dixit Atray, Parnas Oren, Li Biyu, Chen Jenny, Fulco Charles P., Jerby-Arnon Livnat, Marjanovic Nemanja D., Dionne Danielle, Burks Tyler, Raychowdhury Raktima, Adamson Britt, Norman Thomas M., Lander Eric S., Weissman Jonathan S., Friedman Nir, and Regev Aviv. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell, 167(7):1853–1866.e17, December 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Adamson Britt, Norman Thomas M., Jost Marco, Cho Min Y., Nuñez James K., Chen Yuwen, Villalta Jacqueline E., Gilbert Luke A., Horlbeck Max A., Hein Marco Y., Pak Ryan A., Gray Andrew N., Gross Carol A., Dixit Atray, Parnas Oren, Regev Aviv, and Weissman Jonathan S.. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell, 167(7):1867–1882.e21, December 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Schiebinger Geoffrey, Shu Jian, Tabaka Marcin, Cleary Brian, Subramanian Vidya, Solomon Aryeh, Gould Joshua, Liu Siyan, Lin Stacie, Berube Peter, Lee Lia, Chen Jenny, Brumbaugh Justin, Rigollet Philippe, Hochedlinger Konrad, Jaenisch Rudolf, Regev Aviv, and Lander Eric S.. Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming. Cell, 176(4):928–943.e22, February 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Bastidas-Ponce Aimée, Tritschler Sophie, Dony Leander, Scheibner Katharina, Tarquis-Medina Marta, Salinno Ciro, Schirge Silvia, Burtscher Ingo, Böttcher Anika, Theis Fabian, Lickert Heiko, and Bakhti Mostafa. Massive single-cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development, page dev.173849, January 2019. [DOI] [PubMed] [Google Scholar]
- [8].Patel Anoop P., Tirosh Itay, Trombetta John J., Shalek Alex K., Gillespie Shawn M., Wakimoto Hiroaki, Cahill Daniel P., Nahed Brian V., Curry William T., Martuza Robert L., Louis David N., Rozenblatt-Rosen Orit, Suvà Mario L., Regev Aviv, and Bernstein Bradley E.. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science, 344(6190):1396–1401, June 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Tirosh Itay, Izar Benjamin, Prakadan Sanjay M., Wadsworth Marc H., Treacy Daniel, Trombetta John J., Rotem Asaf, Rodman Christopher, Lian Christine, Murphy George, Fallahi-Sichani Mohammad, Dutton-Regester Ken, Lin Jia-Ren, Cohen Ofir, Shah Parin, Lu Diana, Genshaft Alex S., Hughes Travis K., Ziegler Carly G. K., Kazer Samuel W., Gaillard Aleth, Kolb Kellie E., Villani Alexandra-Chloé, Johannessen Cory M., Andreev Aleksandr Y., Van Allen Eliezer M., Bertagnolli Monica, Sorger Peter K., Sullivan Ryan J., Flaherty Keith T., Frederick Dennie T., Jané-Valbuena Judit, Yoon Charles H., Rozenblatt-Rosen Orit, Shalek Alex K., Regev Aviv, and Garraway Levi A.. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 352(6282):189–196, April 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Ma Dingbang, Przybylski Dariusz, Abruzzi Katharine C, Schlichting Matthias, Li Qunlong, Long Xi, and Rosbash Michael. A transcriptomic taxonomy of Drosophila circadian neurons around the clock. eLife, 10:e63056, January 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Wen Shao’ang, Ma Danyi, Zhao Meng, Xie Lucheng, Wu Qingqin, Gou Lingfeng, Zhu Chuanzhen, Fan Yuqi, Wang Haifang, and Yan Jun. Spatiotemporal single-cell analysis of gene expression in the mouse suprachiasmatic nucleus. Nature Neuroscience, 23(3):456–467, March 2020. [DOI] [PubMed] [Google Scholar]
- [12].Calderon Diego, Blecher-Gonen Ronnie, Huang Xingfan, Secchia Stefano, Kentro James, Daza Riza M., Martin Beth, Dulja Alessandro, Schaub Christoph, Trapnell Cole, Larschan Erica, O’Connor-Giles Kate M., Furlong Eileen E. M., and Shendure Jay. The continuum of Drosophila embryonic development at single-cell resolution. Science, 377(6606):eabn5800, August 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Di Bella Daniela J., Habibi Ehsan, Stickels Robert R., Scalia Gabriele, Brown Juliana, Yadollahpour Payman, Yang Sung Min, Abbate Catherine, Biancalani Tommaso, Macosko Evan Z., Chen Fei, Regev Aviv, and Arlotta Paola. Molecular logic of cellular diversification in the mouse cerebral cortex. Nature, 595(7868):554–559, July 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Farrell Jeffrey A., Wang Yiqun, Riesenfeld Samantha J., Shekhar Karthik, Regev Aviv, and Schier Alexander F.. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science, 360(6392):eaar3131, June 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Yeo Grace Hui Ting, Saksena Sachit D., and Gifford David K.. Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions. Nature Communications, 12(1):3222, May 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Zhang Jiaqi, Larschan Erica, Bigness Jeremy, and Singh Ritambhara. scNODE : Generative Model for Temporal Single Cell Transcriptomic Data Prediction. preprint, Bioinformatics, November 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Kingma Diederik P and Welling Max. Auto-Encoding Variational Bayes. 2013. Publisher: arXiv Version Number: 11. [Google Scholar]
- [18].Butler Andrew, Hoffman Paul, Smibert Peter, Papalexi Efthymia, and Satija Rahul. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5):411–420, May 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Korsunsky Ilya, Millard Nghia, Fan Jean, Slowikowski Kamil, Zhang Fan, Wei Kevin, Baglaenko Yuriy, Brenner Michael, Loh Po-ru, and Raychaudhuri Soumya. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12):1289–1296, December 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Yu Xiaokang, Xu Xinyi, Zhang Jingxiao, and Li Xiangjie. Batch alignment of single-cell transcriptomics data using deep metric learning. Nature Communications, 14(1):960, February 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Qin Lu, Zhang Guangya, Zhang Shaoqiang, and Chen Yong. Deep Batch Integration and Denoise of Single-Cell RNA-Seq Data. Advanced Science, 11(29):2308934, August 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Huang Yitong and Braun Rosemary. Platform-independent estimation of human physiological time from single blood samples. Proceedings of the National Academy of Sciences, 121(3):e2308114120, January 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Braun Rosemary, Kath William L., Iwanaszko Marta, Kula-Eversole Elzbieta, Abbott Sabra M., Reid Kathryn J., Zee Phyllis C., and Allada Ravi. Universal method for robust detection of circadian state from gene expression. Proceedings of the National Academy of Sciences, 115(39), September 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Anafi Ron C., Francey Lauren J., Hogenesch John B., and Kim Junhyong. CYCLOPS reveals human transcriptional rhythms in health and disease. Proceedings of the National Academy of Sciences, 114(20):5312–5317, May 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Lopez Romain, Regier Jeffrey, Cole Michael B., Jordan Michael I., and Yosef Nir. Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12):1053–1058, December 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Eraslan Gökcen, Simon Lukas M., Mircea Maria, Mueller Nikola S., and Theis Fabian J.. Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications, 10(1):390, January 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Ding Jiarui, Condon Anne, and Shah Sohrab P.. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nature Communications, 9(1):2002, May 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Deshpande Ishan, Zhang Ziyu, and Schwing Alexander. Generative Modeling using the Sliced Wasserstein Distance. 2018. Publisher: arXiv Version Number: 1. [Google Scholar]
- [29].Kingma Diederik P. and Ba Jimmy. Adam: A Method for Stochastic Optimization, January 2017. arXiv:1412.6980 [cs]. [Google Scholar]
- [30].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Köpf Andreas, Yang Edward, DeVito Zach, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. PyTorch: An Imperative Style, High-Performance Deep Learning Library, December 2019. arXiv:1912.01703 [cs, stat]. [Google Scholar]
- [31].Auerbach Benjamin J., FitzGerald Garret A., and Li Mingyao. Tempo: an unsupervised Bayesian algorithm for circadian phase inference in single-cell transcriptomics. Nature Communications, 13(1):6580, November 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Strunz Maximilian, Simon Lukas M., Ansari Meshal, Kathiriya Jaymin J., Angelidis Ilias, Mayr Christoph H., Tsidiridis George, Lange Marius, Mattner Laura F., Yee Min, Ogar Paulina, Sengupta Arunima, Kukhtevich Igor, Schneider Robert, Zhao Zhongming, Voss Carola, Stoeger Tobias, Neumann Jens H. L., Hilgendorff Anne, Behr Jürgen, O’Reilly Michael, Lehmann Mareike, Burgstaller Gerald, Königshoff Melanie, Chapman Harold A., Theis Fabian J., and Schiller Herbert B.. Alveolar regeneration through a Krt8+ transitional stem cell state that persists in human lung fibrosis. Nature Communications, 11(1):3559, July 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].McInnes Leland, Healy John, and Melville James. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018. Publisher: arXiv Version Number: 3. [Google Scholar]
- [34].Louizos Christos, Swersky Kevin, Li Yujia, Welling Max, and Zemel Richard. The Variational Fair Autoencoder. 2015. Publisher: arXiv Version Number: 6. [Google Scholar]
- [35].Stuart Tim, Butler Andrew, Hoffman Paul, Hafemeister Christoph, Papalexi Efthymia, Mauck William M., Hao Yuhan, Stoeckius Marlon, Smibert Peter, and Satija Rahul. Comprehensive Integration of Single-Cell Data. Cell, 177(7):1888–1902.e21, June 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Zhang Ray, Lahens Nicholas F., Ballance Heather I., Hughes Michael E., and Hogenesch John B.. A circadian gene expression atlas in mammals: Implications for biology and medicine. Proceedings of the National Academy of Sciences, 111(45):16219–16224, November 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Thaben Paul F. and Westermark Pål O.. Detecting Rhythms in Time Series with RAIN. Journal of Biological Rhythms, 29(6):391–400, December 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Hughes Michael E., Hogenesch John B., and Kornacker Karl. JTK_cycle: An Efficient Nonparametric Algorithm for Detecting Rhythmic Components in Genome-Scale Data Sets. Journal of Biological Rhythms, 25(5):372–380, October 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Xu Bingxian and Braun Rosemary. Detecting Rhythmic Gene Expression in Single Cell Transcriptomics. preprint, Bioinformatics, December 2023. [DOI] [PubMed] [Google Scholar]
- [40].Tong Alexander, Huang Jessie, Wolf Guy, van Dijk David, and Krishnaswamy Smita. TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics. 2020. Publisher: arXiv Version Number: 2. [PMC free article] [PubMed] [Google Scholar]
- [41].Boyeau Pierre, Hong Justin, Gayoso Adam, Kim Martin, McFaline-Figueroa José L., Jordan Michael I., Azizi Elham, Ergen Can, and Yosef Nir. Deep generative modeling of sample-level heterogeneity in single-cell genomics, October 2022. [Google Scholar]
- [42].Gao Haoxiang, Hua Kui, Wu Xinze, Wei Lei, Chen Sijie, Yin Qijin, Jiang Rui, and Zhang Xuegong. Building a learnable universal coordinate system for single-cell atlas with a joint-VAE model. Communications Biology, 7(1):977, August 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Cao Kai, Gong Qiyu, Hong Yiguang, and Wan Lin. A unified computational framework for single-cell data integration with optimal transport. Nature Communications, 13(1):7419, December 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Higgins Irina, Matthey Loïc, Pal Arka, Burgess Christopher P., Glorot Xavier, Botvinick Matthew M., Mohamed Shakir, and Lerchner Alexander. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. [Google Scholar]
- [45].Kim Hyunjik and Mnih Andriy. Disentangling by Factorising, July 2019. arXiv:1802.05983 [cs, stat]. [Google Scholar]
- [46].Gao Shuyang, Brekelmans Rob, Steeg Greg Ver, and Galstyan Aram. Auto-Encoding Total Correlation Explanation, February 2018. arXiv:1802.05822 [cs, stat]. [Google Scholar]
- [47].Chen Ricky T. Q., Li Xuechen, Grosse Roger, and Duvenaud David. Isolating Sources of Disentanglement in Variational Autoencoders, April 2019. arXiv:1802.04942 [cs, stat]. [Google Scholar]
- [48].Esmaeili Babak, Wu Hao, Jain Sarthak, Bozkurt Alican, Siddharth N., Paige Brooks, Brooks Dana H., Dy Jennifer, and van de Meent Jan-Willem. Structured Disentangled Representations, December 2018. arXiv:1804.02086 [cs, stat]. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.