Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Jun 27;26(3):bbaf287. doi: 10.1093/bib/bbaf287

Evaluating discrepancies in dimensionality reduction for time-series single-cell RNA-sequencing data

Maren Hackenberg 1,2,✉,b, Laia Canal Guitart 3,4,✉,b, Rolf Backofen 5,6, Harald Binder 7,8,9
PMCID: PMC12203088  PMID: 40576031

Abstract

There are various dimensionality reduction techniques for visually inspecting dynamical patterns in time-series single-cell RNA-sequencing (scRNA-seq) data. However, the lack of one-to-one correspondence between cells across time points makes it difficult to uniquely uncover temporal structure in a low-dimensional manifold. The use of different techniques may thus lead to discrepancies in the representation of dynamical patterns. However, The extent of these discrepancies remains unclear. To investigate this, we propose an approach for reasoning about such discrepancies based on synthetic time-series scRNA-seq data generated by variational autoencoders. The synthetic dynamical patterns induced in a low-dimensional manifold reflect biologically plausible temporal patterns, such as dividing cell clusters during a differentiation process. We consider manifolds from different dimensionality reduction techniques, such as principal component analysis, t-distributed stochastic neighbor embedding, uniform manifold approximation, and projection and single-cell variational inference. We illustrate how the proposed approach allows for reasoning about to what extent low-dimensional manifolds, obtained from different techniques, can capture different dynamical patterns. None of these techniques was found to be consistently superior and the results indicate that they may not reliably represent dynamics when used in isolation, underscoring the need to compare multiple perspectives. Thus, the proposed synthetic dynamical pattern approach provides a foundation for guiding future methods development to detect complex patterns in time-series scRNA-seq data.

Keywords: single-cell RNA-sequencing, time-series data, dimensionality reduction, evaluation, synthetic data, deep learning

Introduction

Dimensionality reduction is an essential component of any exploratory single-cell RNA sequencing (scRNA-seq) analysis workflow to reveal the underlying patterns of cell heterogeneity, and many computational approaches have been proposed [1, 2]. Among the most popular techniques are t-distributed stochastic neighbor embedding [3] and uniform manifold approximation and projection (UMAP) [4, 5]. They come in many recently developed variants, such as parametric versions [6–8], versions with explainability components [9] or versions for longitudinal structure [10]. Other commonly used approaches include linear techniques based on principal component analysis (PCA) [11].

Despite their widespread use, these techniques were primarily designed for visualization and clustering rather than capturing biological dynamics. Their ability to meaningfully represent temporal processes remains an open question, as they may distort trajectories depending on the dataset and underlying assumptions of each method.

While PCA, t-SNE, and UMAP are often used for visualization in two or three dimensions, there are also deep learning–based methods that support further downstream tasks beyond visualization, such as batch correction, noise reduction, imputation or synthetic data generation [12–14]. The latter often use an autoencoder architecture, where a low-dimensional representation is learned in an unsupervised way via artificial neural networks, e.g. in the single-cell variational inference (scVI) framework [12, 15], or [16, 17].

These dimensionality reduction techniques are also used for more complex settings, in particular with experimental designs that comprise multiple time points to provide insights into dynamic biological processes such as differentiation, proliferation, or response to stimuli [e.g. 18–20]. Since the sequencing protocol is usually destructive, there is no one-to-one correspondence between cells at different time points, making it difficult to identify cellular trajectories [21]. Although there are approaches to computationally trace cell populations over time [e.g. 22–25], these often tend to be rather complex and specific to the scenario for which they have been developed. In practice, researchers often resort to simultaneous dimensionality reduction of the data from all time points using PCA, t-SNE, or UMAP, with subsequent visualization and comparison between time points [26–29].

While these methods are known to effectively cluster cells based on similarities in their gene expression profiles [30], they are not explicitly designed to capture biological dynamics [31], i.e. they are used under the implicit assumption that they capture the underlying lower-dimensional manifold in which the dynamics occur [32]. However, this assumption is not explicitly enforced, such that different approaches may yield conflicting representations of the same biological process which do not capture the underlying structure. Thus, researchers cannot reliably use the spatial arrangement of clusters seen in visualizations across time points to infer how cells transition between states because continuous relationships between cells in high-dimensional gene expression space might be distorted or lost [33, 34]. Indeed, an increasing number of results suggest that such representations might be misleading [34–37]. However, the extent of such potential problems has not been investigated systematically so far. We therefore propose an approach for reasoning about discrepancies in dimensionality reduction representations of temporal patterns.

For evaluating these discrepancies in dimensionality reduction of time-series scRNA-seq data, a key challenge is that no gold standard exists for determining which technique provides the most biologically meaningful representation. In general, dimensionality reduction techniques for scRNA-seq are typically evaluated using real or simulated data. For settings without temporal structure, current benchmarks focus on clustering performance with real data, such as for cell type discovery, where a ground truth is known (e.g. [38–40]). However, such ground truth is usually unavailable for time-series data, making evaluation more challenging. While there are some approaches for simulating scRNA-seq data with temporal structure, such as the splatter R package [41], these typically do not allow for introducing more complex developmental patterns. In particular, there is no consideration of the original manifold in which the patterns occur, making it difficult to assess whether a dimensionality reduction approach has identified an appropriate representation. Therefore, we propose a novel approach in which we extract manifolds from real data and directly induce temporal patterns in these.

Specifically, we consider deep generative approaches for obtaining manifolds and generating synthetic data, which are a promising tool for obtaining realistic synthetic single-cell RNA-seq data [14, 42]. Specifically, we suggest using a variational autoencoder (VAE) [43] for generating synthetic time-series data based on a snapshot scRNA-seq dataset. We propose an approach for representing low-dimensional manifolds from different techniques via VAEs, specifically from PCA, t-SNE, UMAP, and scVI.

In these manifolds, we use vector fields that describe different plausible temporal patterns, such as dividing cell clusters during a differentiation process. Applying these vector fields to each manifold, we can simulate a dynamic process directly within the low-dimensional space, reflecting biologically meaningful cellular transitions. The VAE then can map back from the transformed manifold to gene expression space to create benchmark datasets with different known underlying structures (i.e. different temporal patterns introduced in different underlying manifolds).

We use the proposed approach to investigate to what extent low-dimensional manifolds, obtained from different techniques, can capture different dynamical patterns.

As a starting point, we describe the typical strategy currently used for dimensionality reduction and visualization of time-series scRNA-seq data before giving a brief overview of specific dimensionality reduction techniques, namely PCA, t-SNE, UMAP, and scVI. Subsequently, we comparatively evaluate them with different dynamic patterns induced by our proposed synthetic data approach. We discuss their discrepancies, highlighting the need for a comparative evaluation, and consequences for data analysis strategy and future methods development.

We provide an implementation of our approach, including tutorial notebooks on GitHub at https://github.com/laia-cg/scManifoldDynamics.

Materials and methods

Illustration of a typical strategy for dimensionality reduction on a time-series scRNA-seq dataset

When visually analyzing temporal patterns in time-series scRNA-seq data using dimensionality reduction, the different techniques are typically applied to the joint dataset of all cells from all time points. This embedding projects the entire dataset into a common low-dimensional space. For visualization, the embedded representation is then split by time point, i.e. for each time point, a separate plot is created, showing only the cells captured at that time point in the joint representation. This allows for observing the distribution of cells at each time point separately and for comparing the locations of cell type clusters across time points to trace developmental patterns visually.

Figure 1 illustrates this procedure, applying PCA (Panel A), t-SNE (Panel B), UMAP (Panel C), and scVI (Panel D) in turn on a time-series dataset of human embryonic stem cells grown as embryoid bodies from four time points spanning 27 days during differentiation [32]. In this example, while PCA does not resolve most of the local cluster structure in the data, some structures are captured to some extent by all of the other three methods, e.g. a transition from the cells indicated in yellow via the subgroup indicated in brown, and a subgroup indicated in grey to the cells shown in blue. However, the general visual impression of the emerging temporal pattern is rather distinct, indicating that different methods come to a relatively different representation of the underlying biological development process. From the plot alone and in the absence of ground-truth information, it is not clear, however, whether these discrepancies are merely due to different properties of each algorithm and optimization procedure, e.g. UMAP generally produces sharper clusters than scVI, where the latent representation is modeled as a Gaussian distribution, or whether there are qualitative differences in the perspective of different approaches on the underlying dynamics.

Figure 1.

Figure 1

The different rows correspond to different dimensionality reduction techniques applied to the dataset: the first row shows PCA, the second t-SNE, the third UMAP, and the last row scVI. The first column displays the entire dataset, grouped by time point. The subsequent columns represent only the cells from four selected time points, grouped by Louvain clusters. Clustering follows the original authors’ specifications. For each method, we computed a joint dimensionality reduction using cells from all time points and split the representation by time point for visualization.

Dimensionality reduction techniques as manifold learning tools to reveal cellular dynamics

Using dimensionality reduction to visualize temporal structure in scRNA-seq data assumes that the underlying cellular state space is a lower-dimensional manifold, where cells transition smoothly between states. Waddington’s landscape metaphor of cellular development [44], though limited given today’s knowledge on cellular differentiation [45, 46], illustrates this idea. Differentiation is conceptualized as the process of a marble (corresponding to a cell) rolling down a landscape of hills and valleys (the cellular state space manifold), determining the cell’s fate. Time-series scRNA-seq data corresponds to snapshots of this process at different stages.

Any dimensionality reduction approach generates a possible version of this underlying, intrinsically unobserved, cellular state space. When visualizing this manifold, similar cells typically cluster together [31]. In stable, fully differentiated cell populations, using dimensionality reduction for visualization may be sufficient to understand their organization. However, for dynamic processes, dimensionality reduction techniques should not only place similar cells close to each other but also reflect developmental trajectories.

We focus on four commonly used dimensionality reduction techniques, covering a range of complexity and different prototypical mechanisms [2]. Specifically, we consider PCA as a deterministic matrix factorization-based linear method, nonlinear and probabilistic neighbor-based algorithmic approaches like t-SNE and UMAP, and scVI as a nonlinear, probabilistic method based on neural networks. These different characteristics of the approaches are reflected in their underlying objective when optimizing an embedding—finding the rotation and projection that capture maximum variance, embedding a graph or a distribution over nearest neighbors, or minimizing a reconstruction loss (Fig. 2A). All of them work unsupervised and optimize data-intrinsic properties without allowing to incorporate structural information, e.g. on temporal dynamics, in their original (and widely used) version.

Figure 2.

Figure 2

(A) Different popular dimensionality reduction approaches for scRNA-seq data. (B) We train a (modified) VAE on a snapshot scRNA-seq dataset to obtain (1) a two-dimensional representation, which can optionally mimic a t-SNE, UMAP, or PCA manifold via supervised training, and (2) a matching scVI decoder, which can generate single-cell data based on the respective manifold. (C) Next, we introduce artificial dynamic patterns in the obtained manifold by applying vector fields and use the trained VAE to generate corresponding high-dimensional data for each synthetic time point. (D) We finally concatenate the datasets to create a synthetic time-series dataset, apply different dimensionality reduction techniques and compare.

Principal component analysis

PCA is a linear technique for dimensionality reduction based on matrix factorization by singular value decomposition (SVD) [47]. Specifically, SVD provides a rotation of the original data matrix such that the variation within the data happens along the new coordinate axes, called the principal components (PCs). By projecting onto the first two PCs, a dimensionality reduction in 2D is obtained, which retains as much of the original variation in the data as possible.

PCA is a linear, global approach which does not capture fine-grained local structures or nonlinearities [48]. Further, it may be sensitive to outliers and the scaling of the data [49, 50].

t-distributed stochastic neighbor embedding

To capture more complex, nonlinear patterns in the data, t-distributed stochastic neighbor embedding (t-SNE) [3] has emerged as a popular tool for single-cell transcriptomics [1, 9]. The approach builds a probability distribution over pairwise distances of cells in the high-dimensional gene expression space, such that for each cell, other cells with similar transcriptomic profiles are assigned a higher probability. t-SNE then embeds points into a two-dimensional space by defining a corresponding distribution over the low-dimensional distances and minimizing the Kullback–Leibler divergence between the two distributions via gradient descent. The resulting embedding preserves local neighbor relations between similar cells, i.e. relative distances. t-SNE is thus designed to focus on local rather than global structure of the data. While it provides finely resolved cell clusters, the global distances between clusters cannot be meaningfully interpreted [51]. The approach can be sensitive to the choice of hyperparameters and the initialization of the algorithm [1]. It is computationally rather intensive and thus does not scale well to large datasets. t-SNE is an inherently nonparametric technique. As the embedding is optimized for each cell in relation to its neighbors, for projecting new cells onto a given embedding, the embedding has to be re-computed based on the new neighbor relations. To circumvent this, parametric alternatives employ neural networks to learn a functional mapping that approximates the t-SNE embedding [7, 52, 53].

Uniform manifold approximation and projection

Similar to t-SNE, the UMAP algorithm is a nonlinear technique for dimensionality reduction that embeds points in a low-dimensional space, preserving local neighborhood information. Specifically, it constructs a kNN-based neighbor graph that is subsequently embedded into a lower-dimensional space. UMAP is theoretically grounded in Riemannian geometry as it is based on the assumption that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is (approximately) locally constant [4]. While producing visually similar representations to t-SNE, its main advantages over t-SNE are that it can better capture global relations to some extent [4, 54]. This is achieved by the constructing fuzzy graphs that can also capture long-distance relationships beyond immediate local neighbor information and thus better approximate the overall topology. Further, UMAP is more scalable due to a more efficient optimization algorithm. However, similar to t-SNE, it is sensitive to hyperparameters and evidence suggests that it effectively binarizes similarity information [36, 51]. Like t-SNE, it is nonparametric and requires re-optimization to embed new data. A parametric version using neural networks has been proposed [8].

Single-cell variational inference

An autoencoder consists of an encoder and decoder neural network that map data to a lower-dimensional latent space and back (Fig. 2A). The neural network parameters are optimized by minimizing the reconstruction error between the original input and the reconstructed output, such that the model learns to compress the main characteristics of the data in the lower-dimensional space. This approach has been successfully employed for dimensionality reduction of scRNA-seq data [17]. The VAE extends the autoencoder to a probabilistic version [43, 55] by modeling the latent space as a random variable, using variational inference [56] to approximate the conditional probability distributions of the latent variable given the data and vice versa in the encoder and decoder. Training maximizes a lower bound on the true data likelihood called the evidence lower bound (ELBO), such that the model learns to approximate the underlying high-dimensional data distributions, enabling subsequent synthetic data generation. For scRNA-seq, the scVI model adapts VAEs to negative binomial distributions while accounting for library size and batch effects [12, 57–59]. Synthetic scRNA-seq data from such models can be leveraged, e.g. for investigating dominant patterns in scRNA-seq data [60] or experimental planning [14, 42]. The low-dimensional representation can be used for downstream tasks such as classification or clustering [12, 15]. While such analyses are typically based on a richer representation with more latent dimensions, here we focus on the ability of scVI to visually capture temporal patterns in comparison to PCA, t-SNE, and UMAP and therefore use a two-dimensional latent space.

Generating synthetic time-series scRNA-seq data, using different manifolds for introducing temporal structure

Our approach for reasoning about dynamical patterns in different manifolds is based on synthetic time-series scRNA-seq data, to enable access to ground-truth information. For synthetic data generation, we use a VAE approach based on the scVI framework [12, 15]. We train a VAE on a real snapshot scRNA-seq dataset, i.e. without temporal structure, and subsequently introduce a synthetic temporal pattern via vector field dynamics, as explained below. To explore dynamics in different manifolds, we adapt scVI by introducing a supervised encoder, such that the latent representation is trained to align with a given t-SNE, UMAP, or PCA manifold (Fig. 2B). We chose this VAE-based approach, because t-SNE and UMAP are nonparametric, i.e. do not represent the embedding into low-dimensional space as an explicit function and cannot generate new data points, whereas the VAE is a parametric approach that is inherently generative, i.e. allows for creating synthetic data.

To train the adapted scVI model with a supervised encoder, we add the mean squared error between the VAE latent representation and the coordinates of a precomputed t-SNE, UMAP, or PCA embedding to the loss function. The trained encoder then acts as a surrogate to parameterize the embedding function of t-SNE, PCA, or UMAP. Given a snapshot dataset, we thus compute t-SNE, PCA, and UMAP manifolds, and train VAEs with supervised encoders for each, alongside a standard scVI model for a pure VAE manifold. For each approach, we use two-dimensional manifolds for visualization.

For (supervised) VAE training, hyperparameters such as the learning rate and the number of training epochs are chosen dataset-dependent based on monitoring convergence of the loss function. We aim to keep all hyperparameters as close as possible to the default values in the scVI framework. Details on hyperparameter configurations are reported in the Supplementary Material Section 5. For PCA, t-SNE and UMAP, we also report hyperparameter configurations for each dataset in the Supplementary Material Section 4. We further discuss the effect of hyperparameter choice in Section 3 and show results with different hyperparameters in Supplementary Material Section 3.

We select one dimensionality reduction approach as the “ground-truth” manifold for simulating temporal structure, and use the corresponding supervised encoder VAE. Then, we introduce temporal dynamics by transforming the two-dimensional representation using a vector field (Fig. 2C).

This geometric perspective allows for applying biologically meaningful transformations to the geometry of the cell manifold as a whole rather than specifying individual cellular trajectories, aligning with the landscape metaphor [61, 62].

Formally, a vector field Inline graphic on a set Inline graphic assigns to each point Inline graphic a vector Inline graphic. For our applications in Inline graphic, the vector field is associated with a dynamical system governed by the differential equations:

graphic file with name DmEquation1.gif

This formulation helps in defining transformations Inline graphic and Inline graphic which can mimic the behavior of the dynamical system without needing explicit solutions to the differential equations.

graphic file with name DmEquation2.gif

In practice, Inline graphic and Inline graphic are each defined as a series of transformations, each dependent on a set of parameters Inline graphic that tailor each transformation to the desired behavior.

As a result, the components Inline graphic and Inline graphic transform any point Inline graphic in the domain Inline graphic as

graphic file with name DmEquation3.gif

where Inline graphic and Inline graphic are the unit vectors in the Inline graphic and Inline graphic directions, respectively and Inline graphic is the set of indices for the transformations.

The additive structure allows for stacking various transformations to manipulate the vector field Inline graphic, making the design visually intuitive and accessible for users. We provide a detailed guide on how to design customized transformations based on this approach, available as a tutorial notebook in our Github repository. While our framework allows for freely customizable transformations, we also provide predefined vector fields for common use cases as ready-to-use solutions in our implementation.

The vector fields are designed to correspond to meaningful and biologically plausible developmental patterns, e.g. dividing and spreading cell groups, as illustrated in Fig. 2C.

We apply the vector field dynamics to the two-dimensional representation of the snapshot dataset and iterate this process over a fixed number of time steps. At each step we also add a small random offset to each cell’s representation, to account for the stochasticity inherent in cellular evolution and potential measurement noise. The equations describing the dynamics are tailored specifically for each dataset and incorporate available cell type annotations.

To generate corresponding high-dimensional data, we pass the transformed latent space from each artificial time point to the VAE decoder, obtaining parameters of a high-dimensional negative binomial distribution for sampling. This generates synthetic high-dimensional time-series data from a real snapshot dataset with artificially introduced temporal structure, specified as a used-defined vector field, in a chosen low-dimensional manifold. Since both the manifold in which the dynamics happen and the dynamical process itself are known, this approach provides a controlled benchmark with a known ground truth.

Finally, we concatenate the generated data across time points to obtain a complete time-series dataset, on which we apply all dimensionality reduction approaches (Fig. 2D). Knowing the original manifold and the true underlying pattern then enables us to reason about discrepancies in their representations, and investigate the sensitivity of the different approaches with respect to the manifold in which the dynamics happen, and to the dynamical pattern itself.

To assess our findings quantitatively, we use the 1-Wasserstein distance, also known as the Earth Mover’s Distance (EMD), to compute the shifts in the distributions of cells between time points in the low-dimensional representations and compare these distances between representations obtained from different techniques. Optimal transport-based methods have been successfully applied to similar distributional comparisons in single-cell data [63–65]. Specifically, we estimate low-dimensional densities using kernel density estimation and compute pairwise differences using the Wasserstein distance, approximated via the entropically-regularized Sinkhorn algorithm for computational efficiency [66]. Details are explained in the Suppplementary Material, Section 4.

Results

To illustrate our approach across different contexts, we have applied it on three snapshot scRNA-seq datasets. In this section, we present results using a dataset of 8000 peripheral blood mononuclear cells (PBMCs) [67], referred to as the PBMC data and a dataset of 3000 mouse brain cells [68], which we refer to as the Zeisel data. These datasets have been used as the initial basis at the first time point for applying our method to generate the high-dimensional synthetic time-series datasets. To further assess the robustness of our approach, we also applied it to a third, larger and more heterogeneous dataset of 17,000 immune cells from murine spleen and lymph nodes [69], referred to as the Lymph data. Detailed results for the latter dataset are provided in the Supplementary Material Section 6.

For each dataset, we have compared different hyperparameter configurations of t-SNE, UMAP, and scVI and selected the parameters that provided the visually most informative representation, reflecting the experience of a domain expert user without in-depth programming expertise. Even if there is some combination of pre-processing and hyperparameters that might lead to “better” capturing a specific pattern, it would be impossible to identify this without access to the ground truth. We have not found any systematic structure in hyperparameter configurations that would in general allow for better seeing temporal patterns. A detailed comparison is provided in the Supplementary Material Sections 2 and 3.

All plots are colored according to original cell type annotation from the first time point. As this annotation is no longer consistent when clusters experience differentiating processes, we refer to the clusters by their color annotation and not cell type label.

Figure 3 shows an exemplary result on the PBMC data. In this case, the blue cluster differentiates, creating two new separate clusters: one exhibits increasingly similar gene expression profiles to those of the yellow cluster, whereas the other one shows a growing resemblance to the profiles of the pink and red cell types. We have taken the t-SNE and UMAP representations based on the original data as first time points, and transformed the cell clusters according to the described pattern in both spaces (Fig. 3A and C respectively). From each of these artificial time points, we have then generated corresponding high-dimensional gene expression datasets, and applied t-SNE, UMAP, PCA, and scVI on the resulting dataset. Our design induces temporal patterns in different manifolds, allowing us to evaluate how the same transformation is captured by different methods across these manifolds. As expected, when the differentiation process occurs in the t-SNE manifold, t-SNE effectively depicts the transformation (Fig. 3B), likely because the method is operating in its own representation space. However, when the same transformation takes place in the UMAP manifold, t-SNE can still capture the differentiation, but it is visually less evident (Fig. 3D).

Figure 3.

Figure 3

(A and C) Latent representations from the supervised VAE, trained to match the t-SNE (A) and UMAP (C) embeddings of the PBMC dataset (leftmost panel), used as initial time point (t1), and transformed to induce a visual temporal pattern corresponding to the differentiation of the artificially transformed B-cells, shown in panels t2–t4. (B and D) t-SNE applied to the high-dimensional datasets generated by decoding the transformed latent representations in the t-SNE (A) and UMAP (C) manifolds using the supervised VAE. Cell annotations correspond to manually assigned cell types at the original time point (t1). This comparison illustrates how the same transformation is captured differently by t-SNE across these manifolds.

This inconsistency is observed across various combinations of methods and manifolds, where certain methods perform better in specific manifolds and for particular transformations, but no single method dominates across all scenarios. For example we can observe in Fig. 4 how PCA enables us to see two cell clusters evolving in different directions in terms of their gene expression profiles within a UMAP manifold (Fig. 4B), but the same type of transformations is not perceptible anymore when it involves different clusters and directions (Fig. 4D).

Figure 4.

Figure 4

(A and C) Latent representations from the supervised VAE, trained to match the t-SNE (A) and UMAP (C) embeddings of the Zeisel dataset (leftmost panel), used as the initial time point (t1), and transformed to induce a visual temporal pattern, shown in panels t2–t4. These patterns correspond to structural transitions in the artificially transformed endothelial-mural and microglia populations in A, and in the endothelial-mural and interneurons in C. (B and D) PCA applied to the high-dimensional datasets generated by decoding the differently transformed latent representations in the UMAP manifold using the supervised VAE. Cell annotations correspond to manually assigned cell types at the original time point (t1). This comparison highlights how the performance of the same method (PCA) can vary depending on the transformation applied, even if this transformation is done in the same manifold.

To complement the visual evaluation with a quantitative measure, we computed the EMD between the cellular distributions at different time points in the low-dimensional representation. Briefly, the EMD values confirm the visual impression, by showing that representations which capture a temporal pattern in a visually similar way also have similar EMD values, whereas a greater discrepancy in how a pattern is visually captured is also associated with a larger difference in EMD value. Detailed results are shown in the Supplementary Material Section 4.

In summary, the empirical evaluation shows that the different dimensionality reduction techniques have often different perspectives on the same underlying temporal structure and thus cannot be assumed to accurately reflect an underlying developmental process. These findings do not allow us to recommend one method as superior in general, but rather emphasize the need for careful method selection based on the characteristics of both the temporal patterns and the manifolds they exist in. In this context, our framework enables to more comprehensively characterize their performance, and can aid targeted selection of an optimal dimensionality reduction approach if some knowledge about anticipated pattern is available.

Discussion

The increasing availability of molecular datasets with complex structure and multiple measurements, such as from several time points, promises novel insights into biological mechanisms, such as understanding developmental trajectories at single-cell resolution.

For time-series scRNA-seq data, like the motivational example in Fig. 1, the true underlying dynamics are typically not fully known. While there may be some knowledge about the assumed underlying process (e.g. which cell types evolve during differentiation and which markers are up- and downregulated), this is usually incomplete, and dimensionality reduction and subsequent modeling is applied precisely to infer these temporal processes. In particular, even with some knowledge of the expected driving process, it is unclear how to identify the “true” low-dimensional manifold where this process can ideally be observed. Often, researchers apply unsupervised dimensionality reduction approaches for visual inspection of temporal patterns, which cannot be guaranteed to identify this manifold, as common approaches optimize purely data-intrinsic criteria.

To investigate to what extent different approaches can nonetheless reveal complex underlying structure such as temporal development patterns, we developed a VAE approach for generating synthetic data with temporal structure based on a chosen manifold (Fig. 2). Specifically, the approach allows for defining a hypothetical “ground-truth” manifold, artificially introducing temporal structure in this space, and generating corresponding synthetic high-dimensional gene expression data.

While it was to be expected that the choice of the dimensionality reduction technique affects which patterns can be seen, comparing performance just based on visual representations is challenging for with time-series single-cell datasets, especially since there are no established benchmark datasets specifically for this purpose or ground-truth information. Our synthetic data approach provides a solution by creating such datasets based on real data that approximate an underlying dynamic system, using vector fields as an intuitive geometric framework.

We used some of these datasets to compare four popular techniques, namely PCA, t-SNE, UMAP, and scVI and observed in the examples that no method is consistently superior. We conclude that relying on a single techniques cannot be assumed sufficient to reliably represent dynamic patterns and thus consider it highly beneficial to look at different representations to get a more comprehensive picture. Alternatively, dimensionality reduction techniques could be extended to specifically target manifolds with the most variability over time, to flexibly pick up the manifold where the temporal dynamics actually unfold.

The main advantages of the proposed synthetic data approach are its versatility for introducing dynamic patterns via an intuitive geometric vector field approach and its flexibility with respect to different manifolds where dynamics can occur. This is enabled by a supervised component that enables the model to mimic the two-dimensional representations from different techniques. This flexibility allows for inducing different temporal patterns across various learned manifolds and posing questions such as, “We have a four-time-point dataset where one cell cluster divides over time in the manifold learned by t-SNE. Can UMAP, when trained on this generated data, also reveal this temporal pattern? Does it do so in the same way, or are there differences? Is this generalizable to any cluster division in a t-SNE space, or does it depend on the division’s speed? How significantly can parameter tuning affect the visualization?.”

When researchers have prior knowledge or hypotheses regarding the dynamics they expect, experimenting with our approach can help to identify the most suitable underlying low-dimensional manifold where the expected processes are best observed.

The presented approach assumes an equal number of cells at each time point, which in reality might differ considerably. To address this, cells could be subsampled at specific time points, using either random dropout or subsample according to an assumed missingness distribution (e.g. if specific cell types are assumed to become more or less abundant at later time points). As an illustration, we have implemented a prototypical example of this idea, and show an exemplary result in the Supplementary Material Section 5. Additionally, the VAE decoder can be used to generate synthetic data of arbitrary size, such that even from a latent representation with an equal number of time points, we can generate high-dimensional data with different numbers of cells at each time point. Investigating the effect of varying cell numbers on the representations of different dimensionality reduction approaches, applied to the high-dimensional data, can then help to further characterize their perspective on the underlying dynamics. The synthetic data approach could be further enhanced by incorporating more complex patterns.

The quantitative comparison of discrepancies in the representations from different techniques via the EMD offers a first approach to objectively evaluate dimensionality reduction approaches for preserving temporal patterns. Yet, the EMD only allows for relative comparisons between techniques and the absolute value itself is not straightforward to interpret. We also observed that for t-SNE and UMAP, the EMD mostly tended to be within a specific range of values across different patterns and datasets, limiting the comparability of absolute values across techniques. Future work could thus focus on developing further quantitative measures to directly compare the performance of different techniques, offering a standardized way to evaluate manifold learning in time-series single-cell data.

In summary, our approach addresses a critical gap in analyzing time-series scRNA-seq data, offering a way to reason about discrepancies between dimensionality reduction techniques. By leveraging synthetic datasets with controlled temporal dynamics, we provide a tool for exploring how different methods capture key developmental patterns, enabling researchers to assess multiple perspectives for adequately representing underlying biological processes.

Key Points

  • Different dimensionality reduction techniques, such as PCA, t-SNE, UMAP, and scVI, can produce inconsistent representations of temporal patterns in time-series scRNA-seq data.

  • We propose a synthetic data approach using variational autoencoders (VAEs) to introduce biologically plausible temporal patterns, for reasoning about the discrepancies of the obtained representations when subsequently applying different dimensionality reduction techniques.

  • The approach enables researchers to assess the ability of various methods to capture dynamic processes, in particular highlighting the limitations of relying on a single technique.

  • We provide an implementation and tutorial notebooks to guide researchers in evaluating representations and interpreting single-cell dynamics effectively.

Supplementary Material

supp_bbaf287
supp_bbaf287.pdf (39.1MB, pdf)

Acknowledgments

The authors thank Martin Treppner and Moritz Hess for help with data pre-processing and Julian Pfeifle for his support and helpful advice during the course of this research. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project-ID 322977937—GRK 2344 (MH) and Project-ID 499552394—SFB 1597 (HB, RB, MH).

Contributor Information

Maren Hackenberg, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Straße 26, 79106 Freiburg, Germany; Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Ernst-Zermelo-Straße 1, 79106 Freiburg, Germany.

Laia Canal Guitart, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Straße 26, 79106 Freiburg, Germany; Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Ernst-Zermelo-Straße 1, 79106 Freiburg, Germany.

Rolf Backofen, Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Kohler-Allee 106, 79110 Freiburg, Germany; Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Schänzlestr. 18, 79104 Freiburg im Breisgau, Germany.

Harald Binder, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Straße 26, 79106 Freiburg, Germany; Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Ernst-Zermelo-Straße 1, 79106 Freiburg, Germany; Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Schänzlestr. 18, 79104 Freiburg im Breisgau, Germany.

Author contributions

M.H. and L.C.G. developed and performed the experiments and wrote the manuscript. R.B contributed to the writing of the manuscript. H.B. proposed the idea and supervised the project. All authors reviewed the manuscript and approved the final version.

Conflict of interest

No competing interest is declared.

Data availability

The exemplary time-series dataset is the Embryoid body data. For generating synthetic data we use two publicly available snapshot scRNA-seq datasets, the PBMC8k data, a dataset of peripheral blood mononuclear cell (PBMC) from [67] and the Zeisel data, a heterogeneous dataset of mouse brain cells [68]. Details of data acquisition and pre-processing can be found in the Supplementary Material, Section 1.

We use the Julia programming language [70] of version 1.6.7 for all our analysis and models. For training scVI models in Julia and their supervised encoder adaptation for targeted generation of synthetic data, we have written a Julia version of the original scVI model from [12] based on the Python scvi-tools ecosystem [15], including pre-processing functionality based on the anndata [71] and scanpy [72] packages. As this has not been comprehensively developed in Julia before, we have created a corresponding Julia package available at https://github.com/maren-ha/scVI.jl. The complete code to reproduce our analysis and all results in this manuscript can be found at https://github.com/laia-cg/scManifoldDynamics, including tutorial Jupyter notebooks for a user-friendly introduction. Details about the hyperparameters of t-SNE and UMAP, the VAE architecture and training procedure can be found in the Supplementary Material.

References

  • 1. Kobak  D, Berens  P. The art of using t-SNE for single-cell transcriptomics. Nat Commun  2019;10:5416. 10.1038/s41467-019-13056-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Luecken  MD, Theis  FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol  2019;15:e8746. 10.15252/msb.20188746 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. van der Maaten  L, Hinton  G. Viualizing data using t-SNE. Journal of Machine Learning Research  2008;9:2579–605. [Google Scholar]
  • 4. McInnes  L, Healy  J, Saul  N. et al.  UMAP: uniform manifold approximation and projection. J Open Soure Soft  2018;3:861. 10.21105/joss.00861 [DOI] [Google Scholar]
  • 5. Becht  E, McInnes  L, Healy  J. et al.  Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol  2019;37:38–44. 10.1038/nbt.4314 [DOI] [PubMed] [Google Scholar]
  • 6. Graving  JM, Couzin  ID. VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering  bioRxiv. 2020.
  • 7. Crecchi  F, de Bodt  C, Verleysen  M. Perplexity-free parametric t-SNE. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning ESANN, 2020.  arXiv preprint: arXiv:2010.01359.
  • 8. Sainburg  T, McInnes  L, Gentner  TQ. Parametric UMAP embeddings for representation and semisupervised learning. Neural Comput  2021;33:2881–907. 10.1162/neco_a_01434 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Han  H, Zhang  T, Li  C. et al.  Explainable t-SNE for single-cell RNA-seq data analysis  bioRxiv. 2022.
  • 10. Dadu  A, Satone  VK, Kaur  R. et al.  Application of aligned-UMAP to longitudinal biomedical studies. Patterns  2023;4:100741. 10.1016/j.patter.2023.100741 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Tsuyuzaki  K, Sato  H, Sato  K. et al.  Benchmarking principal component analysis for large-scale single-cell RNA-sequencing. Genome Biol  2020;21:9. 10.1186/s13059-019-1900-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lopez  R, Regier  J, Cole  MB. et al.  Deep generative modeling for single-cell transcriptomics. Nat Methods  2018;15:1053–8. 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Amodio  M, Van Dijk  D, Srinivasan  K. et al.  Exploring single-cell data with deep multitasking neural networks. 16:1139–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Treppner  M, Salas-Bastos  A, Hess  M. et al.  Synthetic single cell RNA sequencing data from small pilot studies using deep generative models. Sci Rep  2021;11:1–11. 10.1038/s41598-021-88875-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gayoso  A, Lopez  R, Xing  G. et al.  A python library for probabilistic analysis of single-cell omics data. Nat Biotechnol  2022;40:163–6. 10.1038/s41587-021-01206-w [DOI] [PubMed] [Google Scholar]
  • 16. Eraslan  G, Simon  LM, Mircea  M. et al.  Single-cell RNA-seq denoising using a deep count autoencoder. Nat Commun  2019b;10:390. 10.1038/s41467-018-07931-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Ding  J, Condon  A, Shah  SP. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun  2018;9:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Ding  J, Sharon  N, Bar-Joseph  Z. Temporal modelling using single-cell transcriptomics. Nat Rev Genet  2022;23:355–68. 10.1038/s41576-021-00444-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Vandereyken  K, Sifrim  A, Thienpont  B. et al.  Methods and applications for single-cell and spatial multi-omics. Nat Rev Genet  2023;24:494–515. 10.1038/s41576-023-00580-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Baysoy  A, Bai  Z, Satija  R. et al.  The technological landscape and applications of single-cell multi-omics. Nat Rev Mol Cell Biol  2023;24:695–713. 10.1038/s41580-023-00615-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Weinreb  C, Wolock  S, Tusi  BK. et al.  Fundamental limits on dynamic inference from single-cell snapshots. Proc Natl Acad Sci  2018;115:e2467–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Schiebinger  G, Shu  J, Tabaka  M. et al.  Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell  2019;176:1517. 10.1016/j.cell.2019.02.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Tran  TN, Bader  GD. Tempora: cell trajectory inference using time-series single-cell RNA sequencing data. PLoS Comput Biol  2020;16:e1008205. 10.1371/journal.pcbi.1008205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Bergen  V, Soldatov  RA, Kharchenko  PV. et al.  RNA velocity—Current challenges and future perspectives. Mol Syst Biol  2021;17:e10282. 2023/01/26 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Yeo  GHT, Saksena  SD, Gifford  DK. Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions. Nat Commun  2021;12:3222. 10.1038/s41467-021-23518-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Qiu  C, Martin  BK, Welsh  IC. et al.  A single-cell time-lapse of mouse prenatal development from gastrula to birth. Nature  2024;626:1084–93. 10.1038/s41586-024-07069-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Chen  C-C, Tran  W, Song  K. et al.  Temporal evolution reveals bifurcated lineages in aggressive neuroendocrine small cell prostate cancer trans-differentiation. Cancer Cell  2023;41:2066–2082.e9. 10.1016/j.ccell.2023.10.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Cao  J, Spielmann  M, Qiu  X. et al.  The single-cell transcriptional landscape of mammalian organogenesis. Nature  2019;566:496–502. 10.1038/s41586-019-0969-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Farrell  JA, Wang  Y, Riesenfeld  SJ. et al.  Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science  2018;360:eaar3131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Feng  C, Liu  S, Zhang  H. et al.  Dimension reduction and clustering models for single-cell RNA sequencing data: a comparative study. Int J Mol Sci  2020;21:2181. 10.3390/ijms21062181 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Moon  KR, Stanley  JSIII, Burkhardt  D. et al.  Manifold learning-based methods for analyzing single-cell rna-sequencing data. Current opinion. Syst Biol  2018;7:36–46. [Google Scholar]
  • 32. Moon  K. Embryoid Body Data for Phate  2018.
  • 33. Cooley  SM, Hamilton  T, Aragones  SD. et al.  A novel metric reveals previously unrecognized distortion in dimensionality reduction of scrna-seq data  bioRxiv. 2022.
  • 34. Chari  T, Pachter  L. The specious art of single-cell genomics. PLoS Comput Biol  2023;19:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Pal  K, Sharma  M. Performance evaluation of non-linear techniques umap and t-SNE for data in higher dimensional topological space. In: 2020 4th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Piscataway, NJ, USA: Institute of Electrical and Electronics Engineers (IEEE), pp. 1106–10, 2020.
  • 36. Damrich  S, Hamprecht  FA. On UMAPs true loss function. In: Ranzato  M, Beygelzimer  A, Dauphin  Y. et al. (eds.), Advances in Neural Information Processing Systems, Vol. 34, pp. 5798–809. Curran Associates, Inc., 2021. [Google Scholar]
  • 37. Xia  L, Lee  C, Li  JJ. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun  2024;15:1753. Publisher: Nature Publishing Group [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Sun  S, Zhu  J, Ma  Y. et al.  Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol  2019;20:269. 10.1186/s13059-019-1898-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Xiang  R, Wang  W, Yang  L. et al.  A comparison for dimensionality reduction methods of single-cell RNA-seq data. Front Genet  2021;12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Pandit  V, Swain  AK, Yadav  P. Comparison of dimensionality reduction and clustering methods for single-cell transcriptomics data  bioRxiv. 2022.
  • 41. Zappia  L, Phipson  B, Oshlack  A. Splatter: simulation of single-cell rna sequencing data. Genome Biol  2017;18:174. 10.1186/s13059-017-1305-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Treppner  M, Haug  S, Köttgen  A. et al.  Designing single cell rna-sequencing experiments for learning latent representations  bioRxiv. 2022. bioRxiv preprint
  • 43. Kingma  DP, Welling  M. Auto-encoding variational bayes. In: Bengio Y, LeCun Y (eds.), 2nd International Conference on Learning Representations (ICLR), Conference Track Proceedings. Banff, Alberta, Canada: International Conference on Learning Representations (ICLR), 2014.
  • 44. Waddington  CH. The Strategy of the Genes. Routledge, London, 2014, 10.4324/9781315765471. [DOI] [Google Scholar]
  • 45. Ferrell  JE. Bistability, bifurcations, and Waddington’s epigenetic landscape. Curr Biol  2012;22:R458–66. 10.1016/j.cub.2012.03.045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Henikoff  S. The epigenetic landscape: an evolving concept. In: Sharon YR. Dent (ed.), Frontiers in Epigenetics and Epigenomics. Lausanne, Switzerland: Frontiers Media SA, 2023. [Google Scholar]
  • 47. Hotelling  H. Analysis of a complex of statistical variables into principal components. J Educ Psychol  1933;24:417–41. 10.1037/h0071325 [DOI] [Google Scholar]
  • 48. Shah  MZH, Ahmed  Z, Lisheng  H. Weighted linear local tangent space alignment via geometrically inspired weighted pca for fault detection. IEEE Trans Industr Inform  2022;19:210–9. [Google Scholar]
  • 49. Dennis  R, Cook.  Detection of influential observation in linear regression. Technometrics  1977;19:15–8. [Google Scholar]
  • 50. Jolliffe  IT, Cadima  J. Principal component analysis: a review and recent developments. Phil Trans R Soc Lond Ser A Math Phys Eng Sci  2016;374:20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Chari  T, Banerjee  J, Pachter  L. The specious art of single-cell genomics  bioRxiv. 2021. [DOI] [PMC free article] [PubMed]
  • 52. Cho  H, Berger  B, Peng  J. Generalizable and scalable visualization of single-cell data using neural networks. Cell Syst  2018;7:185, 08–191.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. van der Maaten  L. Learning a parametric embedding by preserving local structure. In: proceedings of the 12th International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research. van Dyk  D, Welling  M (eds.), pp. 384–91. Hilton Clearwater beach resort, Clearwater Beach, Florida USA: PMLR, 2009. [Google Scholar]
  • 54. Allaoui  M, Kherfi  ML, Cheriet  A. Considerably improving clustering algorithms using umap dimensionality reduction technique: A comparative study. In: International Conference on Image and Signal Processing, 317–25. Springer, 2020, 10.1007/978-3-030-51935-3_34. [DOI] [Google Scholar]
  • 55. Kingma  DP, Welling  M. An introduction to variational autoencoders. Foundations and trendsInline graphic. Mach Learn  2019;12:307–92. [Google Scholar]
  • 56. Blei  DM, Kucukelbir  A, McAuliffe  JD. Variational inference: A review for statisticians. J Am Stat Assoc  2017;112:859–77. 10.1080/01621459.2017.1285773 [DOI] [Google Scholar]
  • 57. Jiang  R, Sun  T, Song  D. et al.  Statistics or biology: The zero-inflation controversy about scRNA-seq data. Genome Biol  2022;23:1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Prabhakaran  S, Azizi  E, Carr  A. et al.  Dirichlet process mixture model for correcting technical variation in single-cell gene expression data. In: Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research. Balcan  MF, Weinberger  KQ (eds.), pp. 1070–9. New York, NY, USA: PMLR, 2016. [PMC free article] [PubMed] [Google Scholar]
  • 59. Cole  MB, Risso  D, Wagner  A. et al.  Performance assessment and selection of normalization procedures for single-cell RNA-seq. Cell Systems  2019;8:315–328.e8. 10.1016/j.cels.2019.03.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Hess  M, Hackenberg  M, Binder  H. Exploring generative deep learning for omics data by using log-linear models. Bioinformatics  2020;36:5045–53. 10.1093/bioinformatics/btaa623 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Waddington  CH. Principles of Development and Differentiation  (No Title)1966.
  • 62. Moris  N, Pina  C, Arias  AM. Transition states and cell fate decisions in epigenetic landscapes. Nat Rev Genet  2016;17:693–703. 10.1038/nrg.2016.98 [DOI] [PubMed] [Google Scholar]
  • 63. Klein  D, Palla  G, Lange  M. et al.  Mapping cells through time and space with moscot. Nature  2025;1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Bunne  C, Schiebinger  G, Krause  A. et al.  Optimal transport for single-cell and spatial omics. Nature Reviews Methods Primers  2024;4:58. 10.1038/s43586-024-00334-2 [DOI] [Google Scholar]
  • 65. Schiebinger  G, Shu  J, Tabaka  M. et al.  Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell  2019;176:928–943.e22. 10.1016/j.cell.2019.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Cuturi  M. Sinkhorn distances: lightspeed computation of optimal transport. In: Christopher JC Burges, Léon Bottou, Max Welling, Zoubin Ghahramani, and Kilian Q Weinberger (eds.), Advances in Neural Information Processing Systems  2013;26. [Google Scholar]
  • 67. Zheng  GXY, Terry  JM, Belgrader  P. et al.  Massively parallel digital transcriptional profiling of single cells. Nature. Communications  2017;8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Zeisel  A, Muñoz-Manchado  AB, Codeluppi  S. et al.  Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science  2015;347:1138–42. 2023/02/08 [DOI] [PubMed] [Google Scholar]
  • 69. Gayoso  A, Steier  Z, Lopez  R. et al.  Joint probabilistic modeling of single-cell multi-omic data with totalvi. Nat Methods  2021;18:272–82. 10.1038/s41592-020-01050-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Bezanson  J, Edelman  A, Karpinski  S. et al.  Julia: a fresh approach to numerical computing. SIAM Review  2017;59:65–98. 10.1137/141000671 [DOI] [Google Scholar]
  • 71. Virshup  I, Rybakov  S, Theis  FJ. et al.  Anndata: annotated data  bioRxiv, 2021.12.16.473007, 01 2021. bioRxiv preprint. 10.1101/2021.12.16.473007 [DOI]
  • 72. Alexander Wolf  F, Angerer  P, Theis  FJ. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol  2018;19:15. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp_bbaf287
supp_bbaf287.pdf (39.1MB, pdf)

Data Availability Statement

The exemplary time-series dataset is the Embryoid body data. For generating synthetic data we use two publicly available snapshot scRNA-seq datasets, the PBMC8k data, a dataset of peripheral blood mononuclear cell (PBMC) from [67] and the Zeisel data, a heterogeneous dataset of mouse brain cells [68]. Details of data acquisition and pre-processing can be found in the Supplementary Material, Section 1.

We use the Julia programming language [70] of version 1.6.7 for all our analysis and models. For training scVI models in Julia and their supervised encoder adaptation for targeted generation of synthetic data, we have written a Julia version of the original scVI model from [12] based on the Python scvi-tools ecosystem [15], including pre-processing functionality based on the anndata [71] and scanpy [72] packages. As this has not been comprehensively developed in Julia before, we have created a corresponding Julia package available at https://github.com/maren-ha/scVI.jl. The complete code to reproduce our analysis and all results in this manuscript can be found at https://github.com/laia-cg/scManifoldDynamics, including tutorial Jupyter notebooks for a user-friendly introduction. Details about the hyperparameters of t-SNE and UMAP, the VAE architecture and training procedure can be found in the Supplementary Material.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES