Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Sep 28;18(9):e1010031. doi: 10.1371/journal.pcbi.1010031

Towards reliable quantification of cell state velocities

Valérie Marot-Lassauzaie 1,2,#, Brigitte Joanne Bouman 1,3,#, Fearghal Declan Donaghy 1, Yasmin Demerdash 4,5,6, Marieke Alida Gertruda Essers 4,5,7, Laleh Haghverdi 1,*
Editor: Wei Li8
PMCID: PMC9550177  PMID: 36170235

Abstract

A few years ago, it was proposed to use the simultaneous quantification of unspliced and spliced messenger RNA (mRNA) to add a temporal dimension to high-throughput snapshots of single cell RNA sequencing data. This concept can yield additional insight into the transcriptional dynamics of the biological systems under study. However, current methods for inferring cell state velocities from such data (known as RNA velocities) are afflicted by several theoretical and computational problems, hindering realistic and reliable velocity estimation. We discuss these issues and propose new solutions for addressing some of the current challenges in consistency of data processing, velocity inference and visualisation. We translate our computational conclusion in two velocity analysis tools: one detailed method κ-velo and one heuristic method eco-velo, each of which uses a different set of assumptions about the data.

Author summary

Single cell transcriptomics has been used to study dynamical biological processes such as cell differentiation or disease progression. An ideal study of these systems would track individual cells in time but this is not directly feasible since cells are destroyed as part of the sequencing protocol. Because of asynchronous progression of cells, single cell snapshot datasets often capture cells at different stages of progression. The challenge is to infer both the overall direction of progression (pseudotime) as well as single cell specific variations in the progression. Computational methods development for inference of the overall direction are well advanced but attempts to address the single cell level variations of the dynamics are newer. Simultaneous measurement of abundances of new (unspliced) and older (spliced) mRNA in the same single cell adds a temporal dimension to the data which can be used to infer the time derivative of single cells progression through the dynamical process. State-of-the-art methods for inference of cell state velocities from RNA-seq data (also known as RNA velocity) have multiple unaddressed issues. In this manuscript, we discuss these issues and propose new solutions. In previous works, agreement of RNA velocity estimations with pseudotime has been used as validation. We show that this in itself is not proof that the method works reliably and the overall direction of progression has to be distinguished from individual cells’ behaviour. We propose two new methods (one detailed and one cost efficient heuristic) for estimation and visualisation of RNA velocities and show that our methods faithfully capture the single-cell variances and overall trend on simulation. We further apply the methods to different datasets and show how the method can help us gain biological insight from real data.


This is a PLOS Computational Biology Methods paper.

Introduction

Single cell transcriptomics has facilitated the study of asynchronous cellular processes such as cell differentiation in the high-dimensional gene expression space. Development of computational methods for extracting temporal information from snapshots of the system has attracted much attention in recent years. The output of these methods is typically a pseudo-temporal ordering of cells, representing their progression along the (deterministic) path of directed differentiation. However, this ordering does not reflect the intrinsic stochastic characteristics of the process and leaves several biologically interesting questions unanswered. Can cells go back along de-differentiation paths? If yes, how far and how likely is that? How strong is the stochastic component of the dynamics compared to the deterministic directed part? Answering these questions would allow quantification of cell fate plasticity in different transcriptional regions.

RNA velocity, proposed by [1] (and the corresponding package called velocyto), was a breakthrough towards obtaining a more complete description of the dynamics of cell differentiation. Simultaneous measurement of abundances of nascent unspliced and mature spliced mRNA in single cells adds a temporal dimension to the collected data which can be used to infer the temporal motion of cells in transcriptomic space. A later method, scVelo [2], further advanced the concept by solving the transcriptional dynamics of splicing kinetics and velocity inference. Other extensions included additional temporal layers of gene regulation such as protein levels [3] or chromatin accessibility [4] to the unspliced and spliced mRNA levels to extract further information on cell state dynamics. Recently, there have also been advancements in using cell state velocities to study the degree of cell plasticity [5]. For all these methods, it is important to first ensure robust and reliable estimation of single cell velocities. Ideally, the estimated velocities should capture both the overall course in the population as well as the single-cell specific (stochastic) part of the dynamics. However, reliable inference of cell state velocities is still impeded by multiple computational issues. Some weaknesses in current velocity visualisation approaches, as well as challenges in inclusion of genes with multiple dynamics, have been pointed out in [1, 2, 6, 7]. Another issue on scale invariance of gene-wise velocity components was described in more detail in [8]. Current methods either do not address this scale invariance issue or address it incompletely using unrealistic assumptions. Moreover, there are several inconsistencies in the current methods’ processing pipeline and the stochastic part of the dynamics is lost through multiple layers of data imputation and smoothing. In parallel to this study, [9] and [10] point out some of the limitations and problems of current velocity visualisation methods. [9] also suggests that, due to the highly stochastic nature of gene expression process, currently used (deterministic) approaches are insufficient and propose development of probabilistic alternatives. More recently, a variational inference method for RNA velocity estimation has also become available [11].

In this manuscript, we argue that when dealing with highly stochastic processes, deterministic approaches are only useful when talking about average velocities over specific time intervals, instead of talking about spontaneous velocities which are immeasurable in real life. We propose two different approaches for estimation and visualisation of RNA velocities. In κ-velo, we first design a processing workflow specifically adapted to downstream velocity calculations, thereby addressing problems in previously used workflows. We then solve for the gene-wise reaction rate parameters and propose an approach to relate velocity components across genes, hence resolving the scale invariance issue. We also present a new visualisation method that more faithfully represents the stochastic part of the velocities. In addition, we propose eco-velo, a heuristic method that bypasses several cumbersome, computationally costly and stochasticity killing steps used by other available methods.

A table of contents is provided in S1 Table of contents.

Methods

Dynamical inference

Building high-dimensional cell state velocities as vector sums of their gene-wise components (as is the current practice) requires careful handling of two major issues: ambiguity of the time scales and the relative scaling between different velocity components. In this section, we discuss current problems in state-of-the-art velocity estimation approaches and introduce our novel κ-velo and eco-velo approaches.

The time scale over which average cell state velocities are reported

In the physical world, we can only measure average velocities in a given time interval Δt. As Δt → 0 measured velocities get closer to instantaneous velocities, which are impossible to measure directly. When adding multiple velocity components one would ideally need to measure all gene-wise displacement components Δxg in the same interval Δt. Mathematically:

V=g=1GΔxgΔt (1)

where V is the G-dimensional velocity vector and Δt is the same for all genes. However, in the RNA velocity framework (even without scale invariance problem discussed in the next subsection) we use:

V=g=1Gvg,vg=ΔxgΔtg (2)

the result of which is different from Eq 1 for non-smooth expression dynamics. Using a different Δtg for each gene g, raises an immediate question: which time interval does the average cell state velocity V calculated from Eq (2) correspond to? Obscurity in the physical meaning of velocities calculated as such is more pronounced when including genes with noisy expression dynamics, e.g. bursting genes where velocities will change depending on the time scale (S1 Fig). For such genes, it would be interesting to experimentally measure velocities at multiple time scales. This could help us better understand the extent of cell fate plasticity. One would expect to see more variance in the direction of individual cells velocities reported in small time scales, whereas velocities over sufficiently large time scales would better align with the pseudotemporal direction of differentiation.

Scale invariance of gene-wise velocity components

According to the RNA velocity formalism:

dugdt=αg-βgugdsgdt=βgug-γgsg=vg (3)

where ug and sg represents the number of unspliced and spliced counts for gene g. αg, βg, γg represent transcription rate, splicing rate and mature mRNA degradation rate respectively. vg represent the instantaneous velocity component of gene g.

Eq 3 provides a deterministic (smooth) explanation of gene transcription and splicing events, in which the kinetic rate parameters are assumed to be constant (equal to their mean value over a relatively large time interval). In absence of temporal measurements (i.e., when working with snapshot u-s counts data) the actual time scales for which the assumption of constant kinetic rates for each gene would be valid are not known. In essence, one has to work with a time-independent relation between u and s counts, which we know as the u-s phase portrait of the data. This implies that scaling dt (or equivalently t) by κ does not change the u-s phase portrait of a gene. This scaling factor at the left hand side denominators of Eq 3 can be absorbed to the right hand side (RHS) of the equation, suggesting that if (αg, βg, γg) is a solution, (καg, κβg, κγg) is also a solution for any κ. This complicates deduction of the relative scaling of different genes, as was also shown in previous studies [8]. To get a valid high-dimensional velocity vector V, one needs to know the real scaling factor κg for each gene:

V=g=1Gvg=g=1Gκgdsgdteg^ (4)

where eg^ represents the unit vector for gene g.

To overcome the scale invariance, velocyto assumes κβ = 1 (i.e. same splicing rate) for all genes. scVelo assumes that the time of the observed kinetics (i.e. turning on, reaching stationary state and turning off) on the u-s phase portrait is equal for all genes (using a default constant of 20 hours). They then fit a latent time between 0 and this constant to the cells on the phase portrait of each gene, and scale the other kinetic parameters accordingly.

Having a global (i.e., gene independent for each cell) latent time would put the time scales of different genes’ u-s phase portraits in perspective and resolve the scale invariance issue. However, we note that optimisation of cells’ global latent time is not part of the expectation maximisation procedure in scVelo. Rather, after fitting the gene-specific parameters (including the latent times of the cells), scVelo uses a multi-step ad-hoc voting method among the fitted latent times from multiple high-likelihood genes to calculate a global latent time for the cells. This approach does not realistically address the relative scale of different genes and the full cycle time for all genes remain equal by assumption.

Instead, we suggest that a proxy for typical travel times between cell states could be used as global latent time. For example, one could use pseudotime or the cell density scaled version of it called universal time [12] as proxy for actual transition time between cell states. Here, the accuracy of pseudotime recovery for multiple branches of typical differentiation processes would be crucial for estimation of the velocity parameters.

In κ-velo, we circumvent prior recovery of global latent times and use an equivalent per-gene approach. Here, we use the number of cells between two cell states as a proxy for the typical travel time between them on the gene specific u-s phase portraits. This approach assumes that the probability of capturing cells in a given expression state is proportional to the time cells spend in that state. In eco-velo we take a different approach, which does not decompose the gene-wise velocity components in the first place but, similarly to velocyto, relies on strongly simplifying assumptions on the kinetic rate parameters.

First approach: κ-velo

Our first approach recovers the full dynamics of splicing kinetics and addresses the scale invariance problem by using a proxy of travel time between cell states. In this subsection, we drop the gene-wise indices g as we address the scaling factor κ for one gene at a time.

Consider one gene with true parameters of reaction rate θtrue = (κα, κβ, κγ). In a first step, we recover an arbitrary solution of the reaction rate parameters with β = 1, i.e θ = (α, 1, γ), and in a second step we recover the κ which scales this solution to its actual magnitude relative to the other genes. Below we elaborate on each of the two steps.

The analytical solutions to Eq 3 are given by: [2]

u(t)=u0exp(β(t-t0))+α(1-exp(-β(t-t0)))s(t)=s0exp(-γ(t-t0))+αγ(1-exp(-γ(t-t0)))+α-βu0γ-β(exp(-γ(t-t0))-exp(-β(t-t0))) (5)

where t ∈ (t1, …, tn) is the gene specific latent time assigned to each cell and u0 = u(t0), s0 = s(t0) are the initial conditions. Transcriptional regulation is inscribed in α, which is set to 0 at downregulation. Cells can then either be in up- or downregulation, as encoded in the parameter ki, with k = 1 at upregulation and k = 0 at downregulation. We set (u0, s0) = (0, 0) in the upregulation phase (k = 1) and (u0, s0) = (u(tswitch), s(tswitch)) in the downregulation phase (k = 0).

We note that if t was given as a global (gene independent) latent time assigned to each cell, the scale invariance problem would already be resolved. However, in practice we do not have t.

From the solution u(t), we get exp(-β(t-t0))=βu(t)-αβu0-α. Therefore, exp(-γ(t-t0))=(exp(-β(t-t0)))γ/β=(βu(t)-αβu0-α)γ/β. Substituting this Eq in s(t), we get the time-independent relation between unspliced and spliced counts. For β = 1 specifically, we get:

s(u)=(s0-αγ+α-u0γ-1)(u-αu0-α)γ+u-αγ-1+αγ (6)

This is the form of a function s(u) which we can directly fit to the data points in a u-s phase portrait.

In practice, there is one more amendment needed for parameter fitting. As current procedures for assignment of the sequence reads to either unspliced or spliced mRNA are biased towards spliced assignments and heavily underestimate the unspliced counts, a function of the form Eq 6 cannot approximate the data unless we upscale the measured u counts by a (gene-specific) factor mg (see Note A in S1 Appendix). Thus instead of Eq 3 we now have:

mgdugdt=αg-mgβgugdsgdt=mgβgug-γgsg=vg (7)

We note that, through necessity from current data qualities, scVelo also scales u, but scales u to have the same variance as s [2]. In fact scaling in this way is equivalent to setting κγ/κβ ≈ 1 (see Note A in S1 Appendix). We also note that upscaling u by mg is different from separate normalisation as here the counts of that gene are multiplied by the same constant for all cells, whereas a separate normalisation will affect cells differently for the same gene. With mg, Eq 6 becomes:

s(u)=(s0-αγ+α-mu0γ-1)(mu-αmu0-α)γ+mu-αγ-1+αγ (8)

which we use for fitting to the u-s data and inference of the parameters (α, γ, uswitch, m) (see Note B in S1 Appendix for the details of our expectation maximisation (EM) procedure).

Once the EM is done, we recover the time scale κ for each gene. Let Δtij be a measure of time that can be used to relate time between two states i, j across genes, with i before j in time. Consider one gene with true parameters of reaction rate θ = (κα, κβ, κγ) and recovered parameters θ = (α, β, γ) and ui, uj the measured unspliced counts for cells i, j. Note that for that gene, i, j need to be in the same state of transcriptional induction or repression because the speed of genes is only measurable during transcriptional change, i.e. outside of steady-state. If the cells spend time in steady state, the change in transcriptional state will not be proportional to the distance in time, which is why we only consider cells in the same state.

Considering the time scale κ in the solution for u(t) in Eq 5, yields for two measurements from cell i and j:

muj=muiexp(-βκΔtij)+αβ(1-exp(-βκΔtij)) (9)

Solving for κΔtij we get:

κΔtij=1βlogmui-α/βmuj-α/β (10)

with β = 1, and m and α inferred from EM. As a proxy for the true Δtijs (which we do not have) we use the number of cells that occur between the cells i and j calling it d(i, j). The rational being that, in snapshot data, the probability of capturing cells in a specific region of the expression space is proportional to the time cells typically spend in that region. This assumption serves as a valid approximation for most single-cell datasets, but is undermined in presence of non-uniform cell proliferation and death rates as well as biased sampling of cell types (e.g. enrichment for specific cell types).

Let us call the RHS of Eq 10, f(i, j). For cell pairs that are in the same transcriptional phase (i.e., the upregulation or the downregulation phase), f(i, j) has a linear relation to d(i, j), with the slope given by κ. However if either (or both) cells are in steady-state, f will be smaller than expected from Eq 10. Thus, plotting f(i, j) versus d(i, j) for random pairs of i, j, produces a parallelogram of which the left slope equals κ. To recover κ, we fit a parallelogram to the data points with minimum area, while maximising the number of points in the parallelogram (Note C in S1 Appendix and S2 Fig).

Here, we inferred κ from the unspliced counts data. One could similarly use the spliced counts data and infer κ from the s(t) solution in Eq 5, which yields κ estimations congruent with those inferred from the unspliced data (see Note D in S1 Appendix and S3 Fig). However, as u(t) depends only on α, β while s(t) depends on α, β, γ, i.e. on one more imputed parameter, we consider recovery of κ values from u(t) as more straightforward and less error-prone.

After determination of the gene-wise κ, we are ready to call the high-dimensional, correctly scaled parameters Θ = (A, B, Γ), with Ag = κgαg, Bg = κgβg and Γg = κgγg. We call the high-dimensional unspliced counts scaling parameters mg, M. For calculating the high-dimensional velocity for cell i, we thus use Vi=BMUi-ΓSi, where Ui and Si respectively represent the G-dimensional u-s counts in cell i (G being the number of genes).

Second approach: Eco-velo

Our second approach eco-velo estimates cell state velocities directly in the high-dimensional gene space by calculating the displacement for each cell in a fixed time interval. This approach eliminates the need for cumbersome and error-prone gene-wise parameter estimations. It also specifies the time interval over which high-dimensional velocities are reported, a feature that the gene-wise parameter estimation approaches (including κ-velo) are missing. Specification of the velocity estimation time interval can be important for data sets that include multiple non-smooth-dynamics genes where short-term cell velocities can deviate significantly from their long-term velocity directions.

Starting from Eq 3, for the change of the spliced counts of gene g over Δt we can write:

vg=βgug(t)-γgsg(t)sg(t+Δt)=sg(t)+vgΔt=sg(t)(1-γgΔt)+βgug(t)Δt (11)

By fixing Δtg = 1/γg (this is the time in which existing spliced reads for gene g will be degraded) we get:

sg(t+Δtg)=βgug(t)Δtg=βgγgug(t) (12)

This means that knowing βg/γg is sufficient to estimate the cell state displacements over Δtg. If we further assume all genes have a similar β and γ, we can conclude that the unspliced counts u(t) in a cell are proportional (with a constant factor β/γ) to its spliced counts at the later time point (t + 1/γ).

The assumption of similar γ as well as β across genes, allows us to avoid decomposition of high-dimensional velocities into gene-wise components for velocity estimation and recombining the estimated components again. Thus, leading to another level of simplification that turns out very handy as a heuristic velocity estimation from u and s counts, where we can find cell state displacements by mapping U to S. We do so by searching for the nearest neighbors (NNs) of U in S that are also within the first k nearest neighbors of S in U. We call these pairs mutual nearest neighbors (MNNs). Note that not every point needs to have MNNs. The velocity arrow then goes from a cell’s position in S space to the the mean of the first k MNNs of that same cell’s U space in S. Here, u and s counts can be used directly for estimating cell state velocity directions without any need for smoothing and parameter fitting.

The strong assumptions of eco-velo (similar γ and β across genes) may not hold for every biological processes and every subset of genes. Thus here, one would ideally select a set of genes that are only transcriptionally regulated (via α), but not post-transcriptionally regulated (involving gene-specific β and γ rates). An example of such a cases seems to occur in Fig 1E of the original RNA velocity paper [1], where the authors observed for bulk RNA-seq measurements of cell cycle genes in the mouse liver over a time course of the circadian cycle, that unspliced mRNAs appear predictive of spliced mRNA at the next time point with a similar signal intensity coefficient. Conditioned on its assumptions, eco-velo (in contrast to the methods based on gene-wise parameter estimation) specifies the time interval of the reported velocities and also skips several error-prone parameter estimation and data smoothing steps. How much the different assumptions of each method are satisfied for different experimental settings, data qualities, as well as the purposes of velocity analysis (e.g. estimating the overall velocity directions or obtaining the average cell state velocities over a specific time scale) would determine which method is more appropriate to use.

Visualisation

La Manno et al. [1] suggested using projection of the end of the velocity vectors (s+vΔt) with Δt = 1 on an embedding of the spliced counts. While projection using principal component analysis (PCA) (Note E in S1 Appendix) is the most accurate low-dimensional representation of cell state velocities, it usually does not capture the full complexity of differentiation manifolds with several subpopulations in high-dimensional gene space. Projection of the velocities onto non-parametric nonlinear embeddings (which do not have gene-defined axes) is more challenging. To work around this difficulty, velocyto projects the velocities in a direction relative to the neighbouring cells. This is done by computing a transition probability matrix P containing probabilities of cell-to-cell transitions in accordance with the velocity vector: Pij=exp(corr(ρ(sj-si),ρ(vi))σ2) with σ the kernel width parameter, ρ(x)=sgn(x)|x| a variance-stabilising transformation and corr() the Pearson correlation coefficient. The matrix is row-normalised so that ∑j Pij = 1. Given n observations and Yi the positions of cell i on a K-dimensional embedding, the projected end of velocity vector for cell i is calculated as Yi+ΔYi, where:

ΔYi=j(Pij-1n)Yj-YiYj-Yi (13)

To project the velocities, scVelo uses a similar approach to velocyto but with a slightly different P matrix that calculates Pearson correlations (also called cosine similarity) directly on the Δsij and vi vectors without using the ρ(x) transformation, via Pij=exp(cos(sj-si,vi)σ2). A vector summation as proposed in Eq 13 used in velocyto and scVelo is questionable for three reasons. First, this approach is not faithful to the velocity vectors length, e.g., two velocity vectors with the same direction, but different length (in the same neighbourhood) in the high-dimensional space will be visualised with similar lengths. That is because they will be assigned the same Pij as Pearson correlation does not respect the length of the vectors. Second, Yj-YiYj-Yi does not in general provide an orthonormal basis as the direction of several neighbouring cells to cell i can be correlated on the low dimensional embedding. As a result, this approach may change the direction of the velocity vectors depending on how much the orthonormality principle is disturbed for a given neighbourhood. For example, if the chosen neighbourhood extends longer along the differentiation path than its width, velocities will be visualised as more smooth vectors along the path. Third, (Pij-1n) can be negative even if the velocity direction vi is correlated with the direction of a neighbouring cell j, which is not correct.

Nyström projection (velocity visualisation for κ-velo)

To deal with visualisation of complex data manifolds which require nonlinear embeddings, in κ-velo we propose using the Nyström projection which is more faithful to the actual high-dimensional estimated cell state velocities than the current practices. We use a nonlinear visualisation of the (normalised) spliced counts of the single cells as the start of the velocity vectors and project the end points of the velocity vectors onto this existing embedding using the Nyström method. Nyström projection has also been used for other single cell data integration applications e.g. in [13, 14]. The nonlinear embedding choice is arbitrary and can be diffusion maps [15], t-distributed stochastic neighbor embedding (t-SNE) [16] and uniform manifold approximation and projection (UMAP) [17].

If a K dimensional embedding Ytrain has been created for ntrain data points Xtrain and we want to project a set of ntest points Xtest on the existing map, we first compute a transition probability matrix between the new and old data points, P′ of size [ntest, ntrain] calculated as:

Z(i)=j=1ntrainexp(-xi-xj22σi2),xi,xjXtrainP(i,j)=1Z(i)exp(-xi-xj22σi2),xiXtrain,xjXtest (14)

Note that when the test data is exactly the same as the training set Xtrain = Xtest, P′ would (ideally) be the same transition matrix as the one used for generation of the train set embedding (ideally one would use the same parameters σi as used in construction of the transition matrix for generating the train set embedding. See Note F in S1 Appendix for the spacial case of projection on Diffusion maps). The projection of new points Ytest is then given by:

Ytest=P[ntest×ntrain]*Ytrain (15)

In our application, ntrain equals ntest as each velocity vector has a start as well as an end point.

For cell i Eq 15 implies:

Ytest(i,k)=jP(i,j)*Ytrain(j,k)i{1,..,ntest},j{1,..,ntrain},k{1,..,K} (16)

This looks to some extent similar in form to the previously used velocity projection methods in Eq 13. However, one major difference being that we are calculating the end of the velocity arrow on the embedding space rather than the displacements, hence avoiding the collapse of velocity vectors with different lengths onto the same visualised length. Another advantage is that here Ytrain(j, k) more likely presents an orthonormal basis considering all data points, hence less affected by the shape of neighbourhoods arbitrarily chosen independent from the generation of the reference embedding. For some embedding methods (generally those which perform an analytical embedding optimisation, in contrast to the methods using iterative optimisation techniques such as gradient descent) like diffusion maps, the embedding Ytrain(j, k) is indeed guaranteed to be orthonormal (i.e., jYtrain2(j,k)=1 and ∑j Ytrain(j, kl) * Ytrain(j, km) = 0 for klkm). Lastly, all terms in P′ are positive, making the projected point a weighted average of the data points in the train set.

Note that the Nyström theorem is only valid for projection of test data points which are close enough to the data points existing in the training set. That is, extrapolation for test data to expression regions which have not been sampled in the training set is not possible. In κ-velo, we ensure closeness of the end point of the velocity vector to the existing data manifold of spliced counts by adequately down scaling all inferred high-dimensional velocities by the same factor.

In light of the above, linear projection, e.g. by PCA (Note E in S1 Appendix) although not capable to capture the complexity of several datasets which consist of multiple branching events and subpopulations, remains the only approach in which the visualised arrows are a true representation of the high-dimensional velocity vectors. None of the non-parametric nonlinear projection approaches can deal with projection of out of distribution data points, implying that near the boundaries of the differentiation paths, where actual velocities may point to directions going out of the existing manifold of the start point of velocity vectors, velocity visualisations will be less reliable. Moreover, embedding methods which may not keep the continuity of the data manifold (e.g. t-SNE and UMAP) are more prone to the artefacts of out of distribution data points projection.

Even though our non-linear projection method does not explicitly depend on the dimension of the train and test data sets, we recommend to use the same gene space for projecting the velocities (i.e., for computing of P[ntest×ntrain]) as the gene space that was used for generating the trained embedding, i.e., we use the spliced counts matrix of the filtered gene set S as the training data in κ-velo. This ensures that the embedding only represents a space that can be spanned by following the velocity directions, thus making a closed set of the embedding under addition by velocity vectors. Therefore, we calculate the embedding on the same space used for parameter recovery and velocities’ estimation. This also means that if we use imputed counts for parameter recovery, we calculate the low-dimensional embedding on those imputed counts.

Visualisation for eco-velo

For eco-velo, visualisation of velocities is integrated within the inference of the velocities and hence does not require visualisation by projection. We identify the first k mutual nearest neighbours (MNNs) [18] of U and S for every cell, which we use to visualise the velocities on a low-dimensional embedding of the spliced counts. We simply draw an arrow starting from the position of a cell on the embedding to the mean of the coordinates of its first k MNNs on the same embedding. That means that our velocity arrows point from si to jksj/k, where these sj are the first k MNNs of ui for cell i. These arrows corresponding to a relatively large Δt in which all current spliced counts in the cell would be degraded. For ease of visualisation and obtaining an un-cramped map without intersecting cell velocities, we then scale all velocities by the same factor so that the arrows only point in the direction of the point and not all the way to the future state.

Processing

Before calculating the velocities, single-cell RNAseq datasets are preprocessed (aligning the reads and counting numbers of unspliced and spliced reads) and subsequently processed (filtering, normalisation, etc.). Both the κ-velo and the eco-velo workflows start with processing raw U and S count matrices. Since the methods are based on different assumptions, the processing steps differ per method. Below, we will describe the processing protocol for both approaches.

Processing pipeline of κ-velo

To reduce the number of dimensions of the dataset, we select only genes with high variability. Variability is calculated on the spliced counts using analytic Pearson residuals [19]. We then filter genes with extremely low u or s counts because we want to focus only on genes with significant velocity signal. After gene filtering, the counts in each cell are size-normalised. Since the size of a cell is represented by its u and s counts together, the counts in each cell are normalised using the sum of the counts for u and s. To recover the dynamics, the noise in the u and s counts has to be reduced. As such, all counts are imputed by averaging the counts of each cell’s nearest neighbours. The nearest neighbours for each cell are found in PCA space calculated on scaled s counts. For a more detailed description of each step see Note G in S1 Appendix and S4 Fig.

Processing pipeline of eco-velo

Similar to κ-velo processing, the eco-velo workflow starts by filtering the dataset for genes with high variability and sufficient u and s counts. After this, all non-zero counts are log-transformed and both count matrices are normalised separately. Here, we deviate from the κ-velo protocol, because u and s counts are treated as separate modalities. Following standard MNN protocols [18], the counts are L2 normalised.

Overview of the workflow for κ-velo and eco-velo

Both the κ-velo and the eco-velo workflows consist of three main steps: processing, velocity calculation and visualisation (Fig 1). First the data is processed as described in Section “Processing”. In κ-velo, after processing, we recover the scaled parameters ακ, βκ and γκ for all genes in the dataset. For downstream velocity analysis, only genes with a likelihood above a certain threshold are used. All other genes are filtered out to reduce the technical noise caused by poorly recovered or noisy genes. Additionally, the user is provided with an option to remove genes where the order of clusters in the recovered dynamics do not match the known hierarchy of the cell types (e.g. when an assigned upregulation starts at the the most differentiated cells and ends in the progenitor population). Using the scaled parameters, a high-dimensional velocity vector is calculated for each cell. To visualise the cells and velocities, we compute an embedding (e.g. PCA, UMAP) using the processed (i.e. filtered, normalised and imputed) and scaled s counts. Lastly, the velocities are projected onto the embedding.

Fig 1. An overview of RNA velocity analysis steps in the κ-velo and eco-velo workflow.

Fig 1

The eco-velo workflow includes fewer steps. After processing, the u counts are used to find the first five mutual nearest neighbours of each cell in S space. The embedding is calculated using processed (i.e. filtered and normalised) s counts and velocities are projected onto the embedding by averaging the position of the cell’s first five mutual nearest neighbours.

Simulation data

For the simulation, we randomly sampled g log-normally distributed parameters of reaction rates, scaled by a scaling factor κ: θ = (κα, κβ, κγ). The true time of the n observations is sampled from a uniform distribution. The time points are such that the final mature steady cell state, for which all genes would reach steady-state, has not been sampled. The u and s counts are simulated following u(t), s(t) with added random normal noise (Note H in S1 Appendix). We simulate the data such that the time of activation of each gene’s transcription is inversely proportional to the gene’s speed. This means that the fastest genes are only active towards the end of the differentiation trajectory. The resulting differentiation trajectory has high velocity variation at the beginning when most genes are not yet committed to change and more deterministic dynamics with higher speed at the end of the trajectory. The motivation for this simulation scenario is to include regions with both high- and low variance velocities and to have velocities for some cells pointing to future states outside of the space observed in the original set. See Note H in S1 Appendix for a more detailed description.

Real data

We demonstrate the performance of κ-velo and eco-velo on four different datasets and compare them with the state-of-the-art scVelo. The first dataset is a subset of the pancreatic endocrinogenesis dataset [20]. The second is a subset of the murine gastrulation dataset [21]. Both datasets were obtained using the 10x genomics platform. The third dataset consists of mouse Schwann cell precursors (SCPs) differentiating into chromaffin cells, obtained using SMART-seq2 [22]. For these three datasets, our RNA velocity analysis starts from the U and S count matrices, which were originally analysed in [2] (pancreatic endocrinogenesis), [6] (murine gastrulation) and [1] (chromaffin cells). Lastly, we also analyse a dataset of murine hematopoiesis [23], obtained using the 10x genomics platform. We used velocyto’s sequence alignment and u-s counting pipeline to get the U and S count matrices, as this dataset has not been analysed for RNA velocity before. For all four datasets, we ran the complete κ-velo and eco-velo workflow as described in Section “Overview of the workflow for κ-velo and eco-velo”. See Note I in S1 Appendix for further details of parameter and threshold settings for each dataset.

Results

In this section, we first demonstrate the artefacts of scVelo’s velocity projection on simulation data with known cell state velocities (i.e. no velocity inference step involved) and compare scVelo to visualisation with linear and nonlinear projection methods. We then compare our velocities with the velocities returned by scVelo on simulation. Afterwards, we show computational experiments on real data which support the design of the processing steps we propose and use in this manuscript. In the last section we apply κ-velo and eco-velo on real datasets: first a pancreas endocrinogenesis dataset and then a hematopoiesis dataset. To validate the method on different sequencing technologies, we also applied it to a dataset of Schwann cell precursors (SCPs) differentiating into chromaffin cells (κ-velo: S5A and S5B Fig and eco-velo: S5C Fig).

PCA and Nyström projection faithfully represent the high-dimensional velocity vectors

Ideally, a visualisation of cell state velocities should faithfully represent all aspects of the high-dimensional vector. The visualisation should respect the direction of velocity vectors as well as their magnitude (speed of change). This can be particularly difficult if the new states are in gene space not yet observed in the original set, e.g. the velocities point further than existing points. The embedding should also preserve local variations, representing fluctuations of the dynamics and cell plasticities. To assess these points, we compare existing RNA velocity visualisation methods with ours, on simulated data where the true high-dimensional velocities are known and do not need to be inferred. We design a simulation to assess all these aspects of the projection. In that simulation the cells follow a hidden true time with a high variance at the beginning and faster transitions towards the end of the trajectory (Section “Simulation data”). The final stable cell state is not yet reached in our simulation, and the velocities of the latest cells point towards not yet observed future states. Projection of the velocities on a PCA embedding (Fig 2A) reliably represents all these aspects. scVelo’s velocity projection on the same PCA embedding (Fig 2B) smooths over the biologically interesting variation and removes the information on speed of change (i.e, disproportionately changes the length of the velocity vectors). The Nyström projection method (Fig 2C) captures the expected cell to cell variation, as well as the direction and length of the simulated velocities on PCA (Fig 2D). The velocities are also well represented when projected on a non-linear embedding such as t-SNE (Fig 2E and S6A and S6B Fig, UMAP shown in S6C and S6D Fig). t-SNE tends to map regions of higher density, e.g. of slower velocities, in gene space to a larger space in the embedding as highlighted by the cells outlined in blue and red. Consequently, the velocity arrows are also visualised in a scale proportional to the distance of cells in a given region on the embedding, hence looking longer than their true length in gene space. On embeddings that do not distort cell to cell distances in the gene space such as PCA or diffusion maps with a constant kernel width, the length of the velocity arrows are well represented by Nyström projection (S7 Fig diffusion map and Fig 2D PCA). We note that, unlike PCA projection, neither scVelo’s nor Nyström projection are able to project end of velocity arrows that are out of distribution of existing data points.

Fig 2. Visualisation of simulated velocities with linear and nonlinear projection methods.

Fig 2

A. Velocities projected on PCA embedding. The blue outline highlights a region of high velocity variation and the red outline shows a low-variance, high-velocity region. The arrows in the PCA linear projection capture both the plasticity in direction and magnitude of the velocities. B. Velocities projected on PCA embedding by scVelo. scVelo smoothes the velocities as artefact of the projection method, thereby losing the information on cell state velocities variation as illustrated in the cells outlined in blue. scVelo also loses the information of vector length as shown in the cells outlined in red. C. Summary of velocity projection using the Nyström method D-E. Velocities projected by Nyström-projection method shown on PCA in (D) and t-SNE in (E).

κ-velo recovers simulated velocities

To ensure that the high-dimensional velocity vector points in the right direction we need to address the scale invariance of gene-wise velocity components (as discussed in Section “Dynamical inference”, Fig 3A). We introduce κ-velo, a method that recovers the full transcriptional dynamics from s as a function of u and thus does not need to fit a hidden latent time to the cells. The method then uses the cell densities as a proxy of time spent in a specific region of the expression space (Fig 3B) to relate velocities across genes and solve the scale invariance issue. To validate our method we simulate reaction kinetics following randomly sampled parameters scaled by a factor κ varied between 1 and 15. The method recovers the scaling factors (Fig 3C). In fact, using cell densities to infer the scaling factors is equivalent to using true time for a given differentiation branch (S8 Fig). Note that the recovery becomes more difficult for higher κ. Very fast genes have few or no cells in transient state so in those cases we would need to sample more cells to reliably recover κ. We note that the scale of recovered κ and true κ is still off by a constant factor related to the chosen Δt, but if all components are scaled by the same factor, the direction of the high dimensional vector is still correct. After scaling, the high dimensional κ-velo velocity vector is much closer to truth (Fig 3D and S9 Fig), than scVelo’s velocity vector. In fact, the errors in the scVelo vectors are proportional to the relative scale of the genes (Fig 3D). Because the high-dimensional vector is not directly conceivable to the human mind, low-dimensional representations of the velocities are usually used for interpretation of the result. We also compare the vectors after projection on a PCA embedding (S10 Fig) and find that they are also much closer to truth, both for direction and length (Fig 3E and S11 Fig). Here, for both κ-velo and scVelo, we find the biggest errors in regions of lowest and highest velocities, but scVelo’s errors are much higher than κ-velo’s.

Fig 3. Scaling of gene-wise velocity components.

Fig 3

A. If the gene-wise velocities are incorrectly scaled the high-dimensional velocity vector will change direction (displacement angle θ). B. We propose to use cell densities as a proxy of time. For a same time interval, the displacement in u will be proportional to a gene’s speed. This allows us to relate velocities across genes and solve the scale invariance problem. C. To validate κ-velo, we simulate splicing kinetics scaled by a scaling factor κ and evaluate how well the factors are recovered. D. We compare the κ-velo and scVelo velocities to the true velocities for two genes with different speeds. The high-dimensional velocity vectors are normalised to have equal variance for ease of comparison. E. The high-dimensional vector is projected on the first two principal components to evaluate differences between true velocities and recovered velocities. We return the change in direction (cosine similarity) and length (difference in vector norm) (Note J in S1 Appendix) for κ-velo and scVelo. To make the length comparable, the vectors are variance-normalised. Note the log-scale for frequency.

Careful processing prevents introduction of artefacts

To illustrate the importance of processing, we apply our processing pipeline to a dataset of erythroid development during murine gastrulation. Previously, it has been shown that scVelo falsely predicts de-differentiation at the end of erythroid development. This has been attributed to the contribution of genes with multiple rate kinetics (MURK genes) to the velocity calculation [6]. In our processing pipeline, we not only filter for low variability genes, but also remove genes with insufficient u and s counts. After normalisation, the counts are imputed by averaging spliced or unspliced counts across neighbouring cells, thereby smoothing the data. This usually produces unreliable results for genes with only few u or s counts (S9 Fig). After filtering, in scVelo’s processing pipeline, the count matrices are normalised separately. This separate normalisation introduces artefacts in the u-s phase portrait (S13A and S13B Fig), which can be traced back to variation in the ratio between total unspliced and total spliced counts between cell types. We found that some of the patterns identifying MURK genes were artefacts of this normalisation (S13A and S13B Fig). Furthermore, many MURK genes in the original publication were imputed from very low counts and are filtered out in our pipeline. Comparing the original processing pipeline to our processing steps, we reduce the number of MURK genes from 98 to 18 (S13C Fig), correcting most of the false de-differentation.

After recovery of the parameters, we remove low-likelihood genes where the learned parameters do not fit the u-s phase portrait well. This prevents us from including the (usually noisy) genes for which the recovered parameters could be incorrect (S4 Fig: step 5). The calculated velocities for those genes would therefore not accurately reflect true dynamics. Even after filtering of low-likelihood genes, we still find genes where the recovered dynamics do not match the known order of cell types. For example, early upregulation or late downregulation can often not be easily differentiated based on the u-s phase portrait alone (S4 Fig: step 6). This could ultimately lead to incorrect velocity assignments. To avoid this issue, we can use prior information about the temporal order of cell types to perform one more round of filtering if that information is given (see Note G in S1 Appendix: step 6). We use this information to exclude genes where the fitted state assignments of up- or downregulation do not fit the expected state assignments. After both filtering steps, we calculate the low-dimensional embedding on the reduced gene set, so that the embedding only represents space that can be reached by velocities.

κ-velo explains cell state plasticities and speed of transcriptional change in pancreas endocrinogenesis

To test whether κ-velo’s velocity estimations better capture the different time scales of genes, we apply our method to a dataset of developing mouse pancreas cells sampled at embryonic day 15.5 [20]. The endocrine progenitor cells differentiate into four main fates: alpha, beta, delta and epsilon cells. In previous work, scVelo delineated cycling progenitors and the endocrine cell differentiation.

After processing, we recover the reaction rate parameters fitted by scVelo and κ-velo. True splicing rates are difficult to determine and different ranges have been reported [24] but none come close to the more than 10000-fold range reported by scVelo (Fig 4A and S14 Fig). We report a range of splicing rates close to 30-fold (Fig 4A), which is more in line with the reported ranges. After scaling, we can distinguish fast and slow genes based on their κβ. Among the fast genes, we find genes associated with the cell cycle such as Adk, while slow genes are constantly up- or downregulated during the whole differentiation trajectory (Fig 4B). This is consistent with prior expectation as the cell cycle in developing mouse pancreas takes less than a day [25], while pancreatic endocrine cell differentiation starts at embryonic day 9 and goes until day 15.5 in the analysed sample. We also find fast genes that are upregulated during commitment to a cell fate at the end of the differentiation trajectory, such as Gcg and Nnat. We note that when filtering genes based on prior knowledge of the expected order of cell types, we also filter many cycling genes that tend to have high variance, and thus partially incorrect state assignments.

Fig 4. κ-velo on pancreas endocrinogenesis.

Fig 4

A. Range of splicing rate β estimated by scVelo (in red) and κ-velo (in blue). B. Examples of fast and slow genes, selected according to κβ. Learned kinetics are shown by blue (upregulation) and orange (downregulation) curves. C. Velocities from κ-velo projected onto a UMAP embedding using κ-velo projection. D. Velocities from scVelo projected onto the same UMAP embedding using κ-velo projection. E. Embedded velocities as returned by scVelo. For ease of comparison, plotting style was matched to (C) and (D). F-G. Quantitative comparison of the projected velocities from κ-velo (A) and scVelo (E) on the low dimensional embedding. We return the norm of the errors in F and the cosine similarity in G.

We display the high dimensional vector field in a UMAP embedding of the data and compare the κ-velo velocities (Fig 4C) to scVelo velocities (Fig 4D), both projected with Nyström projection to compare only the velocity vectors (S15 Fig show projections of the velocities on a PCA embedding, S16 Fig shows smoothed velocities on the UMAP embedding). The κ-velo velocities better capture the differences in speed along the trajectory, as well as the progression within the four terminal states. scVelo’s embedding (Fig 4E) smooths over the velocities, returning a view that partially appears more consistent with the expected direction of differentiation but not with the actual noisy velocity vectors. Comparing the projected velocities of the full scVelo pipeline (Fig 4E) to κ-velo pipeline (Fig 4C), we see that the methods most strongly disagree in the high-plasticity ductal population (Fig 4F and 4G and S17 Fig). There is also a strong disagreement in the delta cells, which scVelo predicts to differentiate into the alpha cells, as well as in the alpha cells themselves that are predicted to have very small velocities all along the branch. Looking at single genes u-s phase portrait such as the Gcg gene (Fig 4B), we see that the cells are still differentiating and the full alpha branch has not reached the terminal state yet.

κ-velo recovers multiple differentiation paths in hematopoietic system

RNA velocity analysis of single-cell datasets of differentiation of hematopoietic stem cells into different blood progenitor cells has proved difficult in the past [6, 7], and often the predicted velocities display a direction reversal. This reversal was attributed to genes with more complex kinetics leading to u-s phase portraits that do not have the shape expected from the current RNA velocity model. To investigate the potential of κ-velo on more complex datasets, we applied the method to a dataset of murine hematapoietic stem and progenitor cells (HSPCs) [23]. The HSPCs in this dataset were acquired by sorting bone marrow cells using a broad Linneg c-Kit+ (LK) gating strategy. Additionally, the datasets has been enriched for long-term hematopoietic stem cells (HSCs), which are usually less abundant than other populations. HSCs, which have a high multipotent potential (as indicated by the stemness score, Fig 5A) are at the beginning of the differentiation trajectory, and give rise to all mature blood cells [26]. In this dataset, these final states of mature blood cells are not yet reached since only HSPCs were included. Using a curated set of cell type gene markers, we identify the HSCs and progenitor populations, matching the original annotations (S18 Fig) [23].

Fig 5. κ-velo on hematopoiesis.

Fig 5

A. UMAP embedding with cells coloured for stemness score. B. κ-velo-recovered velocities projected onto UMAP embedding of the cells using Nyström projection. C. Velocities from scVelo projected onto the same UMAP embedding plotted using scVelo’s velocity stream plot. D-F. Recovered dynamics in u-s portrait and expression UMAP of two fast genes Fcnb in D and Ermap in E and one slow gene Pum2 in F.

The κ-velo pipeline correctly recovers the overall differentiation paths from the HSCs to various progenitor populations, such as to the myeloid and megakaryocyte progenitors (Fig 5B), while still capturing cell specific velocity variations (S19 Fig shows smoothed velocities on the embedding). The velocities show higher plasticity in the regions with higher stemness score and more commitment towards the ends of the differentiation branches. On the same dataset, scVelo recovers velocities in the exact opposite directions with velocities pointing from the more differentiated progenitor cells towards the HSCs (Fig 5C). We also identify fast genes, such as Fcnb and Ermap (Fig 5D and 5E), which are known to be involved in the commitment to the myeloid lineage and erythroid lineage respectively [27, 28]. Pum2 is identified as a slow gene because its downregulation takes place over the full span from stem cell to progenitor (Fig 5F). This gene is known to suppress differentiation in HSCs [29].

Eco-velo approximates cell state velocities using minimal data processing and computation

As a heuristic method that does not require cumbersome recovery of the rate parameters, we apply eco-velo on some of the introduced data sets. By simply taking the unspliced counts as a proxy of a cell’s future state (Fig 6A), we can skip a few gene set filtering steps, imputation and parameter fitting, all of which are computationally expensive and can kill some of the true signal variability. We validate the model on a simulated dataset (Fig 6B and S20 Fig), where the model recovers the expected flow. We then test eco-velo on the pancreas endocrinogenesis dataset and the hematopoiesis dataset (pancreas endocrinogenesis: Fig 6C and S21 Fig for smoothed velocities, hematopoiesis dataset: S22 Fig). Since the method is based on the assumption that genes have the same splicing and degradation rates, and we know that cell cycle genes have different rates in the pancreas endocrinogenesis dataset, we exclude them from this analysis. The model delineates the directional flow from progenitor cells to alpha and beta cell fates. eco-velo also captures the high cell plasticities in the ductal population seen in Fig 4C. The final state of epsilon cells is also captured (S21 Fig smoothed) but the dynamics within the delta cells cannot be resolved. For delta and epsilon cells the issues could arise from trying to capture future states within sparse populations that are transcriptionally close to the more abundant population of alpha cells. A quantitative comparison of the projected velocities from eco-velo and κ-velo is shown in S23 Fig, where we see a strong similarity in the Ngn3 low endocrine progenitor, but more variation between the methods in the cycling ductal population as well as in the terminal states. Given the strong theoretical assumptions of the model, eco-velo still captures the complex lineages of endocrinogenesis remarkably well. For the hematopoiesis dataset however, eco-velo is unable to capture the dynamics correctly (S22 Fig). Similarly to scVelo, the velocities are falsely projected back to the most stem-like state, hinting that the more basic assumptions about the splicing dynamics in eco-velo may not hold up for this particular biological process.

Fig 6. Eco-velo as an alternative to computationally costly reaction rate parameter recovery.

Fig 6

A. Under certain conditions, a cell’s unspliced state will represent the cell’s future spliced state. To infer velocities, we look for the first MNN between a cell’s unspliced counts and other cells’ spliced counts. We draw an arrow from the cell to the identified MNN. B. We validate eco-velo on simulation and visualise the resulting velocities on t-SNE. C. eco-velo on pancreas endocrinogenesis.

Computational efficiency of the methods

We report the runtime on an Intel Core i5 CPU with 2GHz, 4 Cores and 16 GB of RAM. On the pancreatic endocrinogensis dataset with 3696 cells and top 5000 highly variable genes, the κ-velo workflow takes 15 minutes while the eco-velo worflow takes about 40 seconds. Full scVelo pipeline on the same dataset takes about 8 minutes.

Data and software availability

All analysed datasets are publicly available. The pancreatic endocrinogenesis dataset is available from the Gene Expression Omnibus (GEO) under accession GSE132188 [20]. The murine gastrulation dataset is available on the Arrayexpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-6967 [21]. For both datasets the count matrices can be downloaded directly from the scVelo Python implementation (https://scvelo.org) v0.2.4. The raw data from the chromaffin dataset is available on GEO under accession number GSE99933 [22]. The count matrices are made available by [1] at http://velocyto.org. The count matrices of the HSPC dataset are available on our GitHub Page: https://github.com/HaghverdiLab/velocity_notebooks. This GitHub page also contains all notebooks necessary to reproduce the results reported in this paper. A python implementation of the κ-velo and eco-velo pipeline can be found at https://github.com/HaghverdiLab/velocity_package.

Discussion

In this manuscript, we study some of the current challenges in the inference of cell state velocities from scRNA-seq data and suggest novel approaches for tackling these problems. We argue that one of the interests in obtaining single cell velocities is to quantify the variation of dynamics among individual cells. This variance in single cell velocities can inform us about fluctuations of the dynamics, cell state plasticities and heterogeneity. We demonstrate that the processing procedure, several data smoothing steps and the visualisation approach in existing methods kill such biologically meaningful variance. The resulting information is closer to knowledge we could get from pseudotemporal ordering of cells than the true single cells velocity directions; one gets good looking cell velocity maps (i.e. conforming the expected pseudotime directions) that do not reflect the reality of the information contained in the u-s mRNA data.

For applications in which obtaining the average cell state velocities over the specific time scale of mRNAs degradation is desired, we propose the eco-velo approach. It eliminates multiple cumbersome and error-prone steps, such as the gene-wise parameter estimation and visualisation of high-dimensional velocities.

For more detailed velocity analyses, we designed the κ-velo approach. The method recovers the full dynamics of splicing kinetics and addresses the relative scaling of velocity components across genes. We also design a consistent processing pipeline and suggest a new visualisation approach. We demonstrate how our model achieves better estimation of velocities than current methods on simulation. On real data, our method returns more plausible ranges of splicing rates and velocity magnitudes in several differentiation regions. κ-velo’s velocity components’ scaling is based on the assumption that cell densities can be used as a proxy of typical travel time between two cell states. Heterogeneous cell birth and death rates along the differentiation path could partly disturb this assumption. To further improve this model, one could therefore consider estimating the heterogeneous cell birth and death rates based on the activity of apoptotic and proliferation genes [30]. Our results on simulation data (S8 Fig) demonstrate that the true global time of cells also resolves the scale-invariance issue. This indicates that other proxies of the true global time, e.g. cell density-scaled pseudotime, may also be used for inferring the relative scaling of velocity components among the genes in future work.

As described in Section “Careful processing prevents introduction of artefacts”, we find that there can be difficulty in fitting reaction rate parameters for genes that do not display clear kinetic patterns of up- or downregulation on the u-s phase portrait. In the current version of κ-velo, we filter out genes where the fitted state assignments do not match the known pseudotemporal order of cell types. In future work, we could use this prior information as initialisation in the parameter fitting procedure. The recovered high-dimensional velocity vectors now contain the deterministic part, but also capture the stochasticity of the dynamics. This can be used to perform several downstream analyses and answer questions about cell’s progress through the dynamical process.

In the past, recovery of cell specific global latent time has been done after velocity analysis [2]. The recovery of a cell’s global time was based on a heuristic integration of time assignment from individual genes. However, the gene-wise assignment of latent time are error-prone and additionally do not take into account the time that genes spend in steady-state. Integrating these errors does not necessarily mean that they cancel out. Because of these two reasons, recovery of global latent time should be done more carefully in follow up studies with strategies similar to CellRank [5], where several sequential cell state transitions are chained together to construct long transition paths along the differentiation manifold. Alternatively, estimation of global latent time may be integrated in the expectation maximisation procedure, similar to the approach in a recent preprint [11].

We also raise awareness about the time scales for which average velocities are being estimated. It would be interesting to measure velocities at multiple time scales to get an overview on the “plans” individual cells have in preparation for their short- or long-term developmental journey. One way of studying the changes that cells undergo at different time scales would be by inferring velocities from different sets of genes related to these time scales. For example investigating velocities on the time scale of the cell cycle or of the entire differentiation process. This also supports growing interest for inferring cell state velocities from other pairs of single-cell data modalities, e.g. mRNA coupled with protein levels [3], as they correspond to different time scales of gene regulation. Furthermore, inferring cell state velocities from modalities in which measurements are more accurate (in comparison to the uncertainty in quantification of unspliced-spliced mRNA counts) can enhance our ability to understand the biological variation in cell state velocities rather than variations due to measurement noise.

Estimation of cell state velocities in presence of multiple time point measurements or multiple batches of data collection is another important problem. However, the solution is not trivial as existing batch effect correction methods can distort the proportions between the s and u counts from separate batches. One possible strategy can be to estimate the velocities within each batch separately and visualise and project the estimated velocities on a shared embedding of all batches. Investigation of different approaches and possibilities remain open.

To conclude, we suggest that a comprehensive grasp of what we are actually estimating and visualising as cell state velocities is crucial for obtaining a full description of cell differentiation dynamics. True cell state velocities encompass both stochastic and deterministic parts of the biological dynamics. This information can be complementary to attempts for describing cell differentiation as a full diffusion process [1, 5, 12, 15, 3133] which contains the three terms of deterministic, stochastic and cell birth and death rates. Reliable quantification of cell state velocities in different transcriptional regions can put the relative magnitude (i.e. coefficients) of these terms into perspective in relation with one another.

Supporting information

S1 Table of contents. Table of contents of the main text.

(PDF)

S1 Appendix. Supplementary Notes A-J.

Details on theory, the algorithms, the simulation and processing of the data.

(PDF)

S1 Fig. Average velocities for different time scales can be very different if the expression dynamics are not smooth.

On the left is the example of two noisy genes: the average velocity over Δt1 is very different from the average velocity over Δt2. For smooth gene dynamics as shown on the right, the average velocities are more similar.

(TIFF)

S2 Fig. Density estimation for two simulated genes with different time scales.

c = 10−3 is a constant scaling factor. The two simulated genes have the same reaction parameters θ but those for gene 2 are scaled by 10. (A) a slow gene, where no cells are in steady-state. The slope of the line gives us κg1 directly. (B) A fast gene, where a lot of cells are in steady-state. The slope of the red line gives us κg2.

(TIFF)

S3 Fig. Comparison of recovery of scaling factors from unspliced counts (Eq 10) and from spliced counts (Note D in S1 Appendix).

(A) On simulation; the simulation is the same as in main Fig 3. (B) On the pancreas endocrinogenesis dataset.

(TIFF)

S4 Fig. Overview of all processing steps in the κ-velo workflow.

In the middle, a schematic representation of how the spliced and unspliced matrices change during each step is shown. A size reduction of the coloured area indicates a filtering step where the number of genes are reduced. A change in colour represent a data manipulation, which does not changes the number of cells or genes, but changes the values in the matrix. On the left, some extra information is provided for some of the processing steps. More detailed information can be read in Note G in S1 Appendix. On the right, the u-s phase portraits of several example genes are shown to demonstrate how the different steps change the phase portraits, as well as which kind of genes are selected or removed in the filtering steps. Each of the genes is selected from the pancreas endocrinogenesis dataset that is analysed in main Fig 4.

(TIFF)

S5 Fig. κ-velo and eco-velo applied on the chromaffin dataset.

The chromaffin dataset includes Schwann cell precursors (SCPs) (blue) differentiating into chromaffin cells (green). In the original paper, the purple cluster was identified as symphatoblasts and the yellow and red cluster as “bridge” cells [22]. (A) κ-velo applied on chromaffin dataset using PCA embedding for visualisation. Principal component (PC) 1 and 2 left and PC 2 and 3 right. (B) κ-velo applied on chromaffin dataset using UMAP embedding for visualisation (left: raw vector visualisation, right: smoothed vector visualisation). (A) and (B) show that κ-velo correctly captures the differentiation from SCPs into chromaffin cells. Interestingly, there also seems to be a more committed differentiation in the bridge cells than the SCPs in the beginning of the manifold. (C) eco-velo applied on chromaffin dataset using UMAP embedding for visualisation (left: raw vector visualisation, right: smoothed vector visualisation).

(TIFF)

S6 Fig. Projection of the velocity arrows (test set data points) onto existing embedding of initial cell positions (training set).

We compare our projection approach (left column) to scVelo’s [2] (right column) projection for t-SNE [16] in (A) and (B) and UMAP [17] in (C) and (D).

(TIFF)

S7 Fig. Projection of the velocity arrows (test set data points) onto existing diffusion map embedding of initial cell positions (training set).

We compare our projection approach in A to scVelo’s [2]’s projection in B.

(TIFF)

S8 Fig. Recovery of the scaling factor κ from true time on simulation.

The simulation is the same as in main Fig 3. The factors are recovered similarly to the density approach described in Note C in S1 Appendix, except that d(i, j) is calculated from ti the true simulated time of cell i: d(i, j) = |(titj)|. Plotting d on the x-axis and f on the y-axis, the slope of the corresponding line gives us κ. Here, since we have true time, we do not need to exclude steady-states. (A) Comparison of the scaling factors recovered from true time to the true simulated factors. Note that here the range of recovered scaling factors is equivalent to the true factors because they were recovered from true time and not from a proxy of time that might be off by some constant factor. (B) Comparison of the factors recovered from the density approach to the factors recovered from true time.

(TIFF)

S9 Fig. Comparison of the high-dimensional velocities recovered by κ-velo and scVelo on simulation for 100 genes with different speeds.

(A) High-dimensional velocity vector. One point represents a velocity for one cell for one gene. (B) We evaluate differences between true high-dimensional velocities and recovered velocities. We return the change in direction (cosine similarity), length (difference in vector norm) and the overall norm of the errors between real velocities and κ-velo velocities (in blue), or scVelo velocities (in red). To make the length comparable, the vectors high-dimensional vectors are normalised to have equal variance. Note the log-scale for frequency.

(TIFF)

S10 Fig. Comparison of velocities recovered by κ-velo and scVelo on simulation projected on PCA embedding of spliced counts.

(A) Real simulated velocities (B) velocities recovered by κ-velo and (C) velocities recovered by scVelo projected on PCA. Cells on PCA coloured by norm of the errors between real velocities and (D) κ-velo velocities, or (E) scVelo velocities.

(TIFF)

S11 Fig. Comparison of velocities recovered by κ-velo and scVelo on simulation projected on 2D-PCA embedding of spliced counts.

(A) Norm of the errors: vt-vr with vt the true 2D velocity vector on PCA and vr the recovered vector. (B) Change in direction (cosine similarity) and length (difference in vector norm: vt-vr) for each cell in PCA space.

(TIFF)

S12 Fig. The u-s phase portrait of Acly, Dpysl2 and Gnaz (raw counts, after normalisation and after recovering of dynamics).

The u-s phase portrait of Acly, Dpysl2 and Gnaz (from the pancreas endocrinogenesis dataset), which are all genes with insufficient unspliced counts. Here, we show how scVelo would recover the dynamics if these genes were not filtered out.

(TIFF)

S13 Fig. Applying κ-velo processing pipeline on erythroid lineage dataset.

The scRNA-seq dataset on the erythroid lineage of mouse gastrulation [21] has been described in the context of RNA velocity by Barile et al. [6]. Here, we show that the subset has a varying ratio of total unspliced to total spliced counts in different cell types (A). This results in artefacts when using the standard scVelo processing pipeline (U and S normalised separately) (B, second row). Those artefacts are mostly resolved by normalising U and S combined (B, third row), which is part of the κ-velo processing workflow (B, last row). Using the κ-velo processing workflow fixes some of the reported de-differentiation (C).

(TIFF)

S14 Fig. Comparison of recovered reaction rate parameters on pancreas endocrinogenesis dataset.

Range of transcription rate α, splicing rate β, and degradation rate γ estimated by scVelo (in red) and κ-velo (in blue).

(TIFF)

S15 Fig. PCA projection of velocities in the pancreas endocrinogenesis dataset.

(A) Velocities returned by κ-velo projected on PCA embedding of spliced counts. (B) Velocities returned by scVelo projected on PCA embedding of spliced counts. We note that the gene space used is different for the two methods, as they have different criteria for gene selection. scVelo uses 1809 genes, while κ-velo uses 134.

(TIFF)

S16 Fig. Smoothed κ-velo projection of velocities in the pancreas endocrinogenesis dataset.

The two UMAPs compare (A) smoothed scVelo velocities projected by Nyström projection and (B) smoothed κ-velo velocities projected by Nyström projection. Velocities were smoothed by averaging over the 30 nearest neighbours. Neighbourhoods are calculated in S space.

(TIFF)

S17 Fig. Quantitative comparison of low-dimensional projection of velocities.

We compare scVelo velocities projected by scVelo v1 to κ-velo velocities projected by Nyström-projection v2 for every cell. (A) UMAP colored by cell types. (B) Difference in the norm of the two vectors ‖v1‖ − ‖v2‖.

(TIFF)

S18 Fig. UMAP embedding of the HSPC dataset as calculated in the κ-velo pipeline.

Cells are coloured for (A) our assigned cell types (see Note I in S1 Appendix) or (B) the cell types assignments from the original data analysis [23].

(TIFF)

S19 Fig. Smoothed κ-velo projection of velocities in the HSPC dataset.

Velocities were smoothed by averaging over the 30 nearest neighbours. Neighbourhoods are calculated in S space. Non-smoothed projection in main Fig 5B.

(TIFF)

S20 Fig. Eco-velo projection of velocities (calculated on simulations) shown on PCA in (A) and UMAP in (B).

(TIFF)

S21 Fig. Smoothed eco-velo projection of velocities in the pancreas endocrinogenesis dataset.

Velocities were smoothed by averaging over the 50 nearest neighbours. Neighbourhoods are calculated in S space.

(TIFF)

S22 Fig. Eco-velo applied on HSPC dataset using UMAP embedding for visualisation.

Left: raw vector visualisation, right: smoothed vector visualisation. Like scVelo (main Fig 5C), the velocities point from the more differentiated populations back to the stem cells.

(TIFF)

S23 Fig. Quantitative comparison of low-dimensional projection of velocities.

We compare κ-velo velocities projected by Nyström-projection v1 to eco-velo velocities projected onto the UMAP calculated in the κ-velo pipeline and shown in main Fig 4 for every cell. (A) UMAP colored by cell types. (B) Cosine similarity between the two vectors. (C) Norm of the difference between the two vectors v1-v2. (D) Difference in the norm of the two vectors v1-v2. Cells are colored in grey when we do not have a velocity value for eco-velo, i.e. the cell does not have a mutual nearest neighbour within the top 50 neighbours.

(TIFF)

Data Availability

All analysed datasets are publicly available. The pancreatic endocrinogenesis dataset is available from the Gene Expression Omnibus (GEO) under accession GSE132188. The murine gastrulation dataset is available on the Arrayexpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-6967. For both datasets the count matrices can be downloaded directly from the scVelo Python implementation (https://scvelo.org) v0.2.4. The raw data from the chromaffin dataset is available on GEO under accession number GSE99933. The count matrices are made available at http://velocyto.org. The count matrices of the HSPC dataset are available on our GitHub Page: https://github.com/HaghverdiLab/velocity_notebooks. This GitHub page also contains all notebooks necessary to reproduce the results reported in this paper. A python implementation of the κ-velo and eco-velo pipeline can be found at https://github.com/HaghverdiLab/velocity_package.

Funding Statement

This study was supported by the Max Delbrück Center for Molecular Medicine as well as the Bundesministerium für Bildung und Forschung (BMBF) grant for ‘junior consortia in systems medicine’ to LH (01ZX1911B) and the Dietmar Hopp Foundation, as well as SFB873 funded by the Deutsche Forschungsgemeinschaft (DFG) to MAGE. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, et al. RNA velocity of single cells. Nature. 2018;560(7719):494–498. doi: 10.1038/s41586-018-0414-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Bergen V, Lange M, Peidli S, Wolf FA, Theis FJ. Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology. 2020;38(12):1408–1414. doi: 10.1038/s41587-020-0591-3 [DOI] [PubMed] [Google Scholar]
  • 3. Gorin G, Svensson V, Pachter L. Protein velocity and acceleration from single-cell multiomics experiments. Genome biology. 2020;21(1):1–6. doi: 10.1186/s13059-020-1945-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Li C, Virgilio M, Collins KL, Welch JD. Single-cell multi-omic velocity infers dynamic and decoupled gene regulation. bioRxiv [Preprint]. 2021. [cited 2022 Sep 7]. Available from: https://www.biorxiv.org/content/10.1101/2021.12.13.472472v1. [Google Scholar]
  • 5. Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, et al. CellRank for directed single-cell fate mapping. Nature methods. 2022; p. 1–12. doi: 10.1038/s41592-021-01346-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Barile M, Imaz-Rosshandler I, Inzani I, Ghazanfar S, Nichols J, Marioni JC, et al. Coordinated changes in gene expression kinetics underlie both mouse and human erythroid maturation. Genome Biol. 2021;22(1):197. doi: 10.1186/s13059-021-02414-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Bergen V, Soldatov RA, Kharchenko PV, Theis FJ. RNA velocity—current challenges and future perspectives. Molecular systems biology. 2021;17(8):e10282. doi: 10.15252/msb.202110282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Li T, Shi J, Wu Y, Zhou P. On the Mathematics of RNA Velocity I: Theoretical Analysis. CSIAM Transactions on Applied Mathematics. 2021;2(1):1–55. doi: 10.4208/csiam-am.SO-2020-0001 [DOI] [Google Scholar]
  • 9. Gorin G, Fang M, Chari T, Pachter L. RNA velocity unraveled. bioRxiv [Preprint]. 2022. [cited 2022 Sep 7]. Available from: https://www.biorxiv.org/content/10.1101/2022.02.12.480214v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zheng SC, Stein-O’Brien G, Boukas L, Goff LA, Hansen KD. Pumping the brakes on RNA velocity—understanding and interpreting RNA velocity estimates. bioRxiv [Preprint]. 2022. [cited 2022 Sep 7]. Available from: https://www.biorxiv.org/content/10.1101/2022.06.19.494717v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gu Y, Blaauw DT, Welch J. Variational Mixtures of ODEs for Inferring Cellular Gene Expression Dynamics. In: International Conference on Machine Learning. PMLR; 2022. p. 7887–7901.
  • 12. Haghverdi L, Büttner M, Wolf FA, Buettner F, Theis FJ. Diffusion pseudotime robustly reconstructs lineage branching. Nature methods. 2016;13(10):845–848. doi: 10.1038/nmeth.3971 [DOI] [PubMed] [Google Scholar]
  • 13. Angerer P, Haghverdi L, Büttner M, Theis FJ, Marr C, Buettner F. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32(8):1241–1243. doi: 10.1093/bioinformatics/btv715 [DOI] [PubMed] [Google Scholar]
  • 14. Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, et al. SnapATAC: Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun. 2021;12(1):1337. doi: 10.1038/s41467-021-21583-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Haghverdi L, Buettner F, Theis FJ. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics. 2015;31(18):2989–2998. doi: 10.1093/bioinformatics/btv325 [DOI] [PubMed] [Google Scholar]
  • 16. van der Maaten L, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(86):2579–2605. [Google Scholar]
  • 17.McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [Preprint]. 2020 [cited 2022 Sep 7]. Available from: https://arxiv.org/abs/1802.03426.
  • 18. Haghverdi L, Lun AT, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology. 2018;36(5):421–427. doi: 10.1038/nbt.4091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Lause J, Berens P, Kobak D. Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data. Genome biology. 2021;22(1):1–20. doi: 10.1186/s13059-021-02451-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Bastidas-Ponce A, Tritschler S, Dony L, Scheibner K, Tarquis-Medina M, Salinno C, et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development. 2019;146(12):dev173849. doi: 10.1242/dev.173849 [DOI] [PubMed] [Google Scholar]
  • 21. Pijuan-Sala B, Griffiths JA, Guibentif C, Hiscock TW, Jawaid W, Calero-Nieto FJ, et al. A single-cell molecular map of mouse gastrulation and early organogenesis. Nature. 2019;566(7745):490–495. doi: 10.1038/s41586-019-0933-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Furlan A, Dyachuk V, Kastriti ME, Calvo-Enrique L, Abdo H, Hadjab S, et al. Multipotent peripheral glial cells generate neuroendocrine cells of the adrenal medulla. Science. 2017;357(6346):eaal3753. doi: 10.1126/science.aal3753 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Demerdash Y, Bouman BJ, Haghverdi L, Essers M. Unbiased, longitudinal analysis of the inflammatory response of HSPCs at the single-cell level resolves controversies regarding the HSPC stress response. Presented during EHA2022 as a poster. 2022;EHA library: Demerdash Y. 06/10/22; 358259; P1401. [Google Scholar]
  • 24. Alpert T, Herzel L, Neugebauer KM. Perfect timing: splicing and transcription rates in living cells. Wiley Interdiscip Rev RNA. 2017;8(2). doi: 10.1002/wrna.1401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Kim YH, Larsen HL, Rué P, Lemaire LA, Ferrer J, Grapin-Botton A. Cell cycle-dependent differentiation dynamics balances growth and endocrine differentiation in the pancreas. PLoS Biol. 2015;13(3):e1002111. doi: 10.1371/journal.pbio.1002111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Nestorowa S, Hamey FK, Pijuan Sala B, Diamanti E, Shepherd M, Laurenti E, et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood. 2016;128(8):e20–e31. doi: 10.1182/blood-2016-05-716480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kwok I, Becht E, Xia Y, Ng M, Teh YC, Tan L, et al. Combinatorial single-cell analyses of granulocyte-monocyte progenitor heterogeneity reveals an early uni-potent neutrophil progenitor. Immunity. 2020;53(2):303–318. doi: 10.1016/j.immuni.2020.06.005 [DOI] [PubMed] [Google Scholar]
  • 28. Ye TZ, Gordon CT, Lai YH, Fujiwara Y, Peters LL, Perkins AC, et al. Ermap, a gene coding for a novel erythroid specific adhesion/receptor membrane protein. Gene. 2000;242(1-2):337–345. doi: 10.1016/S0378-1119(99)00516-8 [DOI] [PubMed] [Google Scholar]
  • 29. Zayas J, George J, Nachtman R, Jurecic R. RNA-Binding Protein Pum2 Promotes Self-Renewal and Suppresses Differentiation of Multipotent Hematopoietic Cells by Maintaining Them in Inactive CD34- State. Blood. 2007;110(11):2231. doi: 10.1182/blood.V110.11.2231.223117557896 [DOI] [Google Scholar]
  • 30. Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell. 2019;176(4):928–943. doi: 10.1016/j.cell.2019.01.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Wagner DE, Weinreb C, Collins ZM, Briggs JA, Megason SG, Klein AM. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science. 2018;360(6392):981–987. doi: 10.1126/science.aar4362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Qiu X, Zhang Y, Martin-Rufino JD, Weng C, Hosseinzadeh S, Yang D, et al. Mapping transcriptomic vector fields of single cells. Cell. 2022;185(4):690–711. doi: 10.1016/j.cell.2021.12.045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Cho H, Rockne RC. Mathematical modeling with single-cell sequencing data. bioRxiv [Preprint]. 2019. [cited 2022 Sep 7]; p. 710640. Available from: https://www.biorxiv.org/content/10.1101/710640v1. [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010031.r001

Decision Letter 0

Ilya Ioshikhes, Wei Li

19 May 2022

Dear Dr. Haghverdi,

Thank you very much for submitting your manuscript "Towards reliable quantifcation of cell state velocities" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Wei Li, Ph.D.

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The review is attached as a docx file.

Reviewer #2: In this manuscript, the authors aimed to solve several key questions regarding the velocity estimation and visualization. They developed κ-velo which enables the estimation of the relative magnitude of velocity components across genes. At the same time, they developed a new method to visualize the velocity. Using both simulated and real data, the authors demonstrated that their method outperforms scVelo, one of the state-of-the-art methods for velocity estimation and visualization. The main comments and concerns are as follows:

1. The algorithm developed in this manuscript is not available, at least not user-friendly. Scripts in GitHub can only be used to generate figures for this manuscript. It is difficult for users to use this algorithm to analyze their internal data or other public data other than those mentioned in this manuscript.

2. In the manuscript, the authors address several key issues of velocity estimation and visualization in the Introduction section. All work in the manuscript is aimed at addressing these issues. Although these issues have been reported in detail in other studies, a brief description of these issues can help readers better understand this manuscript.

3. In Section 3.1, the authors demonstrate that their PCA embedding method can reliably represent high plasticity at the beginning and commit fate faster at the end of the trajectory through simulated data. However, with the nonlinear projection method, it appears that these phenomena cannot be observed. Does this mean that PCA-based methods and nonlinear projection methods can only show limited aspects of the velocity?

4. The operation of κ-velo does not take much computing resources and time. In this case, is eco-velo mode still necessary? Does this model have any other advantages or might be suitable for specific situations?

5. Are there any other datasets or cell types that can be used to test the performance of κ-velo? For example, single-cell data of different T-cell states.

Reviewer #3: See attached PDF.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: PLOS-towards-reliable.docx

Attachment

Submitted filename: review.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010031.r003

Decision Letter 1

Ilya Ioshikhes, Wei Li

26 Aug 2022

Dear Dr. Haghverdi,

We are pleased to inform you that your manuscript 'Towards reliable quantifcation of cell state velocities' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Wei Li, Ph.D.

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I believe that the authors have addressed all my concerns with either additional detailed analyses or appropriate theoretical clarifications. The limitations and possible improvements of current RNA velocity approaches that the authors pointed out will be valuable to the research community.

Reviewer #2: The authors have addressed all my questions and I have no further questions. I recommend accepting.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010031.r004

Acceptance letter

Ilya Ioshikhes, Wei Li

20 Sep 2022

PCOMPBIOL-D-22-00426R1

Towards reliable quantification of cell state velocities

Dear Dr Haghverdi,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofi Zombor

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table of contents. Table of contents of the main text.

    (PDF)

    S1 Appendix. Supplementary Notes A-J.

    Details on theory, the algorithms, the simulation and processing of the data.

    (PDF)

    S1 Fig. Average velocities for different time scales can be very different if the expression dynamics are not smooth.

    On the left is the example of two noisy genes: the average velocity over Δt1 is very different from the average velocity over Δt2. For smooth gene dynamics as shown on the right, the average velocities are more similar.

    (TIFF)

    S2 Fig. Density estimation for two simulated genes with different time scales.

    c = 10−3 is a constant scaling factor. The two simulated genes have the same reaction parameters θ but those for gene 2 are scaled by 10. (A) a slow gene, where no cells are in steady-state. The slope of the line gives us κg1 directly. (B) A fast gene, where a lot of cells are in steady-state. The slope of the red line gives us κg2.

    (TIFF)

    S3 Fig. Comparison of recovery of scaling factors from unspliced counts (Eq 10) and from spliced counts (Note D in S1 Appendix).

    (A) On simulation; the simulation is the same as in main Fig 3. (B) On the pancreas endocrinogenesis dataset.

    (TIFF)

    S4 Fig. Overview of all processing steps in the κ-velo workflow.

    In the middle, a schematic representation of how the spliced and unspliced matrices change during each step is shown. A size reduction of the coloured area indicates a filtering step where the number of genes are reduced. A change in colour represent a data manipulation, which does not changes the number of cells or genes, but changes the values in the matrix. On the left, some extra information is provided for some of the processing steps. More detailed information can be read in Note G in S1 Appendix. On the right, the u-s phase portraits of several example genes are shown to demonstrate how the different steps change the phase portraits, as well as which kind of genes are selected or removed in the filtering steps. Each of the genes is selected from the pancreas endocrinogenesis dataset that is analysed in main Fig 4.

    (TIFF)

    S5 Fig. κ-velo and eco-velo applied on the chromaffin dataset.

    The chromaffin dataset includes Schwann cell precursors (SCPs) (blue) differentiating into chromaffin cells (green). In the original paper, the purple cluster was identified as symphatoblasts and the yellow and red cluster as “bridge” cells [22]. (A) κ-velo applied on chromaffin dataset using PCA embedding for visualisation. Principal component (PC) 1 and 2 left and PC 2 and 3 right. (B) κ-velo applied on chromaffin dataset using UMAP embedding for visualisation (left: raw vector visualisation, right: smoothed vector visualisation). (A) and (B) show that κ-velo correctly captures the differentiation from SCPs into chromaffin cells. Interestingly, there also seems to be a more committed differentiation in the bridge cells than the SCPs in the beginning of the manifold. (C) eco-velo applied on chromaffin dataset using UMAP embedding for visualisation (left: raw vector visualisation, right: smoothed vector visualisation).

    (TIFF)

    S6 Fig. Projection of the velocity arrows (test set data points) onto existing embedding of initial cell positions (training set).

    We compare our projection approach (left column) to scVelo’s [2] (right column) projection for t-SNE [16] in (A) and (B) and UMAP [17] in (C) and (D).

    (TIFF)

    S7 Fig. Projection of the velocity arrows (test set data points) onto existing diffusion map embedding of initial cell positions (training set).

    We compare our projection approach in A to scVelo’s [2]’s projection in B.

    (TIFF)

    S8 Fig. Recovery of the scaling factor κ from true time on simulation.

    The simulation is the same as in main Fig 3. The factors are recovered similarly to the density approach described in Note C in S1 Appendix, except that d(i, j) is calculated from ti the true simulated time of cell i: d(i, j) = |(titj)|. Plotting d on the x-axis and f on the y-axis, the slope of the corresponding line gives us κ. Here, since we have true time, we do not need to exclude steady-states. (A) Comparison of the scaling factors recovered from true time to the true simulated factors. Note that here the range of recovered scaling factors is equivalent to the true factors because they were recovered from true time and not from a proxy of time that might be off by some constant factor. (B) Comparison of the factors recovered from the density approach to the factors recovered from true time.

    (TIFF)

    S9 Fig. Comparison of the high-dimensional velocities recovered by κ-velo and scVelo on simulation for 100 genes with different speeds.

    (A) High-dimensional velocity vector. One point represents a velocity for one cell for one gene. (B) We evaluate differences between true high-dimensional velocities and recovered velocities. We return the change in direction (cosine similarity), length (difference in vector norm) and the overall norm of the errors between real velocities and κ-velo velocities (in blue), or scVelo velocities (in red). To make the length comparable, the vectors high-dimensional vectors are normalised to have equal variance. Note the log-scale for frequency.

    (TIFF)

    S10 Fig. Comparison of velocities recovered by κ-velo and scVelo on simulation projected on PCA embedding of spliced counts.

    (A) Real simulated velocities (B) velocities recovered by κ-velo and (C) velocities recovered by scVelo projected on PCA. Cells on PCA coloured by norm of the errors between real velocities and (D) κ-velo velocities, or (E) scVelo velocities.

    (TIFF)

    S11 Fig. Comparison of velocities recovered by κ-velo and scVelo on simulation projected on 2D-PCA embedding of spliced counts.

    (A) Norm of the errors: vt-vr with vt the true 2D velocity vector on PCA and vr the recovered vector. (B) Change in direction (cosine similarity) and length (difference in vector norm: vt-vr) for each cell in PCA space.

    (TIFF)

    S12 Fig. The u-s phase portrait of Acly, Dpysl2 and Gnaz (raw counts, after normalisation and after recovering of dynamics).

    The u-s phase portrait of Acly, Dpysl2 and Gnaz (from the pancreas endocrinogenesis dataset), which are all genes with insufficient unspliced counts. Here, we show how scVelo would recover the dynamics if these genes were not filtered out.

    (TIFF)

    S13 Fig. Applying κ-velo processing pipeline on erythroid lineage dataset.

    The scRNA-seq dataset on the erythroid lineage of mouse gastrulation [21] has been described in the context of RNA velocity by Barile et al. [6]. Here, we show that the subset has a varying ratio of total unspliced to total spliced counts in different cell types (A). This results in artefacts when using the standard scVelo processing pipeline (U and S normalised separately) (B, second row). Those artefacts are mostly resolved by normalising U and S combined (B, third row), which is part of the κ-velo processing workflow (B, last row). Using the κ-velo processing workflow fixes some of the reported de-differentiation (C).

    (TIFF)

    S14 Fig. Comparison of recovered reaction rate parameters on pancreas endocrinogenesis dataset.

    Range of transcription rate α, splicing rate β, and degradation rate γ estimated by scVelo (in red) and κ-velo (in blue).

    (TIFF)

    S15 Fig. PCA projection of velocities in the pancreas endocrinogenesis dataset.

    (A) Velocities returned by κ-velo projected on PCA embedding of spliced counts. (B) Velocities returned by scVelo projected on PCA embedding of spliced counts. We note that the gene space used is different for the two methods, as they have different criteria for gene selection. scVelo uses 1809 genes, while κ-velo uses 134.

    (TIFF)

    S16 Fig. Smoothed κ-velo projection of velocities in the pancreas endocrinogenesis dataset.

    The two UMAPs compare (A) smoothed scVelo velocities projected by Nyström projection and (B) smoothed κ-velo velocities projected by Nyström projection. Velocities were smoothed by averaging over the 30 nearest neighbours. Neighbourhoods are calculated in S space.

    (TIFF)

    S17 Fig. Quantitative comparison of low-dimensional projection of velocities.

    We compare scVelo velocities projected by scVelo v1 to κ-velo velocities projected by Nyström-projection v2 for every cell. (A) UMAP colored by cell types. (B) Difference in the norm of the two vectors ‖v1‖ − ‖v2‖.

    (TIFF)

    S18 Fig. UMAP embedding of the HSPC dataset as calculated in the κ-velo pipeline.

    Cells are coloured for (A) our assigned cell types (see Note I in S1 Appendix) or (B) the cell types assignments from the original data analysis [23].

    (TIFF)

    S19 Fig. Smoothed κ-velo projection of velocities in the HSPC dataset.

    Velocities were smoothed by averaging over the 30 nearest neighbours. Neighbourhoods are calculated in S space. Non-smoothed projection in main Fig 5B.

    (TIFF)

    S20 Fig. Eco-velo projection of velocities (calculated on simulations) shown on PCA in (A) and UMAP in (B).

    (TIFF)

    S21 Fig. Smoothed eco-velo projection of velocities in the pancreas endocrinogenesis dataset.

    Velocities were smoothed by averaging over the 50 nearest neighbours. Neighbourhoods are calculated in S space.

    (TIFF)

    S22 Fig. Eco-velo applied on HSPC dataset using UMAP embedding for visualisation.

    Left: raw vector visualisation, right: smoothed vector visualisation. Like scVelo (main Fig 5C), the velocities point from the more differentiated populations back to the stem cells.

    (TIFF)

    S23 Fig. Quantitative comparison of low-dimensional projection of velocities.

    We compare κ-velo velocities projected by Nyström-projection v1 to eco-velo velocities projected onto the UMAP calculated in the κ-velo pipeline and shown in main Fig 4 for every cell. (A) UMAP colored by cell types. (B) Cosine similarity between the two vectors. (C) Norm of the difference between the two vectors v1-v2. (D) Difference in the norm of the two vectors v1-v2. Cells are colored in grey when we do not have a velocity value for eco-velo, i.e. the cell does not have a mutual nearest neighbour within the top 50 neighbours.

    (TIFF)

    Attachment

    Submitted filename: PLOS-towards-reliable.docx

    Attachment

    Submitted filename: review.pdf

    Attachment

    Submitted filename: Response_First_Revision.pdf

    Data Availability Statement

    All analysed datasets are publicly available. The pancreatic endocrinogenesis dataset is available from the Gene Expression Omnibus (GEO) under accession GSE132188. The murine gastrulation dataset is available on the Arrayexpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-6967. For both datasets the count matrices can be downloaded directly from the scVelo Python implementation (https://scvelo.org) v0.2.4. The raw data from the chromaffin dataset is available on GEO under accession number GSE99933. The count matrices are made available at http://velocyto.org. The count matrices of the HSPC dataset are available on our GitHub Page: https://github.com/HaghverdiLab/velocity_notebooks. This GitHub page also contains all notebooks necessary to reproduce the results reported in this paper. A python implementation of the κ-velo and eco-velo pipeline can be found at https://github.com/HaghverdiLab/velocity_package.

    All analysed datasets are publicly available. The pancreatic endocrinogenesis dataset is available from the Gene Expression Omnibus (GEO) under accession GSE132188 [20]. The murine gastrulation dataset is available on the Arrayexpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-6967 [21]. For both datasets the count matrices can be downloaded directly from the scVelo Python implementation (https://scvelo.org) v0.2.4. The raw data from the chromaffin dataset is available on GEO under accession number GSE99933 [22]. The count matrices are made available by [1] at http://velocyto.org. The count matrices of the HSPC dataset are available on our GitHub Page: https://github.com/HaghverdiLab/velocity_notebooks. This GitHub page also contains all notebooks necessary to reproduce the results reported in this paper. A python implementation of the κ-velo and eco-velo pipeline can be found at https://github.com/HaghverdiLab/velocity_package.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES