Abstract
Motivation
A number of pseudotime methods have provided point estimates of the ordering of cells for scRNA-seq data. A still limited number of methods also model the uncertainty of the pseudotime estimate. However, there is still a need for a method to sample from complicated and multi-modal distributions of orders, and to estimate changes in the amount of the uncertainty of the order during the course of a biological development, as this can support the selection of suitable cells for the clustering of genes or for network inference.
Results
In applications to scRNA-seq data we demonstrate the potential of GPseudoRank to sample from complex and multi-modal posterior distributions and to identify phases of lower and higher pseudotime uncertainty during a biological process. GPseudoRank also correctly identifies cells precocious in their antiviral response and links uncertainty in the ordering to metastable states. A variant of the method extends the advantages of Bayesian modelling and MCMC to large droplet-based scRNA-seq data sets.
Availability and implementation
Our method is available on github: https://github.com/magStra/GPseudoRank.
1. Introduction
Single cell RNA-seq (scRNA-seq) technology can assay mRNA expression levels in individual cells and has revealed substantial inter-cell heterogeneity. Technical noise contributes to this heterogeneity but part of it is attributable to biologically meaningful inter-cell differences, see, for instance, Brennecke et al., 2013; Vallejos et al., 2015. Due to the destruction of the cells as a result of the measurement process, scRNA-seq only provides a single measurement per cell (Stegle et al., 2015), never time series data following the development of the same single cell. However, individual cells progress through changes at different time scales (Trapnell et al., 2014). Thus it is possible to obtain a form of time series data even from cross-sectional data by statistical means, an approach referred to as pseudotime ordering.
Most approaches to pseudotemporal ordering are based on representing cells as ng-dimensional vectors, where ng is a selected number of genes in a cell. Algorithms exploit the neighborhood structure of these vectors to find a pseudotemporal ordering, a linear ordering of all or most cells so that cells which are close in ℝng are also close in the linear ordering.
Wanderlust (Bendall et al., 2014) and SLICER (Welch, 2017; Welch et al., 2016) are two examples of methods based on k nearest neighbours graphs. SLICER additionally first applies LLE (local linear embedding) (Roweis and Saul, 2000) for dimensionality reduction. A number of methods are based on diffusion maps (Angerer et al., 2016; Haghverdi et al., 2015, 2016; Setty et al., 2016). TSCAN (Ji and Ji, 2016, 2015) is based on the construction of a minimum spanning tree (MST) between centroids of clusters, with an intermediate clustering step. Another well-known method using MST and clustering is Monocle 2 (Qiu et al., 2017), which applies graph structure learning (Mao et al., 2015).
The approaches mentioned above and a number of others provide singular pseudotime orderings without modelling uncertainty. Campbell and Yau (2016) examined the stability of Monocle’s pseudotime estimation when applied to random subsets of cells. They showed that the estimates can vary significantly. Thus quantification of uncertainty in pseudotime is crucial to avoid overconfidence. There are two existing methods for pseudotime estimation using MCMC to sample from a posterior distribution (Campbell and Yau, 2016; Reid and Wernisch, 2016), and a few others using variational methods (Ahmed et al., 2018; Reid and Wernisch, 2016; Welch et al., 2017). Both use Gaussian processes (GPs, see Section 2.1) to model the data. However, these methods sample from, or approximate in the case of variational inference, posterior distributions of continuous pseudotime vectors in ℝn, rather than sampling the ordering as a permutation.
We propose GPseudoRank, an algorithm sampling from a posterior distribution of pseudo-orders instead of pseudotimes, avoiding the exploration of pseudotime assignments that all map to the same ordering. MCMC samplers (such as NUTS (Hoffman and Gelman, 2014)) suitable for use in continuous pseudotime spaces make local moves that can have problems exploring bi-modal posteriors. GPseudoRank, by contrast, exploits a range of local and long-distance MCMC moves tailored to traverse the space of permutations efficiently. It also provides continuous pseudotime estimates by deriving a pseudotime vector from a fixed ordering through a deterministic transformation. This is based on the observation that most continuous pseudotime vectors with high likelihood are concentrated around pseudotime vectors derived from orderings through this transformation.
2. Methods
2.1. Single-cell trajectories as stochastic processes
We assume we have preprocessed log-transformed gene expression data in the form yg(c) of gene g = 1, . . . , ng, in cell c = 1, . . . , T (see section 2.7 for preprocessing steps). We start with a vector of time points τ = (τ1, . . . , τT) and define an ordering of cells as a permutation o = (o1, . . . , oT), oi ∈ {1, . . . , T}, oi ≠ oj for i ≠ j, where oi is the index of the cell assigned to time τi in the ordering. We model the gene expression trajectories yg = (yg(o1), . . . , yg(oT)) for each gene g by Gaussian processes (GPs) (Rasmussen and Williams, 2006), conditional on an ordering o of the cells. A GP is a distribution over functions of time in terms of a mean function μ and a covariance function Σ. For an input vector τ = (τ1, . . . , τT) of time points, μ(τ) is a vector of T mean values of function evaluations at these time points and Σ(τ) a T × T matrix of covariances of function evaluations at these points. The distribution of functions f ~ GP (μ, Σ) is described by stating that, for any vector of time points τ = (τ1, . . . , τT), evaluations f(τi) follow a multivariate normal (f (τ1), . . . , f (τT)) ~ (μ(τ), Σ(τ)). Here we use a squared exponential covariance function for Σ:
(1) |
where is a scale parameter, l a length scale and a term representing measurement noise.
Given an ordering o, the expression data for gene g can be ordered accordingly: yg(o) = (yg(o1), . . . , yg(oT)) and we model this trajectory as
(2) |
for each gene g = 1, . . . , ng, where τ = (τ1, . . . , τT) are time points. In practice, we assume a zero-mean GP: μ = 0. To adjust the data accordingly we subtract the overall mean across all genes and cells from each entry in the matrix of gene expression levels (see Section 2.7.2).
2.2. Geodesic mapping
Pseudotime should not be confused with physical time in which cell development unfolds. In order to identify the latent time points τ = (τ1, . . . , τT), which we assume to be unknown, together with the smoothness parameters of the GP, we have to make additional assumptions. The overall scale can be fixed by assuming τi ∈ [0, 1]. Each cell can then be assigned some rank time from equidistant time points ((i – 0.5)/T | i = 1, . . . , T). Rank time is similar to the concept of master time developed in Welch et al., 2017. Simply identifying pseudotime with rank time has some drawbacks. Rank time depends on the number of cells sampled per capture time, which often is rather arbitrary. It also does not allow any local change in scale. We therefore suggest a different route for identifying latent time points. We assume the covariance structure, essentially the smoothness of the process, is independent of time, that is, the GP is stationary. Pseudotime can then be considered a latent variable measuring biological development rather than physical time (Ahmed et al., 2018; Campbell and Yau, 2016; Reid and Wernisch, 2016; Welch et al., 2017). For periods of slower development, for example, pseudotime intervals will be shorter than physical time intervals and longer for faster development. In order to account for such change in scale over time we compute time points for any given ordering o as follows (recall oj is the index of the cell in position j).
(3) |
where y(oj) = (y1(oj), . . . , yng (oj))T and ∥·∥2 is the Euclidean norm in ℝng. We set to obtain pseudotimes τ (o) in the interval [0, 1]. For cells next to each other in the order o, this mapping puts them closer in pseudotime if they are similar in their expression profiles and further apart if they are less so. That is, the j-th time point τj is the geodesic distance of cell oj from the first cell o1, where we approximate the geodesic distance as the sum of the Euclidean distances between the cells ranked next to each other, similar to the dimensionality reduction method Isomap (Tenenbaum et al., 2000). Geodesic distances have previously been used for pseudotime estimation, see for instance Qiu et al., 2017; Welch, 2017.
2.3. Gaussian process priors
The correct ordering o of cells is distinguished by comparatively low measurement noise in (1), since most of the variation is captured by the trajectory whose variability is determined by the scale parameter Therefore informative priors for the noise parameters are necessary to ensure the model concentrates probability mass around the correct order and to avoid that a sampling or estimation algorithm gets trapped in local modes. Furthermore, since total variability is a sum of measurement noise and signal variability, we sample only and set where V is the sample variance taken across the entire ng × T matrix of gene expression levels of T cells for ng genes. The priors are as follows:
We set υ = 0.01 for all the single-cell data sets considered (see Section 2.7.2). A strong prior is preferable for single-cell data because of their high noise levels. With a vague prior on the length scale the posterior tends to be too short and the GP tends to overfit.
2.4. MCMC sampling
Markov Chain Monte Carlo (MCMC) methods (Gilks et al., 1996) are widely used to sample from continuous posterior densities in Bayesian statistics. After convergence, MCMC chains provide samples from the posterior distribution. More specifically, our method uses a Metropolis-Hastings approach (Hastings, 1970; Metropolis et al., 1953). For each given state of the Markov Chain, a new state is proposed using a proposal distribution, and accepted with a probability given by an acceptance ratio. While the construction of proposal distributions is often straightforward in the continuous case, we developed a set of proposal moves to sample from the discrete distribution of orders (see Section 2.5). For the sampling of the GP parameters we use Gaussian proposal distributions, adapting their standard deviation during burn-in aiming at acceptance rates between 0.45 and 0.5.
2.5. Sampling orderings
In the following we propose a Metropolis-Hastings algorithm for the sampling of the orderings. Preliminary experience with a variety of combinatorial moves to sample permutations led to the following set of five core moves, each with probability pj, j = 1, . . . , 5. In the following, we use sampling parameters n0, γ, n3, n3a:
Move 1, iterated swapping of neighbouring cells: draw the number of swaps to be applied, r1, uniformly from 1, . . . , n0 and draw r1 swap positions P1, . . . , Pr1 from 1, . . . , T – 1 with replacement. Then iterate for j = 1, . . . , r1: swap cell at position Pj with its neighbor at position Pj + 1.
Move 2, swapping of cells with short L1-distances: select two positions i and j according to probability pij ∝ exp(–d(ci, cj)2/γ), where d refers to the L1 distances of cells ci and cj (as ng-dimensional vectors) in these positions. Move ci to position j and cj to position i.
Move 3, reversing segments between cells with short L1-distances: obtain two positions i and j as in move 2 and reverse the ordering of all cells in between, including cells at i and j.
Move 4, short random permutations: draw a number r2 of short permutations uniformly from 1, . . . , n3. For each j = 1, . . . , r2, draw a number r3,j uniformly from 3, . . . , max(n3a, 3)) and a cell position kj uniformly from 1, . . . , T – r3,j. Randomly permute the cells at positions kj, . . . , kj + r3,j.
Move 5, reversing the entire ordering.
The rationale for moves 2 and 3 is that two cells which are positioned apart in the ordering should only be exchanged (move 2) or the segment between them reversed (move 3) if these cells have similar expression profiles and the smoothness of the trajectory remains intact after the move. For move 1 we use a default setting of n0 = ⌊T/4⌋ for the simulation studies. For move 4 we set n3 = ⌊T/20⌋, and n3a = ⌊T/12⌋. The distributions for choosing moves 2 and 3 may be tempered, that is taken to the power of a factor 0 < α < 1, to lower acceptance rates if required.
For the simulation studies we apply all possible combinations of moves 1 to 4 with equal probabilities and move 5 with a probability of 0.002. For the 5 experimental data sets analysed (see Section 2.7.2, we chose default parameters depending on the number of cells (see Section 6 of the supplement), and slightly adapted some of them to optimise convergence rates. For details on the parameters for the proposal distribution for all the data sets, see Table 5 in the supplement.
As our posterior distribution is a symmetric function of the order, each order and its reverse will be sampled with equal probability from the posterior distribution. We remove this symmetry in further analysis by reversing orders which are negatively correlated with the capture times, if available, or else with marker genes and invert posterior orders accordingly. For details and an application, see Section 4 of the supplementary materials.
2.6. Method for large data sets
To decrease run times for very large data sets, we perform a preprocessing step clustering cells into a large number of very small clusters using k-means clustering. If capture times are provided this is done separately for each group of cells at the different capture times. The recommended number of clusters for each capture time is 1/8th of the number of cells at the that capture time. One might also want to set an absolute minimum number of cells per capture time. The number of clusters may be decreased substantially for very large data sets, as they include larger numbers of similar cells, making our method applicable to data sets with tens of thousands of cells. We then apply GPseudoRank to the k centroids, reducing the computational complexity of each individual likelihood computation. The proposed preprocessing step also drastically reduces the number of samples required for convergence by reducing the size of the sample space.
The posterior distribution of orderings of the centroids of the mini-clusters is obtained. To the individual cells of a mini-cluster we assign the posterior pseudotimes of its centroid. To assess the accuracy of this approximation, we applied it to two medium-sized data sets with 307 and 550 cells respectively, where inference with the exact model is feasible. For details and a comparison to sparse GP approximation, see Section 3 of the supplementary materials.
2.7. Data sets
2.7.1. Simulated data
The efficacy of the individual moves and of combinations of different moves for different types of data is first assessed on simulated data. We simulate ng = 50 genes for T = 90 cells. For each simulation study we generate 16 data sets. On each of these data sets we run MCMC chains using all the possible combinations of the four proposed moves (with equal probability for combinations of more than one move). Since in the simulations we are mostly interested in the assessment of ordering moves and not any parameter estimation, we fix them to their true values and fix time points to rank time.
Simulation 1: three capture times, low noise. Each of the 16 data sets is generated as follows. First 90 temporal input points are drawn uniformly from [0, 1]. Then for each of the 50 genes in each of the simulated data sets, a parameter set for a GP underlying the trajectory of the simulated gene is drawn from
The data are assumed to be obtained at three capture times with 30 cells each.
Simulation 2: two capture times, low noise. The setup is similar to simulation 1, but with two capture times, where 30 cells are assigned to the first capture time, and the remaining 60 to the second.
Simulation 3: three capture times, high noise. The setup is similar to simulation 1, but .
2.7.2. Single cell RNA-seq and RT-PCR data sets
Shalek et al. (2014) examined the response of primary mouse bone-marrow-derived dendritic cells in three different conditions using single-cell RNA-seq. We apply GPseudoRank to the lipopolysaccharide stimulated (LPS) condition. Shalek et al. (2014) identified four modules of genes. As in Reid and Wernisch (2016), we use a total of 74 genes from the four modules with the highest temporal variance relative to their noise levels. The number of cells is 307, with 49 unstimulated cells, 75 captured after 1h, 65 after 2h, 60 after 4h, and 58 after 6h. We use an adjustment for cell size developed by Anders and Huber (2010), also used in Reid and Wernisch (2016).
Klein et al., 2015 generated a droplet-based data set of mouse embryonic stem cells after Leukemia inhibition factor withdrawal (0d, 2d, 4d, 7d). We apply GPseudoRank to the main branch with 1543 cells identified in a previous publication (the third branch in Figure 2c of Haghverdi et al., 2016). Shin et al., 2015 generated an in-vivo scRNA-seq data set of mouse adult hippocampal quiescent neural stem cells and their immediate progeny (Shin et al., 2015) and used 101 cells for their subsequent analysis. Stumpf et al., 2017 generated RT-PCR data following the development of mouse embryonic stem cells along the neuronal lineage (0h, 24h, 48h, 72h, 96h, 120h, 172h) (550 cells after preprocessing). Shalek et al., 2013 obtained scRNA-seq data from mouse bone-marrow-derived dendritic cells after exposure to lipopolysaccharide for 4 hours (18 cells). We refer to the data sets as Shalek, Klein, Shin, Stumpf, and Shalek13, respectively. For a description of the data sets, their availability (all publicly available) and preprocessing steps, see Section 2 of the supplementary materials. For all data sets with different capture times (excepting the Stumpf data set, which only contains 96 genes), we use an ANOVA test (Murphy, 2012, ch. 7) for differences of mean expression (the mean being taken across one individual capture time) for different capture times to filter a set of genes most relevant to the ordering. In the absence of capture times, we use genes with both a high mean expression and high variance (for details see Section 2 of the supplement).
2.8. Convergence assessment and analysis of posterior distribution
For thorough convergence assessment, we run multiple MCMC chains. For the simulated data sets we run 100,000 iterations per MCMC chain for 5 chains and apply a thinning factor of 10. For the scRNA-seq data sets we used the same thinning factor and at least 3 MCMC chains for convergence analysis (12 for those data sets (Shalek and Shin) where we analysed data sets without capture times). The number of samples required depends on the data set (see Table 5 in the supplement with all the examples, providing approximate guidance on the number of samples required for similar data sets). In order to assess convergence and not to bias the sampler towards specific orderings, all chains are seeded with random starting orders and with random GP parameters sampled from the prior distribution. However, we do restrict starting orders to permutations of cells within, but not across capture times. The restriction, while resulting in faster convergence, is not actually necessary (see Section 4 of the supplementary materials for details).
To check convergence, we use the Gelman-Rubin -statistic (Gelman and Rubin, 1992), corrected for sampling variability (Brooks and Gelman, 1998), implemented in the R-package coda (Plummer et al., 2006). The -statistic estimates the factor by which the pooled variance across all the chains is larger than the within-sample variance. For convergent chains, approaches 1 as the number of samples tends to infinity. According to Brooks and Gelman, 1998, convergence may be assumed to have been reached if < 1.2. We apply the stricter recommendation of < 1.1 (Gelman and Shirley, 2011). We compute the statistics for the following two quantities: first, the log-likelihood, and second the L1-distances of the sampled cell positions from a fixed reference set of cell positions, for which we use the true order, if known, and 1, . . . , T, where T is the number of cells, in case of scRNA-seq data. We compute the statistics a number of times during sampling, each time discarding the first 50% (Gelman and Shirley, 2011). We compare the speed of convergence for different combinations of proposal moves in the simulation studies. See Section 1 in the supplementary materials for details.
While distance from reference orders is an efficient way of obtaining a statistic for convergence assessment, more insights into the structure of the orders can be obtained from low dimensional representations, for example by MDS (multidimensional scaling) (Borg and Groenen, 2005) on the (Euclidean) distance matrix of the position vectors of the cells. It also allows us to visualize comparisons with alternative methods. Figure 2 shows the TSCAN solution located in one of the areas of higher density of the GPseudoRank solution, while the solution found by SLICER lies somewhat in between two modes, around the centre of the distribution.
3. Results
3.1. Simulation studies
This section summarises the insights gained from the simulation studies. For details on the assessment criteria and results, see Section 1 in the supplementary materials.
Simulation 1. Any combination of moves leads to good convergence, and although there are differences in the speed and level of convergence, any combination of moves is recommended.
Simulation 2. There are only two capture times, hence there is more variety in the starting orders for each chain. The performance of the combinations of moves is different from simulation 1. Move 3 performs better than any other single move.
Move 3 generally traverses the space of permutations faster by reversing whole segments of an ordering and it is the only move for which all -statistics go below 1.1 within the first 10,000 thinned samples. The combination of moves ranked first according to the criteria described in Section 1 of the supplementary materials is the combination 1,2,3,4 of all the moves.
Simulation 3. All moves and combinations thereof perform well in this situation, though move 3, while still achieving reasonable levels of convergence, is now the comparatively less well performing single move. The combination of all four moves performs well.
3.2. Pseudotemporal uncertainty varies during response to infection
For the scRNA-seq data from Shalek et al. (2014), collected at five different capture times, the true cell ordering is unknown. To check convergence of orders the -statistic is computed both on the log-likelihood and on the L1-distances of the permutation of cell positions to an arbitrary reference permutation (Figure 1).
Figure 1 shows that a threshold for the -statistic of 1.1 has been reached after 6,000 thinned samples (see also Table 5 in the supplement). We therefore discard a burn-in of 3,000 thinned samples at the beginning of each chain, as recommended by Gelman and Shirley (2011). Indeed, by the 1.1 threshold for the statistic 6,000 thinned samples would have been sufficient for convergence.
Figure 2 demonstrates again the value of providing a posterior distribution for orders, rather than a single estimate: TSCAN and SLICER give different results. Figure 3 illustrates the uncertainty of the pseudotime over the mean pseudotime. To ensure that the inverted U-shape in the amount of uncertainties of the first two capture times at 0h and 1h is not a sampling artifact, cells from these capture times were mixed together for initialising the sampler (that is, capture time information was discarded). On the other hand, despite being separated during initialisation of the sampler, cells from capture times 4h and 6h are completely merged, again indicating that the sampler has reached convergence.
Overall uncertainty in the ordering of cells is markedly lower around capture time 2h, when the reaction to the infection has set in, but is not yet complete. The slight U-shape in the amount of uncertainty for capture times 0h, 1h, and 4h/6h seems to be an experimental batch effect of capturing multiple heterogeneous cells at different time points. Within a batch (or merged batches 4h and 6h) cells which are either lagging behind or slightly ahead in their development are assigned a more specific pseudotime with lower uncertainty behind or ahead of the bulk of cells whose pseudotimes are, in contrast, more interchangeable with higher uncertainty. However, for other data sets (for instance with different time intervals between capture times) this graph might look different, as, for example, in supplement Figure 7 for the Stumpf data set.
GPseudoRank identifies two precocious cells, pointed out in the original analysis by Shalek et al., 2014, ahead in terms of their response to the stimulus, see Figure 4. Shalek et al., 2014 identified a set of genes particularly associated with antiviral response. Ahmed et al., 2018 and Reid and Wernisch, 2016 also used this score to demonstrate that their methods identify two cells at capture time 1h precocious in their antiviral response. Figure 5 shows the average expression of a set of genes associated with antiviral response for each cell. As expected, this antiviral score increases over pseudotime, confirming that the pseudotime assignment captures a biological phenomenon. In contrast to Figure 5, both DeLorean (Reid and Wernisch, 2016) and GrandPrix (Ahmed et al., 2018) show considerable edge effects in comparable plots (Reid and Wernisch, 2016, Fig. 4, Ahmed et al., 2018, Fig. 2). Such edge effects are not biologically motivated and presumably algorithmic artifacts which GPseudoRank is able to avoid by restricting pseudotimes to a finite interval and by using a geodesic mapping.
3.3. Posterior uncertainty modelling for droplet-based scRNA-seq
The mini-cluster approximation allows us to apply GPseudoRank to larger data sets. Figure 6 shows that the uncertainty in the ordering for the Klein data set (see Section 2.7.2) clearly exceeds that of other data sets in the early stages of the process. There is high uncertainty of cell positions at the beginning of the process as seen in the large area of intermediate densities in the lower left of Figure 6. This reflects the metastable state found early in the main branch in Figure 2c in Haghverdi et al., 2016. According to Haghverdi et al., 2016 such states can be defined as states with a high density in diffusion pseudotime, as many cells progress through this state slowly. With GPseudoRank we are able to identify such states in terms of the uncertainty of the posterior cell position in terms of rank time: this uncertainty is large if many cells are in a similar state and their ordering is more uncertain compared to phases where cells are more clearly separated by their progress. That is, uncertainty in rank time corresponds to metastable states.
The time to convergence at the 1.1-level for the Gelman-Rubin statistic for the Klein data set was 6 min on a laptop. For details on computation times for all single-cell data sets analysed, see Table 5 in the supplementary materials.
3.4. Multi-modal structure of posterior distributions
MDS shows that posterior distributions of cell position vectors tend to be multimodal (see Figure 2). To understand better what this complicated structure of the posterior positions means biologically, we applied GPseudoRank to a small data set of only 18 cells (Shalek13), as the structure of the posterior distribution of the cell orderings is easier to understand with a smaller number of possible orderings. We performed MDS (supplement Figure 13a), clustered the MDS projections into 4 clusters, and then computed, for the medoids of the 4 clusters, the antiviral score as for Figure 5. The result is shown in supplement Figure 13b. It shows that differences between different regions of the posterior distribution correspond to differences mainly in the second part of the orders, rather than the first. More precisely, the different regions of the posterior of the orders correspond to different trajectories of the antiviral score in the second part of the orders. While there is little uncertainty in the first half of the orders, the cells in the second half correspond to a metastable state, as in Figure 6. However, even in this metastable state, some orderings of cells are more likely than others as shown by the multi-modal structure of the posterior distribution. This indicates that there might be additional structure even in metastable states that can be revealed by algorithms such as GPseudoRank.
4. Discussion
GPseudoRank is a new type of Gaussian process latent variable model for pseudotemporal ordering. It samples orderings instead of pseudotimes, with combinatorial proposal moves designed to allow the Metropolis-Hastings sampler to make large changes to permutations and still achieve a high acceptance rate. This specific proposal distribution allows the sampler to explore complicated posterior distributions (see Figure 2). Point estimation methods are only able to find a single estimate of the order, and are therefore at most able to capture one mode or find an estimate that lies between several modes (see again Figure 2).
The applications to scRNA-seq and RT-PCR data illustrate another advantage of sampling from the posterior of orderings: the amount of uncertainty about the position of a cell can vary with time. In the Shalek data set, the uncertainty is lowest in the middle of the process, where the heterogeneity of cells with regard to their progress through the response to the infection is highest. This identifies parts of the process with increased change and higher biological variability compared to technical noise. For other data sets, the noise levels are highest at the beginning (Klein data), in the middle (Stumpf data), or at the end (Shalek13 data).
The uncertainty of the orders is relevant to any further analysis that models scRNA-seq data in terms of time-series data. This applies, for instance, to any type of network inference where the order of the input time series is relevant, including GP models (Penfold et al., 2012) and vector-autoregressive ones (Opgen-Rhein and Strimmer, 2007). Alternatively, identifying the regions of the process where the uncertainty of a cell’s position is low can support the selection of suitable cells for the clustering of genes for example.
Variational inference, which avoids sampling altogether, is considered a computationally efficient if only approximate Bayesian inference alternative to MCMC sampling. Considering that it samples from the full posterior distribution of the orders, GPseudoRank is very efficient and though runtimes obviously exceed those of well-designed variational methods (Ahmed et al., 2018), the mini-cluster approximation allows GPseudoRank to be applied to large data sets without losing much insight concerning the structure of the posterior distribution. GPseudoRank with the mini-cluster approximation described in Section 2.6 takes 6 min to converge on a laptop for a data set with more than 1500 cells. GPseudoRank can be applied to medium-sized data sets without approximation methods, taking about 50 min to converge at the 1.1-level of the Gelman-Rubin statistic for a data set with 550 cells. However, with the mini-cluster approximation it takes 1 minute to reach the same level of convergence.
Overall, GPseudoRank offers new insights into biological phenomena and experimental artifacts. It quantifies the amount and variability of uncertainty in single-cell ordering (Figures 2, 3 and 6). Assessing the degree of uncertainty enables spotting experimental batch effects created by sampling from a continuous spectrum of developmental stages at only a few capture times. Our approach is also able to identify precocious cells (Figure 4). By combining a geodesic pseudotime mapping with sampling permutations, GPseudoRank also avoids edge effects present in other GP methods for pseudotime ordering (Figure 5).
Except for relative measurements like qPCR, GPseudoRank is applied to log-transformed data. This is a frequent procedure for many pseudotime methods: see among many others Haghverdi et al., 2016; Ji and Ji, 2016; Reid and Wernisch, 2016; Welch et al., 2016. Modelling count data directly in GPseudoRank could be achieved by a change in the likelihood function to, say, the negative binomial distribution with GPs modelling the mean. However, this would require additional sampling of latent mean values for a small gain in accuracy over a log normal approximation which is usually very accurate for large count data.
We have illustrated GPseudoRank on a number of scRNA-seq data sets. Welch et al., 2017 developed a GP-based method for the inference of multi-omics pseudotime profiles through manifold alignment. A similar extension of GPseudoRank to the multi-omics case would allow insight into time-varying and multi-modal uncertainty structure of orderings for the multi-omics case.
Ordering problems are not restricted to the analysis of single-cell data. For instance, with clinical health record data the actual time of the onset of a disease is not usually known. It would be interesting to use an approach similar to GPseudoRank to order the measurements for different patients relative to each other. Unlike in the case of cells, the order and times of the measurements are known for each individual person. However, neither the rate of progression of the illness for the individual person, which is similar to the difference between actual time and pseudotime, nor the relative progression of the illness across different people are known. Generally our approach of proposing local and wider proposal moves for MCMC in a sample space of distributions suggests new ways of addressing a number of discrete sampling problems, such as covariate selection or ranking in mixture models for clustering.
Supplementary Material
Supplementary information: Supplementary data are available at Bioinformatics online.
Funding
MS, JR and LW are funded by the UK Medical Research Council (Grant Ref MC_UU_00002/1).
References
- Ahmed S, et al. GrandPrix: Scaling up the Bayesian GPLVM for single-cell data. Bioinformatics. 2018:bty533. doi: 10.1093/bioinformatics/bty533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106–R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Angerer P, et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics. 2016;32(8):1241–1243. doi: 10.1093/bioinformatics/btv715. [DOI] [PubMed] [Google Scholar]
- Bendall SC, et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell. 2014;157(3):714–725. doi: 10.1016/j.cell.2014.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borg I, Groenen P. Modern Multidimensional Scaling: Theory and Applications. 2 edition. Springer; 2005. [Google Scholar]
- Brennecke P, et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat Meth. 2013;10(11):1093–1095. doi: 10.1038/nmeth.2645. [DOI] [PubMed] [Google Scholar]
- Brooks S, Gelman A. General methods for monitoring convergence of iterative simulations. J Comput Graph Stat. 1998;7(4):434–455. [Google Scholar]
- Campbell K, Yau C. Order under uncertainty: robust differential expression analysis using probabilistic models for pseudotime inference. PLOS Comput Biol. 2016;12(11):e1005212. doi: 10.1371/journal.pcbi.1005212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Rubin D. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–472. [Google Scholar]
- Gelman A, Shirley K. Inference from simulations and monitoring convergence. In: Brooks S, Gelman A, Jones GL, Meng X-L, editors. Handbook of Markov Chain Monte Carlo. Vol. 6. CRC; Boca Raton: 2011. pp. 163–174. [Google Scholar]
- Gilks W, et al. Markov Chain Monte Carlo in Practice. Chapman and Hall; London: 1996. [Google Scholar]
- Haghverdi L, et al. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics. 2015;31(18):2989–2998. doi: 10.1093/bioinformatics/btv325. [DOI] [PubMed] [Google Scholar]
- Haghverdi L, et al. Diffusion pseudotime robustly reconstructs lineage branching. Nat Meth. 2016;13(10):845–848. doi: 10.1038/nmeth.3971. [DOI] [PubMed] [Google Scholar]
- Hastings W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97–109. [Google Scholar]
- Hoffman M, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res. 2014:1593–1623. [Google Scholar]
- Ji Z, Ji H. TSCAN: Tools for Single- Cell ANalysis. R package version 1.14.0. 2015. [Google Scholar]
- Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44(13):e117–e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein A, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mao Q, et al. Dimensionality reduction via graph structure learning; Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15; New York: ACM; 2015. pp. 765–774. [Google Scholar]
- Metropolis N, et al. Equation of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–1092. [Google Scholar]
- Murphy K. Machine learning: a probabilistic perspective. The MIT Press; Cambridge, MA: 2012. [Google Scholar]
- Opgen-Rhein R, Strimmer K. Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics. 2007;8(2):S3. doi: 10.1186/1471-2105-8-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Penfold C, et al. Nonparametric Bayesian inference for perturbed and orthologous gene regulatory networks. Bioinformatics. 2012;28(12):i233–i241. doi: 10.1093/bioinformatics/bts222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plummer M, et al. CODA: convergence diagnosis and output analysis for MCMC. R News. 2006;6(1):7–11. [Google Scholar]
- Qiu X, et al. Reversed graph embedding resolves complex single-cell trajectories. Nat Meth. 2017;14:979–982. doi: 10.1038/nmeth.4402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rasmussen C, Williams C. Gaussian processes for machine learning. MIT Press; Cambridge, MA: 2006. [Google Scholar]
- Reid J, Wernisch L. Pseudotime estimation: deconfounding single cell time series. Bioinformatics. 2016;32(19):2973–2980. doi: 10.1093/bioinformatics/btw372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–2326. doi: 10.1126/science.290.5500.2323. [DOI] [PubMed] [Google Scholar]
- Setty M, et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat Biotech. 2016;34(6):637–645. doi: 10.1038/nbt.3569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shalek A, et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013;498:236. doi: 10.1038/nature12172. EP –. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shalek A, et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014;510:363–369. doi: 10.1038/nature13437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shin J, et al. Single-cell RNA-seq with Waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell. 2015;17(3):360–372. doi: 10.1016/j.stem.2015.07.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stegle O, et al. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16(3):133–145. doi: 10.1038/nrg3833. [DOI] [PubMed] [Google Scholar]
- Stumpf P, et al. Stem cell differentiation as a non-Markov stochastic process. Cell Systems. 2017;5(3):268–282.e7. doi: 10.1016/j.cels.2017.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tenenbaum J, et al. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–2323. doi: 10.1126/science.290.5500.2319. [DOI] [PubMed] [Google Scholar]
- Trapnell C, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vallejos C, et al. BASiCS: Bayesian analysis of single-cell sequencing data. PLOS Comput Biol. 2015;11(6):1–18. doi: 10.1371/journal.pcbi.1004333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Welch J. SLICER: Selective Locally Linear Inference of Cellular Expression Relationships. R package version 0.2.0. 2017. [Google Scholar]
- Welch J, et al. SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol. 2016;17(1):106. doi: 10.1186/s13059-016-0975-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Welch J, et al. MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol. 2017;18(1):138. doi: 10.1186/s13059-017-1269-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.