Optimal sequencing budget allocation for trajectory reconstruction of single cells

Noa Moriel; Edvin Memet; Mor Nitzan

doi:10.1093/bioinformatics/btae258

. 2024 Jun 28;40(Suppl 1):i446–i452. doi: 10.1093/bioinformatics/btae258

Optimal sequencing budget allocation for trajectory reconstruction of single cells

Noa Moriel ^1,^b, Edvin Memet ^2,^b, Mor Nitzan ^3,^4,^5,^✉

PMCID: PMC11211845 PMID: 38940162

Abstract

Background

Charting cellular trajectories over gene expression is key to understanding dynamic cellular processes and their underlying mechanisms. While advances in single-cell RNA-sequencing technologies and computational methods have pushed forward the recovery of such trajectories, trajectory inference remains a challenge due to the noisy, sparse, and high-dimensional nature of single-cell data. This challenge can be alleviated by increasing either the number of cells sampled along the trajectory (breadth) or the sequencing depth, i.e. the number of reads captured per cell (depth). Generally, these two factors are coupled due to an inherent breadth-depth tradeoff that arises when the sequencing budget is constrained due to financial or technical limitations.

Results

Here we study the optimal allocation of a fixed sequencing budget to optimize the recovery of trajectory attributes. Empirical results reveal that reconstruction accuracy of internal cell structure in expression space scales with the logarithm of either the breadth or depth of sequencing. We additionally observe a power law relationship between the optimal number of sampled cells and the corresponding sequencing budget. For linear trajectories, non-monotonicity in trajectory reconstruction across the breadth-depth tradeoff can impact downstream inference, such as expression pattern analysis along the trajectory. We demonstrate these results for five single-cell RNA-sequencing datasets encompassing differentiation of embryonic stem cells, pancreatic beta cells, hepatoblast and multipotent hematopoietic cells, as well as induced reprogramming of embryonic fibroblasts into neurons. By addressing the challenges of single-cell data, our study offers insights into maximizing the efficiency of cellular trajectory analysis through strategic allocation of sequencing resources.

1 Introduction

Single-cell RNA-sequencing (scRNA-seq) technologies, measuring the gene expression levels of cellular populations at single-cell resolution, have been instrumental in uncovering principles of development, tissue homeostasis, reprogramming, and cascades of cell decisions and their underlying mechanisms at resolutions and scales that have been inaccessible until recently (Kolodziejczyk et al. 2015, Hedlund and Deng 2018, Hwang et al. 2018, Papalexi and Satija 2018). Such dynamic analysis often requires computational methods for trajectory inference, or, the reconstruction of temporal trajectories of cellular states out of static population snapshots provided by scRNA-seq data [see recent reviews (Cannoodt et al. 2016, Hwang et al. 2018, Kester and van Oudenaarden 2018, Ding et al. 2022) and benchmarking analysis (Saelens et al. 2019)]. In practice, however, trajectory inference can be challenging; scRNA-seq data is noisy and sparse due to both stochastic biological processes and technical noise arising from experimental limitations, including constraints on the fraction of cells and reads that can be captured (Lähnemann et al. 2020). Moreover, extrinsic biological variation is introduced by factors that are partially independent of the temporal process, such as the physical positioning of cells (Wagner et al. 2016).

Biological and technical noise can be reduced by increased biological replicates as well as improved experimental protocols and design choices (Grün and van Oudenaarden 2015, Kolodziejczyk et al. 2015, Stegle et al. 2015, Bacher and Kendziorski 2016, Ecker et al. 2017, Haque et al. 2017, Tung et al. 2017, Birnbaum 2018, Bass et al. 2019, Dal Molin and Di Camillo 2019, Seirup et al. 2020). In particular, technical variation can be controlled by adjusting the sequencing depth (i.e. the number of reads sequenced) (Pollen et al. 2014, Shalek et al. 2014, Streets and Huang 2014, Rizzetto et al. 2017, Tung et al. 2017). Yet, sequencing depth can only be set within experimental and financial limitations, many times at the expense of the number of cells assayed (breadth), under a given sequencing budget. This breadth-depth tradeoff is a fundamental experimental design challenge (Heimberg et al. 2016) which can be modeled in terms of a constant sequencing budget constraint, $B = n_{c} n_{r}$ , where B is the total number of reads sequenced across all assayed cells, $n_{c}$ is the number of cells assayed, and $n_{r}$ is the average number of reads sequenced per cell (Fig. 1A and B). Thus, given a limited sequencing budget, the experimenter can navigate between acquiring noisy information (few reads per cell) for many cells and acquiring higher-quality information (many reads per cell) for fewer cells. Several recent works approached the challenge of optimizing the breadth-depth tradeoff under a fixed sequencing budget for tasks including identification of transcriptional programs (Heimberg et al. 2016), modeling gene expression distributions (Svensson et al. 2019), profiling rare cell types (Torre et al. 2018), and gene expression estimation (Zhang et al. 2020). However, optimizing this tradeoff for dynamic processes is an unresolved challenge (Ding et al. 2022).

Figure 1. — The breadth-depth tradeoff of trajectory reconstruction. (A) Under a fixed scRNA-seq sequencing budget B, each of the $n_{c}$ cells that are captured (red) has an average coverage of $n_{r} = B / n_{c}$ reads. An experimenter can assay (i) few cells, many reads per cell, (iii) many cells, few reads per cell, or (ii) an intermediate option. (B) Fixed sequencing budget curves, for high ( $B_{1}$ ) and low ( $B_{2}$ ) budgets, show the tradeoff between the number of cells and reads per cell. *(Insets)* Illustrations of cell positions (red dots) and ground truth trajectory in gene expression space (black line). (C) For a given sequencing budget, the trajectory reconstruction error can be non-monotonic with an optimum at an intermediate number of sampled cells. (D) The reconstruction error is computed *a priori* in terms of pairwise distances between cells and manifested *a posteriori* through properties of the reconstructed trajectory such as its pseudotime ordering and the patterns of gene expression along it.

Here, we approach this task by analyzing the breadth-depth tradeoff in the context of cellular trajectory reconstruction (Fig. 1C and D). We find that the accuracy of reconstruction of internal cell structure, or cell-to-cell distances in expression space, scales with the logarithm of either the sequencing depth or breadth, we show how these relate to the optimal breadth-depth allocation for a fixed sequencing budget, and we demonstrate how such choices affect trajectory reconstruction for diverse scRNA-seq datasets which capture cellular trajectories along differentiation and reprogramming. The code is available at https://github.com/nitzanlab/trajectory_reconstruction_tradeoff.

2 Materials and methods

2.1 Reconstruction error

We measure the reconstruction error as the discrepancy between the normalized pairwise geodesic distances of the complete data and the subsampled data. To compute the pairwise geodesic distances, given the raw read count matrix $X \in N^{g \times n_{c}}$ , where g is the number of genes and $n_{c}$ is the number of cells, we first apply standard preprocessing involving logarithmic transformation using $log (X + 1)$ , followed by Principal Component Analysis (PCA) for dimensionality reduction to the top 10 principal components. This process serves as an approximation of cellular positioning in gene expression space. Subsequently, this reduced latent space representation is used to construct a k-Nearest Neighbors graph, wherein the value of k is determined as the minimum required to achieve a fully connected graph. The inferred distances between pairs of cells are computed as the shortest path distances across this graph. The reconstruction error is then calculated using the L1 norm of these normalized distances. Normalization is done by dividing each distance ( $d_{i j}$ between cells i and j) by the maximal distance observed across all cell pairs $(k, l)$ , $\max_{k, l \in cells} (d_{k l})$ .

2.2 Modeling the empirical reconstruction error

The read reconstruction error, $ε_{t}$ , is modeled as follows:

\begin{matrix} ε_{t} \approx {\begin{matrix} a + b log p_{t} & for p_{t} < p_{t}^{sat} \\ ε_{t}^{sat} & otherwise \end{matrix} . \end{matrix}

(2)

where $p_{t}^{sat}$ denotes the transition into a saturated sequencing regime where the reconstruction error, $ε_{t}^{sat}$ , is approximately constant.

The cell reconstruction error, $ε_{c}$ , is fitted to the fraction of captured cells, $p_{c}$ , by $ε_{c} \approx α + β log p_{c}$ .

In both cases, $a, b, α,$ and $β$ are constants that parameterize the relationships between the subsampling rates and the respective reconstruction errors.

Based on empirical results (see Supplementary Fig. S7), we model the overall reconstruction error as $ε = \max (ε_{c}, ε_{t})$ . Consequently, to predict the optimal cell subsampling probability, $\hat{p_{c}}$ , we solve $ε_{c} = ε_{t}$ for $p_{c}$ . Given $\tilde{B}$ , the fractional sequencing budget, and $p_{c}$ , the cell subsampling probability, the respective cell reconstruction error is $ε_{c} = α + β log p_{c}$ . When sequencing is unsaturated ( $p_{t} < p_{t}^{sat}$ ), the estimated read reconstruction error is $ε_{t} = a + b log p_{t} = a + b log \frac{\tilde{B}}{p_{c}}$ , and $\hat{p_{c}} = exp (\frac{a - α + b log \tilde{B}}{β + b})$ . When sequencing is saturated ( $p_{t} \geq p_{t}^{sat}$ ), sequencing at read subsampling probability $p_{t}^{sat}$ is optimal. Hence, altogether:

\begin{matrix} \hat{p_{c}} \sim {\begin{matrix} {\tilde{B}}^{γ} & p_{t} < p_{t}^{sat} \\ \tilde{B} / p_{t}^{sat} & otherwise \end{matrix}, \end{matrix}

(3)

with $γ = \frac{b}{β + b}$ .

In Fig. 3B, the inferred optimal cell subsampling probability, $\hat{p_{c}}$ , is contrasted with the empirical optimal cell subsampling probability, $p_{c}^{*}$ . To get the empirical optimum $p_{c}^{*}$ per budget $\tilde{B}$ , we compute the reconstruction error for $p_{c} \in [0.01, 0.9]$ (and corresponding $p_{t} = \frac{\tilde{B}}{p_{c}}$ ), thresholding at a minimum of 5 cells and 20 reads on average per cell, see Fig. 3A for reconstruction error tradeoff curves. The empirical optimum $p_{c}^{*}$ is the cell subsampling probability corresponding to the minimal average reconstruction error, averaged over 50 repetitions and over a rolling window of 4 $p_{c}$ values.

Figure 3. — Reconstruction error and optimal sampling across fixed sequencing budgets. (A) Reconstruction error as a function of sampled cell fraction, where each curve corresponds to a different sequencing budget. Solid lines represent the mean error and the shaded regions represent standard deviation over 50 repetitions. (B) For an extended series of sequencing budgets, we measure the empirical optimal cell sampling probability, $p_{c}^{*}$ (colored stars, see budget legend in A), mark the range of $p_{c}$ whose reconstruction error is 0.01 away from optimal (gray arrows), and the minimal and maximal tested $p_{c}$ in dashed gray lines. Sequencing budgets are written in their fractional form, $\tilde{B} = B / B^{0}$ (x-axis and colors). From Equation 3 (see Section 2), we compute the predicted optimal cell sampling probability, $\hat{p_{c}}$ (black stars). Mean Squared Error (MSE) is computed over predicted ( $\hat{p_{c}}$ ) and empirical ( $p_{c}^{*}$ ) optimal cell sampling probabilities where prediction is within the range of $p_{c}$ . Titles in B correspond to the titles of the columns in (A).

2.3 Pseudotime labeling of linear trajectories with diffusion pseudotime

We follow a basic pipeline suggested in Scanpy (Alexander Wolf et al. 2018) of ordering cells by diffusion pseudotime based on (Haghverdi et al. 2016). This involves preprocessing the data (Zheng et al. 2017), computing the neighborhood graph over reduced principal component representation, while avoiding disconnected components by increasing the number of neighbors when such exist, and computing the corresponding diffusion map and the resulting pseudotime ordering of the cells. Alternative pseudotime ordering methods are described in the Supplementary data.

2.4 Computing ordered expression pattern over linear trajectories

For this analysis, we focus on the 50 genes exhibiting the highest expression levels (averaged over all cells) from the top 10% highly variable genes chosen with Scanpy (Alexander Wolf et al. 2018) (Section 2). For each gene, we compute its ordered expression pattern by first binning cells into 5 equally sized groups according to their pseudotime ordering of the full data (see Section 2 and Supplementary Notes). Then, each gene’s expression is averaged per bin and normalized by the bin with maximal expression. In-silico runs in which no reads of the gene are captured are ignored. For each gene, we compute the Pearson correlation of the normalized, ordered expression of the subsampled and full data and report for each subsampling experiment the averaged correlation across genes.

2.5 Inferring gene expression patterns over linear trajectories

We follow the same procedure as for computing the ordered expression pattern (see Section 2), except that instead of binning cells by their pseudotime ordering of the full data (considered as ground-truth ordering), we use the inferred pseudotime ordering of the subsampled data, using pseudotime inference methods as described in the Section 2 and in the Supplementary data.

3 Results

3.1 Reconstruction error under subsampling of cells or reads per cell

The quality of the reconstruction of a dynamical process can depend on the quality of the scRNA-seq data, the characteristics of the underlying trajectory, and the specific trajectory inference algorithm used (Kester and van Oudenaarden 2018, Saelens et al. 2019, Wagner and Klein 2020). Since many trajectory inference methods, such as PAGA (Alexander Wolf et al. 2019) and Wanderlust (Bendall et al. 2014), are based on a computation of pairwise distances (Kester and van Oudenaarden 2018) over a low dimensional representation of cells [e.g. using log transformation and PCA, as advocated in (Ahlmann-Eltze and Huber 2023)], we center our analysis around the estimation quality of pairwise geodesic distances of such cell embeddings. Specifically, we examine the impact of subsampling cells or reads per cell on inferred pairwise distances $d_{i j}^{^{'}}$ between cells i, j, computed as shortest path distances over k-Nearest Neighbor graph of the cells, relative to the distances $d_{i j}$ before subsampling, which we take as a proxy for the ground truth (see Section 2). That is, we quantify the reconstruction error $ε$ in terms of the discrepancy between the inferred and ground-truth cell-to-cell distances:

ε = \frac{1}{n_{c}^{2}} \sum_{i j} | {\hat{d}}_{i j} - {\hat{d}}_{i j}^{^{'}} |,

(1)

where ${\hat{d}}_{i j}$ and ${\hat{d}}_{i j}^{^{'}}$ are both scaled entry-wise to a maximum of 1 ( ${\hat{d}}_{i j} = d_{i j} / {max}_{kl} (d_{k l})$ ).

We begin by studying how subsampling only cells or only reads per cell (while the other quantity remains fixed) affects the quality of trajectory reconstruction. For either read or cell subsampling (binomial sampling of reads and uniform random sampling of cells; see examples in Supplementary Fig. S5), we vary either $p_{t}$ , the fraction of read counts ( $p_{t} = n_{r} / n_{r}^{0}$ ), or $p_{c}$ , the fraction of cells captured ( $p_{c} = n_{c} / n_{c}^{0}$ ), and compute the reconstruction error of the pairwise geodesic distances between cells in the subsampled data, relative to their ground-truth distances. We denote by $ε_{t}$ (or $ε_{c}$ ) the reconstruction error when subsampling only reads (or only cells). Of note, when a Unique Molecule Identifier (UMI) is incorporated in the RNA-sequencing protocol, multiple reads can be understood as originating from a single transcript (leading to differences in the subsampling of reads versus transcripts, particularly for deep sequencing). Here, we use reads and transcripts interchangeably to represent the captured expression as done in (Zhang et al. 2020).

We analyzed five publicly available, deeply sequenced, mouse-derived single-cell datasets, including those of embryonic stem cells differentiation into primitive endoderm cells [“mESC”; (Hayashi et al. 2018)], maturation of pancreatic beta cells [“beta”; (Qiu et al. 2017)], hepatoblasts differentiation into hepatocytes and its branched cholangiocytes [“hepatoblast”; (Yang et al. 2017)], embryonic fibroblasts reprogramming into induced neuronal or myocyte cells [“fibroblasts”; (Treutlein et al. 2016)], and transitioning of multipotent hematopoietic cells towards lineage-specific progenitors [“hematopoiesis”; (Olsson et al. 2016)], see Fig. 2A and Supplementary Table S1 for further details.

Figure 2. — The independent effects of subsampling reads/cells on the accuracy of inferred cell-cell distances. (A) scRNA-seq datasets used for evaluation, embedded and plotted in the first two principal components following log transform (see Section 2) and colored by “milestones” as standardized in Saelens *et al.* (2019). (B) The resulting reconstruction error $ε_{t}$ decreases linearly with $log p_{t}$ (fitted by a dotted black line), where $p_{t}$ is the fraction of sampled read counts. Large $p_{t}$ denote a regime of saturated sequencing in which the error $ε_{t}^{sat}$ is modeled as constant. Mean number of reads per cell ( $n_{r}$ ) are noted on the top x-axis. (C) The reconstruction error $ε_{c}$ decreases linearly with $log p_{c}$ (fitted by a black dotted line) where $p_{c}$ is the fraction of sampled cells. Number of cells ( $n_{c}$ ) are noted on the top x-axis. In both (B) and (C), subsampling is performed 10 times, mean error is denoted by a gray line, standard deviation is denoted by a shaded region. $R^{2}$ (coefficient of determination) is computed between each sample and the fitted curve. Titles in B and C correspond to the titles of the columns in (A).

Across these datasets, the reconstruction error $ε$ decreases approximately linearly with $\log (p_{c})$ (Fig. 2C; mean $R^{2} = 0.68$ ), as well as with $\log (p_{t})$ (Fig. 2B; mean $R^{2} = 0.89$ ) for $p_{t} < p_{t}^{sat}$ , beyond which is a regime dominated by sequencing saturation, where $ε (p_{t} \geq p_{t}^{sat}) =$ constant (see Section 2 for details). While we focus on deeply sequenced datasets (Fig. 2), which are respectively associated with small sample sizes (355–562 cells per dataset), the reconstruction error scales similarly with cell subsampling for larger datasets, as we show for scRNA-seq data of pancreatic endocrinogenesis composed of 3696 cells (Bastidas-Ponce et al. 2019) (Supplementary Fig. S6).

3.2 Reconstruction error under sequencing budget constraints

Given a fixed sequencing budget, B, how can one minimize the reconstruction error? Concretely, what is the optimal number of cells to assay, $n_{c}^{*}$ , alternatively modeled as the optimal cell subsampling probability, $p_{c}^{*}$ , that achieves minimal reconstruction error? The inherent tradeoff between the number of cells and the number of reads per cell given a constant budget B (where $B = n_{c} n_{r}$ ) suggests that any shift in the balance between sequencing breadth and depth would lead to conflicting effects on the reconstruction error, as individually, the error increases with subsampling of both cells and reads (Fig. 2B and C).

To study this tradeoff, we examine the reconstruction error when subsampling cells and reads for the scRNA-seq datasets described above under constant budgets (Fig. 3A). That is, we subsample a fraction of the total number of cells $n_{c}^{0}$ and of the total number of reads $n_{r}^{0}$ in the original data in order to synthetically simulate less favorable experimental conditions of a limiting sequencing budget B that is strictly smaller than $B^{0} = n_{c}^{0} n_{r}^{0}$ . We then compute the reconstruction error over a range of cell sampling probabilities $p_{c}$ (in other words, over a range of cell numbers and corresponding read numbers).

At low sequencing budgets, the reconstruction error shows non-monotonicity as a function of $p_{c}$ , and is minimized at small sampled cell fractions (Fig. 3A). For example, for the mouse embryonic stem cell data (Hayashi et al. 2018), with a budget of $B \approx 60 K$ reads (or equivalently, fractional sequencing budget of $\tilde{B} = B / B^{0} = 6.0 e - 4$ ), the reconstruction error is minimized when sampling $n_{c}^{*} \approx 60$ cells with $n_{r}^{*} \approx 1000$ reads per cell. As the budget increases, the optimal number of cells to assay increases (Fig. 3A).

To model how the optimal subsample of cells, $p_{c}^{*}$ , that minimizes the reconstruction error $ε$ , varies with the sequencing budget B, we approximate $ε$ by the maximum over both cell and read errors ( $ε = \max (ε_{t}, ε_{c})$ ), supported empirically in Supplementary Fig. S7. Using this modeling assumption, the reconstruction error is expected to be minimal when $ε_{t} = ε_{c}$ . Hence, when sequencing is unsaturated, the predicted optimal number of cells to assay, $\hat{p_{c}}$ , follows a power-law, $\hat{p_{c}} \sim {\tilde{B}}^{γ}$ where $\tilde{B}$ is the fractional sequencing budget, $γ = \frac{b}{β + b}$ , and $b, β$ are inferred based on the fit of reconstruction error as a function of $p_{c}$ and $p_{t}$ , individually (Section 2). These predictions for optimal allocation of sequencing budgets ( $\hat{p_{c}}$ ) are found to be in good agreement with the empirical optimal allocations ( $p_{c}^{*}$ ) for the five scRNA-seq datasets presented above (Fig. 3B; mean $MSE = 0.06$ ).

3.3 Breadth-depth sequencing tradeoff is reflected in trajectory reconstruction for cells and gene expression

To assess the downstream effects of the breadth-depth sequencing tradeoff on trajectory reconstruction beyond cell-cell distances, we next analyzed the inferred gene expression pattern of highly expressed and variable genes across linear differentiation trajectories, see Fig. 4. Inferring the change in gene expression across a reconstructed linear trajectory requires (i) accurate pseudotime ordering of the cells (Fig. 4B), and (ii) high-quality capture of the expression pattern (Fig. 4C). When both are achieved, the quality of the inferred expression pattern across the reconstructed trajectory is expected to be high (Fig. 4D). See Section 2 for further details.

Figure 4. — Low reconstruction error for intermediate cell-read budget allocation manifests in downstream tasks. We analyze the mESC and beta cells datasets using fractional budgets $\tilde{B} = 6.0 e - 04, \tilde{B} = 7.7 e - 05$ , respectively. We compute (A) the reconstruction error, together with the Pearson correlation between the complete and subsampled data in terms of: (B) diffusion pseudotime, (C) gene expression patterns, given the true pseudotime (ordering of the full data), (D) gene expression patterns over the inferred pseudotime (see Section 1). Throughout, we highlight sampling experiments of minimal reconstruction error (green) and of deeper or broader choices marked in red or in blue, respectively.

We find that using the sequencing budget to sequence many cells at low read counts (high breadth, low depth) can impede the recovery of the arrangement of cells for both mouse ESC and beta cell differentiation datasets (Fig. 4B). On the other hand, given the cellular ordering along the trajectory, estimation of the corresponding mean gene expression can benefit from a large batch of samples (Fig. 4C, see Supplementary data). While each of these substeps, individually, may benefit from lending the sequencing budget towards either deeper, or broader sampling, the complete task of recovering gene expression pattern benefits from an intermediate budget allocation (Fig. 4D). This result captures the tradeoff within the temporal gene expression inference task, demonstrated in Fig. 4B, C, as inference of cellular ordering is generally disrupted when capturing many cells and few reads per cell, and gene expression inference is disrupted when deeply sampling few cells. We obtain similar results using alternative pseudotime inference methods, see Supplementary Figures S8 and S9.

4 Discussion

In this paper, we analyzed the optimal allocation of a sequencing budget in single-cell RNA-sequencing experiments so as to optimize cellular trajectory reconstruction. Such optimization can play a key role both for smaller-scale pilot experiments which serve to inform the design of subsequent larger-scale studies, or high-throughput screenings where efficient use of resources is key. While the optimal tradeoff can depend on biological features such as the precise topology of the underlying trajectory, or the rates at which cells progress along the trajectory, we abstract away much of this intricacy by modeling directly the change of reconstruction error with read or cell subsampling. The combined effect of subsampling cells and reads can be complex, however, we find that the reconstruction error, to a good approximation, is structured according to the factor corresponding to the dominant error, thus providing interpretation of the emerging reconstruction error over the breadth-depth tradeoff and a prediction of the optimal experimental design.

While in this work we focus on optimizing sequencing budget allocation to enhance the quality of trajectory reconstruction, different objectives can lead to substantial variations in the optimal budget allocation strategy. For example, Heimberg et al. (2016) showed that shallow sequencing can suffice for the recovery of transcriptional programs due to the modularity of gene expression, and specifically the error of the inferred programs, or the low-dimensional structure of the data, saturates with increasing depth, starting at relatively low depth. When targeting genes of exceedingly low expression, Zhang et al. (2020) suggest to assay extremely few cells, a strategy which can be suboptimal for trajectory reconstruction (see Supplementary data).

While we demonstrate several advantages of our chosen budget with regard to the cellular trajectory, our study can be broadened to address multi-faceted objectives (e.g. preserving topology and gene expression pattern of specific genes) and to be posed as a satisfaction, rather than an optimization, problem. Our approach can be generalized to additional biological structures or processes, such as optimizing the breadth-depth tradeoff for the inference of the spatial configuration of tissues, and integrated with structural and expression-based prior knowledge.

Supplementary Material

btae258_Supplementary_Data

btae258_supplementary_data.pdf^{(2.5MB, pdf)}

Acknowledgements

We thank Rebecca Boiarsky and David Sontag for early discussions on this work, and Zoe Piran for comments on the manuscript.

Contributor Information

Noa Moriel, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.

Edvin Memet, Department of Physics, Harvard University, Cambridge, MA 02138, United States.

Mor Nitzan, School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel; Racah Institute of Physics, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel; Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9112102, Israel.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the Israeli Council for Higher Education Ph.D. fellowship (N.M.), the Center for Interdisciplinary Data Science Research at the Hebrew University of Jerusalem (N.M.), Alon Fellowship, and the Israel Science Foundation [1079/21 to M.N.].

Data availability

Single-cell datasets of mESC differentiation into primitive endoderm cells, pancreatic beta cells maturation, hepatoblasts bifurcating differentiation, embryonic fibroblasts reprogramming, and of hematopoiesis were downloaded from https://zenodo.org/record/1443566#.YEExrpMzbDI (Saelens et al. 2019). The pancreatic endocrinogenesis dataset (Bastidas-Ponce et al. 2019) was downloaded through scVelo (Bergen et al. 2020). See dataset statistics in Supplementary Table S1.

Code availability

Code is available at: https://github.com/nitzanlab/trajectory_reconstruction_tradeoff.

References

Ahlmann-Eltze C, Huber W.. Comparison of transformations for single-cell RNA-seq data. Nat Methods 2023;20:665–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alexander Wolf F, Angerer P, Theis FJ.. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alexander Wolf F, Hamey FK, Plass M. et al. Paga: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 2019;20:59. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bacher R, Kendziorski C.. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 2016;17:63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bass AJ, Robinson DG, Storey JD. Determining sufficient sequencing depth in RNA-seq differential expression studies. bioRxiv, 10.1101/635623,2019, preprint: not peer reviewed. [DOI]
Bastidas-Ponce A, Tritschler S, Dony L. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 2019;146:dev173849. [DOI] [PubMed] [Google Scholar]
Bendall SC, Davis KL, Amir EaD. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human b cell development. Cell 2014;157:714–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bergen V, Lange M, Peidli S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 2020;38:1408–14. [DOI] [PubMed] [Google Scholar]
Birnbaum KD. Power in numbers: single-cell RNA-seq strategies to dissect complex tissues. Annu Rev Genet 2018;52:203–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cannoodt R, Saelens W, Saeys Y.. Computational methods for trajectory inference from single-cell transcriptomics. Eur J Immunol 2016;46:2496–506. [DOI] [PubMed] [Google Scholar]
Dal Molin A, Di Camillo B.. How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives. Brief Bioinform 2019;20:1384–94. [DOI] [PubMed] [Google Scholar]
Ding J, Sharon N, Bar-Joseph Z.. Temporal modelling using single-cell transcriptomics. Nat Rev Genet 2022;23:355–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ecker JR, Geschwind DH, Kriegstein AR. et al. The brain initiative cell census consortium: lessons learned toward generating a comprehensive brain cell atlas. Neuron 2017;96:542–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grün D, van Oudenaarden A.. Design and analysis of single-cell sequencing experiments. Cell 2015;163:799–810. [DOI] [PubMed] [Google Scholar]
Haghverdi L, Büttner M, Alexander Wolf F. et al. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 2016;13:845–8. [DOI] [PubMed] [Google Scholar]
Haque A, Engel J, Teichmann SA. et al. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med 2017;9:75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hayashi T, Ozaki H, Sasagawa Y. et al. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat Commun 2018;9:619. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hedlund E, Deng Q.. Single-cell RNA sequencing: technical advancements and biological applications. Mol Aspects Med 2018;59:36–46. [DOI] [PubMed] [Google Scholar]
Heimberg G, Bhatnagar R, El-Samad H. et al. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst 2016;2:239–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hwang B, Lee JH, Bang D.. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 2018;50:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kester L, van Oudenaarden A.. Single-cell transcriptomics meets lineage tracing. Cell Stem Cell 2018;23:166–79. [DOI] [PubMed] [Google Scholar]
Kolodziejczyk AA, Kim JK, Svensson V. et al. The technology and biology of single-cell RNA sequencing. Mol Cell 2015;58:610–20. [DOI] [PubMed] [Google Scholar]
Lähnemann D, Köster J, Szczurek E. et al. Eleven grand challenges in single-cell data science. Genome Biol 2020;21:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olsson A, Venkatasubramanian M, Chaudhri VK. et al. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature 2016;537:698–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
Papalexi E, Satija R.. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol 2018;18:35–45. [DOI] [PubMed] [Google Scholar]
Pollen AA, Nowakowski TJ, Shuga J. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 2014;32:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiu W-L, Zhang Y-W, Feng Y. et al. Deciphering pancreatic islet $β$ cell and $α$ cell maturation pathways and characteristic features at the single-cell level. Cell Metabolism 2017;25:1194–205.e4. [DOI] [PubMed] [Google Scholar]
Rizzetto S, Eltahla AA, Lin P. et al. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells. Sci Rep 2017;7:12781. [DOI] [PMC free article] [PubMed] [Google Scholar]
Saelens W, Cannoodt R, Todorov H. et al. A comparison of single-cell trajectory inference methods. Nat Biotechnol 2019;37:547–54. [DOI] [PubMed] [Google Scholar]
Seirup M, Chu L-F, Sengupta S. et al. Reproducibility across single-cell RNA-seq protocols for spatial ordering analysis. PLoS One 2020;15:e0239711. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shalek AK, Satija R, Shuga J. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 2014;510:363–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle O, Teichmann SA, Marioni JC.. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet 2015;16:133–45. [DOI] [PubMed] [Google Scholar]
Streets AM, Huang Y.. How deep is enough in single-cell RNA-seq? Nat Biotechnol 2014;32:1005–6. [DOI] [PubMed] [Google Scholar]
Svensson V, da Veiga Beltrame E, Pachter L. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. bioRxiv, 10.1101/762773, 2019, preprint: not peer reviewed. [DOI]
Torre E, Dueck H, Shaffer S. et al. Rare cell detection by single-cell RNA sequencing as guided by single-molecule RNA fish. Cell Syst 2018;6:171–9.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Treutlein B, Lee QY, Camp JG. et al. Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 2016;534:391–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tung P-Y, Blischak JD, Hsiao CJ. et al. Batch effects and the effective design of single-cell gene expression studies. Sci Rep 2017;7:39921. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wagner A, Regev A, Yosef N.. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol 2016;34:1145–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wagner DE, Klein AM.. Lineage tracing meets single-cell omics: opportunities and challenges. Nat Rev Genet 2020;21:410–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang L, Wang W-H, Qiu W-L. et al. A single-cell transcriptomic analysis reveals precise pathways and regulatory mechanisms underlying hepatoblast differentiation. Hepatology 2017;66:1387–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang MJ, Ntranos V, Tse D.. Determining sequencing depth in a single-cell RNA-seq experiment. Nat Commun 2020;11:774. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng GXY, Terry JM, Belgrader P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae258_Supplementary_Data

btae258_supplementary_data.pdf^{(2.5MB, pdf)}

Data Availability Statement

[btae258-B1] Ahlmann-Eltze C, Huber W.. Comparison of transformations for single-cell RNA-seq data. Nat Methods 2023;20:665–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B2] Alexander Wolf F, Angerer P, Theis FJ.. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol 2018;19:15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B3] Alexander Wolf F, Hamey FK, Plass M. et al. Paga: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 2019;20:59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B4] Bacher R, Kendziorski C.. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol 2016;17:63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B5] Bass AJ, Robinson DG, Storey JD. Determining sufficient sequencing depth in RNA-seq differential expression studies. bioRxiv, 10.1101/635623,2019, preprint: not peer reviewed. [DOI]

[btae258-B6] Bastidas-Ponce A, Tritschler S, Dony L. et al. Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 2019;146:dev173849. [DOI] [PubMed] [Google Scholar]

[btae258-B7] Bendall SC, Davis KL, Amir EaD. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human b cell development. Cell 2014;157:714–25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B8] Bergen V, Lange M, Peidli S. et al. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol 2020;38:1408–14. [DOI] [PubMed] [Google Scholar]

[btae258-B9] Birnbaum KD. Power in numbers: single-cell RNA-seq strategies to dissect complex tissues. Annu Rev Genet 2018;52:203–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B10] Cannoodt R, Saelens W, Saeys Y.. Computational methods for trajectory inference from single-cell transcriptomics. Eur J Immunol 2016;46:2496–506. [DOI] [PubMed] [Google Scholar]

[btae258-B11] Dal Molin A, Di Camillo B.. How to design a single-cell RNA-sequencing experiment: pitfalls, challenges and perspectives. Brief Bioinform 2019;20:1384–94. [DOI] [PubMed] [Google Scholar]

[btae258-B12] Ding J, Sharon N, Bar-Joseph Z.. Temporal modelling using single-cell transcriptomics. Nat Rev Genet 2022;23:355–68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B13] Ecker JR, Geschwind DH, Kriegstein AR. et al. The brain initiative cell census consortium: lessons learned toward generating a comprehensive brain cell atlas. Neuron 2017;96:542–57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B14] Grün D, van Oudenaarden A.. Design and analysis of single-cell sequencing experiments. Cell 2015;163:799–810. [DOI] [PubMed] [Google Scholar]

[btae258-B15] Haghverdi L, Büttner M, Alexander Wolf F. et al. Diffusion pseudotime robustly reconstructs lineage branching. Nat Methods 2016;13:845–8. [DOI] [PubMed] [Google Scholar]

[btae258-B16] Haque A, Engel J, Teichmann SA. et al. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med 2017;9:75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B17] Hayashi T, Ozaki H, Sasagawa Y. et al. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat Commun 2018;9:619. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B18] Hedlund E, Deng Q.. Single-cell RNA sequencing: technical advancements and biological applications. Mol Aspects Med 2018;59:36–46. [DOI] [PubMed] [Google Scholar]

[btae258-B19] Heimberg G, Bhatnagar R, El-Samad H. et al. Low dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst 2016;2:239–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B20] Hwang B, Lee JH, Bang D.. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med 2018;50:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B21] Kester L, van Oudenaarden A.. Single-cell transcriptomics meets lineage tracing. Cell Stem Cell 2018;23:166–79. [DOI] [PubMed] [Google Scholar]

[btae258-B22] Kolodziejczyk AA, Kim JK, Svensson V. et al. The technology and biology of single-cell RNA sequencing. Mol Cell 2015;58:610–20. [DOI] [PubMed] [Google Scholar]

[btae258-B24] Lähnemann D, Köster J, Szczurek E. et al. Eleven grand challenges in single-cell data science. Genome Biol 2020;21:31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B25] Olsson A, Venkatasubramanian M, Chaudhri VK. et al. Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature 2016;537:698–702. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B26] Papalexi E, Satija R.. Single-cell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol 2018;18:35–45. [DOI] [PubMed] [Google Scholar]

[btae258-B27] Pollen AA, Nowakowski TJ, Shuga J. et al. Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 2014;32:1053–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B29] Qiu W-L, Zhang Y-W, Feng Y. et al. Deciphering pancreatic islet $β$ cell and $α$ cell maturation pathways and characteristic features at the single-cell level. Cell Metabolism 2017;25:1194–205.e4. [DOI] [PubMed] [Google Scholar]

[btae258-B30] Rizzetto S, Eltahla AA, Lin P. et al. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells. Sci Rep 2017;7:12781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B31] Saelens W, Cannoodt R, Todorov H. et al. A comparison of single-cell trajectory inference methods. Nat Biotechnol 2019;37:547–54. [DOI] [PubMed] [Google Scholar]

[btae258-B32] Seirup M, Chu L-F, Sengupta S. et al. Reproducibility across single-cell RNA-seq protocols for spatial ordering analysis. PLoS One 2020;15:e0239711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B33] Shalek AK, Satija R, Shuga J. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 2014;510:363–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B34] Stegle O, Teichmann SA, Marioni JC.. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet 2015;16:133–45. [DOI] [PubMed] [Google Scholar]

[btae258-B35] Streets AM, Huang Y.. How deep is enough in single-cell RNA-seq? Nat Biotechnol 2014;32:1005–6. [DOI] [PubMed] [Google Scholar]

[btae258-B36] Svensson V, da Veiga Beltrame E, Pachter L. Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq. bioRxiv, 10.1101/762773, 2019, preprint: not peer reviewed. [DOI]

[btae258-B37] Torre E, Dueck H, Shaffer S. et al. Rare cell detection by single-cell RNA sequencing as guided by single-molecule RNA fish. Cell Syst 2018;6:171–9.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B39] Treutlein B, Lee QY, Camp JG. et al. Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 2016;534:391–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B40] Tung P-Y, Blischak JD, Hsiao CJ. et al. Batch effects and the effective design of single-cell gene expression studies. Sci Rep 2017;7:39921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B41] Wagner A, Regev A, Yosef N.. Revealing the vectors of cellular identity with single-cell genomics. Nat Biotechnol 2016;34:1145–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B42] Wagner DE, Klein AM.. Lineage tracing meets single-cell omics: opportunities and challenges. Nat Rev Genet 2020;21:410–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B43] Yang L, Wang W-H, Qiu W-L. et al. A single-cell transcriptomic analysis reveals precise pathways and regulatory mechanisms underlying hepatoblast differentiation. Hepatology 2017;66:1387–401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B44] Zhang MJ, Ntranos V, Tse D.. Determining sequencing depth in a single-cell RNA-seq experiment. Nat Commun 2020;11:774. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae258-B45] Zheng GXY, Terry JM, Belgrader P. et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun 2017;8:14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Optimal sequencing budget allocation for trajectory reconstruction of single cells

Noa Moriel

Edvin Memet

Mor Nitzan

Abstract

Background

Results

1 Introduction

Figure 1.

2 Materials and methods

2.1 Reconstruction error

2.2 Modeling the empirical reconstruction error

Figure 3.

2.3 Pseudotime labeling of linear trajectories with diffusion pseudotime

2.4 Computing ordered expression pattern over linear trajectories

2.5 Inferring gene expression patterns over linear trajectories

3 Results

3.1 Reconstruction error under subsampling of cells or reads per cell

Figure 2.

3.2 Reconstruction error under sequencing budget constraints

3.3 Breadth-depth sequencing tradeoff is reflected in trajectory reconstruction for cells and gene expression

Figure 4.

4 Discussion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

Code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases