Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Oct 1;16(10):e1008240. doi: 10.1371/journal.pcbi.1008240

PhyDOSE: Design of follow-up single-cell sequencing experiments of tumors

Leah L Weber 1,#, Nuraini Aguse 1,#, Nicholas Chia 2,3, Mohammed El-Kebir 1,*
Editor: Niranjan Nagarajan4
PMCID: PMC7553321  PMID: 33001973

Abstract

The combination of bulk and single-cell DNA sequencing data of the same tumor enables the inference of high-fidelity phylogenies that form the input to many important downstream analyses in cancer genomics. While many studies simultaneously perform bulk and single-cell sequencing, some studies have analyzed initial bulk data to identify which mutations to target in a follow-up single-cell sequencing experiment, thereby decreasing cost. Bulk data provide an additional untapped source of valuable information, composed of candidate phylogenies and associated clonal prevalence. Here, we introduce PhyDOSE, a method that uses this information to strategically optimize the design of follow-up single cell experiments. Underpinning our method is the observation that only a small number of clones uniquely distinguish one candidate tree from all other trees. We incorporate distinguishing features into a probabilistic model that infers the number of cells to sequence so as to confidently reconstruct the phylogeny of the tumor. We validate PhyDOSE using simulations and a retrospective analysis of a leukemia patient, concluding that PhyDOSE’s computed number of cells resolves tree ambiguity even in the presence of typical single-cell sequencing errors. We also conduct a retrospective analysis on an acute myeloid leukemia cohort, demonstrating the potential to achieve similar results with a significant reduction in the number of cells sequenced. In a prospective analysis, we demonstrate the advantage of selecting cells to sequence across multiple biopsies and that only a small number of cells suffice to disambiguate the solution space of trees in a recent lung cancer cohort. In summary, PhyDOSE proposes cost-efficient single-cell sequencing experiments that yield high-fidelity phylogenies, which will improve downstream analyses aimed at deepening our understanding of cancer biology.

Author summary

Cancer development in a patient can be explained using a phylogeny—a tree that describes the evolutionary history of a tumor and has therapeutic implications. A tumor phylogeny is constructed from sequencing data, commonly obtained using either bulk or single-cell DNA sequencing technology. The accuracy of tumor phylogeny inference increases when both types of data are used, but single-cell sequencing may become prohibitively costly with increasing number of cells. Here, we propose a method that uses bulk sequencing data to guide the design of a follow-up single-cell sequencing experiment. Our results suggest that PhyDOSE provides a significant decrease in the number of cells to sequence compared to the number of cells sequenced in existing studies. The ability to make informed decisions based on prior data can help reduce the cost of follow-up single cell sequencing experiments of tumors, improving accuracy of tumor phylogeny inference and ultimately getting us closer to understanding and treating cancer.


This is a PLOS Computational Biology Methods paper.

Introduction

Tumorigenesis follows an evolutionary process during which cells gain and accumulate somatic mutations that lead to cancer [1]. The most natural expression of an evolutionary process is a phylogeny—a tree that describes the order and branching points of events in the history of a cellular population. Tumor phylogenies are critical to understanding and ultimately treating cancer, with recent studies using tumor phylogenies to identify mutations that drive cancer progression [2, 3], assess the interplay between the immune system and the clonal architecture of a tumor [4, 5], and identify common evolutionary patterns in tumorigenesis and metastasis [6, 7]. These downstream analyses critically rely on accurate phylogenies that are inferred from sequencing data of a tumor.

The majority of current cancer genomics data consist of pairs of matched normal and tumor samples that have undergone bulk DNA sequencing. Bulk data is composed of sequences from cells with distinct genomes. More specifically, we observe frequencies f = [fi] for the set of somatic mutations in the tumor (Fig 1A). Many deconvolution methods have been proposed for tumor phylogeny inference from such data [813], typically inferring a set T of equally plausible trees (Fig 1B). These approaches are unsatisfactory, as candidate trees with different topologies may alter conclusions in downstream analyses. Single-cell sequencing (SCS), as opposed to bulk sequencing, enables us to observe specific clones present within the tumor. These clones correspond to the leaves of the true phylogeny, allowing phylogeny inference methods to reconstruct the tree itself once we observe all clones in the tumor [1417]. However, the elevated error rates of SCS, as well as its high cost [18], make it prohibitive as a standalone method for phylogeny inference. As such, hybrid methods have been recently proposed to infer high-fidelity phylogenies from combined bulk and SCS data obtained from the same tumor [19, 20]. Furthermore, without a rigorous framework to determine the number of single-cells to sequence, this decision is currently guided by budget constraints or arbitrarily determined by exogenous factors, such as thresholds for a sequencing run. This could result in excessive costs by sequencing too many cells or sunk costs associated with an unsuccessful experiment when an insufficient number of cells are sequenced.

Fig 1. PhyDOSE computes the number of single cells to sequence to identify the true phylogeny.

Fig 1

(A) Mutation frequencies f obtained from bulk DNA sequencing data. (B) The solution space T of trees inferred from f. We show a distinguishing feature of T1 (orange and green). (C) For tree T1, PhyDOSE suggests that k = 2 single cells suffice to observe clones that are unique to T1. (D) In a follow-up SCS experiment we observe k = 2 cells, one from the orange clone and one from the green clone. As such, we eliminate trees T2 and T3, concluding that phylogeny T1 is the true phylogeny T*.

Several hybrid datasets have been obtained by performing bulk and single-cell DNA sequencing simultaneously [21, 22]. However, there is merit in first performing bulk sequencing to guide follow-up SCS experiments. For instance, several studies first identified a subset of single-nucleotide variants from the bulk data to target in subsequent SCS experiments, thereby reducing costs compared to conventional whole-genome SCS approaches [2325]. A recently introduced method, SCOPIT, computes how many cells are needed to observe all clones of a tumor, given estimates on the smallest prevalence of a clone as well as the number of clones with that smallest prevalence to detect [26]. The authors provide no guidance on how to obtain these two quantities. Here, we build upon this work by directly incorporating knowledge encoded by the trees T inferred from the initial bulk sequencing data. Indeed, by using data from an SCS experiment we may eliminate trees from T that do not align with the observed clones (Fig 1). In other words, if we observe all clones in a tumor, it is possible to determine the phylogeny of the tumor. However, is it possible to achieve the same goal by observing fewer clones? If so, how many cells are necessary for us to observe the required clones?

We introduce Phylogenetic Design Of Single-cell sequencing Experiments (PhyDOSE), a method to strategically design a follow-up SCS experiment aimed at inferring the true phylogeny (Fig 1). Given a set T of candidate trees inferred from initial bulk data, we describe how to distinguish a single tree T among the rest using features unique to T. In particular, if our SCS experiment results in observing cells corresponding to a distinguishing feature of T, we may conclude that T is in fact the true tree. This means that we can typically identify T using only a subset of the clones. To determine the number of cells to sequence, we introduce a probabilistic model that incorporates SCS errors and models successful SCS experiments as a tail probability of a multinomial distribution (Fig 1D). Finally, we reconcile the sampled cells utilizing these distinguishing features to infer the true phylogeny (Fig 1D) and provide heuristics for considering uncertainty in frequency estimates and determining the number of cells to sequence across multiple available biopsies. We validate PhyDOSE using both simulated data and a retrospective analysis of a leukemia patient that has undergone both bulk and SCS sequencing. We also demonstrate the utility of PhyDOSE by prospectively computing how many cells are needed to resolve the uncertainty in phylogenies of a recent acute myeloid leukemia cohort [27] and lung cancer cohort [3]. The cost-efficient SCS experiments enabled by PhyDOSE will yield high-fidelity phylogenies, improving downstream analyses aimed at understanding tumorigenesis and developing treatment plans.

Materials and methods

We introduce Phylogenetic Design Of Single-cell sequencing Experiments (PhyDOSE), a method to determine the number of single cells to sequence to identify the true phylogeny given initial bulk sequencing data. PhyDOSE is implemented in C++/R and is available as an R package at https://github.com/elkebir-group/phydoser. This section describes the various methodological components of PhyDOSE.

Problem statement

Let n be the number of single-nucleotide variants, or simply mutations, identified from initial bulk sequencing data of a matched normal and tumor biopsy sample. For each mutation i, we observe the variant allele frequency (VAF), i.e. the fraction of aligned reads that harbor the tumor allele at the locus of mutation i. Specialized methods exist that combine copy number information and VAFs to infer a cancer cell fraction fi for each mutation i, which is the proportion of cells in the tumor biopsy that contain at least one copy of the mutation [3, 2830]. Here, we refer to cancer cell fractions as frequencies. Typically, phylogenies T inferred by current methods from frequencies f = [fi] adhere to the infinite sites assumption. That is, each mutation i is introduced exactly once at vertex vi and never subsequently lost.

When we sequence a single cell from the same tumor biopsy, assuming no errors, we identify a clone of the tumor. In other words, we observe a set of mutations that must form a connected path in the unknown true phylogeny T*. By repeatedly sequencing single cells until we observe all clones in the tumor, we will have observed all root-to-vertex paths of T*, thus identifying tree T* itself. We assume that (i) the true unknown phylogeny T* is among the trees in T and that (ii) mutations among single cells that we sample from the tumor biopsy follow the same distribution as f. These assumptions are important for the mathematical derivation of PhyDOSE but it is typical for violations to occur in practice. Through simulations, we explore the impact of violating these assumptions and show that our approach is robust to many realistic scenarios.

This leads to the following question and problem statement with respect to these two assumptions. How many single cells do we need to identify T* with confidence level γ?

Problem 1 (SCS Power Calculation (SCS-PC)). Given a set T of candidate phylogenies, frequencies f and confidence level γ, find the minimum number k* of single cells needed to determine the true phylogeny T* among T with probability at least γ.

Clearly, we do not know which phylogeny in T is the true underlying phylogeny T* of the tumor. Thus, we consider a slightly different problem: In the T-SCS-PC problem (defined formally at the end of the section), we are given an arbitrary phylogeny TT and want to perform a similar power calculation when conditioning on T being the true phylogeny. By solving the T-SCS-PC problem for all trees T1,,T|T|, we obtain the numbers k(T1),,k(T|T|) of single cells needed for each tree. As T* is in T, the maximum number among k(T1),,k(T|T|) is an upper bound on the number of required SCS experiments to identify T* with probability at least γ. To solve the T-SCS-PC problem, we need to reason for which SCS experiments we can conclude that T is the true phylogeny.

Observe that each tree T in T describes a unique set of clones, corresponding to the sets of mutations encountered in all root-to-vertex paths of T (Fig 1). Thus, if we observe all clones of a phylogeny T in our SCS experiments, we may conclude that T is the true phylogeny. What is the probability of doing so? To answer this question, we must compute the prevalence of each clone in the tumor biopsy.

For phylogenies that adhere to the infinite sites assumption, the prevalence u(T, f) = [ui] of the clones in the tumor biopsy are uniquely determined by the phylogeny T and frequencies f as

ui=fi-jδT(i)fji[n]. (1)

where δT(i) is the set of children of the node where mutation i was introduced [9].

Tumor phylogeny inference methods guarantee that the inferred phylogenies T from frequencies f have clonal prevalence u(T, f) = [ui] that are nonnegative and that i=1nui1, where the remainder u0=1-i=1nui is the prevalence of the normal clone. Thus, conditioning on a phylogeny T and frequencies f, sequencing one cell from the tumor will lead us to observe one of the n + 1 clones of T with probabilities (u0, …, un). In other words, the outcome of this SCS experiment with one cell is a draw from the categorical distribution Cat(u0, …, un). The possible outcomes of an SCS experiment composed of k cells thus follow a multinomial distribution Mult(u0, …, un). Thus, the probability of observing all tumor clones of T in such an SCS experiment with k cells corresponds to the tail probability of the multinomial where each of the n tumor clones is observed at least once.

The corresponding power calculation is to determine the smallest number for k where the tail probability is greater or equal to the confidence level γ. Note that this power calculation for observing all clones has been previously introduced [26].

Importantly, in many cases we need not observe all clones of T to distinguish T from the remaining phylogenies T\{T} (Fig 2). This means that we may conclude that T is the true phylogeny with an SCS experiment with fewer cells. To formalize this notion, we start by defining a featurette.

Fig 2. The SCS power calculation for phylogeny T (T-SCS-PC) problem.

Fig 2

(A) We are given frequencies f and a tree T1 that we want to distinguish from the other trees {T2, T3}. The pair (T1, f) uniquely determine clonal prevalence u(T1, f). (B) Featurettes of T1 correspond to root-to-vertex paths, yielding distinguishing features Π1 and Π2, each with one featurette absent in T2 and another absent in T3. (C) With k = 2 cells, we must observe clones from either Π1 or Π2 for a successful outcome, resulting in probability Pr(Y2u(T1, f)) ≈ 0.12. (d) To increase this probability to γ = 0.95, we need k* = 32 cells.

Definition 1. A featurette τ is a subset of mutations.

We say that a featurette τ is present in a phylogeny T if the nodes/mutations of τ form a connected path of T starting at the root node, otherwise we say that τ is absent in T. The same featurette, however, may be present in more that one phylogeny. Thus, multiple featurettes may be required to distinguish a phylogeny T from the remaining phylogenies T\{T}.

Definition 2. A set Π of featurettes is a distinguishing feature of T if (i) for all featurettes τΠ it holds that τ is present in T, and (ii) for each remaining phylogeny TT\{T} there exists a featurette τ′ ∈ Π where τ′ is absent in T′.

Thus, an SCS experiment where we observe one cell from each clone of a distinguishing feature Π of T enables us to conclude that phylogeny T is the true phylogeny. As discussed, every phylogeny T has a trivial distinguishing feature, which is composed of all featurettes present in T. Moreover, T may have multiple distinguishing features. Therefore, we must consider the complete set of all distinguishing features, which we call the distinguishing feature family.

Definition 3. The set Φ(T,T\{T}) composed of all distinguishing features of T with respect to T\{T} is a distinguishing feature family of T.

Let (c0, …, cn) be the outcome of an SCS experiment of k cells, where ci ≥ 0 is the number of cells observed of clone i and i=0nci=k. This experiment is successful if, among the k sequenced cells, we observe the clones of at least one distinguishing feature ΠΦ(T,T\{T})—i.e. ci > 0 for all clones i in some distinguishing feature ΠΦ(T,T\{T}). As discussed, conditioning on frequencies f and T being the true phylogeny, outcomes (c0, …, cn) of SCS experiments of k cells follow a multinomial distribution Mult(k, u0, …, un) where u(T, f) = [ui] is defined as in (1). Let Yk denote the event of a successful outcome. We are interested in computing the probability Pr(Yku(T, f)), which equals the sum of the probabilities of all successful outcomes. More specifically, we want to determine the smallest number k* of single cells to sequence such that Pr(Yk*u(T, f)) is at least the prescribed confidence level γ (Fig 2).

Problem 2 (SCS Power Calculation for Phylogeny T (T-SCS-PC)). Given a set T of candidate phylogenies and a phylogeny TT, frequencies f and confidence level γ, find the minimum number k* of single cells needed such that Pr(Yk*u(T, f)) ≥ γ.

In Section A.1 in S1 Text, we prove that the above problem is NP-hard.

Theorem 1. T-SCS-PC is NP-hard.

Multiple biopsies

The SCS-PC problem is only applicable to bulk sequencing data obtained from a single biopsy, i.e. the number of cells calculated is only for an SCS experiment on one sample. However, bulk samples from tumors are often obtained from multiple biopsies, each with different mutation frequencies and consequently different clonal prevalences. One approach to support such data is to solve the SCS-PC problem for each biopsy in isolation and select the biopsy that requires the smallest number of cells. However, a more cost-effective approach that also better captures intra-tumor heterogeneity is to perform a follow-up SCS experiment with cells from multiple biopsies. In particular, the naive selection approach might not yield a solution if the clones of a distinguishing feature do not co-occur in a single biopsy.

With b ≥ 1 biospies, the input changes from a frequency vector f ∈ [0, 1]n to a frequency matrix F ∈ [0, 1]b×n, whose entries fpi indicate the frequency of mutation i in biospy p. Similarly, the output changes from an integer k*N to a count vector k*Nb, such that each entry kp* indicates the number of single cells in biopsy p. We have the following two generalizations of the SCS-PC problem and the T-SCS-PC problem for the case of b ≥ 1 biopsies.

Problem 3 (Multi-Sample SCS Power Calculation (Mul-SCS-PC)). Given a set T of candidate phylogenies, frequencies F from b biopsies and confidence level γ, find the numbers k*=[k1*,,kb*] of single cells needed from each biopsy to determine the true phylogeny T* among T with probability at least γ and the total number k*1=p=1bkp* of cells is minimum.

Problem 4 (Multi-Sample SCS Power Calculation for Phylogeny T (T-Mul-SCS-PC)). Given a set T of candidate phylogenies and a phylogeny TT, frequencies F from b biopsies and confidence level γ, find the numbers k*=[k1*,,kb*] of single cells needed from each biopsy such that Pr(Yk*U(T, F)) ≥ γ and the total number k*1=p=1bkp* of cells is minimum.

For the case where b = 1, the T-SCS-PC and the T-Mul-SCS-PC problems are identical, amounting to following hardness result.

Corollary 1. T-Mul-SCS-PC is NP-hard.

Multinomial power calculation

To solve the T-SCS-PC problem, it suffices to have an algorithm that computes Pr(Yku(T, f)), which is the probability of concluding that T is the true phylogeny. Using this algorithm we identify k* by starting from k = 0 and simply incrementing k until the corresponding probability Pr(Yku(T, f)) exceeds the prescribed confidence level γ. In the following, we describe how to efficiently compute Pr(Yku(T, f)).

Recall that the outcome of an SCS experiment composed of k cells corresponds to a vector c = [ci], where ci ≥ 0 is the number of cells that we observe from clone i and i=0nci=k. In a successful outcome c we observe at least one cell for each featurette in at least one distinguishing feature ΠΦ(T,T\{T}), where Φ(T,T\{T}) is the distinguishing feature family. For brevity, we will write Φ rather than Φ(T,T\{T}).

Let c(Π, k) denote the set of all outcomes where we observe at least one cell for each featurette in a distinguishing feature Π—i.e. i=0nci=k, and for all i ∈ {0, …, n} it holds that ci > 0 if clone i is a featurette in Π and ci ≥ 0 otherwise. The set c(Φ, k) of successful outcomes is defined as the union ⋃Π∈Φ c(Π, k). The probability of any SCS outcome c = (c0, …, cn) is distributed according to Mult(k, u(T, f)). Since successful outcomes enable us to conclude that T is the true phylogeny, we have

Pr(Yku(T,f))=c(Φ,k)Mult(k,u(T,f))=c(Φ,k)k!i=0ni!i=0nuii. (2)

If there is only one distinguishing feature Π, i.e. Φ = {Π}, then the desired probability is a standard tail probability of the multinomial where we sum up the probabilities of outcomes c(Π, k) = [ci] such that i=0nci=k, ci > 0 if clone i is a featurette of Π and ci ≥ 0 otherwise. A fast calculation of this tail probability was developed using a connection to the conditional probability of independent Poisson random variables [26, 31]. If there are multiple distinguishing features but they are pairwise disjoint—i.e. no two distinct distinguishing features share the same featurette—then we simply have

Pr(Yku(T,f))=ΠΦc(Π,k)Mult(k,u(T,f)), (3)

and we can apply the fast computation [26] to obtain each independent tail probability. However, the equality in the above equation does not hold if the family Φ is composed of distinguishing features with overlapping featurettes. Incorrectly applying this equation will lead us to overestimate the value of k*. Since single-cell sequencing is expensive, overestimating the number of cells to sequence in an SCS experiment can be costly and unnecessary. One naive way would be to simply brute force all (n + 1)k SCS outcomes, but this will not scale. Instead, to calculate Pr(Yku(T, f)) exactly, we propose to use the inclusion-exclusion principle as follows.

Pr(Yku(T,f))=ΦΦ(-1)|Φ|+1c(I(Φ),k)Mult(k,u(T,f)), (4)

where I(Φ′) is the set of all featurettes in Φ′, i.e. I(Φ′) = ⋃Π∈Φ′ Π (Fig 3A).

Fig 3. PhyDOSE implementation details.

Fig 3

(A) To account for minimal distinguishing features that share featurettes, we use the inclusion-exclusion principle to compute Pr(Yku(T, f)). Here, Π1 (red) and Π2 (blue) share a featurette (with ‘triangle’ and ‘heart’ mutations). (B) To enumerate the set Φ* of minimal distinguishing features of T1, we reduce the problem to Set Cover and repeatedly identify minimum covers. Here, the universe U is composed of trees {T2, T3} and there is a subset in F for each featurette τ of T1 composed of the trees where τ is absent.

Thus, we need to compute 2|Φ| − 1 tail probabilities, which each can be done using the fast calculation in SCOPIT [26].

In the worst case, Φ has O(2n) distinguishing features resulting in O(2n) tail probabilities. We now describe one final optimization that will significantly reduce the number of required computations. This is based on the following observation.

Observation 1. If Π is a distinguishing feature of T then for all featurettes τ present in T it holds that Π ∪ {τ} is a distinguishing feature of T.

This means that distinguishing features in Φ form a partially ordered set under the set inclusion relation. We call a distinguishing feature Π minimal if there does not exist another distinguishing feature Π′ ∈ Φ that is a proper subset of Π, i.e. Π′ ⊊ Π.

A direct consequence of Observation 1 is that the outcome of an SCS experiment is successful when we observe all featurettes of a distinguishing feature Π, and remains so even if we observe additional featurettes τ′ ∉ Π.

As such, successful outcomes w.r.t. Φ equal those w.r.t. the set Φ* of all minimal distinguishing features of T.

Observation 2. It holds that c(Φ*, k) = c(Φ, k).

Therefore, it suffices to restrict our attention to only Φ* rather than the complete family Φ when computing Pr(Yku(T, f)) using (4). Section B.1 in S1 Text describes how to find Φ* by reducing the problem to that of finding all minimal set covers, which we solve in an iterative fashion using integer linear programming.

Power calculation for multiple biopsies

We now discuss the T-Mul-SCS-PC, which is the generalization of the T-SCS-PC problem to b ≥ 1 biopsies. The key probability is Pr(YkU(T, F)), i.e. the probability of concluding that T is the true tree when sequencing k = [k1, …, kb]T cells from each biopsy. In the following, we discuss an exact (but computationally intensive) approach to compute this probability as well as a fast heuristic.

Given the numbers k = [k1, …, kb]T of cells to sequence from each biopsy, an outcome of the corresponding SCS experiment across b biopsies is defined as a matrix C = [cpi] such that cpiN is the number of cells that we observe from clone i in biopsy p, and ∑i = 0 cpi = kp for all biopsies p. A successful outcome for a distinguishing feature Π is an outcome where we observe at least one cell for each featurette in Π. Let C(Π, k) denote the set of all successful outcomes for distinguishing feature Π. The set C(Φ*, k) of successful outcomes for the minimal distinguishing feature family Φ* is defined as the union ⋃Π∈Φ* C(Π, k). Since sequencing of each biopsies proceeds independently, the probability of a observing an outcome C = [c1, …, cb]T across b biopsies equals

Pr(c1,,cbk,U(T,F))=p=1bPr(cpkp,u(T,fp))=p=1bMult(cpkp,u(T,fp)). (5)

Hence, the desired tail probability of successful outcomes for the minimal distinguishing feature family Φ* equals

Pr(YkU(T,F))=[c1,,cb]TC(Φ*,k)p=1bMult(cpkp,u(T,fp)). (6)

For a single biopsy (b = 1), the probability Pr(YkU(T, F)) corresponds to sums of multinomial tail probabilities, enabling fast calculation using [26] as discussed in the previous section. This is no longer the case for b > 1 biopsies where the tail probability is over b independent multinomial distributions (see product in Eq (6)). A naive way to compute Pr(YkU(T, F)) would be to exhaustively enumerate all SCS outcomes with k cells, which scales exponentially in k. As such, we develop a heuristic approach, which selects a subset of required featurettes/clones in each biopsy that together form a distinguishing feature in Φ* and achieve the smallest total number of cells with confidence level γ. Section B.4 in S1 Text provides more details and Figure B in S1 Text provides an example where the heuristic returns a suboptimal solution.

Consideration of SCS error rates

One current challenge with SCS is that the false negative rate per site is quite high with typical rates up to 0.4 for the commonly used multiple displacement amplification (MDA) method [32]. On the other hand, current false positive rates are low and are typically less than 0.0005 for MDA-based whole-genome amplification [32]. A false negative is defined as not observing a mutation that is present in the cell. A false positive occurs when we observe the presence of a mutation that did not occur in that cell.

With PhyDOSE, we propose one possible method for incorporating the false negative rate β when it is known. Specifically, sampled cells follow a categorical distribution u = [u0, …, un] when conditioned on tree T. Hence, the probability of sampling a cell from clone i equals ui. True positives, i.e. correctly observing a mutation in a clone, follow a Bernoulli distribution with parameter 1 − β. To observe a featurette/clone i that has ni mutations and a prevalence of ui, we thus need to have ni true positives. In other words, assuming independence among mutations, we require ni successful draws from a Bernoulli distribution parameterized by 1 − β. As such, we derive new clonal prevalence u(T,f,β)=[ui] from u(T, f) = [ui]. Additionally assuming independence between the events of a cell being sampled from clone i and the absence of false negatives, we set ui=ui(1-β)ni where ni is the number of mutations in featurette/clone i. We set u0 to be equal to 1-i=1nui. This adjustment results in a reduction of the clonal prevalence and ultimately increases the value of k*. The issue of false positives is less serious as error rates are low enough to be negligible.

Prioritizing candidate trees post SCS experiment

The final step is to prioritize candidate trees after performing an SCS experiment with the number k* of cells computed by PhyDOSE. To this end, we compute the support of each tree TT. Intuitively, support(T) is the number of cells that support the conclusion that T is the actual phylogeny. Formally, we say that a distinguishing feature Π of a tree T is observed if each featurette of Π is observed in at least one cell. Using this, we define support(T) as the number of cells that correspond to featurettes of an observed distinguishing feature Π of T. Per Observation 1, it suffices to restrict our attention to the set Π* of minimal distinguishing features.

There are two outcomes of an SCS experiment with k* cells. Either there is no tree TT with non-zero support or there are one or more trees with non-zero support. In the former case, the SCS experiment has failed, which is expected to occur with probability 1 − γ. In the latter case, which may occur in the presence of false negatives and false positives, we return the set of trees with maximum support.

Alternatively, we may use existing methods that infer tumor phylogeny from SCS data [14, 16, 33] or a combination of SCS and bulk data [19, 20].

k* confidence interval

A common challenge in bulk sequencing is uncertainty in the cancer cell fractions f due to sampling of reads as well as sequencing and mapping errors. Following standard practice [9, 10, 34], we account for this uncertainty by taking confidence intervals [f, f+] as input. Typically, such confidence intervals are obtained by viewing variant read counts as draws from a binomial or beta-binomial distribution. Importantly, uncertainty in the cancer cell fraction leads to uncertainty in clonal prevalences. Therefore, for each tree TT, we utilize [f, f+] to construct an interval [k-*(T),k+*(T)] of the number of single cells reflecting the extreme values the clonal prevalences may assume. To find these values, we must consider the frequencies [f, f+] using constraints from the tree T following the sum condition (1) and the featurettes in a distinguishing feature of T. We do this using a heuristic that we describe in Section B.2 in S1 Text. We set the overall interval [k-*,k+*] conservatively as the confidence interval [k-*(T),k+*(T)] from the tree T with the maximum k+*(T) among all trees TT.

phydoser R package

We developed PhyDOSE and the associated optimizations into a freely available R package named phydoser. The functions in the phydoser R package are grouped into four areas: (i) I/O support (ii) pre and post processing (iii) PhyDOSE implementation iv) visualization. For I/O support, phydoser offers a suite of functions to read and convert external data into the data structures required by the R implementation of PhyDOSE. The pre and post processing capabilities include generation of the distinguishing feature families of each tree, the frequency inputs for the k* confidence interval and the computation of the support metric for a completed SCS experiment. The implementation of PhyDOSE includes both a single biopsy and a multiple biopsy mode. Lastly, the visualization functions facilitate creation of high resolution tree graphics while annotating the distinguishing feature or a specific featurette of a tree. phydoser is available at https://github.com/elkebir-group/phydoser.

Results

In this section, we demonstrate the application of PhyDOSE to simulated and real data. We begin by validating our method using simulated data. Next, we provide retrospective results for a leukemia patient [23] and an acute myeloid leukemia cohort [27] where both bulk and single-cell DNA sequencing have been performed [23]. Finally, we use PhyDOSE to perform a prospective analysis to determine the required number of single cells to identify the true phylogeny in a non-small cell lung cancer patient cohort [3]. Source data and results can be found at https://github.com/elkebir-group/PhyDOSE.

Simulations

Design

We used simulations to assess (i) the benefit of PhyDOSE’s distinguishing feature analysis, (ii) robustness to uncertainty due to sequencing errors and (iii) robustness to violations of PhyDOSE’s model assumptions. We generated simulated data where the ground truth tree T* is known. Given a fixed number c of clones and n mutations, we first generated a ground truth tree T* with c vertices uniformly at random using Prüfer sequences [35] and randomly distributed the n mutations to the c clones while ensuring that every clone had at least one mutation. Next, we generated clonal prevalences u = [ui] by drawing from a symmetric (n + 1)-dimensional Dirichlet distribution with concentration parameter 0.2. We used rejection sampling to ensure that each clonal prevalence ui was at least 0.05. Let σ(i) be the set of clones that contain mutation i. We generated frequencies f = [fi] by setting fi = ∑jσ(i) uj for each mutation i ∈ {1, …, n}. We used the SPRUCE algorithm to enumerate the set T of trees given frequencies f [9].

To account for common single-cell sequencing errors, we varied false negative rates β ∈ {0, 0.2} and doublet rates δ ∈ {0, 0.1}. We generated, for each simulation instance, 10, 000 single cells sampled under the specified false negative rates β and doublet rates δ according to the bulk clonal prevalence u. To account for uncertainty in bulk sequencing, we additionally obtained confidence intervals on the cancer cell fractions of simulation instances sim3a, using a binomial distribution (with confidence α = 0.05) and a mean coverage of 1000x (drawn from a Poisson distribution). Modeling additional uncertainty in bulk sequencing, sim4a instances consist of n = 100 mutations each and use PyClone [36] to cluster the 100 mutations before enumerating T.

Recall that PhyDOSE has three model assumptions: (i) ground truth tree T* is among the candidates trees T, (ii) correspondence between clonal prevalences in bulk and subsequent single-cell sequencing samples, and (iii) infinite sites assumption for mutations. To assess (i), for simulation conditions ‘b’ (sim1b, sim2b and sim3b), we randomly sampled 10% of the trees outputted by SPRUCE [9]. To assess (ii), we varied single-cell clonal prevalences from the bulk clonal prevalences by resampling u^DIR(λu). We tuned the parameter λ so that the clonal prevalence varied by an absolute average of 5% and 20% from the clones of the ground truth tree T*, which resulted in λ = 2000 and λ = 50, respectively (Figure C in S1 Text). To assess (iii), we introduced violations of the infinite sites assumption (ISA) in the form of mutation losses. Specifically, we introduced one mutation loss in each of the instances of sim1a as follows. First, we randomly picked two distinct mutations (i, j) where i is introduced prior to j in the ground truth tree T*. Then, we designated the descendant mutation j as a loss of mutation i in each candidate tree TT.

In total, we generated 100 simulation instances under eight varying conditions as specified in Table 1.

Table 1. Simulation conditions.

We generate simulated data under eight conditions with 100 instances each. These conditions have varying subsets of candidate trees, number of clones, number of mutations per clone, clonal prevalence distortions, false negative and doublet rates. To analyze violations of the infinite sites assumption analysis, we introduced mutation losses in the sim1a instances. To analyze uncertainty in cancer cell fractions, we used the sim3a instances with a coverage of 1000x.

ID % of Trees Clones Mutations Prevalence Noise FNR β Doublet δ
sim1a 100% 7 7 0% 0 0
sim1b 10% 7 7 0% 0 0
sim2a 100% 7 7 5% 0 0
sim2b 10% 7 7 5% 0 0
sim2c 100% 7 7 20% 0 0
sim3a 100% 7 7 5% 0.2 0.1
sim3b 10% 7 7 5% 0.2 0.1
sim4a 100% 10 100 5% 0.2 0.1

The benefit of PhyDOSE’s distinguishing feature analysis

We compared PhyDOSE against SCOPIT [26], an existing method to design SCS experiments which takes as input a confidence level and the prevalence rate of each clone. Since SCOPIT does not connect the clones to be observed with a phylogenetic tree, we ran SCOPIT under two regimes. In the first regime (called ‘SCOPIT’), we input the clonal prevalence rates for each TT and take the maximum SCOPIT output of all trees as an upper bound. In the second regime (called ‘SCOPIT (true clones)’), we supplied SCOPIT with the prevalence rates of the simulated ground truth clones. The comparison to PhyDOSE was conducted at confidence level γ = 0.95.

PhyDOSE yielded a significant reduction in the number of cells to sequence compared to SCOPIT (Fig 4A). This is even the case when we provided the clonal prevalences of the ground truth tree to SCOPIT but not to PhyDOSE. In particular, for sim1a, SCOPIT required a median of 18.7 times as many cells than PhyDOSE, whereas SCOPIT (true clones) required 1.5 times as many cells (Fig 4A). In absolute numbers, PhyDOSE computed a median number of k* = 35 cells compared to 544 cells computed by SCOPIT and 59 cells computed by SCOPIT (true clones) (Figure D in S1 Text).

Fig 4. Simulations demonstrate that PhyDOSE’s calculated number of single cells resolves tree ambiguity in bulk sequencing data.

Fig 4

We used confidence level γ = 0.95 to determine the number k* of single cells to sequence. (A) SCOPIT to PhyDOSE cell ratio on a log scale when considering SCOPIT in a worst case regime where the true phylogeny is unknown and a best case regime where SCOPIT utilizes the clonal prevalance of the clones in the simulated ground truth tree. (B) Recall metrics of the tree inferred by SPhyR [16] by randomly sampling k*/2, k* and 2k* simulated single cells. (C) Number |Π| of featurettes among minimal distinguishing features Φ* when compared between the enumerated candidate set T (conditions a) and the downsampled candidate set (conditions b). (D) Number |T| of trees in the candidate set when enumerated by SPRUCE [9] (condition a) and when downsampling the enumerated candidate set (condition b). (E) Number k* of cells identified by PhyDOSE.

To assess the accuracy of PhyDOSE’s k* value, we generated follow-up in silico SCS experiments. Specifically, we ran our approach for prioritizing candidate trees and SPhyR [16] on sampled single cells. For the former, we performed 100 experiments for each simulation instance, reporting the number of experiments that successfully recovered the ground truth tree T*. We counted an experiment as successful if we correctly and uniquely selected the ground truth tree as T*. For the latter, we also considered the performance of SPhyR when sampling half and double the number k* of cells determined by PhyDOSE.

Figure F in S1 Text shows that the prioritization approach worked particularly well for sim1a with a median success rate of 96%. Similarly, SPhyR [16] was able to identify the true tree in the majority of cases after sampling PhyDOSE computed number k* of cells. To quantify the similarity between the tree T estimated by SPhyR and the true tree T*, we used two commonly-used tree distance metrics, ancestral and incomparable pair recall. Ancestral pair recall is defined as |A(T)A(T*)|/|A(T*)| where A(T) (A(T*)) is the set of ordered pairs of mutations that occur on distinct edges of the same branch of T (T*). Incomparable pair recall is defined as |I(T)I(T*)|/|I(T*)| where I(T) (I(T*)) is the set of unordered pairs of mutation that occur on edges in distinct branches in T (T*). For sim1a, the median of both metrics is 1 when sampling k* cells, reflecting that SPhyR identified the true tree in the majority of cases (Fig 4B). Moreover, we found greater gains in performance between sampling k* cells versus k*/2 cells than sampling 2k* cells versus k* cells.

Fig 4C shows the number of clones in each minimal distinguishing feature identified by PhyDOSE, ranging from 1 to 5 with a median of 3 for simulation conditions ‘a’. Importantly, this number is smaller than the total number of 7 clones, demonstrating that a distinguishing feature yields an efficient representation of a tree. This led to a smaller number k* of cells inferred by PhyDOSE compared to SCOPIT without sacrificing performance in the tree reconstruction from the follow-up SCS experiment.

Robustness to violations of PhyDOSE’s model assumptions

We used simulations to assess PhyDOSE’s performance when model assumptions are violated. We begin with the case where the ground truth tree T* is not guaranteed to be present among the candidate trees T. In sim1b, we downsampled the set of candidate trees of the instances in sim1a to 10% (Fig 4D). Similarly to sim1a, PhyDOSE significantly reduced the required number k* of single cells compared to SCOPIT (Fig 4A). We analyzed follow-up in silico SCS experiments using the prioritization approach and SPhyR [16]. Since the prioritization approach only considers candidate trees T, which may not contain the ground truth tree, it is not surprising that performance dropped to a median success rate of 0%. However, SPhyR was able to identify the ground truth tree in the majority of cases (median of 1 for both incomparable and ancestral pair recall).

Next, we assessed the impact of clonal prevalence distortions between bulk and single cell data in sim2a, sim2b and sim2c. We found PhyDOSE to be robust to random clonal prevalence noise between bulk and single-cell sequencing as evidenced by a drop of only 1% in the median percentage of successful in silico SCS experiment when there is no downsampling of trees (Figure C in S1 Text). Additionally, the recall performance metrics (Fig 4B, Figure C in S1 Text) are also not substantially different between sim1a versus sim2a and sim2c, showing similar trends when using k*/2 and 2k* cells. We attribute this to the fact that PhyDOSE’s use of distinguishing features relies on the clonal prevalence of a few key clones. Furthermore, PhyDOSE performs well for the case where candidate trees have been downsampled and clonal prevalences have been distorted (sim2b, see Fig 4B).

Finally, we assessed the sensitivity of PhyDOSE to ISA violations by applying it to a candidate set T of 1-Dollo phylogenies obtained from sim1a. Specifically, the resulting simulation instances had candidate trees composed of 6 mutations, one of which having undergone a loss. We performed 100 in silico experiments for each of the 100 simulation instances. Like sim1a under the ISA, the median percentage of successful experiments is 95% for the 1-Dollo phylogenies (Figure H in S1 Text). However, 23 simulation instances had a success rate of 0%. This increased variance is to be expected as distinguishing features may no longer be distinguishable from those of other trees when mutation losses occur. Nevertheless, PhyDOSE’s suggested SCS experiments for these 23 instances significantly reduced the number of candidate trees from a median of 161 trees pre-experiment to a median of 5 trees post-experiment (Figure H in S1 Text). In all cases, the ground truth tree was included in the candidate set after performing the SCS experiment. Hence, as distinguishing features provide an efficient representation for each tree using only a subset of clones, PhyDOSE performed well on data with violations of the infinite sites assumption.

Robustness to uncertainty due to sequencing errors

Sequencing errors occur in both the initial bulk sequencing experiment as well as the follow-up single-cell sequencing experiment. We begin by considering common SCS errors using false negative rate β = 0.2 and doublet rate δ = 0.1 in sim3a and sim3b. While the number k* significantly increased compared to the simulations without false negatives (Fig 4E), PhyDOSE’s computed number k* of cells remained an order of magnitude smaller than SCOPIT but not for SCOPIT (true clones) (Fig 4). The latter is due to PhyDOSE’s adjustments of the clonal prevalence when factoring in a false negative rate of β = 0.2 in trees other than T* (Figure E in S1 Text). Tree inference using SPhyR in subsequent SCS experiments based on PhyDOSE’s k* identified the ground truth tree in the majority of cases (median of 1 for both incomparable and ancestral pair recall).

Figure G in S1 Text shows PhyDOSE performance in sim4a instances with mutation clusters inferred by PyClone [36]. We included the clustered pair recall in our analysis defined as |C(T)C(T*)|/|C(T*)|, where C(T) (C(T*)) is defined as the set of unordered pairs of mutations that are introduced on the same edge in T (T*). At k* cells, the median ancestral pair recall was 0.96, the incomparable pair recall was 0.86 and the clustered pair recall was 0.94, showing a reduction in performance from the first sim1, sim2 and sim3 simulations due to the additional errors introduced by PyClone [36].

Finally, we evaluated the performance of PhyDOSE in the presence of cancer cell fraction uncertainty by adapting sim3a instances. We simulated variant and total read counts with a coverage of 1000x. Then, we constructed a binomial proportion 95%-confidence interval using Jeffrey’s prior interval [37]. Next, we used the mean of this interval to enumerate the candidate set of trees with SPRUCE [10]. Using the method described in Section B.2 in S1 Text, we constructed a confidence interval [k-*,k+*] for each replication (Figure I in S1 Text). We performed 100 in silico for each replication using both k-* and k+*, selecting the tree TT with the maximum support(T) as T*. We obtained a median percentage of successful trials of 86% (IQR: 44%–93%) and 86% (IQR: 31%–98%) when randomly sampling k-* and k+* in silico cells, respectively (Figure I in S1 Text).

Running time

We performed an empirical run time analysis on a server with two Intel Xeon Gold 5120 CPUs @ 2.20GHz and 512 GB RAM. Performing the power calculation of k* is fast [26] when |Φ| = 1 and the median of |Φ| in our simulations was 1. Therefore, the main bottleneck in PhyDOSE is the determination of Φ for each tree in the candidate using the algorithm presented in Section B.1 in S1 Text. Note that this step is embarrassingly parallelized because Φ is computed independently for each TT. However, we leave the parallel phydoser implementation as future work. To additionally explore how PhyDOSE scales with |T|, we generated an additional simulation set with 10 mutations and no mutations clusters and set a ten minute time limit on finding the distinguishing features. We calculated the runtime in seconds for each simulation instance when Φ is solved sequentially for each TT. The largest input size to complete was 9901 required 619 seconds with 92% of the the spent on finding the distinguishing features. The results are displayed in Figure J in S1 Text.

In summary, our simulations demonstrate that PhyDOSE’s distinguishing feature analysis results in significantly fewer cells to sequence than SCOPIT [26] without a subsequent loss in power to identify the true phylogeny. Moreover, we find that PhyDOSE is robust to typical sequencing errors in both the bulk and SCS data as well as violations of model assumptions.

Retrospective analysis of an acute lymphoblastic leukemia patient

We considered a cohort of six childhood acute lymphoblastic leukemia (ALL) patients whose blood was sequenced using bulk and targeted single-cell DNA sequencing [23]. The number of sequenced single cells per patient varied between 96 and 150. To validate our approach, we used PhyDOSE to calculate the number k(T*) of cells needed to identify the true phylogeny T* that is consistent with both data types, thereby retrospectively determining whether fewer single cells suffice to determine T*, decreasing the cost of replicate experiments. In addition, we assessed whether the calculated number k(T*) yielded T* using in silico SCS experiments.

Due to the absence of published copy-number aberration information for this dataset, we focused our attention on patient 2 whose single-cell phylogeny adhered to the infinite sites assumption and the variant allele frequencies suggested the absence of copy-number aberrations (as detailed in Section C in S1 Text). For this patient, 16 autosomal mutations in 115 cells were sequenced [23]. We note that the authors had no knowledge of the number of cells that would suffice to infer the tumor phylogeny of the patient. Using the infinite sites assumption and assuming the absence of copy-number aberrations, we define the cancer cell fraction, or frequency fi of each mutation i in the bulk data as 2 ⋅ VAF(i). We define the SCS mutation frequency as the fraction of single cells that harbor the mutation. Strikingly, there is a clear correlation between the bulk and SCS mutation frequencies, supporting PhyDOSE’s first assumption (Fig 5A). We excluded mutation CMTM8 because of a notable discrepancy in frequencies (0.4 in bulk vs. 0.2 in SCS). Using SPRUCE [9], we enumerated the set T of trees from the bulk data, yielding over 2.5 million trees. This number is mainly driven by 3 mutations (ATRNL1, LINC00052 and TRRAP) with a VAF less than 0.05. Excluding these 3 mutations resulted in a more tractable number of 2, 576 trees. We note that in practice we may similarly exclude mutations because of very low VAFs or less importance in downstream analyses. Fig 5B shows the single tree T*T that was consistent with the cleaned single-cell data, supporting PhyDOSE’s second assumption.

Fig 5. Retrospective analysis of ALL patient 2 [23] and AML cohort [27] demonstrates that fewer cells suffice for replication.

Fig 5

Panels (A)-(D) consider ALL patient 2 [23] and panel (E) considers the AML cohort [27]. (A) There is a strong correlation between bulk and single-cell mutation frequencies. Colors indicate mutation clusters from SCS data and excluded mutations are indicated by ‘x’. (B) Phylogeny T* that is consistent with the SCS and bulk data. (C) Percent of successful outcomes in 100 in silico SCS experiments, obtained by sampling from the 115 sequenced cells without replacement following PhyDOSE’s calculated number k(T*) of cells (103 for γ = 0.95 and 50 for γ = 0.75). Exclusive outcomes (yellow) uniquely identified T* whereas tied outcomes (purple) yielded a small set of candidate phylogenies that include T*. (D) Number of candidate phylogenies in the case of ties. (E) The distribution of PhyDOSE’s k* for γ ∈ {0.75, 0.95} of all patients in the AML cohort with |T|>2 as well as the number of cells that were originally sequenced.

We ran PhyDOSE using varying confidence levels γ ∈ {0.75, 0.95} and an estimated false negative rate of β = 0.2 reported by the authors [23]. PhyDOSE calculated that k(T*) = 103 cells suffice to identify T* with confidence level γ = 0.95. Indeed, performing 100 in silico SCS experiments, by sampling k(T*) cells among the 115 sequenced cells without replacement, yielded a success rate of 99% (Fig 5C).

To reduce costs, we explored what would have happened retrospectively with a lower confidence level γ of 0.75. PhyDOSE calculated that k(T*) = 50 cells are needed for γ = 0.75, which is a significant cost savings over γ = 0.95. Performing 100 in silico SCS experiments yielded a success rate of uniquely identifying T* of 66%, which was lower than the expected rate of 75%. Furthermore, we noted that in an additional 26% of experiments the correct phylogeny T* was among the trees with the highest overall support (Fig 5C). The number of trees in the tied set of successes varied from 2 to 6 (Figure L in S1 Text), showing that although PhyDOSE did not uniquely identify the tree, it was able to significantly reduce the original set of 2576 trees (Figure L in S1 Text and Figure M in S1 Text).

In summary, this retrospective analysis shows that the true tree for patient 2 could have been identified confidently with fewer cells than the 115 cells initially sequenced [23]. With a lower confidence level γ, PhyDOSE computes that far fewer cells are required, significantly reducing costs but at the expense of a lower success rate of uniquely identifying the true phylogeny. Nevertheless, the resulting SCS experiment will eliminate a large fraction of the original set of candidate phylogenies due to the incorporation of distinguishing features in the PhyDOSE power calculation.

Retrospective analysis of an acute myeloid leukemia cohort

Morita et al. [27] performed high-throughput targeted microfluidic single-sequencing using the Tapestri platform [38] on a cohort of 77 patients with acute myeloid leukemia (AML). The authors additionally performed bulk sequencing in order to confirm the presence of a mutation in the single-cell data. We note that the authors restricted their analysis to somatic mutations (SNVs and indels) that did not occur in regions affected by additional copy number aberrations.

Here, we utilized the published bulk sequencing VAFs of the SNVs in each patient, eliminating any mutations not detected via bulk sequencing, to enumerate a set of candidate trees using SPRUCE [9]. We restrict our analysis to the 24 patients where bulk sequencing data was available and SPRUCE identified more than one candidate tree. The median number of mutations for these patients was 4 (IQR: 3-5). We retrospectively used PhyDOSE at confidence levels γ ∈ {0.75, 0.95} to estimate the cells needed to perform an equivalent single-cell experiment. We used false negative rate β = 0.049, which is the mean of the per patient published false negative rate. In the original study, a median of 7, 584 cells per patient (IQR: 6, 194–8, 361) were sequenced. Fig 5E shows the distribution of PhyDOSE k* for the 24 patients (median is 2, IQR: 2–6, max is 316) at γ ∈ {0.75, 0.95} versus the total number of cells sequenced in [27]. For γ = 0.95, the median value of k* was 274 cells (IQR: 230–497). This is a significant reduction from the number of cells sequenced per patient in [27] with a median percent reduction at confidence level γ = 0.95 of 95.4% (IQR: 92.2%–98.0%) (Table A in S1 Text).

For these 24 patients, Morita et al. [27] sequenced 153, 558 cells while the PhyDOSE design at confidence level γ = 0.95 requires 8,144 cells (Table A in S1 Text). Using that a Tapestri run of 10,000 cells costs $795 and including additional sequencing costs of $200 per run with NovaSeq or $1000 per run with MiSeq [39, 40], we estimate the costs of the original study as $15,920 and the estimated costs of the PhyDOSE design as $1,995. This assumes the original study utilized 16 Tapestri runs with NovaSeq while PhyDOSE requires 1 Tapestri run with the more expensive MiSeq to avoid multiplexing. Thus, designing the experiment with PhyDOSE would have yielded a 93.75% cost savings. Further, we note that this study design requires targeted sequencing and that the potential cost savings when the experimental design requires whole genome sequencing would further increase.

Prospective analysis of a non-small cell lung cancer cohort

Using PhyDOSE, we prospectively determined the number of cells needed to uniquely identify the true phylogeny for the 25 out of 100 patients in the TRACERx non-small-cell lung cancer cohort that have multiple candidate trees [3]. The authors previously identified the set T of candidate trees for each patient using CITUP [11] after clustering mutations with PyClone [36]. The authors also reported the cancer cell fraction of each mutation cluster in each bulk sample. The number of trees in the candidate set for each patient ranged from 2 to 17, with each containing mutation clusters with between 5 and 882 mutations (Table B in S1 Text).

Assuming high confidence on the co-occurrence of mutations in a cluster, mutation clusters alleviate the issue of false negatives, i.e. it suffices to only observe a small number of mutations to impute the presence of the other mutations in the same cluster. Here, with a typical SCS false negative rate of 0.2, the probability of all mutations in the smallest cluster (with size 5) dropping out thus equals 0.25 = 0.00032, a probability that can be neglected. As such, we set β = 0. Unlike in the simulations and the previous real datasets, multiple bulk samples corresponding to distinct spatial locations were available for analysis per patient. In addition to the naive method where we select a single biopsy that minimizes k*, we used the multiple biopsy heuristic to infer numbers k* of cells for each biopsy. For both methods, we used confidence level γ = 0.95.

Following the naive approach, PhyDOSE returned a finite value of k* for 24 out of the 25 patients. The naive approach yielded k* = ∞ for patient CRUK0037 because for each of the 5 biopsies there is a tree where every distinguishing feature is not observable. That is, the clonal prevalence of one of the comprising featurettes is 0. By contrast, the heuristic calculated a total of 243 cells (R1: 36; R2: 49; R3: 60; R4: 38 and R5: 51 cells) for this patient. For patients CRUK0013 and CRUK0076, the naive approach required the sequencing of more cells from a single biopsy than the multiple biopsy heuristic (CRUK0013: 1, 051 vs. 215 cells; CRUK0076 47, 479 vs. 48 cells). For the remaining 22 patients, treating the samples independently yields the same number of cells from the selected biopsy as the heuristic. Table B in S1 Text and Fig 6 provide detailed numbers.

Fig 6. PhyDOSE multiple biopsy heuristic calculated numbers k* of cells per biopsy for the lung cancer cohort [3] at confidence level γ = 0.95.

Fig 6

These strikingly low values of total number of cells for the 25 patients with multiple candidate trees and multiple biopsies demonstrate the benefit of using PhyDOSE to strategically optimize the design of follow-up single cell experiments.

Discussion

In this work, we showed that the mutation frequencies f and the set T of tumor phylogenies inferred from initial bulk data contain valuable information to provide guidance for follow-up SCS experiments. We introduced PhyDOSE, a method to calculate the number k* of single cells needed to infer the true phylogeny T* given f, T and a user-specified confidence level γ. Underpinning our method is the observation that often only a subset of clones suffices to distinguish one tree TT from the remaining trees T\{T}. Although PhyDOSE is motivated by the output of deconvolution methods for bulk sequencing, it is agnostic to the method used to obtain the candidate set as long as the clonal prevalence rates of the distinguishing features can be estimated. Thus, the input set T of candidate trees can be obtained from preliminary single-cell and/or bulk sequencing data. Similarly, PhyDOSE is agnostic to the phylogeny inference method used to analyze data from the proposed SCS experiment. We also provided heuristics for realistic scenarios that arise in practice, such as handling uncertainty in the estimation of cancer cell fractions and the availability of multiple biopsies.

We validated PhyDOSE using simulations and a retrospective analysis of leukemia patients [23, 27], concluding that PhyDOSE’s computed number k* of cells resolves tree ambiguity, even in the presence of SCS errors. Our simulations showed that PhyDOSE remains robust in the presence of sequencing errors and violations of model assumptions, outperforming the competing method, SCOPIT [26]. In a prospective analysis, we demonstrated that only a small number of cells suffice to disambiguate the solution space of trees in a recent non-small cell lung cancer cohort [3]. In summary, PhyDOSE proposes cost-efficient SCS experiments that will yield high-fidelity phylogenies, which may consequently improve downstream analyses in cancer genomics aimed at deepening our understanding of cancer biology.

There are several future research directions. First, in the case of multiple bulk samples, although we propose an exact calculation, we only implement a heuristic since the exact calculation does not scale to realistic problem sizes. Developing an implementation of the exact calculation in the case of multiple samples would yield a further cost reduction in the experimental design since the heuristic overestimates the number of cells at given confidence level γ. Second, to further reduce SCS costs, we might want to include a mutation selection step as part of our approach to perform targeted rather than whole-genome sequencing. Third, similar ideas can be used to design follow-up sequencing experiments using alternative sequencing technologies such as long read sequencing. Alternatively, performing additional bulk sequencing rather than single-cell sequencing might be more cost-effective, especially when obtaining a bulk sample with distinct clonal prevalences [10, 41]. Fourth, we plan to develop an easy-to-use Shiny user interface to facilitate the use of PhyDOSE for the design of sequencing experiments. Fifth, to improve robustness in the presence of SCS errors, we plan to explore alternative definitions of successful SCS experiment outcomes, requiring that more than one cells is observed of each featurette of a distinguishing feature. This will enable us to address errors such as doublets and false positives in an SCS experiment. Similar ideas can be used to address uncertainty in mutation clusters inferred from bulk sequencing data. Sixth, the concept of distinguishing features may be useful to summarize diverse solution spaces in cancer phylogenetics [42]. Finally, we plan to explore evolutionary models beyond the infinite sites model, such as the Dollo parsimony model where mutations might be lost [16], requiring a more careful approach to find the distinguishing features of a tree.

Supporting information

S1 Text. Supplementary materials.

(PDF)

Data Availability

Simulated and real data are available at https://github.com/elkebir-group/PhyDOSE. Source code and R package are available at https://github.com/elkebir-group/phydoser (under AGPL-3.0 license).

Funding Statement

L.L.W., N.A., N.C. and M.E.K. were supported by UIUC Center for Computational Biotechnology and Genomic Medicine (grant: CSN 1624790). M.E.K. was supported by the National Science Foundation (grant: CCF 1850502). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194(4260):23–8. 10.1126/science.959840 [DOI] [PubMed] [Google Scholar]
  • 2. McGranahan N, Favero F, de Bruin EC, Birkbak NJ, Szallasi Z, Swanton C. Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Science Translational Medicine. 2015;7(283):283ra54–283ra54. 10.1126/scitranslmed.aaa1408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Jamal-Hanjani M, Wilson GA, McGranahan N, Birkbak NJ, Watkins TB, Veeriah S, et al. Tracking the evolution of non–small-cell lung cancer. New England Journal of Medicine. 2017;376(22):2109–2121. 10.1056/NEJMoa1616288 [DOI] [PubMed] [Google Scholar]
  • 4. Zhang AW, McPherson A, Milne K, Kroeger DR, Hamilton PT, Miranda A, et al. Interfaces of Malignant and Immunologic Clonal Dynamics in Ovarian Cancer. Cell. 2018;173(7):1755–1769.e22. 10.1016/j.cell.2018.03.073 [DOI] [PubMed] [Google Scholar]
  • 5. Łuksza M, Riaz N, Makarov V, Balachandran VP, Hellmann MD, Solovyov A, et al. A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy. Nature. 2017;551(7681):517 10.1038/nature24473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Turajlic S, Xu H, Litchfield K, Rowan A, Chambers T, Lopez JI, et al. Tracking Cancer Evolution Reveals Constrained Routes to Metastases: TRACERx Renal. Cell. 2018;173(3):581–594.e12. 10.1016/j.cell.2018.03.057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Turajlic S, Xu H, Litchfield K, Rowan A, Horswell S, Chambers T, et al. Deterministic Evolutionary Trajectories Influence Primary Tumor Growth: TRACERx Renal. Cell. 2018;173(3):595–610.e11. 10.1016/j.cell.2018.03.043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biology. 2015;16(1):35 10.1186/s13059-015-0602-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ. Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics. 2015;31(12):i62–i70. 10.1093/bioinformatics/btv261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. El-Kebir M, Satas G, Oesper L, Raphael BJ. Inferring the Mutational History of a Tumor Using Multi-state Perfect Phylogeny Mixtures. Cell Systems. 2016;3(1):43–53. 10.1016/j.cels.2016.07.004 [DOI] [PubMed] [Google Scholar]
  • 11. Malikic S, McPherson AW, Donmez N, Sahinalp CS. Clonality Inference in Multiple Tumor Samples using Phylogeny. Bioinformatics. 2015. 10.1093/bioinformatics/btv003 [DOI] [PubMed] [Google Scholar]
  • 12. Yuan K, Sakoparnig T, Markowetz F, Beerenwinkel N. BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biology. 2015;16(1):36 10.1186/s13059-015-0592-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Popic V, Salari R, Hajirasouliha I, Kashef-Haghighi D, West RB, Batzoglou S. Fast and scalable inference of multi-sample cancer lineages. Genome Biology. 2015;16(1):91 10.1186/s13059-015-0647-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Jahn K, Kuipers J, Beerenwinkel N. Tree inference for single-cell data. Genome Biology. 2016;17(1):86 10.1186/s13059-016-0936-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Ross EM, Markowetz F. OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biology. 2016;17(1):69 10.1186/s13059-016-0929-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. El-Kebir M. SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics. 2018;34(17):i671–i679. 10.1093/bioinformatics/bty589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zafar H, Tzen A, Navin N, Chen K, Nakhleh L. SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biology. 2017;18(1):178 10.1186/s13059-017-1311-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Navin NE. Cancer genomics: one cell at a time. Genome Biology. 2014;15(8):452 10.1186/s13059-014-0452-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Malikic S, Jahn K, Kuipers J, Sahinalp SC, Beerenwinkel N. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nature Communications. 2019;10(1):1–12. 10.1038/s41467-019-10737-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, et al. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Research. 2019;29(11):1860–1877. 10.1101/gr.234435.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Kuboki Y, Fischer CG, Guthrie VB, Huang W, Yu J, Chianchiano P, et al. Single-cell sequencing defines genetic heterogeneity in pancreatic cancer precursor lesions. The Journal of Pathology. 2019;247(3):347–356. 10.1002/path.5194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Leung ML, Davis A, Gao R, Casasent A, Wang Y, Sei E, et al. Single cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome Research. 2017; p. gr.209973.116. 10.1101/gr.209973.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Gawad C, Koh W, Quake SR. Dissecting the clonal origins of childhood acute lymphoblastic leukemia by single-cell genomics. Proceedings of the National Academy of Sciences. 2014;111(50):17947–17952. 10.1073/pnas.1420822111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. McPherson A, Roth A, Laks E, Masud T, Bashashati A, Zhang AW, et al. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nature Genetics. 2016;. 10.1038/ng.3573 [DOI] [PubMed] [Google Scholar]
  • 25. Kim C, Gao R, Sei E, Brandt R, Hartman J, Hatschek T, et al. Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing. Cell. 2018;173(4):879–893. 10.1016/j.cell.2018.03.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Davis A, Gao R, Navin NE. SCOPIT: sample size calculations for single-cell sequencing experiments. BMC Bioinformatics. 2019;20(1):566 10.1186/s12859-019-3167-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Morita K, Wang F, Jahn K, Kuipers J, Yan Y, Matthews J, et al. Clonal Evolution of Acute Myeloid Leukemia Revealed by High-Throughput Single-Cell Genomics. bioRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Bolli N, Avet-Loiseau H, Wedge DC, Van Loo P, Alexandrov LB, Martincorena I, et al. Heterogeneity of genomic evolution and mutational profiles in multiple myeloma. Nature Communications. 2014;5 10.1038/ncomms3997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Dentro SC, Wedge DC, Van Loo P. Principles of Reconstructing the Subclonal Architecture of Cancers. Cold Spring Harbor Perspectives in Medicine. 2017;7(8):a026625 10.1101/cshperspect.a026625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Stephens PJ, Tarpey PS, Davies H, Van Loo P, Greenman C, Wedge DC, et al. The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012;486(7403):400–404. 10.1038/nature11017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Levin B. A Representation for Multinomial Cumulative Distribution Functions. The Annals of Statistics. 1981;9(5):1123–1126. 10.1214/aos/1176345593 [DOI] [Google Scholar]
  • 32. Fu Y, Li C, Lu S, Zhou W, Tang F, Xie XS, et al. Uniform and accurate single-cell sequencing based on emulsion whole-genome amplification. Proceedings of the National Academy of Sciences of the United States of America. 2015;112(38):11923–11928. 10.1073/pnas.1513988112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Zafar H, Navin N, Chen K, Nakhleh L. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Research. 2019;29(11):1847–1859. 10.1101/gr.243121.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. El-Kebir M, Satas G, Raphael BJ. Inferring parsimonious migration histories for metastatic cancers. Nature Genetics. 2018;50(5):718–726. 10.1038/s41588-018-0106-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Prüfer H. Neuer beweis eines satzes uber permutationen. Arch Math Phys. 1918;27:742–4. [Google Scholar]
  • 36. Roth A, Khattra J, Yap D, Wan A, Laks E, Biele J, et al. PyClone: statistical inference of clonal population structure in cancer. Nature Methods. 2014;11(4):396–398. 10.1038/nmeth.2883 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Statistical science. 2001; p. 101–117. [Google Scholar]
  • 38. Pellegrino M, Sciambi A, Treusch S, Durruthy-Durruthy R, Gokhale K, Jacob J, et al. High-throughput single-cell DNA sequencing of acute myeloid leukemia tumors with droplet microfluidics. Genome Research. 2018;28(9):1345–1352. 10.1101/gr.232272.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.DECIBIO INSIGHTS. Mission Bio Launches Single-Cell DNA Analysis Platform at ASHG 2017; 2017. Available from: https://www.decibio.com/2017/10/17/mission-bio-launches-single-cell-dna-analysis-platform-ashg-2017/.
  • 40.Bioinformatics. ASHG 2017: New Single-cell, CRISPR and NGS Products Highlight Lab Technology’s Progress; 2017. Available from: https://bioinfoinc.com/digest/ashg-2017-new-single-cell-crispr-ngs-products-highlight-lab-technologys-progress/.
  • 41. Qi Y, Pradhan D, El-Kebir M. Implications of non-uniqueness in phylogenetic deconvolution of bulk DNA samples of tumors. Algorithms for Molecular Biology. 2019;14(1):23–14. 10.1186/s13015-019-0155-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Aguse N, Qi Y, El-Kebir M. Summarizing the solution space in tumor phylogeny inference by multiple consensus trees. Bioinformatics. 2019;35(14):i408–i416. 10.1093/bioinformatics/btz312 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008240.r001

Decision Letter 0

Florian Markowetz, Niranjan Nagarajan

1 Jun 2020

Dear Dr. El-Kebir,

Thank you very much for submitting your manuscript "PhyDOSE: Design of Follow-up Single-cell Sequencing Experiments of Tumors" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

While the reviewers acknowledge that this work represents a rigorous methodological contribution to this field, they raise concerns about comparisons with other methods (reviewer #3) and practical limitations (reviewer #1, #2) that need to be addressed in a revised manuscript.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Florian Markowetz

Deputy Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************

While the reviewers acknowledge that this work represents a rigorous methodological contribution to this field, they raise concerns about comparisons with other methods (reviewer #3) and practical limitations (reviewer #1, #2) that need to be addressed in a revised manuscript.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This paper develops a method, PhyDOSE, for designing follow-up single-cell sequencing experiments to resolve tumor phylogenies based on initial bulk sequence analysis. This paper is based on a conference submission that has already been through a round of revision in response to the conference reviewers. Although I was not one of the reviewers of the conference version, I recognize that it is more polished than a typical new submission as a result and that the conference reviewers already raised what would have been some of my concerns about the paper. I therefore read it in the spirit of a paper that has already been through a round of revisions. The general problem is well motivated. Although the cost of single-cell sequencing is dropping rapidly, it remains costly for large studies and there is good reason to use it efficiently. The approach proposed here appears sensible and technically sound, making good use of prior theory of the authors on tumor phylogeny enumeration, adding robust handling for common kinds of data errors in single-cell sequencing, and connecting the theory nicely to a rigorous new probabilistic model for use in power calculations. The method is well backed up by empirical analysis of simulated data and three real cohorts, showing the method to be effective and to improve substantially on naïve approaches in at least some cases. Some natural concerns about the approach in the conference version, such as robustness to biased sampling or to failing to identify some clones, were raised by the conference reviewers and effectively rebutted by new experiments. The method does still make some restrictive assumptions and limitations, as was pointed out by the prior reviewers. While these are not fully resolved even in the revision, the authors do have fair responses for them.

Given this, I have little to add in the way of criticism and would consider my remaining points largely discretionary. My only real substantive concern is that some of the limitations of the method raised in the conference reviews are still limitations and one might question whether it is sufficient in some cases to note them and defer them to future work. I refer here essentially to the points raised in the final paragraph of Discussion.

In that regard, the use of the infinite sites model is questionable enough that one could argue it needs to be at least demonstrated that the method is reasonably robust to violations. While many methods in this space use the infinite sites assumption, it is well established that it is not consistently accurate and is at least becoming more accepted that methods must handle some violations. I think it is fine to defer to future work extension to a more robust model like Dollo parsimony, which would understandably require some significant changes to the theory and algorithms, so long as the method works reasonably well on data that violates the assumption without that.

The paper also considers a criticism about the use of single rather than multiple bulk samples and defers that question to future work. There is good evidence that phylogeny inference from single bulk samples is simply not accurate enough to be the basis for even the initial step of a combined bulk and single-cell study, and so one might reasonably argue that accommodating multiple bulk samples is so important that it should be part of even a first method of this class.

I will also just raise as a discretionary thought some other possible scenarios where I could imagine this method being useful. I wonder if the method could be applied if there has already been some bulk and some single-cell sequencing done, as in some studies to date, and we want to plan further single-cell sequencing. Would the method be adaptable to such a case? Or could it do better if we assume multiple batches of single-cell sequencing, with an opportunity to reevaluate after each batch? I can accept that these are getting far enough afield that they do not need to be solved in this paper, but might also be questions for future work.

Reviewer #2: This paper discusses PhyDOSE, a method to perform power calculation for single-cell sequencing, when we need to disentangle the clone tree associated to a tumour sample. The idea is that, while we often perform a bulk sequencing experiment to assess a number of possible trees that fit the mutation allele frequency (VAF), it happens often that more than one tree are equally-likely to fit the data. If we can generate single-cell sequencing data of a number X of single-cells, then we can disambuiguate which tree best fits the data. PhyDOSE is a method that tells us what should be the value of X to bound the probability that we can determine a unique best tree.

The paper is clear, and the problem is known in the field. There is abundant literature explaining/ showing that determining a single tree from bulk data can be challenging, therefore the solution of sequencing single cells can be appealing, as much as other approaches. The ILP formulation of the problem seems to be correct, and the results and methods consistent with the theory.

However, there are some major limitations of the work in the current form. I think fixing them would make the main message (a computational design technology) appealing.

- I do not think that you can prescind from the fact that many datasets collect multiple tumour bulks at once, as you also note in your Discussion. This requires a multivariate problem definition, in principle, that you need at least to discuss. There are multi-region sequencing simulators that you can use to this respect, if you want to try to simualte data. The current work presents instead an independence assumption, and uses that in the current analyses (sect "Prospective Analysis of a Non-small Cell Lung Cancer Cohort"). The choice of PhyDOSE is to minimise the number of cell estimates across all samples; is this supported by some consideration?

- [related to the above] what do the author mean by saying that "Mutation clusters alleviate the issue of false negatives, i.e. it suffices to only observe a single mutation to impute the presence of the other mutations in the same cluster.". Imputation can be tricky; if I observe a low-VAF mutation in 3 out of 4 biopsies, I think that the imputation should depend on the coverage at the locus. If high-enough, imputation can be supported by a statistical argument based on Binomial testing on read counts (what are the odds of not-seeing a mutation with a certain VAF with my current coverage). If low, imputation might generate false positives. Is it possible to frame this uncertainty in PhyDOSE's computation of the optimal number of cells for these scenarios?

- You present some reduction in cells numbers that are not exceptionally striking. Can you justify a difference in sequencing cost for the effort of using your design method? At the end of the day, if one does not save a substantial amount of sequencing costs, why would he/ she bother using PhyDOSE? I think you need to provide stronger evidence of why your computations can be important for a molecular biologist that is designing a new experiment. If the reduction is not substantial, I think that your contribution would be just theoretical and could be less appealing. In the context of sequencing technologies, you should put effort to understand the cost for standard experimental setups (e.g., I presume you would be using either a deep sequencing panel, or a digital-PCR assay) and their possible parametrisation. On your real data you can effectively discuss these reductions (assuming certain costs since you did not generate the data).

- It is increasingly evident that a number of "clusters" identified through standard VAF deconvolution method can represent random ancestors constituted by neutrally evolving mutations (https://doi.org/10.1101/586560). Clustering tail mutations is also wrong because tail lineages are polyphyletic. You should discuss this when you consider the problem of using certain subsets of mutations. Since some of the input clusters should be removed from the clone trees, and you could discuss what happens if you end up taking cells from those clusters to design your experiment. This is important because many of your inconsistencies in assembling a bulk clone tree stem from low-frequency mutations, but the low-VAF spectrum is where most of the neutral mutations reside; if those are removed how often is it that you remain with a non-identifiable treee?

Reviewer #3: The authors report on a new method, PhyDOSE, for determining the number of cells to sequence in a single-cell sequencing experiment based on information from bulk data. The bulk data is first used to estimate the mutation frequencies and then this information is used to estimate the number of cells. The authors state that their method improves upon SCOPIT, since the latter assumes knowledge about the number of clones and the frequency of the smallest clone. The authors study the performance of their method on simulated and empirical datasets.

I would like the authors to address the following questions:

1. How does the reliability of the mutations called from the bulk data affect the performance of the method? What if some/many of those mutations were wrong?

2. Why not compare to SCOPIT? After all, there are method for estimating clonality from bulk data. Why not run such a method, get the number of clones and frequencies, and use those as inputs to SCOPIT? I think it's very important to do this comparison.

3. I think the model of evolution must be incorporated into the problem formulation, as the number of cells and mutations needed depends on whether the infinite-sites assumptions holds or not.

My main issue with this method (which applies to SCOPIT as well, I feel, even though I don't know the details of how SCOPIT works) is that the number of cells to sequence is not the only/main quantity of interest in in an SCS experiment. The number of spatial regions to sample and sequence in order to capture the heterogeneity is as important, and that number must be a lower bound on the number of cells to sequence. So, I'm not sure how useful these methods will be in practice. Yes, SCOPIT has been published for a year only, but it still has no real citations (the two citations it has are this article and one that develops a simpler method for scRNA data). As scDNAseq becomes even less expensive, I doubt the number of cells is the bottleneck; it's the spatial regions to sample and sequence (indeed, some recent studies, mainly focused on CNA detection, are now sequencing thousands of single cells).

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008240.r003

Decision Letter 1

Florian Markowetz, Niranjan Nagarajan

12 Aug 2020

Dear Dr. El-Kebir,

We are pleased to inform you that your manuscript 'PhyDOSE: Design of Follow-up Single-cell Sequencing Experiments of Tumors' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Niranjan Nagarajan

Associate Editor

PLOS Computational Biology

Florian Markowetz

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The revisions have satisfied all of my concerns about the paper. The authors have made a number of significant improvements to address the original criticisms, including extending the theory and code to multiple samples and uncertainty in the inferences, and empirically demonstrating robustness to data errors and violations of the infinite sites model, along with other more minor changes. The new material is substantial and very responsive to the critiques. I do not see any new problems introduced by the new material or have any other issues to raise. As before, I consider this an innovative and technically rigorous contribution to important current problems in cancer genomics that should be of interest to many working in computational biology or cancer research.

Reviewer #2: The authors presented an extended version with a new technical improvement to handle multiple biopsies of the same tumour. The formulation of the new problems follows from the single-sample ILP one. A new heuristic is proposed to solve some algorithmic complexity issues in some cases, but in general this is reasonable given the problem complexity.

This improvement approaches a point was shared also by another reviewer, and hasI feel it has been addressed properly.

I also asked to motivate practically the advantage of using this tool, showing the drop in costs in doing a proper experimental design with this new tool. The authors have motivated this for one case study that uses Tapestri; I never used Tapestri so I cannot confirm the reported costs, but the advantage is evident and I think this might be an important motivation to use this approach to designs Cancer Evolution assays that need to find the exact tumour clone tree.

Reviewer #3: The authors have done an excellent job responding to all comments, revising the writing, and running more experiments that I believe have strengthened the paper significantly.

I'm satisfied with it.

Very minor comments: What are A and B in panels (c) and (d) of Fig. 4? This should be described in the caption.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008240.r004

Acceptance letter

Florian Markowetz, Niranjan Nagarajan

23 Sep 2020

PCOMPBIOL-D-20-00693R1

PhyDOSE: Design of Follow-up Single-cell Sequencing Experiments of Tumors

Dear Dr El-Kebir,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Laura Mallard

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary materials.

    (PDF)

    Attachment

    Submitted filename: response.pdf

    Data Availability Statement

    Simulated and real data are available at https://github.com/elkebir-group/PhyDOSE. Source code and R package are available at https://github.com/elkebir-group/phydoser (under AGPL-3.0 license).


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES