DeCiFering the Elusive Cancer Cell Fraction in Tumor Heterogeneity and Evolution

Gryte Satas; Simone Zaccaria; Mohammed El-Kebir; Benjamin J Raphael

doi:10.1016/j.cels.2021.07.006

. Author manuscript; available in PMC: 2022 Oct 20.

Published in final edited form as: Cell Syst. 2021 Aug 19;12(10):1004–1018.e10. doi: 10.1016/j.cels.2021.07.006

DeCiFering the Elusive Cancer Cell Fraction in Tumor Heterogeneity and Evolution

Gryte Satas ^1,^3,^*, Simone Zaccaria ^1,^4,^5,^*, Mohammed El-Kebir ^2,^*,^†, Benjamin J Raphael ^1,^†,^‡

PMCID: PMC8542635 NIHMSID: NIHMS1729957 PMID: 34416171

Abstract

The cancer cell fraction (CCF), or proportion of cancerous cells in a tumor containing a single-nucleotide variant (SNV), is a fundamental statistic used to quantify tumor heterogeneity and evolution. Existing CCF estimation methods from bulk DNA sequencing data assume that every cell with an SNV contains the same number of copies of the SNV. This assumption is unrealistic in tumors with copy-number aberrations that alter SNV multiplicities. Furthermore, the CCF does not account for SNV losses due to copy-number aberrations, confounding downstream phylogenetic analyses. We introduce DeCiFer, an algorithm that overcomes these limitations by clustering SNVs using a novel statistic, the descendant cell fraction (DCF). The DCF quantifies both the prevalence of an SNV at the present time and its past evolutionary history using an evolutionary model that allows mutation losses. We show that DeCiFer yields more parsimonious reconstructions of tumor evolution than previously reported for 49 prostate cancer samples.

eTOC blurb

Analyses of tumor heterogeneity and evolution from bulk DNA sequencing data typically rely on the estimation of the cancer cell fraction (CCF) of a somatic single-nucleotide variant, defined as the fraction of cancer cells that contain the mutation. Estimation of the CCF is complicated for variants that overlap copy number aberrations, and current approaches make overly simplistic assumptions about the interactions between single-nucleotide variants and copy number aberrations. We introduce the descendent cell fraction (DCF), a novel statistic that accounts for loss of mutations during tumor evolution and derive the DeCiFer algorithm that jointly estimates DCFs and clusters mutations using a phylogenetic model.

Introduction

Cancer arises from an evolutionary process during which somatic mutations accumulate in the genome of different cells, yielding a heterogeneous tumor composed of different subpopulations of cells, or clones, that have distinct complements of mutations (Nowell 1976). Quantifying the heterogeneity within a tumor is essential for understanding carcinogenesis and devising personalized treatment strategies (Burrell et al. 2013, Andor et al. 2015, McGranahan & Swanton 2017). While recent single-cell DNA sequencing technologies enable high-resolution measurements of tumor heterogeneity (Navin 2015, Gawad et al. 2016, Kim et al. 2018, Gao et al. 2016, Laks et al. 2019, Myers et al. 2020, Zaccaria & Raphael 2021), the vast majority of cancer studies in research and clinical settings (The et al. 2020, Jamal-Hanjani et al. 2017, Dewey et al. 2014) rely on DNA sequencing of bulk tumor samples, where an individual sample comprises a mixture of thousands of different tumor cells. To quantify tumor heterogeneity using bulk sequencing data, most cancer sequencing studies analyze somatic single-nucleotide variants (SNVs) as these mutations are ubiquitous in cancer. The fundamental quantity used to quantify tumor heterogeneity from SNVs is the cancer cell fraction (CCF) – also known as the cellular prevalence or the mutation cellularity – which is the proportion of cancer cells that contain the SNV. CCFs form the basis for many cancer analyses, including: studying tumor heterogeneity (Jamal-Hanjani et al. 2017, Rajput et al. 2017, Dentro et al. 2020), reconstructing clonal evolution and metastatic progression (Gundem et al. 2015, Brastianos et al. 2015, Jamal-Hanjani et al. 2017, Cresswell et al. 2020), identifying selection (Williams et al. 2016, Lakatos et al. 2020, Bozic et al. 2010), and analyzing changes in mutational processes over time (Rubanova et al. 2020, Harrigan et al. 2020, Christensen et al. 2020). In these and other studies, the underlying assumption is that groups of SNVs with the same CCF are likely to be present in the same cancer cells and thus occurred on the same branch of the phylogenetic tree that describes the evolution of the tumor (Figure 1b).

Figure 1: — a, DNA sequencing of a bulk tumor sample yields two signals of somatic aberrations: (Top) at a position containing an SNV (green triangle), the number of total and variant sequencing reads provide an estimate of the variant allele frequency (VAF) of the SNV; (Bottom) the read depth (and B-allele frequency – not pictured) across genomics regions reveal allele-specific copy numbers (*x, y*) (black and gray lines) and their proportions μ_{(x, y)}. Bulk sequencing data does not measure the number of copies of an SNV in each cell/clone; these unknown mutation multiplicities complicate the estimation of the cancer cell fraction (CCF) of an SNV. b, Estimating the CCF of an SNV requires knowledge of the genotype at a locus defined as a triple (*x, y, m*) of integers indicating the allele-specific copy number at the locus and the mutation multiplicity m of the SNV. CCFs quantify heterogeneity within a tumor at the present time and are related to the past evolutionary history of a tumor. SNVs present in the same cells have the same CCF. Under some mild evolutionary assumptions the converse is also true and thus researchers typically group SNVs with similar CCFs. c, Nearly all existing methods to estimate CCFs rely on the Constant Mutation Multiplicity (CMM) assumption to constrain the possible genotypes at a locus. The constant mutation multiplicity assumption yields sets of genotypes with the property that all genotypes that contain the mutation have the same number M of copies of the mutation (highlighted in green). However, the constant mutation multiplicity assumption may admit sets of genotypes that are produced by unlikely evolutionary events such as homoplasy, and may also exclude other reasonable sets of genotypes, including copy-neutral loss of heterozygosity (right). d, We propose a less restrictive assumption on genotypes, the single split copy number assumption, as well as evolutionary constraints to enumerate potential biologically plausible sets of genotypes from a genotype tree. e, We introduce the Descendant Cell Fraction (DCF), a novel statistic that summarizes both the prevalence of the SNV and its evolutionary history. A pair of SNVs (light and dark green triangles) occurring on the same branch of a phylogenetic tree may have different CCF values if one of the SNVs (dark green) is later lost due to a copy-neutral loss of heterozygosity. This same pair of SNVs will have the same DCF value. Thus, clustering SNVs with similar DCFs groups mutations under an evolutionary model that allows mutation losses.

Importantly, the CCF of an SNV is not directly measured in bulk DNA sequencing data of a tumor sample. Rather, the CCF must be inferred from the DNA sequencing reads that align to the locus containing the SNV. Specifically, the CCF is calculated from the variant allele frequency (VAF), or proportion of copies of the locus in the sample that contain the SNV. The VAF in turn is estimated as the proportion of variant reads at the SNV locus (Figure 1a). For a heterozygous SNV in a diploid genomic region, the CCF is twice the VAF scaled by the proportion of cancer cells in the tumor sample, or tumor purity. However, somatic copy-number aberrations, including loss-of-heterozygosity events, that overlap at an SNV locus can alter the number of copies of the SNV in a cell, and these events substantially complicate the estimation of the CCF.

Calculation of CCF from the VAF requires knowledge of the mutation multiplicity, or the number of copies of the SNV in a cell or clone. Copy-number aberrations and loss-of-heterozygosity events alter the mutation multiplicities of SNVs. While copy-number aberrations and loss-of-heterozygosity events can be estimated from bulk sequencing data (Van Loo et al. 2010, Carter et al. 2012, Nik-Zainal et al. 2012, Oesper et al. 2013, 2014, Fischer et al. 2014, Zaccaria & Raphael 2020), knowledge of the copy numbers at a locus – even allele-specific copy numbers and subclonal copy-number proportions – is insufficient to determine mutation multiplicities. Indeed, there are often multiple possible values for the unobserved mutation multiplicities that provide equally plausible explanations for the observed read counts and copy numbers at an SNV locus. In statistical terms, the CCF is not identifiable from DNA sequencing data (Figure 1a). Since copy-number aberrations that amplify or delete large genomic segments, chromosomal arms, and even the whole genome (Zack et al. 2013, McPherson et al. 2016, Gerstung et al. 2020, Watkins et al. 2020) are frequent in cancer – particularly in solid tumors where up to ∼90% of tumors (Weaver & Cleveland 2006) may contain copy-number aberrations – it is imperative to have robust methods to calculate CCFs from bulk sequencing data.

Multiple computational methods have been developed in recent years to estimate CCFs in bulk sequencing data. These methods can be categorized into two different strategies. The first strategy is to severely restrict the possible mutation multiplicities of SNVs; specifically, many methods (Roth et al. 2014, Miller et al. 2014, McGranahan et al. 2015, Dentro et al. 2017, Jamal-Hanjani et al. 2017, Yuan et al. 2018, Cun et al. 2018, Dentro et al. 2020, Xiao et al. 2020, Gerstung et al. 2020, Lakatos et al. 2020, Tarabichi et al. 2021) assume that all cells harboring an SNV have the same mutation multiplicity. We refer to this assumption as the constant mutation multiplicity assumption (Figures 1c). Many methods rely on the constant mutation multiplicity assumption, as well as additional heuristics, to select a single value of the CCF for each SNV. These methods either calculate the CCF for each SNV separately (McGranahan et al. 2015, Dentro et al. 2017, Jamal-Hanjani et al. 2017, Cun et al. 2018, Dentro et al. 2020, Gerstung et al. 2020) or simultaneously infer CCFs and cluster SNVs across one or multiple samples, as done by PyClone (Roth et al. 2014), Ccube (Yuan et al. 2018), and others (Miller et al. 2014, Yuan et al. 2015, Caravagna et al. 2020, Xiao et al. 2020). While the constant mutation multiplicity assumption reduces the ambiguity in the calculation of the CCF, the assumption alone is insufficient to fully resolve such ambiguity. Moreover, the heuristics used to select the mutation multiplicity of an SNV – e.g., rounding the estimated average mutation multiplicity to the nearest integer (Dentro et al. 2017) – may introduce unexpected biases into the resulting analyses. More importantly, the constant mutation multiplicity assumption is often violated in practice. For example, an SNV occurring before an amplification may result in cancer cells with different mutation multiplicities: a group of cells without the amplification harboring a single copy of the SNV, and another group of cells with the amplification and multiple copies of the SNV. Scenarios such as this are frequent in solid tumors that often have subclonal copy-number aberrations (Carter et al. 2012, Zack et al. 2013, The et al. 2020, Watkins et al. 2020). Thus, the constant mutation multiplicity assumption is both too restrictive to model many real tumors and also too weak to overcome the issue of non-identifiability.

The second strategy to estimate CCFs is a phylogenetic approach using an evolutionary model that includes both SNVs and copy-number aberrations. Methods that use this strategy include PhyloWGS (Deshwar et al. 2015), SPRUCE (El-Kebir et al. 2016) and Canopy (Jiang et al. 2016). The evolutionary models employed in these methods do not make the constant mutation multiplicity assumption and thus allow more realistic scenarios such as mutation losses. However, this flexibility comes at a cost of computational efficiency: none of the current methods scale to the large numbers of SNVs measured in current cancer sequencing studies, and these methods may take days or weeks to process samples with a moderate number of mutations (≈1000). To address scalability, these methods group mutations into clusters where all mutations in a cluster are assumed to occur on the same branch of the phylogenetic tree describing the evolution of the tumor. Specifically, PhyloWGS (Deshwar et al. 2015) simultaneously clusters mutations and reconstructs a phylogeny, while Canopy (Jiang et al. 2016) and SPRUCE (El-Kebir et al. 2016) require mutation clusters in input. However, this pre-clustering approach is difficult because the CCFs of the mutations are not known in advance. If one clusters mutations using CCFs derived under the constant mutation multiplicity assumption then the restriction on mutation multiplicities reduces or eliminates the advantage of the phylogenetic approach. Because of these limitations, phylogenetic methods are not as widely used as methods that rely on the constant mutation multiplicity assumption.

In addition to the difficulties in estimating the CCF, there is another important limitation of the CCF itself: in many cases, the CCF is not the correct quantity to use for phylogenetic analysis. Specifically, the CCF measures only the prevalence of a mutation in the tumor at the present time and does not necessarily provide complete information about the past history of the mutation. Two mutations that occurred during the same cell division may have very different CCF values if one of these mutations is later lost due to a deletion (Figure 1e) (Tarabichi et al. 2021). In this case, one mutation may have a high CCF – suggesting that the mutation occurred early in the evolution of the tumor – while the other mutation has a low CCF value, misleadingly suggesting that the mutation occurred late during evolution. In fact, mutation losses are common in cancers that contain many copy-number aberrations (McPherson et al. 2016, Satas et al. 2020, El-Kebir 2018). Jamal-Hanjani et al. (2017) described this issue in the TRACERx sequencing study of non-small-cell lung cancer patients and proposed the “phyloCCF”, an ad hoc correction of the inferred CCF for SNVs in genomic regions affected by subclonal deletions. However, the phyloCCF still relies on the constant mutation multiplicity assumption and thus models only some of the effects of copy-number aberrations on CCFs; moreover, no standalone implementation of the method is currently available.

In this paper, we propose a new approach to analyze tumor heterogeneity and evolution in tumors that contain both SNVs and copy-number aberrations, addressing both limitations in current approaches to estimate CCF and limitations in the CCF quantity itself for phylogenetic analysis. We first show how to compute the CCF under the single split copy number assumption (Figures 1d), an assumption that relies on standard evolutionary models for SNVs and copy-number aberrations and is more realistic than the constant mutation multiplicity assumption. We then introduce a novel statistic, the descendant cell fraction (DCF) (Figures 1e), that generalizes the CCF to account for mutation losses. The DCF provides an elegant mapping between the quantities measured in bulk DNA sequencing data – copy-number aberrations and read counts of SNVs – and the evolutionary history of SNVs. Specifically, SNVs that co-occur on the same branch of the phylogenetic tree will have the same DCF. We utilize this mapping to derive a probabilistic model to estimate the CCF or DCF while accounting for uncertainties in DNA sequencing data. Finally, we address the issue of non-identifiability in the CCF and DCF by sharing information across multiple SNVs and samples. We implement our approach in an algorithm named DeCiFer (Figures 2). DeCiFer can be viewed as an intermediate between scalable approaches that compute CCFs using the restrictive constant mutation multiplicity assumption without an evolutionary model and phylogenetic approaches that simultaneously model the evolution of all SNVs and copy-number aberrations but do not scale to large numbers of mutations. DeCiFer combines the advantages of both approaches (e.g., clustering of mutations and joint evolution of SNVs and copy-number aberrations) while avoiding some of their major limitations (e.g., constant mutation multiplicity assumption and scalability). We show that that DeCiFer outperforms existing methods on simulated data. Finally, we use DeCiFer to analyze 49 metastatic prostate cancer samples (Gundem et al. 2015), and we show that DeCiFer infers DCFs that result in more realistic and more parsimonious evolutionary histories for these tumors compared to existing approaches.

Figure 2: — (Left) Inputs to DeCiFer are variant a and total t read counts for SNVs identified in one or more tumor samples as well as the inferred copy numbers (*x, y*) and proportions μ_{(x, y)} for each SNV locus. (Middle) DeCiFer examines possible genotype trees for each SNV conforming to the single split copy-number (SSCN) assumption and evolutionary constraints. Each genotype tree defines a transformation between the input data and a DCF value d in each sample. DeCiFer solves the probabilistic mutation clustering and genotype selection problem, simultaneously selecting a genotype tree for each SNV and clustering the DCF values determined by the selected genotype trees for all SNVs. DeCiFer uses a coordinate ascent approach, alternating between optimization of the DCF for each cluster given genotypes and cluster assignments followed by the selection of genotypes and cluster assignments given the DCF of each cluster. (right) Clusters output by DeCiFer often contain groups of SNVs with distinct VAFs (green points) but similar DCF values. These are a result of differing mutation multiplicities at SNV loci as well as loss of mutations in subpopulations of tumor cells.

Results

The Cancer Cell Fraction: Constant Mutation Multiplicity Assumption

The cancer cell fraction (CCF) c of an SNV is defined as the fraction of cancer cells in a sample that contain at least one copy of the SNV. The CCF is not directly observed from bulk sequencing data; rather, one observes the total number t of reads that align to the SNV locus and the corresponding number a of reads with the variant allele (Figure 1a). If the SNV locus is diploid (i.e., no copy-number aberrations), the standard approach (Roth et al. 2014, Miller et al. 2014, McGranahan et al. 2015, Dentro et al. 2017, Jamal-Hanjani et al. 2017, Yuan et al. 2018, Cun et al. 2018, Dentro et al. 2020, Xiao et al. 2020, Gerstung et al. 2020, Lakatos et al. 2020, Caravagna et al. 2020, Cresswell et al. 2020, Tarabichi et al. 2021) estimates the CCF c from the fraction $\hat{v} = a / t$ of variant reads as $c \approx \frac{1}{ρ} 2 \hat{v}$ , where ρ is the tumor purity – i.e., fraction of cancer cells in the sample – which also may be inferred from bulk sequencing data (Van Loo et al. 2010, Carter et al. 2012, Nik-Zainal et al. 2012, Oesper et al. 2013, 2014, Fischer et al. 2014, Zaccaria & Raphael 2020). Note that $\hat{v}$ is the maximum likelihood estimate of the variant allele frequency (VAF) v – i.e., the proportion of copies of the locus in the sample that contain the SNV. More generally, if the SNV locus is aneuploid due to copy-number aberrations, nearly all existing methods (Roth et al. 2014, Miller et al. 2014, McGranahan et al. 2015, Dentro et al. 2017, Jamal-Hanjani et al. 2017, Yuan et al. 2018, Cun et al. 2018, Dentro et al. 2020, Xiao et al. 2020, Gerstung et al. 2020, Lakatos et al. 2020, Tarabichi et al. 2021) estimate the CCF by using the following generalization of the equation for diploid case:

c \approx \frac{1}{ρ} (\frac{ρ N_{tot} + (1 - ρ) 2}{M}) \hat{v},

(1)

where N_tot is the average total copy number in cancer cells and M is the number of copies of the SNV in cancer cells that contain the SNV.

While Eq. (1) has become the standard in the field, this equation incorporates strong assumptions about tumor clonal composition and evolution. To describe these assumptions, we define the genotype of an SNV locus in a cell as a triple (x, y, m) of non-negative integers, where the copy numbers (x, y) correspond to the number of maternal and paternal copies of the locus and the mutation multiplicity m ≤ x + y is the number of copies with the SNV. The CCF c is then the fraction of cancer cells that have genotypes (x, y, m) with m ≥ 1. Thus, we see that Eq. (1) assumes that m = M is fixed across all the cancer cells that contain the SNV (Figure 1c), which we state formally as follows.

Constant Mutation Multiplicity Assumption.

At every SNV locus, there exists an integer M ≥ 1 such that all genotypes at the locus have the form (x, y, m) where either m = 0 or m = M.

The Cancer Cell Fraction: Single Split Copy Number Assumption

The constant mutation multiplicity assumption severely limits the genotypes at a locus and is often violated in tumors with copy-number aberrations, as we will demonstrate in Results. Here, we define a less restrictive assumption on genotypes, the single split copy number assumption, that facilitates the computation of CCFs from bulk sequencing data under commonly used evolutionary models. Formally, we define a genotype set Γ as the set of genotypes at an SNV locus. Each genotype (x, y, m) in Γ has a corresponding genotype proportion g_{(x, y, m)} that gives the prevalence of the genotype in the sample. Let $g = {[g_{(x, y, m)}]}_{(x, y, m) \in Γ}$ denote the genotype proportions for genotypes in Γ, and note that the genotype proportions satisfy $g_{(x, y, m)} \geq 0$ and $\sum_{(x, y, m) \in Γ} g_{(x, y, m)} = 1$ . Given tumor purity ρ and a pair (Γ, g) of a genotype set and genotype proportions the CCF c is uniquely determined by the following equation:

c = \frac{1}{ρ} \sum_{(x, y, m) \in Γ_{CCF}} g_{(x, y, m)},

(2)

where $Γ_{CCF} = {(x, y, m) \in Γ ∣ m \geq 1} \subseteq Γ$ is the set of genotypes that contain the SNV.

Unfortunately, Eq. (2) is not directly applicable to bulk DNA sequencing data because neither the genotype set Γ nor the genotype proportions g are directly measured in bulk data. Rather from the aligned sequencing reads, one can estimate the VAF v of an SNV as well as the proportions μ_{(x, y)} of cells with copy number (x, y) at the locus. The copy-number proportions $μ = [μ_{(x, y)}]$ may be inferred using current tools for copy number deconvolution (Van Loo et al. 2010, Carter et al. 2012, Nik-Zainal et al. 2012, Oesper et al. 2013, 2014, Fischer et al. 2014, Zaccaria & Raphael 2020). The key question is: what genotypes and genotype proportions are consistent with the estimated copy number proportions μ and VAF v? The relationship between these quantities is given by the following equations.

μ_{(x, y)} = \sum_{(x, y, m) \in Γ} g_{(x, y, m)} for all copy numbers (x, y) .

(3)

v = \frac{1}{F} \sum_{(x, y, m) \in Γ} m \cdot g_{(x, y, m)},

(4)

where F is the fractional copy number defined as $\sum_{(x, y)} (x + y) \cdot μ_{(x, y)}$ . Note that F is the average copy number over all cells, including both cancer and normal cells; in contrast N_tot in Eq. (1) is the average copy number in cancer cells only. Thus, we have that $F = ρ N_{tot} + 2 (1 - ρ)$ for SNVs in autosomal chromosomes.

Given copy-number proportions μ and VAF v, there are often many pairs (Γ, g) that satisfy Eqs. (3) and (4); i.e., these equations are severely underdetermined. Thus, it is necessary to impose additional constraints on the pairs (Γ, g) that are considered. We make the following assumption.

Single Split Copy Number Assumption.

At every mutation locus, there is exactly one copy number $(x^{*}, y^{*})$ with two distinct genotypes $(x^{*}, y^{*}, 0)$ and $(x^{*}, y^{*}, m^{*})$ .

We denote a genotype set as Γ* if it adheres to the single split copy number assumption. Genotype sets Γ* have two desirable properties: (1) They arise from standard evolutionary models for SNVs and copy-number aberrations (detailed in STAR Methods); (2) If genotype proportions g satisfying equations Eqs. (3) and (4) exist, then they are unique (see Eq. (13)). Note that these properties are not necessarily true for genotype sets derived from the constant mutation multiplicity assumption. We say that a genotype set Γ is consistent with v and μ provided there exist corresponding genotype proportions g satisfying Eqs. (3) and (4). STAR Methods describes necessary and sufficient conditions for consistency.

As the genotype proportions g for a single split copy number genotype set Γ* are uniquely determined given VAF v and copy-number proportions μ, the CCF c is uniquely determined as well (by Eq. (2)). Thus, we have the following relationship between CCF c and VAF v for a single split copy number genotype set Γ*.

Theorem 1.

Given tumor purity ρ, VAF v, copy-number proportions μ, and a single split copy number genotype set Γ* consistent with v and μ, the CCF c is uniquely determined by

c = \frac{1}{ρ m^{*}} [v F - \sum_{\begin{array}{l} (x, y, m) \in Γ_{CCF} \\ (x, y) \neq (x^{*}, y^{*}) \end{array}} (m - m^{*}) \cdot μ_{(x, y)}],

(5)

where $Γ_{CCF} = {(x, y, m) \in Γ^{*} ∣ m \geq 1} \subseteq Γ^{*}$ is the set of genotypes containing the mutation.

Further details and the proof for Theorem 1 are in STAR Methods. Note that similar to the constant mutation multiplicity Assumption, the CCF is non-identifiable under the single split copy number assumption since μ and v are not sufficient to determine Γ*. STAR Methods discusses how to enumerate and select genotype sets Γ*.

Recall that the VAF v is not directly measured from sequencing data; rather one observes only the total read count t and variant read count a. Thus, the affine transformation in Eq. (5) cannot be used directly to compute the CCF c from sequencing data. Many existing methods (Miller et al. 2014, McGranahan et al. 2015, Jamal-Hanjani et al. 2017, Dentro et al. 2017, Cun et al. 2018, Dentro et al. 2017, Xiao et al. 2020, Lakatos et al. 2020, Cresswell et al. 2020) calculate the CCF by assuming that the proportion $\hat{v} = a / t$ of variant reads is an accurate estimate of the VAF and do not evaluate uncertainty due to sequencing errors and coverage. In STAR Methods, we introduce a probabilistic model to estimate the CCF from read counts of SNVs.

Descendant Cell Fraction

We derive a new quantity, the descendant cell fraction (DCF), a generalization of the CCF that accounts for potential SNV losses. The DCF of a mutation is the proportion of cells in a sample that are descendants of the ancestral cell where the mutation was first introduced. As an example, consider two SNVs that occurred at the same time in the same cell. If one of these SNVs is subsequently lost due to a deletion, these SNVs would have distinct CCFs at the time of sampling. However, the DCF for both SNVs would be the same, as they have the same set of descendent cells in the sample. Note that the DCF equals the CCF for SNVs that are not affected by deletions.

To define the DCF formally, we introduce the notion of a genotype tree $T_{Γ} = (Γ, E)$ , which is a rooted tree whose vertex set is a genotype set Γ and whose directed edges E encode evolutionary relationships between pairs of genotypes. While a tumor phylogeny models the evolutionary history of all SNVs in the tumor, a genotype tree describes the evolutionary history of only a single SNV. As such, inference of genotype trees of individual SNVs is a less challenging task than comprehensive phylogeny inference. We summarize a genotype tree T_Γ and genotype proportions g by the DCF, which will enable us to assign a genotype tree to each SNV subject to a parsimony constraint regarding the number of distinct DCF values. Specifically, the DCF d of an SNV is defined as

d = \frac{1}{ρ} \sum_{(x, y, m) \in Γ_{DCF}} g_{(x, y, m)},

(6)

where $Γ_{DCF} \subseteq Γ$ is the subset of genotypes that are descendants of the vertex corresponding to genotype (x*, y*, m*). Thus, while the CCF is the total prevalence of the subset $Γ_{CCF} \subseteq Γ$ of genotypes with mutation multiplicity m ≥ 1 at the present time, the DCF is total prevalence of genotypes that are descendants of the genotype (x*, y*, m*) where the mutation is first introduced. We have the following theorem.

Theorem 2.

Given tumor purityρ, VAF v, copy-number proportions μ, a single split copy number genotype set Γ* consistent with v and μ, and a genotype tree T_Γ*, the DCF d is uniquely determined by

d = \frac{1}{ρ m^{*}} [v F - \sum_{\begin{matrix} (x, y, m) \in Γ_{DCF} \\ (x, y) \neq (x^{*}, y^{*}) \end{matrix}} (m - m^{*}) \cdot μ_{(x, y)}]

(7)

where $Γ_{DCF}$ is the set of genotypes in genotype tree $T_{Γ^{*}}$ that are descendants of the state (x*, y*, m*).

Observe that the only difference between Eqs. (7) and (5) is the use of Γ_DCF rather than $Γ_{CCF}$ .

DeCiFer: Simultaneous Clustering and Genotype Selection using the DCF

Theorem 2 defines DCF d given VAF v, copy-number proportions μ and a genotype tree T_Γ* for a single split copy number genotype set Γ*. However, neither Γ* nor T_Γ* are directly observed by bulk data, and there frequently are multiple possible values that are consistent with the observed data. Examining SNVs individually, there is no way to distinguish between these values. However, by evaluating SNVs jointly and assuming that there are a small number of possible values of the DCF, we obtain constraints that reduce ambiguity in the selection of Γ* or T_Γ* for individual SNVs. Specifically, we make the following assumption.

Assumption.

There exist DCF values d₁, ..., d_k such that for every SNV in a tumor sample at least one d_j is a valid DCF for the SNV (i.e., solution of Equation (7)).

Under this assumption, SNVs may be partitioned into k groups according to their DCF. However, since SNVs may have more than one possible DCF value, the problem of identifying these groups entails the simultaneous selection of a genotype tree for each SNV (which determines the DCF value of the SNV) and the clustering of SNVs into k groups according to their DCF values.

Here, we describe this simultaneous selection and clustering problem in the more general setting where we have observations from p bulk sequencing samples from the same patient. Thus, we are given variant read counts $a_{i} = {[a_{i, ℓ}]}_{ℓ = 1, \dots, p}$ , total read counts $t_{i} = {[t_{i, ℓ}]}_{ℓ \in 1, \dots, p}$ and copy-number proportions $M_{i} = {[μ_{i, ℓ}]}_{ℓ = 1, \dots, p}$ for each SNV i in each sample $ℓ$ . Let $G_{i}$ be the set of genotype trees T_Γ* for single split copy number genotype sets Γ* that are consistent with μ_i. Let $s_{i} \in {1, \dots | G_{i} |}$ be a selection of a genotype tree T_Γ* for SNV i. Let $z_{i} \in {1, \dots, k}$ be a cluster assignment for SNV i. We aim to find genotype selections s, cluster assignments z, and cluster DCFs $D = [d_{1}, \dots, d_{k}]$ where $d_{j, ℓ}$ is the DCF of cluster j in sample $ℓ$ that maximize the posterior probability of the parameters given the observed read counts. Eq. (30) in STAR Methods gives this posterior probability of the DCF for an individual SNV in one sample. We assume that given cluster assignments and DCFs, variant read counts are conditionally independent across samples and across SNVs. Thus, to compute the objective, we take the product across samples and SNVs. This leads to the following problem.

Probabilistic Mutation Clustering and Genotype Selection Problem.

Given a set $G_{i}$ of pairs of genotype sets and trees, variant read counts a_i, total read counts t_i and (iv) copy-number proportions μ_i for each SNV i as well as an integer k > 0 find (i) DCFs $D^{*} = [d_{1}, \dots, d_{k}]$ and for each SNV i, select (ii) $s_{i}^{*} \in {1, \dots | G_{i} |}$ and (iii) $z_{i}^{*} \in {1, \dots, k}$ such that

\prod_{i = 1}^{n} \prod_{ℓ = 1}^{p} \Pr (d_{z_{i}^{*}, ℓ}^{*} ∣ a_{i, ℓ}, t_{i, ℓ}, μ_{i, ℓ}, Γ_{i, s_{i}^{*}}, T_{i, s_{i}^{*}})

(8)

is maximum.

While the hardness of this problem is open, the variant of the problem where every VAF v is observed instead of read counts a, t is NP-complete as it is equivalent to the well-studied Hitting Set problem (STAR Methods).

We introduce DeCiFer, an algorithm to solve the Probabilistic Mutation Clustering and Genotype Selection. DeCiFer uses a coordinate ascent approach to solve Eq. (8) by alternately optimizing (i) the cluster assignments z and genotype set assignments s for individual SNVs and (ii) the cluster DCFs D.

DeCiFer imposes stronger constraints on the allowed genotype sets than imposed by the single split copy number assumption, since some single split copy number genotype sets are not evolutionary plausible (for example, see the genotype set with homoplasy in Figure 1d). Specifically, we assume that the allowed genotype trees T_Γ for an individual SNV conform to the following evolutionary model. First, each mutation is introduced exactly once but may be subsequently lost or amplified due to copy-number aberrations. Second, each allele-specific copy number (x, y) of the SNV locus is attained exactly once. Thus, viewing SNVs as two-state characters and copy-number aberrations as multi-state characters, we have the Dollo model for SNVs and the infinite alleles assumption for copy-number aberrations. Finally, any change in mutation multiplicity must be caused by a corresponding change in copy-number. These constraints were formally described by El-Kebir et al. (2016) (Definition 12, Supplementary Material). Note that all genotype sets Γ that meet these constraints are also single split copy number genotype sets. Under the evolutionary constraints, the mutation multiplicity m* for the split copy number (x*, y*) is m* = 1. DeCiFer enumerates $G_{i}$ using the same tree enumeration procedure introduced in SPRUCE (El-Kebir et al. 2016).

Further details of DeCiFer model selection and implementation are in STAR Methods. DeCiFer is available at https://github.com/raphael-group/decifer.

Simulations

We compared DeCiFer with PyClone (Roth et al. 2014), a commonly used method for clustering and estimation of CCFs, on simulated data. PyClone simultaneously infers mutation multiplicity for SNVs and clusters mutations based on CCF values computed from these multiplicities. However, PyClone does not include subclonal copy-number aberrations in its model. To analyze subclonal copy-number aberrations, we used two variations of PyClone. First, for mutations with subclonal copy-number aberrations, we estimate a clonal copy-number by computing the average allele-specific copy number at the locus over all clones and rounding to the nearest integers. Second, we used a procedure described in the TRACERx study (Jamal-Hanjani et al. 2017) to estimate CCFs before running PyClone. Specifically, we estimated CCFs for each SNV using a constant mutation multiplicity-based method that accounts for subclonal copy-number aberrations (Dentro et al. 2017). Then we adjusted the variant read counts input to PyClone to match the computed CCF for heterozygous variant on a diploid locus. Note that by pre-computing CCFs, this procedure pre-selects mutation multiplicities independently for each SNV. We refer to this procedure as adjusted PyClone in this section, or “Adj. Pyclone” in Figure 3). Further details are in the Quantification and Statistical Analysis section of STAR Methods.

Figure 3: — a, The error in quantifying cell fractions (DCF for DeCiFer and CCFs for PyClone and Adjusted PyClone) on simulated data with 5 samples, 1000 mutations and varying number of clusters. Each jitter-boxplot gives results for twenty simulated data instances. b, The performance of each method in clustering SNVs as measured by adjusted Rand index. c, The difference between the inferred number $\hat{k}$ of clusters and the true number k of clusters. **d-g**, Results from one simulation with k = 2 clusters demonstrates differences between CCF and DCF. d, Variant allele frequencies from two out of five samples: green mutations are the truncal cluster and the purple mutations are a subtruncal cluster that is absent in these two samples. Mutations that are lost in at least one sample are highlighted with a black outline. e. The relationship between the true clusters and the clusters inferred by DeCiFer and PyClone. DeCiFer infers two clusters and accurately recovers the true cluster assignments for all mutations. f, PyClone infers four clusters subdividing the truncal cluster into three separate clusters. The two additional clusters (yellow and blue) are defined by mutations that were lost in subpopulations of tumor cells. g, Adjusted PyClone infers five clusters subdividing the truncal cluster into four separate clusters.

We compared the three methods on simulated data with with 2, 4, 8, and 12 clusters of SNVs generated as follows. Each simulated instances contained 1000 SNVs, 5 samples and an expected read depth of 100X. We simulated input data by partitioning the SNVs into the given number of clusters and simulating an evolutionary process where cells accumulate clusters of SNVs as well as copy-number aberrations (whole chromosome, whole arm, and focal events) which may amplify or delete overlapping SNVs. To simulate the sequencing of every SNV in each sample, we draw the total number of reads from a Poisson distribution with rate equal to the coverage and the number of variant reads from a binomial distribution, as done in previous studies (Deshwar et al. 2015, Satas et al. 2020, Satas & Raphael 2017). For each number of clusters, we simulate 20 instances. The simulation process is described in detail in STAR Methods.

We find that DeCiFer outperforms both PyClone and the adjusted PyClone across all inputs according to three metrics. First, we compare methods using cell fraction error – the mean absolute difference between the inferred cell fraction (where cell fraction is DCF for DeCiFer and CCF for PyClone) and the true cell fraction over all SNVs in all samples (Figure 3a). Second, we evaluate the quality of clustering results by computing the adjusted Rand index across all pairs of SNVs (Figure 3b). We find that while the adjusted PyClone is more effective at inferring cluster CCFs, this adjustment does not improve and may worsen the accuracy of mutation clustering. This may be caused by the selection of mutation multiplicities when computing the input for adjusted PyClone. Finally, we observe that DeCiFer has highest accuracy in inferring the number of clusters (Figure 3c). Both versions of PyClone overestimate the number of clusters for these instances, with PyClone inferring between 40 to 120 clusters on several instances. In contrast, DeCiFer accurately estimates the number of clusters for most inputs and moderately underestimate the number of cluster for data with 12 clusters.

To further characterize the differences between methods, we examine one simulated instance (random seed s = 1) with k = 2 clusters (Figure 3d-g). In this example, the true clustering consists of a truncal cluster (green) and a subtruncal cluster (purple) which is present in 2 of 5 samples. DeCiFer infers two clusters(Figure 3e) accurately grouping all mutations in the truncal cluster. PyClone infers four clusters, subdividing the truncal cluster into three separate clusters (Figure 3e, f). In contrast, adjusted PyClone infers five clusters, subdividing the truncal cluster into four separate clusters (Figure 3e, g). The SNVs in two of the extra clusters inferred by Pyclone and Adj. Pyclone (yellow and blue clusters in Figure 3f, g) are affected by mutation losses, which DeCiFer correctly groups together by using the DCF. The extra cluster inferred by Adjusted Pyclone (orange cluster in Figure 3g) results from incorrect selection of mutation multiplicies. PyClone correctly groups together the SNVs in these clusters orange and green cluster because it selects mutation multiplicities while clustering. This example illustrates the importance of evaluating mutation multiplicities simultaneously while clustering SNVs and also accounting for mutation losses, key features of DeCiFer.

We performed further simulations to evaluate the performance of DeCiFer across a range of parameters (Figure S1). We simulate data with different numbers of SNVs (100, 250, 500, or 1000), samples (1, 3, 5, or 7) and varying expected read depth (25×, 100×, 200×, or 1000×). We see that performance generally increases with larger numbers of samples (Figure S1a) and higher read depth (Figure S1a). The number of SNVs minimally affects the performance of DeCiFer and PyClone (Figure S1b). We also compared DeCiFer with PhyloWGS (Deshwar et al. 2015), a method that infers tumor phylogenies while simultaneously clustering SNVs into clones. We ran PhyloWGS on simulated instances with 100 and 1000 SNVs. While reconstruction of a tumor phylogeny jointly from copy-number aberrations and all SNVs should in principle produce the most accurate mutation multiplicities, we found that on instances with 100 SNVs, PhyloWGS had the lowest performance (Figure S2). On instances with 1000 SNVs, we found that PhyloWGS did not converge in reasonable time (<3 days). These results suggest that phylogeny inference is challenging for large numbers of SNVs in the presence of copy-number aberrations, leading to convergence issues for approaches like PhyloWGS that attempt to simultaneously cluster SNVs and infer phylogenetic tree relating these clusters. In contrast, DeCiFer completed the largest instances in under 1.5 hours and produced highly accurate clusters. This suggests that the approach of simultaneous selection of genotype trees for individual SNVs while clustering DCF values retains many of the advantages of full phylogenetic inference.

Finally, we evaluated DeCiFer across varying sizes of mutation clusters. We simulated data with three clusters: a truncal cluster and two subtruncal clusters. The truncal cluster comprises 20% of mutations, and we simulate instances where the smaller cluster comprises 40%, 20%, 10% and 5% of mutations. For each value of prevalence, we generated 10 instances with 1000 mutations, 3 samples and an expected read depth of 100x. We find that while the cell fraction error increases and the adjusted Rand index decreases when the prevalence of the smaller cluster decreases, the drop in performance is modest (Figure S3) suggesting that DeCiFer’s results are fairly stable under varying distribution of SNVs across tumor clones.

Metastatic Prostate Cancer

We analyzed SNVs and copy-number aberrations identified in whole-genome sequencing data of 49 tumor samples from 10 metastatic prostate cancer patients from Gundem et al. (2015). The initial published analysis (Gundem et al. 2015) of these patients inferred CCFs for a subset of validated SNVs using the constant mutation multiplicity assumption, clustered these SNVs into tumor clones according to their CCF values, and built a phylogenetic tree describing the evolution of these clones. We analyzed SNVs and copy-number aberrations from another published analysis of these same samples (Zaccaria & Raphael 2020), computing the DCF of each SNV using DeCiFer and computing the CCF of each SNV using the method from Dentro et al. (2017) that relies on the constant mutation multiplicity assumption. Further details of the analysis are in the Quantification and Statistical Analysis section of STAR Methods.

We found that the DCFs computed by DeCiFer are substantially different from the constant multiplicity CCFs for a large number of SNVs (Figure 4 and Figure S4). We summarize these differences according to two commonly used classifications of SNVs. First, we classify SNVs as clonal if they are inferred to be present in all cells in a tumor sample (CCF ≈ 1) or subclonal if they are inferred to be present only in a subpopulation (CCF ≪ 1). The clonal/subclonal classification categorizes SNVs according to their presence in cancer cells at the time of sampling. Second, we classify SNVs as truncal if they are inferred to occur before the most recent common ancestor of all cancer cells in the sample, and subtrunal otherwise. Most previous studies (Roth et al. 2014, Miller et al. 2014, McGranahan et al. 2015, Dentro et al. 2017, Yuan et al. 2018, Cun et al. 2018, Dentro et al. 2020, Xiao et al. 2020, Gerstung et al. 2020, Lakatos et al. 2020, Caravagna et al. 2020, Cresswell et al. 2020) assume that clonal SNVs and truncal SNVs are identical. However, this assumption does not necessarily hold when SNVs are lost due to deletions in cancer cells. Using CCFs, one cannot distinguish such lost mutations from mutations that never occurred in a sample. However, the DCFs of these two cases will be different. Thus, by computing the DCF, DeCiFer has the ability to more precisely designate SNVs as truncal, especially those that were subsequently deleted and are subclonal. We classify SNVs as truncal if DCF ≈ 1 or subtruncal if DCF << 1 (Figure 4a).

Overall, >23000 SNVs across all samples had a change in classification (Figure 4a) when using constant multiplicity CCFs versus DCFs inferred by DeCiFer. Multiple factors contribute to such changes. First, we found that ∼8500 SNVs across all samples that were classified as subclonal using constant multiplicity CCFs are classified as truncal by DeCiFer (Figure 4b). This difference is due to the loss of mutations by deletions. Such losses affect a moderate percentage of SNVs across all patients (5–32%, Figure S4) consistent with other recent estimates of the frequency of mutation losses (McPherson et al. 2016). We also see a large number of classification differences that cannot be explained by losses of SNVs. For example, we found that ∼12000 SNVs across all samples are classified as clonal using constant multiplicity CCFs and as subtruncal by DeCiFer (Figure 4b). These correspond to a moderate percentage of SNVs across all patients (3–40%, Figure S4). These differences are explained by choices of different mutation multiplicities (and different genotype sets Γ), for these SNVs. We will show below that the genotypes selected by DeCiFer often result in more parsimonious explanations for the observed data.

Another key difference in classification of SNVs is the classification of mutations as absent in a sample. SNVs without any observed variant reads in a sample (VAF = 0) are typically classified as absent from the sample. As a result, current cancer sequencing studies generally assume in downstream analyses that these SNVs were never present in the observed cancer cells in that sample or their ancestors. However, SNVs can be deleted by copy-number aberrations during tumor evolution and, when all cancer cells in a sample have been affected by such deletions, truncal SNVs may appear as absent. We find that 1, 560 SNVs across all samples classified as absent using the constant multiplicity CCFs are classified as truncal or subtruncal by DeCiFer (Figure 4b), corresponding to 0–5% of the total number of SNVs across patients (Figure S4c).

The differences between the inferred constant multiplicity CCFs and the DCFs do not simply result in different classifications of SNVs but also have a critical impact on downstream phylogenetic analysis. For example, on chromosome 5q in prostate cancer patient A17, we found two groups of 284 SNVs with different VAFs in sample A17-D that have been differently classified (Figure 5a, top). While one group (green) is classified as clonal and the other group (brown) as subclonal based on constant multiplicity CCFs, DeCiFer classifies both groups as truncal. A previous copy-number analysis (Zaccaria & Raphael 2020) identified the presence of cancer cells with different copy numbers in the same region (Figure 5a, bottom): 61% of cancer cells have a copy-neutral loss of heterozygosity (i.e., copy number (2, 0)), while the remaining cells are heterozygous diploid (i.e., copy number (1, 1)). Following the constant mutation multiplicity assumption, the mutation multiplicity of the clonal cluster (green) is 2 in all cancer cells (Figure 5b), indicating the presence of cells with genotype (1, 1, 2). As each of these SNVs are present on both the two copies of the locus, this implies that each of the 142 SNVs occured twice (homoplasy), once on each homologous chromosome. While recurrent mutations, or homoplasy, may occasionally happen in tumor evolution (Kuipers et al. 2017), observing homoplasy in 142 SNVs – all on the same chromosomal arm – seems highly unlikely.

Figure 5: — a, (Top) Two groups of 284 SNVs on chromosome 5q (brown and green) have similar VAFs in sample A17-F but different VAFs in sample A17-D, while being affected by the same copy-number aberration. (Middle) Positions of green and brown SNVs on chromosome 5q. (Bottom) Inferred copy numbers in samples: all cells in A17-F are heterozygous diploid, while 61% of cancer cells in A17-D have a copy-neutral loss of heterozygosity. b, A constant mutation multiplicity based method (Dentro et al. 2017) infers CCFs that separate the 284 SNVs into a clonal cluster (green) and a subclonal cluster (brown). Following the constant mutation multiplicity assumption, the clonal SNVs (green) are inferred to have mutation multiplicity of 2 in two distinct tumor clones (clone I and II). This leads to a phylogenetic reconstruction (right) with the unrealistic conclusion that each of the 142 SNVs occurred twice independently on both homologous chromosomes (i.e., 142 homoplasy events). c, DeCiFer infers DCF≈1 for all 284 SNVs by identifying different mutation multiplicities (blue dashed box) for SNVs in the green cluster across the two clones (I and II) and by identifying loss of brown SNVs in subset of tumor cells. Thus all SNVs in the green and brown cluster are truncal. This results in a realistic tumor phylogeny where the mutation multiplicities are consistent with the observed copy-neutral loss of heterozygosity.

On the same patient A17, DeCiFer infers DCFs that result in a much more realistic mutation multiplicies and phylogenetic reconstruction. In particular, DeCiFer infers that the two groups of SNVs (green and brown in Figure 5a) are part of the same truncal cluster (i.e., DCF≈ 1) but have different mutation multiplicities: the SNVs in the first group (green) have a multiplicity of 1 in one clone and a multiplicity of 2 in the other, a scenario that is not allowed under the constant mutation multiplicity assumption. DeCiFer’s results are consistent with a realistic evolutionary scenario where the two groups of SNVs occurred on different alleles of the chromosome: the first group (green) was amplified during the copy-neutral loss of heterozygosity event, while the second group (brown) was lost. Further supporting this reconstruction is the observation that the two groups of SNVs are randomly distributed over chromosome 5q (Figure 5a), indicating that the differences in VAF between the green and brown SNVs are not due to an error in the copy numbers for one group. Thus, DeCiFer’s classification of the SNVs in the brown group as truncal, but lost in a subpopulation of cancer cells results in a simpler and more realistic reconstruction of tumor evolution compared to the classification of these SNVs as subclonal according to their constant multiplicity CCFs.

On prostate cancer patient A12, we see an example of mutations that are classified as absent using constant multiplicity CCFs but classified as truncal by DeCiFer. Specifically, Chromosome 6q contains 86 SNVs that are split into two groups with different VAFs across three samples (Figure 6a, top): the first group (green) has VAFs between 0.4–0.8 in all samples, while the second group (magenta) has VAFs between 0.1–0.4 in samples A12-C and A12-D but VAFs ≈ 0 in the remaining sample A12-A. These two groups of SNVs have different constant multiplicity CCFs across samples and result in the inference of three distinct tumor clones with a complicated evolution characterized by recurrent mutation (homoplasy) of 58 SNVs in the magenta cluster (Figure 6b). However, a previous copy-number analysis (Zaccaria & Raphael 2020) revealed the presence of a copy-neutral loss of heterozygosity on chromosome 6q in these samples. The proportions of cancer cells that have the loss-of-heterozygosity event closely follow the VAFs of the SNVs in the magenta group of SNVs (Figure 6a, bottom): 100%, 66%, and 0% of cancer cells have a copy-neutral loss of heterozygosity in A12-A, A12-C, and A12-D, respectively. DeCiFer infers that all the 86 SNVs in the green and magenta groups are truncal with DCFs ≈ 1, and that the magenta group of SNVs were lost during the copy-neutral loss of heterozygosity event (Figure 6c). Indeed, in sample A12-A, where all cancer cells have the copy-neutral loss of heterozygosity, these SNVs have a VAF= 0. Note that the phyloCCF correction introduced in Jamal-Hanjani et al. (2017) to address issues of mutation loss would not have identified the magenta SNVs as truncal. Since phyloCCF is only applied to genomic regions affected by subclonal deletions and on each sample independently, it only considers SNVs with VAF > 0. Therefore, phyloCCF cannot distinguish between mutation losses and absences in a given sample as done by DeCiFer in this example. Thus DeCiFer yields a more parsimonious reconstruction of tumor evolution than obtained with constant multiplicity CCFs, with fewer tumor clones and no massive homoplasy.

Figure 6: — a, Two groups of 87 SNVs on chromosome 6q (green and magenta) have different VAFs in samples A12-A and A12-C, with one group (magenta) having VAF = 0 in A12-A. (Middle) Positions of green and magenta SNVs on chromosome 6q. (Bottom): Inferred copy numbers in samples: sample A12-A only contains cancer cells with a copy-neutral loss of heterozygosity, A12-D only contains heterozygous diploid cells, and A12-C contains a mixture of both. b, A constant mutation multiplicity based method (Dentro et al. 2017) infers CCFs that separate the SNVs into a clonal cluster in all samples (green) and another cluster (magenta), which is clonal in A12-D, absent in A12-A, and subclonal in A12-C. Following the constant mutation multiplicity assumption, the mutation multiplicities in each sample indicate the presence of three distinct tumor clones (labeled I, II, and III). This leads to a phylogenetic reconstruction (right) with the unrealistic conclusion that 58 SNVs occurred twice independently on both homologous chromosomes in Clone III (58 homoplasy events). c, DeCiFer infers DCF ≈ 1 for all SNVs by identifying different mutation multiplicities (blue dashed box) for SNVs in green cluster in sample A12-C and by identifying the loss of 28 magenta SNVs in a subset of tumor cells due to a copy-neutral loss of heterozygosity. Thus, all SNVs in green and magenta clusters are truncal. Notably, these 28 SNVs have DCF ≈ 1 also in A12-A even though their VAF = 0; this is because all cancer cells have a copy-neutral loss of heterozygosity in sample A12-A. The inferred losses of these 28 SNVs are well supported by the fact that the proportion of cancer cells with the copy-neutral loss of heterozygosity (second row of table in (a)) closely follows their variations of VAF across samples (VAFs in (a)).

DeCiFer also classifies ∼12, 000 SNVs as subtruncal across all samples that are classified as clonal using constant multiplicity CCFs (Figure 4b). Such SNVs correspond to between 3–41% of the total number of SNVs in different patients (Figure S4c). We observe this on prostate cancer patient A24, where 167 SNVs in chromosome 7 of prostate cancer patient A24 form four distinct groups with different VAFs in sample A24-A. In this genomic region, there are two groups of cancer cells with different copy numbers (Figure 7a): 27% of cancer cells have a loss of heterozygosity with amplifications (i.e., allele-specific copy numbers {4, 0}), while the remaining cancer cells are diploid (i.e., allele-specific copy numbers {1, 1}). Using the constant multiplicity CCFs, these SNVs form two clusters (Figure 7b): one subclonal cluster composed of 30 SNVs, and one clonal cluster composed of the remaining 137 SNVs. Based on the constant mutation multiplicity assumption, all the clonal SNVs are inferred to have a constant SNV multiplicity of either 1 or 2. However, this solution is biologically unlikely: if an SNV was present in the cell where the amplification occurred, then the cell would have 4 copies of the SNV. Instead, only 1 or 2 copies are inferred to have the SNV, implying that back mutations simultaneously occurred for 33 SNVs (light blue) and homoplasy events simultaneously occurred for 26 SNVs (dark grey). DeCiFer identifies one group of 60 SNVs (light and dark blue) belonging to a subtruncal cluster (CCF≈ 0.7) while the remaining SNVs (light and dark grey) belong to the truncal cluster (Figure 7c). The DeCiFer solution corresponds to a more realistic reconstruction of tumor evolution, where differences in mutation multiplicities explain different cluster VAFs and are consistent with an amplification event.

Figure 7: — a, 167 somatic SNVs are located on chromosome 7 of prostate cancer patient A24. b, Four clusters of these SNVs have the same values of VAF in sample A24-E but different values in A24-A. c, Different cells have different copy numbers in the two samples for the genomic region harboring these SNVs: while all cells in A24-E are diploid in this region (i.e., copy numbers (1, 1)), the 20% of cells in A24-A have a loss of heterozygosity with an amplification (i.e., copy numbers (4, 0)). d, An existing method (Dentro et al. 2017) using the constant mutation multiplicity assumption infers that three clusters (cyan, and light and dark grey) of SNVs in (b) have CCF≈1 in both samples, corresponding to truncal SNVs. Instead, the method infers that the remaining cluster (dark blue) is subclonal in A24-A, which thus includes two tumor clones (clone I and II). Based on the values of VAF and the constant mutation multiplicity assumption, the method infers that each of the three clusters of clonal SNVs have the same SNV multiplicity: two of these clusters (light grey and cyan) have SNV multiplicity of 1 and the other cluster (dark grey) has an SNV multiplicity of 2. Since all the SNVs are affected by an amplification of the retained allele and since SNVs are also amplified with the corresponding allele, this result implies the unrealistic occurrence of back mutations for all the clusters; for example, three back mutations must have occurred for one of these clusters (cyan). e, DeCiFer infers that two (light and dark grey) of the clusters in (b) have DCF≈1 and correspond to truncal SNVs, while the other two remaining clusters (light and dark blue) have DCF≈0.6 and correspond to subtruncal SNVs. Specifically, DeCiFer obtains this result by identifying the loss of two clusters (light gray and dark blue) located on the same allele (black) and the presence of different mutation multiplicities in different clones (clones II and III) for the other two remaining clusters (dark grey and light blue), resulting in a realistic tumor phylogeny.

Discussion

The cancer cell fraction (CCF) is the cornerstone of tumor heterogeneity and evolution analysis using single-nucleotide variants (SNV). However, we demonstrated that current approaches to estimate CCFs suffer from major limitations and these limitations have striking consequences on real data. First, nearly all existing methods to estimate CCFs are based on the constant mutation multiplicity assumption that is violated in many tumors. Second, the CCF is not the correct quantity to group mutations for phylogenetic analysis in the case where SNV losses occur due to copy-number aberrations, a case that is common in solid tumors. In this work, we address these limitations by: (i) introducing the single split copy number assumption, a more realistic alternative to the constant mutation multiplicity assumption; (ii) defining the descendant cell fraction (DCF), a generalization of the CCF that accounts for SNV losses; (iii) developing DeCiFer, an algorithm that simultaneously estimates DCFs (or CCFs) of individual SNVs and clusters SNVs into a small number of groups according to these DCFs (or CCFs) across multiple tumor samples. We show that DeCiFer improves the identification of SNVs that co-occurred in the same tumor clone on both simulated and real cancer data, yielding more realistic reconstructions of tumor evolution compared to earlier approaches based on CCFs inferred using the constant mutation multiplicity assumption. DCF clusters account for mutation losses and differences in copy number, and thus can be used as input to standard tumor phylogeny methods (Popic et al. 2015, Qiao et al. 2014, El-Kebir et al. 2015, Malikic et al. 2015, Husic et al.´ 2019). This will enable phylogeny inference for realistic sized problems containing thousands of SNVs whose copy numbers may differ within and across tumor samples. DeCiFer can also be run with CCFs instead of DCFs, which may be preferable for certain applications such as neoantigen prediction (Cai et al. 2018) where identifying the clonal status of mutations is important.

While DeCiFer enables us to overcome some major limitations of previous studies, there are several venues for future improvements. First, we assume that the given copy numbers and proportions are exact. However, methods that infer copy numbers and proportions from bulk DNA sequencing data are subject to errors and may miss copy-number aberrations that are small or present at low proportion in a sample. This uncertainty could be incorporated into the DeCiFer model. Second, further improvements in SNV clustering are possible, such as better modeling of the tail of low prevalence SNVs that are expected due to neutral evolution (Caravagna et al. 2020). Third, breakpoints of structural variants could also be analyzed by DeCiFer since the prevalence of these mutations is proportional to read counts (Cmero et al. 2020). Finally, the genotype trees selected by DeCiFer for each SNV could be combined into tumor phylogenies, perhaps using consenus tree methods. DeCiFer provides a robust framework to decipher tumor heterogeneity in the presence of copy number aberrations providing a tool to improving understanding of tumor development and evolution.

STAR Methods

RESOURCE AVAILABILITY

Lead Contact

Requests for further information should be directed to and will be fulfilled by the lead contact, Ben Raphael (braphael@princeton.edu).

Data and Code Availability

DeCiFer code is available on GitHub at https://github.com/raphael-group/decifer(DOI 10.5281/zenodo.5082565) and it is distributed through Bioconda at https://bioconda.github.io/recipes/decifer/README.html. The simulations as well as the processed data and DeCiFer’s results for the prostate cancer patients analyzed in this study are available on GitHub at https://github.com/raphael-group/decifer-data. Whole-genome DNA sequencing data for all samples of the prostate cancer patients are available from the European Genome-phenome Archive (EGA) under accession number EGAS00001000262.

METHOD DETAILS

Derivation of Theorem 1

In this section, we derive in detail Main Text Theorem 1, which describes the relationship between CCF c and VAF v for an single split copy number genotype set Γ*. We will begin by restating several definitions provided in the main text. Given the tumor purity ρ, genotype set Γ and genotype proportions g uniquely determine the CCF c.

Definition 1.

The cancer cell fraction (CCF) c is defined in terms of a genotype set Γ and genotype proportions g and tumor purity ρ as

c = \frac{1}{ρ} \sum_{(x, y, m) \in Γ_{CCF}} g_{(x, y, m)},

(9)

where Γ_CCF is the subset of genotypes $(x, y, m) \in Γ$ where m ≥ 1.

The genotype set Γ and genotype proportions $g = [g_{(x, y, m)}]$ uniquely determine the VAF v as follows.

Definition 2.

The variant allele frequency (VAF) v is defined in terms of a genotype set Γ and genotype proportions g as

v = \frac{1}{F} \sum_{(x, y, m) \in Γ} m \cdot g_{(x, y, m)}

(10)

where

F = \sum_{(x, y, m) \in Γ} (x + y) \cdot g_{(x, y, m)}

(11)

is the fractional copy number of the locus as defined by Γ and g.

The genotype set Γ and genotype proportions $g = [g_{(x, y, m)}]$ uniquely determine the copy-number proportions μ as follows.

Definition 3.

The copy-number proportions μ is defined in terms of a genotype set Γ and genotype proportions g as

μ_{(x, y)} = \frac{1}{F} \sum_{(x, y, m) \in Γ} g_{(x, y, m)}

(12)

for all copy-number states (x, y).

We define consistency between the VAF v, copy-number proportions μ, genotype sets Γ and the underlying genotype states g in terms of the above equations.

Definition 4.

Genotype set and proportions (Γ, g) are consistent with VAF v and copy-number proportions μ provided that Eqs (10) and (12) are satisfied.

Definition 5.

Genotype set Γ is consistent with VAF v and copy-number proportions μ provided that there exist genotype proportions g such that Eqs. (10) and (12) are satisfied.

In the main text, we introduce the single split copy number assumption, which follows from simple evolutionary models on SNVs and copy-number aberrations: the Dollo model for SNVs, and infinite alleles model for allele-specific copy numbers.

Assumption 1. Single Split Copy Number.

At every SNV locus, there is exactly one copy-number state $(x^{*}, y^{*})$ with two distinct genotypes $(x^{*}, y^{*}, 0)$ and $(x^{*}, y^{*}, m^{*})$ .

We make the claim for single split copy number genotype sets Γ* that provided consistent genotype proportions g exist, they are unique. Specifically they take the following form.

Lemma 1.

Given VAF v, copy-number proportions μ, and a single split copy number genotype set Γ*, if genotype proportions g exist such that $(Γ^{*}, g)$ are consistent with VAF v and copy-number proportions μ, they are uniquely determined as

g_{(x, y, m)} = {\begin{array}{l} μ_{(x, y)}, & if (x, y) \neq (x^{*}, y^{*}), \\ (1 - λ) \cdot μ_{(x^{*}, y^{*})}, & if (x, y) = (x^{*}, y^{*}) and m = 0, \\ λ \cdot μ_{(x^{*}, y^{*})}, & if (x, y) = (x^{*}, y^{*}) and m = m^{*}, \end{array}

(13)

where

λ = \frac{1}{m^{*} μ_{(x^{*}, y^{*})}} [v F - \sum_{\begin{matrix} (x, y, m) \in Γ^{*} \\ (x, y) \neq (x^{*}, y^{*}) \end{matrix}} m \cdot μ_{(x, y)}] .

(14)

This follows from the definition of consistency and Eqs. (10) and (12). A detailed proof of this and subsequent lemmas and theorems are provided in the Proofs section. Note that λ has a natural interpretation as the proportion of cells with copy-number state (x, y) that have the mutation, thus quantifying the “split” of the single split copy number state.

Lemma 1 gives the closed form of consistent genotypes but notes that there may not exist consistent genotype proportions for a given VAF v, copy-number proportions μ, and single split copy number genotype set Γ*. Below, we describe two conditions that together are necessary and sufficient for the existence of consistent genotype proportions.

Lemma 2.

Given VAF v, copy-number proportions μ, and a single split copy number genotype set Γ*, genotype proportions g such that / are consistent with VAF v and copy-number proportions μ exist if and only if

For all copy-number states (x, y) such that $μ_{(x, y)} > 0$ , there exists state $(x, y, m) \in Γ$ for some m.
λ is in the range [0, 1] (Eq. (14)).

The constraint on $λ \in [0, 1]$ also defines the range of VAFs that are possible for a given μ and Γ*.

Lemma 3.

A VAF v is feasible for copy-number proportions μ and a single split copy number genotype set Γ* consistent with μ provided v is in the range $[v^{-}, v^{+}]$ where

v^{-} = \frac{1}{F} \sum_{\begin{matrix} (x, y, m) \in Γ^{*} \\ (x, y) \neq (x^{*}, y^{*}) \end{matrix}} m \cdot μ_{(x, y)} and v^{+} = v^{-} + \frac{m^{*} \cdot μ_{(x^{*}, y^{*})}}{F} .

(15)

As the CCF is determined by genotype proportions, we can use Lemma 1 to characterize CCF c in terms of λ.

Lemma 4.

Given (i) copy-number proportions μ, (ii) a single split copy number genotype set Γ and (iii) the tumor purity ρ, the CCFs c resulting from genotype proportions g such that (Γ, g) is consistent with μ are uniquely determined as

c = \frac{1}{ρ} [λ \cdot μ_{(x^{*}, y^{*})} + \sum_{\begin{array}{l} (x, y, m) \in Γ_{CCF} \\ (x, y) \neq (x^{*}, y^{*}) \end{array}} ​ μ_{(x, y)}]

(16)

for any value of the splitting parameter λ ∈ [0, 1].

These results jointly yield Theorem 1. For a detailed derivation, see Proofs.