Abstract
Motivation:
New low-coverage single-cell DNA sequencing technologies enable the measurement of copy number profiles from thousands of individual cells within tumors. From this data, one can infer the evolutionary history of the tumor by modeling transformations of the genome via copy number aberrations. A widely used model to infer such copy number phylogenies is the copy number transformation (CNT) model in which a genome is represented by an integer vector and a copy number aberration is an event that either increases or decreases the number of copies of a contiguous segment of the genome. The CNT distance between a pair of copy number profiles is the minimum number of events required to transform one profile to another. While this distance can be computed efficiently, no efficient algorithm has been developed to find the most parsimonious phylogeny under the CNT model.
Results:
We introduce the zero-agnostic copy number transformation (ZCNT) model, a simplification of the CNT model that allows the amplification or deletion of regions with zero copies. We derive a closed form expression for the ZCNT distance between two copy number profiles and show that, unlike the CNT distance, the ZCNT distance forms a metric. We leverage the closed-form expression for the ZCNT distance and an alternative characterization of copy number profiles to derive polynomial time algorithms for two natural relaxations of the small parsimony problem on copy number profiles. While the alteration of zero copy number regions allowed under the ZCNT model is not biologically realistic, we show on both simulated and real datasets that the ZCNT distance is a close approximation to the CNT distance. Extending our polynomial time algorithm for the ZCNT small parsimony problem, we develop an algorithm, Lazac, for solving the large parsimony problem on copy number profiles. We demonstrate that Lazac outperforms existing methods for inferring copy number phylogenies on both simulated and real data.
1. Introduction
Tumor evolution is characterized by both small and large genomic alterations that alter the fitness of cancer cells [31]. Copy number aberrations, i.e. modifications to the number of copies of a genomic segment, are an important and frequent sub-class of such alterations that drive prognostic and metastatic outcomes [1]. Deriving the evolutionary history of copy number aberrations, herein referred to as copy number phylogenies, is thus important for understanding the emergence of primary tumors and the development of subpopulations of cells that evade treatment and/or metastasize to other anatomical sites.
Recent technological and computational improvements in single-cell sequencing have enabled the mapping of high resolution copy number profiles in single cells. For example, the high-throughput 10x Genomics Single-cell Copy Number Variation solution [2, 48] produces ultra-low coverage (< 0.05×) whole genome sequencing data from ≈ 2000 individual cells. Other recent technologies, including DLP/DLP+ [49, 23, 15] and ACT [28], produce similar data. Multiple computational methods [48, 45, 44, 10, 20, 24] have been introduced to infer high resolution copy number profiles, integer vectors that contain the number of copies of each genomic segment, from this type of data. Other recent methods can infer copy number profiles from thousands of cells or spatial locations from single-cell RNA sequencing (scRNA-seq) [16], scATAC-seq [47], or spatial transcriptomics data [12].
The increasing availability of technologies to measure genomic copy number in thousands of cell motivates development of methods to infer the cellular phylogenies from copy number profiles. However, there are multiple challenges in inferring phylogenies from copy number profiles. First, copy number aberrations are diverse, ranging from small duplications and deletions [3] to whole chromosome shattering and reconstruction events [41]. Second, a single copy number aberration can alter the number of copies of a large section of the genome simultaneously. This means that loci on the genome cannot be treated as independent phylogenetic characters, a widely-used assumption in phylogenetics [18, 19, 39, 46] Finally, the increasing size (> 10, 000 cells) and resolution (< 5Kb bins) of copy number profiles require increasingly scalable algorithms.
One widely used model of copy number evolution is the copy number transformation (CNT) model [38]. In the CNT model, a genome is represented as a vector of non-negative integers and copy number aberrations correspond to the increase or decrease of the entries in a contiguous interval of coordinates in the vector, explicitly modeling the non-independence of copy number amplifications and deletions. The CNT distance is the minimum number of copy number events needed to transformation one profile to another. The CNT distance is computable in linear time [51] and has been used to define an evolutionary distance between profiles. Since the CNT distance is not symmetric, a variety of symmetrized CNT distances have also been used to construct copy number phylogenies using distance-based phylogenetic methods [38, 11, 50, 22]. Further, owing to its effectiveness, the CNT model has become the basis of a variety of distinct models [50, 22, 8] for copy number evolution.
While the CNT model is described by specific events – or mutations – there has been little work on constructing phylogenetic trees under the CNT model using the method of maximum parsimony. Even the small parsimony problem – where the topology of the tree is given and one aims to infer the ancestral profiles that minimizes the total number of copy number events on the tree – has no known efficient solution. For example, for the special case of a two leaf tree, the best algorithm for the CNT small parsimony problem [51] runs in time where is the largest allowed copy number and is the number of loci [11]. Without an efficient algorithm for the small parsimony problem under the CNT model, one cannot hope to solve the large parsimony problem, where the topology of the tree is unknown.
We introduce a relaxation of the CNT model, called the zero-agnostic copy number transformation (ZCNT) model, that approximates the CNT model and has a number of desirable properties. Unlike the CNT model, the ZCNT model allows for the amplification of zero copy number regions. While such an operation is not biologically realistic, we show that this relaxation makes the ZCNT distance a metric, in contrast to the CNT distance. Moreover, we derive a closed form expression for the ZCNT distance between two profiles. We use this closed form expression as well as an alternative characterization of copy number profiles to solve two relaxations of the small parsimony problem in polynomial time. To our knowledge, this is the first attempt to solve the small parsimony problem for a segment-based (i.e. non-independent) model of copy number evolution. We then use our efficient algorithm for the (relaxed) small parsimony problem to design an algorithm, Lazac (Large-scale Analysis of Zero Agnostic Copy number), for inferring copy number phylogenies by solving the large parsimony problem. We show on simulated data that Lazac is > 100× faster than other phylogenetic methods and also more accurate in recovering the ground truth phylogeny. On single-cell whole-genome sequencing data from human breast and ovarian tumors, Lazac finds phylogenies that are more consistent with both copy number clones and single-nucleotide variants (SNVs).
2. Copy number transformations
A copy number profile is a vector of non-negative integers where is the the number of copies of locus . Suppose we measure the copy number profile of cells of a tumor across loci in a single-cell DNA sequencing experiment. We encode the copy number profiles in a copy number matrix where is the copy number of cell at locus . The copy number profile of cell is then the row of this matrix, and is denoted .
One of the most basic phylogenetic principles is that nearly perfect measurement of evolutionary distances enables exact recovery of the evolutionary history [40]. It is thus not surprising that many of the successful attempts at inferring copy number phylogenies focus on finding good methods to compute an evolutionary distance between a pair of copy number profiles. Early methods for computing evolutionary distances on copy number data [3, 7] employed simple measures of distance such as the Hamming, weighted Hamming, and distance between copy number profiles. However, these distances do not account for dependencies between loci caused by long CNAs spanning contiguous segments of the genome, leading to inaccurate phylogenetic reconstruction [38, 22].
In this section, we describe and investigate the copy number transformation (CNT) model, one of the most well-known and successful evolutionary models for copy number evolution in cancer. The CNT model was originally introduced in MEDICC [38] and extended in subsequent studies [51, 11, 50, 8, 22]. Since the CNT model only allows intrachromosomal copy number events, it is sufficient to consider the case of a single chromosome, and thus for ease of exposition we will describe the model using a single chromosome.
The fundamental operation in the CNT model is a copy number event which increases or decreases (by one) the entries in a contiguous interval of a copy number profile, defined formally as follows.
Definition 1 (Copy number event).
A copy number event is a function that maps a copy number profile to a profile described by its entries as
where and . We denote such a function as when clear by context.
That is, an amplification (resp. deletion) increases (resp. decreases) the copy number of all non-zero entries in the interval between positions and , or alternatively a copy number event skips the zero entries (Figure 1). Thus, once a locus is lost (i.e. ), the locus cannot be regained or deleted further. A copy number transformation is the composition of multiple copy number events and we denote this function as when .
Figure 1:
(a) The results of applying a copy number transformation c2,6,+1 under both the CNT model and ZCNT model. Under the CNT model a zero cannot be increased via the amplification, but the zero-agnostic CNT model allows the zero to increase to one copy. (b) An instance of the ZCNT small parsimony problem: given a tree with copy number profiles labeling the leaves, the goal is to infer the ancestral copy number profiles that minimize the total ZCNT distance across all edges.
Several copy number problems have been previously studied to compute evolutionary distances under the CNT model. The first, and simplest, is the copy number transformation problem, originally introduced in [38], which defines a distance, , between two copy number profiles. Put simply, the distance between two profiles is the length of the shortest copy number transformation needed to transform one profile to another.
Definition 2 (Copy number transformation distance).
Given two copy number profiles 𝑢 and 𝑣, the copy number transformation distance is
where is a CNT. Alternatively, if no such transformation exists.
[51, 43] show there is a (non-trivial) strongly linear time algorithm (ie. time complexity ) for computing the CNT distance . Unfortunately, the CNT distance is not symmetric (i.e. , which makes it difficult to use in distance based phylogenetic methods such as neighbor joining [34].
In order to apply distance-based phylogenetic methods, multiple approaches to symmetrize the distance have been introduced. [51] use a mean correction replacing the asymmetric with a symmetric distance defined as
Alternatively, several authors [38, 22, 11] define the distance between two profiles in terms of a closely related, median profile . Specifically, the median distance between two profiles and is defined to be the smallest value of
over all profiles . Computing this median distance is called the copy number triplet problem in [11]. Unfortunately, no efficient algorithm is known for the copy number triplet problem. The fastest algorithm uses time and space where is the maximum allowed copy number [11].
3. Small and large copy number parsimony
The small parsimony problem for copy number profiles is the following: given a tree whose leaves are labeled by copy number profiles, infer ancestral copy number profiles that minimize the total dissimilarity between profiles across all edges (Figure 1). For evolutionary models in which each character evolves independently and has finitely many states (e.g. single nucleotide substitution models), the small parsimony problem is solved in polynomial time via Sankoff’s algorithm, a dynamic programming algorithm [36]. Unfortunately, the CNT model presents two major challenges in solving the small parsimony problem. First, since copy number events affect multiple loci simultaneously, the loci cannot be analyzed independently, in contrast to most phylogenetic characters. Second, the space of possible copy number profiles is a priori unbounded, since the maximum copy number of a segment in a genome is unknown. Thus, it is not surprising that there is no published solution to the small parsimony problem for CNT dissimilarity, with the exception of the special case of two-leaf trees [11]. Here, we formalize both the CNT small parsimony problem and the corresponding large parsimony problem, the latter of which was previously described in [11].
A copy number phylogeny is a rooted tree and leaf labeling . Let , , and denote the edges, vertices, and leaves of , respectively. In our applications below, each leaf of represents one of the cells (or bulk samples) from a tumor. An ancestral labeling of a copy number phylogeny is a vertex labeling of that agrees with on the leaves of , i.e. when . We say that is a copy number phylogeny for copy number matrix if has leaves such that labels each leaf by a row of . Formally, if is a copy number phylogeny for a copy number matrix , then there exists a cell assignment that assigns each cell to a leaf such that .
We define the cost of a vertex labeled, copy number phylogeny as the total number of copy number events required to explain the phylogeny:
We now introduce the small parsimony problem [14] under the copy number transformation model.
Problem 1 (CNT Small parsimony).
Given a copy number phylogeny find an ancestral labeling such that for all leaves and (ii) is minimized.
The parsimony score is defined as the cost of the solution to the CNT small parsimony problem. To the best of our knowledge the CNT small parsimony problem (Problem 1) has not been analyzed in the literature. We believe this is due to the difficulty of solving the CNT small parsimony problem. That is, for even a special case of two-leaf trees, referred to as the copy number triplet problem [11], no strongly polynomial time algorithm is known (Section 2).
The CNT large parsimony problem defined in [11], aims to find a vertex labeled, copy number phylogeny for a matrix with minimum cost.
Problem 2 (CNT Large parsimony).
Given copy number matrix , find a copy number phylogeny for and ancestral labeling such that is minimized.
Unsurprisingly, [11] showed that the above large parsimony problem (Problem 2) is NP-hard. They also formulated an integer linear program (ILP) to solve the problem exactly. However, this ILP consists of variables and does not scale to the size of current real data sets with thousands of cells.
4. The zero-agnostic CNT model
The copy number transformation (CNT) model imposes the constraint that once a locus is lost (has zero copy number), the locus remains with zero copies for all time. While this constraint is biologically realistic, the constraint also makes the inference problems – including the CNT small (and large) parsimony problems – computationally hard to solve. Here, we show that relaxing the constraint that copy number events do not alter zero entries leads to a simpler model with favourable mathematical properties. We call this the zero-agnostic copy number model (Figure 1) to indicate that the model allows the amplification and deletion of loci with zero copies. Formally, we define a zero-agnostic copy number event as follows.
Definition 3 (Zero-agnostic copy number event).
A zero-agnostic copy number event is a function that maps a profile to a profile written described by its entries as
where and . We denote such a function as when clear by context.
Thus, a zero-agnostic copy number event either increases or decreases the number of copies of all loci in the interval regardless of whether the loci have zero copies. While our formulation allows for the number of copies of a locus to decrease below zero, one can show that given two profiles with non-negative entries, it is always possible to find an optimal ZCNT such that no intermediate profile has negative entries. Specifically, as a corollary to the commutativity of zero-agnostic copy number transformations (Proposition 2), one can re-order events such that amplifications always occur first.
Due to space constraints, we do not include all proofs in the main text. Any proof not present in the main text can be found in Supplementary Proofs C.
4.1. Delta profiles
We simplify our analysis of zero-agnostic copy number events by examining their effect on the differences between the copy number of adjacent loci. In particular, while a zero-agnostic copy number event increments (or decrements) all entries where only alters two differences between adjacent loci, namely the difference and the difference . To formalize this idea, we first define the delta profile, vectors obtained by taking the differences in copy number between adjacent loci.
Definition 4.
A delta profile is any vector that satisfies the balancing condition:
| (1) |
Or equivalently, . We denote the set of delta profiles in as .
The above definition provides us with a convenient (and useful) description of the image of the following difference transformation, which we call the delta map.
Definition 5.
The delta map maps a copy number profile to a delta proflle by taking the differences in adjacent copy number loch. Specifically,
where the constant 2 represents a normal, diplold copy number.
A basic property of the delta map is that it is invertible.
Proposition 1.
The delta map is invertible.
Since is one-to-one and onto with respect to , each delta profile then corresponds to a unique copy number profile .
Interestingly, a copy number event applied to a copy number profile only affects two entries of the delta profile , meaning that loci of the corresponding delta profile are (nearly) independent. We formalize this in the following definition of a delta event.
Definition 6 (Delta event).
A delta event is a function that maps a delta profile to a delta profile described by its entries as
where and . We denote such function as when clear by context.
A delta transformation is the composition of multiple delta events, where . We now state the connection between delta events and zero-agnostic copy number (ZCNT) events in the following theorem and corollary.
Theorem 1.
Let be a zero-agnostic copy number event and be a delta event. Then,
Corollary 1.
Let be a zero-agnostic copy number transformation and be the correspond ing delta transformation. Then,
Proof.
The corollary follows by induction on and repeated application of (Theorem 1). ■
4.2. Computing the ZCNT distance
Let be the minimum number of zero-agnostic copy number events needed to transform the copy number profile to . In this section we derive a closed form expression for .
We begin by noting that is equal to the minimum number of delta events needed to transform delta profile to . This follows from the equivalence between the copy number transformations and the corresponding delta transformation (Corollary 1). Thus, it suffices to only consider delta profiles and delta events; for the rest of the section all profiles and are delta profiles unless otherwise specified.
We start by observing two basic facts: delta transformations are commutative and forms a metric.
Proposition 2.
A delta transformation is commutative. That is, the application of to a profile is identical to the application of where is any permutation of .
Proposition 3.
is a distance metric. Further,
Note that this also implies that zero-agnostic copy number transformations are commutative and that is a distance metric. To see this, let be a zero-agnostic copy number transformation and be the corresponding delta transformation, then for any vector and permutation ,
where the first equivalence follows from Corollary 1, the second from Proposition 2, and the third from Corollary 1. This implies that , which proves that a zero-agnostic copy number transformation is commutative. To see that is a distance metric, it suffices to observe that implies symmetry and reflexivity. Then, the triangle inequality is satisfied since the composition of a zero-agnostic copy number transformation from to and to yields a copy number transformation from to .
From our characterization of delta profiles, we derive our expression for the distance between delta profiles.
Theorem 2.
For delta profiles and . Thus, .
Proof.
Since each event decreases the total magnitude of by at most two, to transform to the 0 profile requires at least events.
We prove the other direction by induction on . Clearly, if the sum is zero, the claim holds. Otherwise, by (Proposition 1), we can choose to be any event that decrements and increments . Applying to results in a delta profile such that . Invoking the induction hypothesis then yields a sequence of events to transform to the 0 profile.
The second statement follows from Proposition 3.
As a corollary to the above theorem and the equivalence between zero-agnostic copy number transformations and delta transformations (Corollary 1), we have our closed form expression for the ZCNT distance between copy number profiles.
Corollary 2.
For copy number profiles and ,
Further, as a corollary to the fact that is a distance metric, the following median distance is trivially computed in linear time:
Corollary 3.
Given two copy number profiles and , both and minimize the median distance over all choices of copy number profiles . Thus,
5. Algorithms
5.1. ZCNT small parsimony
We show below that the special form of the ZCNT model enables us to solve two natural relaxations of the small parsimony problem in polynomial time. First, using the equivalence between copy number profiles and delta profiles described above, we formulate the small parsimony problem (Problem 1) using the ZCNT model as follows.
Problem 3 (ZCNT Small Parsimony).
Given a copy number matrix , a tree and cell assignment , find a vertex labeling minimizing
such that the following two conditions are satisfied:
for all cells ,
satisfies the balancing condition (1) for all vertices .
To solve the above problem, we recall the general form of the Sankoff-Rousseau recurrence [9, 36] for solving the small parsimony problem. Let be the cost of the optimal labeling of that agrees with copy number matrix and has label for the root. Let denote the sub-tree rooted at and suppose that has children and . Then, by condition (ii), and the requirement that the ancestral labeling lies in , we have the following recurrence relation [9]:
| (2) |
This recurrence has several difficulties. First, and are unbounded and can take on any value in . Thus, it is impossible to store a dynamic programming table for without imposing bounds on the maximum copy number. Further, even when the entries are constrained to a bounded interval , the dynamic programming table has size , exponentially large. Second, because of the balancing conditions (1), , one cannot analyze the loci independently.
Despite these challenges, the recurrence (2) is a substantial improvement over the analogous recurrence under the CNT model. In fact, if we remove either the balancing (1) or the integrality condition, we can solve this recurrence in (resp. strong or weak) polynomial time.
Theorem 3.
If the balancing condition (1) is dropped, the ZCNT small parsimony problem can be solved in time. If the integrality condition is dropped, the ZCNT small parsimony problem can be solved in (weakly) polynomial time using a linear program with variables and constraints.
Both of these facts derive from our closed form expression for the ZCNT distance between two copy number profiles in terms of the norm. We sketch the ideas here, and refer to Supplementary Results B.2, B.1 for proofs of these claims.
For the first case when we drop the balancing condition, we can analyze the loci independently as there is no constraint on the entries of the ancestral profiles. Then, it suffices to observe that since the distance corresponds to the absolute difference, the function has a nice structure and we do not have to store an infinitely large dynamic programming table. When the integrality is removed, then, since both (i) and (ii) are linear constraints on the profiles and because norm minimization can be written as a linear program (LP), there is an LP formulation of the ZCNT small parsimony problem. As it is well known we can solve LPs in (weakly) polynomial time, this concludes the second case.
5.2. Lazac algorithm for ZCNT large parsimony
We develop a tree-search algorithm, Lazac, to find approximate solutions to the ZCNT large parsimony problem (Problem 2). Our procedure searches the space of copy number trees for a given copy number matrix C using sub-tree interchange operations [27] and relies heavily on the efficient algorithm we developed for the small parsimony problem (Problem 3) when the balancing condition (1) is dropped. The procedure is similar to the tree search procedure we developed for lineage tracing data [37]. Complete details on our tree search procedure are in Supplementary Methods A.1. Lazac is implemented in C++17 and is freely available at: github.com/raphael-group/lazac-copy-number.
6. Results
6.1. Comparison of copy number distances and phylogenies on prostate cancer data
We first investigated the differences between the CNT and ZCNT distances on copy number profiles inferred from bulk whole-genome sequencing data from ten metastatic prostate cancer patients [17]. We analyzed the copy number profiles for these patients published in [22]. For each pair of copy number profiles from distinct samples (e.g. anatomical sites) from the same patient, we computed the CNT distance and ZNCT distance . We found that for all ten patients, the median relative difference over all pairs of samples was less than 5% – and for most patients the relative difference was even smaller (Supplementary Figure 5, 6).
6.2. Evaluation on simulated data
We compared Lazac to several state-of-the-art methods for inferring copy number phylogenies – namely MEDICC2 [22], MEDALT [43], Sitka [35], and WCND [50] – on simulated data.
Lazac inferred the most accurate phylogenetic trees across varying number of cells and loci . In particular, we found that on all but one parameter setting, Lazac had the lowest median RF (Figure 2) and Quartet distance (Supplementary Methods A.5) on simulated instances (Supplementary Figure 1). Further, on large phylogenies containing cells, Lazac showed an even larger improvement in median RF (Figure 2) and Quartet distance (Supplementary Methods A.5) over other methods, owing to its scalability. In terms of speed, Lazac was the fastest method on every instance, taking less than ~250 seconds to run on the largest simulated dataset containing 600 cells. Further, it was ~100 times faster than the other top performing methods Sitka and MEDICC2 (Figure 2b).
Figure 2:
(a) Comparison of reconstruction accuracy (RF distance) on simulated data for several state-of-the art methods for copy number tree reconstruction with varying number of cells across four sets of loci and seven random seeds . (b) Timing results for varying number of cells and fixed number of loci . As MEDICC2 was too slow to run on more than 150 cells (with a 2 hour time limit), we exclude it from comparisons where the number of cells .
As a further evaluation of the differences between the ZCNT and CNT distances, we compared the trees obtained using distance-based phylogenetic methods with the ZCNT and CNT distances. Specifically, we compared the performance of applying neighbor joining on the ZCNT distances, referred to as Lazac-NJ, to three distance-based methods for reconstructing copy number phylognies: MEDICC2 [22], WCND [50], and MEDALT [43] on simulated data. MEDICC2 and WCND compute distances based on extensions of the CNT model and then apply neighbor joining to infer phylogenies. As such, they allow for a natural benchmark with which to compare our simpler, ZCNT distance. Lazac-NJ had nearly identical (within 1%) median RF and Quartet distance compared to other distance based methods (Supplementary Figure 2, 3) and was often the top performer. This provides evidence that even by itself, the ZCNT distance is useful for phylogenetic reconstruction.
6.3. Single-cell DNA sequencing data
We used Lazac to analyze single-cell whole genome sequencing (WGS) data from 25 human breast and ovarian tumor samples [15]. This dataset was generated using the DLP+ [23], single-cell whole-genome sequencing technology which produces ≈ 0.04× coverage from a median (resp. mean) of 636 (resp. 1457) cells per sample. The original study used Sitka [35], a method that uses the breakpoints between copy number segments as phylogenetic markers, to construct copy number phylogenies using this data.
We found that the phylogenies inferred by Lazac are substantially different than the phylogenies constructed by Sitka. Specifically, the normalized RF distance between pairs of phylogenies was greater than 0.90 in all cases (Supplementary Figure 9). In many cases, the normalized RF distance was 1, indicating that the phylogenies completely disagree.
To investigate these differences, we analyzed the concordance between the phylogenies and the assignments of cells to copy number clones, reported in the original publication [15]. Specifically, we defined the clonal discordance score as the parsimony score of the clonal labeling; i.e. the minimum number of changes in clone label that are required to label the leaves with the published clone labels. Thus, for a dataset with clone labels the minimum possible clonal discordance score for a tree is corresponding to the case each clone label is a clade in the tree. We find that on 18/25 of the samples, the Lazac phylogenies had substantially lower clonal discordance scores than the Sitka phylogenies (Supplementary Figure 4, 10) showing that the Lazac phylogenies were more concordant with the copy number clones compared to the published phylogenies. As a qualitative example, the phylogeny constructed for sample SA1184 by Lazac also appears more concordant with the copy number clones than that inferred by Sitka (Figure 3). Further details on the clonal discordance analysis are in Supplementary Methods A.3.
Figure 3:
The copy number phylogenies inferred by Lazac (a) and Sitka (b) on sample SA1184 with the leaves colored by the corresponding clone labels, as visualized using Iroki [29]. The normalized RF distance between the two trees is 0.9869.
As a further evaluation of the Lazac and Sitka phylogenies, we examined whether somatic single-nucleotide variants (SNVs) supported the splits in each phylogeny, following the approach of [48]. Note that these SNVs were not used in the inference of either phylogeny, and thus they provide independent validation of the phylogeny. Given the extremely low sequence coverage (0.04× per cell), it is not possible to reliably measure SNVs of individual cells. Thus, we performed this analysis on the three samples (SA039, SA604, SA1035) with the largest number of cells. We identified subtrees in the phylogeny with at least 5% and at most 15% of the cells and identified SNVs present in the subpopulation of cells in these subtrees. Following the approach in [48], we perform a permutation test to determine whether the subtree is supported by more SNVs than expected (Supplementary Methods A.4). For all three samples, we found that the Lazac phylogenjes had a greater fraction of supported subtrees than the Sitka phylogenies (Supplementary Figure 11). On the largest sample, SA1035, we identified five out of six supported subtrees (supported by 3175, 3334, 3799, 3435, and 3402 SNVs) for the Lazac phylogeny compared to only three of eight statistically significant subtrees (supported by 3426, 3129, and 3362 SNVs) for the Sitka phylogeny.
6.4. Approximation error of ZCNT small parsimony relaxations
We investigated the approximation error produced by the relaxations (Section 5.1) used in our two polynomial time algorithm’s for the ZCNT small parsimony problem. To perform this investigation, we first generated a set of 200 copy number phylogenies by stochastically perturbing a phylogeny inferred by Sitka [35] from single-cell whole genome sequencing data (Section 6.3). Then, for each phylogeny, we computed the optimal solution to the ZCNT small parsimony problem and its two relaxations using (integer) linear programming.
Importantly, we found that the exact solution to the ZCNT small parsimony problem and the solution obtained by relaxing the integrality condition were identical in every case. This leads us to believe that the relaxed linear program has a special structure, which we state as the following conjecture:
Conjecture 1.
The constraint matrix A of the linear program obtained by relaxing the integrality condition of the ZCNT small parsimony problem is totally unimodular.
If true, this would imply that ZCNT small parsimony can be solved exactly in polynomial time using linear programming.
In contrast, dropping the balancing condition resulted in solutions with a lower score, implying that the balancing condition does meaningfully constrain the solution space. Specifically, the Pearson correlation between the score produced by dropping the balancing condition and the exact ZCNT small parsimony score was R2 = 0.972 (p < 10−100) across the 200 copy number phylogenies (Supplementary Figure 7). Further, as the number of stochastic perturbations increased, both the parsimony score of the relaxed and the exact solutions increased (Supplementary Figure 8). This serves as a sanity check for the ZCNT small parsimony score since the phylogeny generally becomes further from ground truth as the number of stochastic perturbations increase.
7. Discussion
We introduced the zero-agnostic copy number transformation (ZCNT) model, a relaxation of the CNT model that allows for modification of zero copy number regions. We derived a closed-form expression for the ZCNT distance and presented polynomial time algorithms to solve two natural approximations of the small parsimony problem for copy number profiles. We used our efficient algorithm for the small parsimony problem to derive a method Lazac, to solve the large parsimony problem for copy number profiles. We demonstrated that on both real and simulated data, Lazac found better copy number phylogenies than existing methods.
There are multiple directions for future work. First, the complexity of the small parsimony problems for both the CNT and ZCNT models remains open, though we conjecture, and provide empirical evidence, that the latter is polynomial. Second, the algorithm we developed for the ZCNT large parsimony problem relies on a simple, hill climbing search using nearest-neighbor interchange operations. We expect that a more advanced approach that uses state-of-the-art techniques from phylogenetics [27, 32] could substantially improve both inference speed and accuracy. Finally, is to apply Lazac to other large single-cell WGS datasets [48, 28]. We anticipate that the scalability and accuracy of Lazac will be useful in analyzing the increasing amount of single-cell WGS cancer sequencing data.
Supplementary Material
8. Acknowledgements
We thank Uthsav Chitra and Gillian Chu for their helpful comments and review of this manuscript during several stages of its preparation. This work was supported by National Cancer Institute (NCI) grant U24CA248453 to B.J.R.
Footnotes
Availability: Lazac is implemented in C++17 and is freely available at: github.com/raphael-group/lazac-copy-number.
References
- [1].Albertson Donna G, Collins Colin, McCormick Frank, and Gray Joe W. Chromosome aberrations in solid tumors. Nature genetics, 34(4):369–376, 2003. [DOI] [PubMed] [Google Scholar]
- [2].Andor Noemi, Lau Billy T, Catalanotti Claudia, Sathe Anuja, Kubit Matthew, Chen Jiamin, Blaj Cristina, Cherry Athena, Bangs Charles D, Grimes Susan M, et al. Joint single cell dna-seq and rna-seq of gastric cancer cell lines reveals rules of in vitro evolution. NAR Genomics and Bioinformatics, 2(2):lqaa016, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Beerenwinkel Niko, Schwarz Roland F, Gerstung Moritz, and Markowetz Florian. Cancer evolution: mathematical models and computational inference. Systematic biology, 64(1):e1–e25, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Bogdanowicz Damian and Giaro Krzysztof. Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1):150–160, 2011. [DOI] [PubMed] [Google Scholar]
- [5].Bogdanowicz Damian, Giaro Krzysztof, and Wróbel Borys. Treecmp: comparison of trees in polynomial time. Evolutionary Bioinformatics, 8:EBO–S9657, 2012. [Google Scholar]
- [6].Cardona Gabriel, Rosselló Francesc, and Valiente Gabriel. Extended newick: it is time for a standard representation of phylogenetic networks. BMC bioinformatics, 9(1):1–8, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Chowdhury Salim Akhter, Shackney Stanley E, Heselmeyer-Haddad Kerstin, Ried Thomas, Schäffer Alejandro A, and Schwartz Russell. Phylogenetic analysis of multiprobe fluorescence in situ hybridization data from tumor cell populations. Bioinformatics, 29(13):i189–i198, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Cordonnier Garance and Lafond Manuel. Comparing copy-number profiles under multi-copy amplifications and deletions. BMC genomics, 21(2):1–12, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Csűrös Miklós. How to infer ancestral genome features by parsimony: Dynamic programming over an evolutionary tree. In Models and Algorithms for Genome Evolution, pages 29–45. Springer, 2013. [Google Scholar]
- [10].Dong Xiao, Zhang Lei, Hao Xiaoxiao, Wang Tao, and Vijg Jan. Sccnv: a software tool for identifying copy number variation from single-cell whole-genome sequencing. biorxiv. Preprint, 10:535807, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].El-Kebir Mohammed, Raphael Benjamin J, Shamir Ron, Sharan Roded, Zaccaria Simone, Zehavi Meirav, and Zeira Ron. Complexity and algorithms for copy-number evolution problems. Algorithms for Molecular Biology, 12(1):1–11, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Elyanow Rebecca, Zeira Ron, Land Max, and Raphael Benjamin J. Starch: Copy number and clone inference from spatial transcriptomics data. Physical Biology, 18(3):035001, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Farris. James S Methods for computing wagner trees. Systematic Biology, 19(1):83–92, 1970. [Google Scholar]
- [14].Fitch. Walter M Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416, 1971. [Google Scholar]
- [15].Funnell Tyler, O’Flanagan Ciara H, Williams Marc J, McPherson Andrew, McKinney Steven, Kabeer Farhia, Lee Hakwoo, Salehi Sohrab, Vázquez-García Ignacio, Shi Hongyu, et al. Single-cell genomic variation induced by mutational processes in cancer. Nature, pages 1–10, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Gao Teng, Soldatov Ruslan, Sarkar Hirak, Kurkiewicz Adam, Biederstedt Evan, Loh Po-Ru, and Kharchenko Peter V. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nature Biotechnology, pages 1–10, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Gundem Gunes, Van Loo Peter, Kremeyer Barbara, Alexandrov Ludmil B, Tubio Jose, Papaemmanuil Elli, Brewer Daniel S, Kallio Heini ML, Högnäs Gunilla, Annala Matti, et al. The evolutionary history of lethal metastatic prostate cancer. Nature, 520(7547):353–357, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Gusfield Dan. Efficient algorithms for inferring evolutionary trees. Networks, 21(1):19–28, 1991. [Google Scholar]
- [19].Gusfield Dan. ReCombinatorics: the algorithmics of ancestral recombination graphs and explicit phylogenetic networks. MIT press, 2014. [Google Scholar]
- [20].Hui Sandra and Nielsen Rasmus. Sconce: a method for profiling copy number alterations in cancer evolution using single-cell whole genome sequencing. Bioinformatics, 38(7):1801–1808, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Jones Matthew G, Khodaverdian Alex, Quinn Jeffrey J, Chan Michelle M, Hussmann Jeffrey A, Wang Robert, Xu Chenling, Weissman Jonathan S, and Yosef Nir. Inference of single-cell phylogenies from lineage tracing data using cassiopeia. Genome biology, 21(1):1–27, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Kaufmann Tom L, Petkovic Marina, Watkins Thomas BK, Colliver Emma C, Laskina Sofya, Thapa Nisha, Minussi Darlan C, Navin Nicholas, Swanton Charles, Van Loo Peter, et al. Medicc2: whole-genome doubling aware copy-number phylogenies for cancer evolution. Genome biology, 23(1):1–27, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Laks Emma, McPherson Andrew, Zahn Hans, Lai Daniel, Steif Adi, Brimhall Jazmine, Biele Justina, Wang Beixi, Masud Tehmina, Ting Jerome, et al. Clonal decomposition and dna replication states defined by scaled single-cell genome sequencing. Cell, 179(5):1207–1221, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Mallory Xian F, Edrisi Mohammadamin, Navin Nicholas, and Nakhleh Luay. Methods for copy number aberration detection from single-cell dna-sequencing data. Genome biology, 21(1):1–22, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Markowska Magda, Cąkała Tomasz, Miasojedow Błażej, Aybey Bogac, Juraeva Dilafruz, Mazur Johanna, Ross Edith, Staub Eike, and Szczurek Ewa. Conet: Copy number event tree model of evolutionary tumor history for single-cell data. Genome Biology, 23(1):1–35, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].McInnes Leland, Healy John, and Melville James. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]
- [27].Minh Bui Quang, Schmidt Heiko A, Chernomor Olga, Schrempf Dominik, Woodhams Michael D, Von Haeseler Arndt, and Lanfear Robert. Iq-tree 2: new models and efficient methods for phylogenetic inference in the genomic era. Molecular biology and evolution, 37(5):1530–1534, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Minussi Darlan C, Nicholson Michael D, Ye Hanghui, Davis Alexander, Wang Kaile, Baker Toby, Tarabichi Maxime, Sei Emi, Du Haowei, Rabbani Mashiat, et al. Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature, 592(7853):302–308, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Moore Ryan M, Harrison Amelia O, McAllister Sean M, Polson Shawn W, and Wommack K Eric. Iroki: automatic customization and visualization of phylogenetic trees. PeerJ, 8:e8584, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Novozhilov Artem S, Karev Georgy P, and Koonin Eugene V. Biological applications of the theory of birth-and-death processes. Briefings in bioinformatics, 7(1):70–85, 2006. [DOI] [PubMed] [Google Scholar]
- [31].Nowell Peter C. The clonal evolution of tumor cell populations: Acquired genetic lability permits stepwise selection of variant sublines and underlies tumor progression. Science, 194(4260):23–28, 1976. [DOI] [PubMed] [Google Scholar]
- [32].Price Morgan N, Dehal Paramvir S, and Arkin Adam P. Fasttree 2–approximately maximum-likelihood trees for large alignments. PloS one, 5(3):e9490, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Robinson David F and Foulds Leslie R. Comparison of phylogenetic trees. Mathematical biosciences, 53(1–2):131–147, 1981. [Google Scholar]
- [34].Saitou Naruya and Nei Masatoshi. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4):406–425, 1987. [DOI] [PubMed] [Google Scholar]
- [35].Salehi Sohrab, Dorri Fatemeh, Chern Kevin, Kabeer Farhia, Rusk Nicole, Funnell Tyler, Williams Marc J, Lai Daniel, Andronescu Mirela, Campbell Kieran R, et al. Cancer phylogenetic tree inference at scale from 1000s of single cell genomes. 2020. [Google Scholar]
- [36].Sankoff David and Rousseau Pascale. Locating the vertices of a steiner tree in an arbitrary metric space. Mathematical Programming, 9(1):240–246, 1975. [Google Scholar]
- [37].Sashittal Palash, Schmidt Henri, Chan Michelle M, and Raphael Benjamin J. Startle: a star homoplasy approach for crispr-cas9 lineage tracing. bioRxiv, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Schwarz Roland F, Trinh Anne, Sipos Botond, Brenton James D, Goldman Nick, and Markowetz Florian. Phylogenetic quantification of intra-tumour heterogeneity. PLoS computational biology, 10(4):e1003535, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Semple Charles, Steel Mike, et al. Phylogenetics, volume 24. Oxford University Press on Demand, 2003. [Google Scholar]
- [40].Steel Mike. Phylogeny: discrete and random processes in evolution, pages 111–145. SIAM, 2016. [Google Scholar]
- [41].Stephens PJ, Greenman CD, Fu BY, Yang FT, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA, et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell, 144(1):2740, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Wagner Warren H. Problems in the classification of ferns. Recent advances in botany, pages 841–844, 1961. [Google Scholar]
- [43].Wang Fang, Wang Qihan, Mohanty Vakul, Liang Shaoheng, Dou Jinzhuang, Han Jincheng, Minussi Darlan Conterno, Gao Ruli, Ding Li, Navin Nicholas, et al. Medalt: single-cell copy number lineage tracing enabling gene discovery. Genome biology, 22(1):1–22, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Wang Rujin, Lin Dan-Yu, and Jiang Yuchao. Scope: a normalization and copy-number estimation method for single-cell dna sequencing. Cell systems, 10(5):445–452, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Wang Xuefeng, Chen Hao, and Zhang Nancy R. Dna copy number profiling using single-cell sequencing. Briefings in bioinformatics, 19(5):731–736, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Warnow Tandy. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge University Press, 2017. [Google Scholar]
- [47].Wu Chi-Yun, Lau Billy T, Kim Heon Seok, Sathe Anuja, Grimes Susan M, Ji Hanlee P, and Zhang Nancy R. Integrative single-cell analysis of allele-specific copy number alterations and chromatin accessibility in cancer. Nature biotechnology, 39(10):1259–1269, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Zaccaria Simone and Raphael Benjamin J. Characterizing allele-and haplotype-specific copy numbers in single cells with chisel. Nature biotechnology, 39(2):207–214, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Zahn Hans, Steif Adi, Laks Emma, Eirew Peter, VanInsberghe Michael, Shah Sohrab P, Aparicio Samuel, and Hansen Carl L. Scalable whole-genome single-cell library preparation without preamplification. Nature methods, 14(2):167–173, 2017. [DOI] [PubMed] [Google Scholar]
- [50].Zeira Ron and Raphael Benjamin J. Copy number evolution with weighted aberrations in cancer. Bioinformatics, 36(Supplement_1):i344–i352, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Zeira Ron, Zehavi Meirav, and Shamir Ron. A linear-time algorithm for the copy number transformation problem. Journal of Computational Biology, 24(12):1179–1194, 2017. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



