Abstract
Motivation
Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.
Results
We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space .
Availability and implementation
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Cancer results from an evolutionary process, during which somatic mutations accumulate in a population of cells (Nowell, 1976), resulting in the formation of multiple tumor clones with distinct sets of mutations (Fig. 1a). A phylogenetic tree, or phylogeny, is a model that represents this process. Mathematically, a phylogenetic tree for a tumor is a rooted tree T, whose leaves correspond to extant cells and whose internal vertices correspond to ancestral cells. The root of T is a normal cell, containing no somatic mutations. In classic phylogenetics, we aim to infer T given the leaf set L(T) under an appropriate evolutionary model. However, due to extensive uncertainty in single-cell DNA sequencing data (Navin, 2014) and the presence of mixed cellular populations in bulk DNA sequencing data (El-Kebir et al., 2015), we do not observe the leaves of T. Rather, our data consist of individually-sequenced cells that may contain errors that must be corrected, or cell populations that have been sequenced in bulk, resulting in mutation frequencies. As a consequence of this ambiguity, tumor phylogeny inference methods for both data types typically infer multiple phylogenetic trees with distinct topologies and distinct leaf sets that represent alternative evolutionary histories (Fig. 1b).
The majority of current methods in cancer phylogenetics make the infinite sites assumption, which states that a mutation is gained only once and never subsequently lost (Dang et al., 2017; Deshwar et al., 2015; Donmez et al., 2016; El-Kebir et al., 2015, 2016; Jahn et al., 2016; Jiang et al., 2016; Jiao et al., 2014; Malikic et al., 2015; Popic et al., 2015; Ross and Markowetz, 2016; Strino et al., 2013; Yuan et al., 2015). Under this assumption, we may represent a phylogenetic tree T by a mutation tree (El-Kebir et al., 2016; Jahn et al., 2016). More specifically, we contract unlabeled edges of T to obtain , whose vertices we label by the mutations that were introduced on the incoming edges (Fig. 1a). Tumor phylogenies that adhere to the infinite sites assumption have been used to identify mutations that drive cancer progression (Jamal-Hanjani et al., 2017; McGranahan et al., 2015), assess the interplay between the immune system and the clonal architecture of a tumor (Łuksza et al., 2017; Zhang et al., 2018) and identify common evolutionary patterns in tumorigenesis and metastasis (Turajlic et al., 2018a, b). These downstream analyses critically rely on the accuracy of the input phylogenetic tree. Thus, methods to accurately summarize the solution space are essential, so as to remove inference errors and identify common dependencies between mutations in the input trees.
A common approach employed in several studies (Deshwar et al., 2015; El-Kebir et al., 2015; Jiao et al., 2014) summarizes the solution space by constructing the parent-child graph, which is a directed, edge-weighted graph that represents the union of all trees in . That is, each edge (u, v) of this graph corresponds to an edge in a tree T in the solution space and is weighted by the number of occurrences in (Fig. 1c). A key deficiency of the parent-child graph is that it does not accurately represent topological features of the solution space, i.e. patterns of co-occurrence and mutual exclusivity among edges in individual trees in the solution space. Moreover, downstream analyses require a single phylogenetic tree as input and are unable to operate directly on the parent-child graph.
To overcome the latter limitation, Govek et al. (2018) introduced the Single Consensus Tree problem, which aims at constructing a consensus tree that best represents the solution space . To quantify similarity or distance between two trees, one needs a distance function. Recently, Karpov et al. (2018) introduced a tree edit distance measure that can be efficiently computed using dynamic programming. Using a distance function that directly measures edge similarity, Govek et al. (2018) seek a consensus tree with minimum total distance to the trees in . The main drawback to summarizing by a single tree is that important topological features may be missed, which is especially the case when contains multiple clusters of distinctive trees. We note that there is a large body of work for consensus tree problems in classic phylogenetics (cf. Warnow, 2017). These methods are often based on bipartitions of a fixed leaf set. However, as mentioned above, the leaf set is typically unknown a priori in cancer phylogenetics due to the nature of the input data, preventing the direct application of consensus tree methods that rely on fixed leaf sets.
In this paper, we introduce the Multiple Consensus Tree (MCT) problem of simultaneously grouping trees into k clusters and reconstructing a consensus tree for each cluster with minimum total distance. The MCT approach better summarizes solution spaces with distinct topological features, overcoming limitations of current approaches (Fig. 1d). We prove that MCT is NP-hard, and give an exact approach based on mixed integer linear programming (MILP) that is able to efficiently solve small instances to optimality. In addition, we introduce a heuristic based on coordinate ascent that scales to large input instances. We benchmark our methods on simulated data, showing that the heuristic approach yields solution of quality comparable to that of the MILP approach at only a fraction of the time. We demonstrate the applicability of the MCT problem on recent lung cancer data. Our methods enable one to draw informed conclusions in downstream phylogenetic analyses of tumors.
2 Problem statement
The key object in this paper is a mutation tree, which is a defined as follows.
Definition 1
A mutation tree T is a rooted tree whose m nodes are uniquely labeled by mutations .
We obtain a mutation tree from a phylogenetic tree that satisfies the infinite sites assumption by first contracting its unlabeled edges, and then labeling the resulting vertices by the mutations present on their incoming edges (Fig. 1a). To summarize a set of mutation trees (Fig. 1b), we consider the following distance function, which was shown to be a distance metric by Govek et al. (2018).
Definition 2
Let and be two rooted trees on the same vertex set V. The parent-child distance is the number of edges unique to either tree, i.e.
(1) Mathematically, the parent-child distance of two rooted trees and is the size of the symmetric difference between E and . This distance has been used extensively in the tumor phylogeny inference literature to compare inferred trees to simulated trees (El-Kebir et al., 2015; Malikic et al., 2015; Popic et al., 2015). Govek et al. (2018) used the parent-child distance to define a consensus tree for a set input trees as follows.
Definition 3
A consensus tree for rooted trees with the same vertex set V is a rooted tree R with vertex set V.
Subsequently, Govek et al. (2018) introduced the Single Consensus Tree problem, which given a set of input trees seeks a consensus tree R with minimum total distance .
Problem 1
(Single Consensus Tree (SCT)) Given distinct rooted trees with the same vertex set, find a consensus tree R such that is minimum.
To better account for extensive ambiguity in the topology of solution trees, we introduce the Multiple Consensus Tree problem, which generalizes the Single Consensus Tree to k clusters.
Problem 2
(Multiple Consensus Tree (MCT)) Given distinct rooted trees and integer , find a clustering and consensus trees such that (i) no cluster is empty, i.e. σ is surjective, and (ii) is minimum.
3 Combinatorial structure and complexity
Section 3.1 characterizes the solution space of the MCT problem. Section 3.2 shows that this problem is NP-hard. Proofs are in the supplement due to space constraints.
3.1 Combinatorial characterization of optimal solutions
To characterize the space of solutions to the MCT, we start by reviewing results for the SCT problem (Govek et al., 2018). Given input trees , Govek et al. (2018) defined the parent-child graph as follows.
Definition 4
(Govek et al. (2018)) The parent-child graph of a set of trees is a weighted directed graph with the same vertex set V as each input tree, an edge if and only if there exists an input tree where , and weight equal to the number of input trees with edge (u, v), i.e.
(2) Subsequently, the authors showed that solutions to an SCT instance are maximum weight spanning arborescences in the parent-child graph . We note that maximum weight spanning arborescences and branchings (with multiple root vertices) have frequent applications in computational biology (e.g. Desper et al., 1999).
Theorem 1
(Govek et al., 2018) Given input trees , there exists a consensus tree R with minimum distance that is a maximum weight spanning arborescence in the parent-child graph .
We have the following two lemmas that follow from the above theorem.
Lemma 1
There exists an optimal consensus tree R to SCT instance where each edge (u, v) of R occurs in an input tree.
Lemma 2
There exists an optimal consensus tree R to SCT instance where if an edge (u, v) is present in all trees then (u, v) is an edge of the consensus tree R.
Let be the size of the vertex set V of a set of input trees. We prove the following relationship between the weight of any spanning arborescence R in and its distance to input trees .
Lemma 3
The total distance of any spanning arborescence of parent-child graph to input trees equals .
We have the following important proposition.
Proposition 1
Given a clustering , the MCT problem decomposes into k independent SCT problems.
From the above proposition and Theorem 1, we obtain the following corollaries that are generalizations of Lemmas 1, 2 and 3.
Corollary 1 There exists an optimal solution to MCT instance where each edge of each consensus tree occurs in an input tree in the set of trees assigned to cluster s.
Corollary 2 There exists an optimal solution to MCT instance where if an edge (u, v) is present in all trees assigned to cluster s then (u, v) is an edge of the consensus tree Rs.
Corollary 3 There exists an optimal solution to MCT instance with total distance
(3) where is the weight of a maximum weight spanning arborescence Rs of the parent-child graph obtained from .
As the number of k of clusters increases the minimum total distance will decrease, as shown by the following proposition.
Proposition 2
The minimum total distance of an MCT instance is monotonically decreasing with increasing number k of clusters.
3.2 Complexity
Theorem 2 Multiple Consensus Tree (MCT) is NP-hard.
We give a polynomial-time reduction from the Clique problem, a known NP-complete problem (Garey and Johnson, 1979).
Problem 3
(Clique) Given an undirected, simple graph H with vertex set V(H), edge set E(H) and a positive integer , decide whether G contains a clique of size c.
To reduce a Clique instance (H, c) to an MCT instance , we introduce the notation to indicate the subset of edges that are incident to v, i.e. . For each vertex vi of H, we construct a tree . The vertex set U of Ti is defined as and the edge set Ai contains directed edges and . We set . Since all the input trees are on the same vertex set U and , it holds that is an instance of MCT problem. Clearly, this construction can be performed in time polynomial in n and m. Figure 2 shows an example of reduction.
Defining the cost as the total distance , we have the following two lemmas that provide lower bounds on the cost of any feasible solution to obtained from a Clique instance (H, c).
Lemma 4
The cost of a clustering that partitions into parts of sizes is at least . This bound is tight if and only if the input trees assigned to each cluster s encode a clique in the undirected graph H.
Lemma 5
The cost of any clustering of is at least . This bound is tight if and only if σ contains k – 1 singleton clusters and one cluster with c trees that encode the vertices of a clique in the undirected graph H.
Finally, we use the above two lemmas to prove the following lemma, from which the theorem follows.
Lemma 6
There is a clique of size c in the undirected graph H if and only if the corresponding MCT instance has an optimal solution with cost .
4 Material and methods
This section introduces three algorithms for Multiple Consensus Tree that exploit the combinatorial structure identified in the previous section. Section 4.4 describes a procedure for selecting the number k of clusters, balancing the decrease in distance and the additional complexity with increasing k.
4.1 Brute force algorithm
By Proposition 1, each MCT instance decomposes into k SCT instances when given the clustering σ. Thus, one can identify optimal solutions to by exhaustively generating all clusterings σ, retaining clusterings that have minimum total distance. The number of clusterings is given by the Stirling number of the second kind (Knuth, 1997), which is bounded by kn. Given σ, we must solve k maximum weight spanning arborescence problems on sets of trees. Gabow et al. (1986) give an algorithm that identifies a maximum (minimum) weight spanning r-arborescence rooted at a given vertex r of a weighted directed graph in time. For simplicity, we bound the number of edges in each parent-child graph by . As such, the complexity of identifying an optimal consensus tree of a set of trees is . It follows that the time of identifying the optimal set of consensus trees is bounded by Therefore the complexity of the brute force algorithm is .
4.2 Mixed integer linear program
We introduce a mixed integer linear program (MILP) that models the feasible solution space of an MCT instance . To do so, we model (i) the surjective clustering function , (ii) the consensus trees as spanning arborescences, (iii) the weight of each consensus tree Rs and (iv) additional cuts to improve performance. Let m be the number of vertices in the shared vertex set of input trees .
4.2.1 Clustering
We introduce binary variables to model clustering . More specifically, we require if and if for each cluster s and input tree i. To that end, we introduce the following constraints.
(4) |
(5) |
In addition, we require σ to be surjective. That is, each cluster s contains at least one tree, which we model as follows.
(6) |
4.2.2 Consensus trees
By Proposition 1, the MCT problem decomposes into k instances of the SCT problem. By Theorem 1, we know that each SCT instance is a maximum weight spanning arborescence problem with unknown root. Consider the subproblem of a cluster . To model the edges of the consensus tree Rs, we introduce variables for each ordered pair of vertices such that if consensus tree Rs contains the edge (p, q) and otherwise. We require that Rs is a spanning arborescence of the vertex set , i.e. Rs contains a single vertex p that does not have a parent. To indicate the root vertex, we introduce variables for each vertex p such that if p is the root of Rs and otherwise. We have the following constraints that model a single root vertex and the presence of a unique parent of each non-root vertex.
(7) |
(8) |
(9) |
(10) |
For each order pair , let if there exists an input tree containing the edge (p, q) and otherwise. By Corollary 1, we have that each edge (p, q) of Rs must occur in at least one input tree . As such, we have the following constraint:
(11) |
Next, we need to model connectivity, i.e. from the root vertex p of Rs every other vertex must be reachable. In other words, we need to prevent that Rs has cycles. For a subset of vertices, let be the subset of directed edges (p, q) occurring in the input trees where and . More formally, . For any cut set , it must hold that U contains either the root vertex or there must be at least one incoming edge to U. This is captured by the following constraint.
(12) |
The spanning arborescence polytope defined by constraints (7)–(12) has integral vertices (Schrijver, 2003). In other words, we do not require variables y and z to be binary.
4.2.3 Parent-child distance
For each ordered pair , let if input tree Ti contains the edge (p, q) and otherwise. To model the distance of input tree to its corresponding consensus tree , we introduce the variable which indicates that trees Ti and Rs contain the edge (p, q) and Ti is assigned to cluster s. In other words, is the product of and . We thus have
(13) |
Using Corollary 3, we obtain the following objective function.
(14) |
We model using the following constraints, which force to 0 if one of is 0.
(15) |
(16) |
(17) |
(18) |
By integrality of x and y, we do not require w to be binary variables. Moreover, by the direction of the objective function, we do not need to force to 1 if .
4.2.4 Additional cuts
To improve performance of the ILP, we use Corollary 1 to require that Rs contains the edge (p, q) only if there exists a tree containing the edge (p, q). To that end, we introduce the following constraint.
(19) |
By Corollary 2, if all input trees contain the edge (p, q) then there exists an optimal solution in which consensus tree Rs contains (p, q) as well. This is captured by the following constraint.
(20) |
Finally, we introduce the following symmetry breaking constraints that impose an ordering on σ such that .
(21) |
4.2.5 Cut separation
The number of constraints (12) grows exponentially in m. Therefore, we do not include these constraints in our formulation. Following a standard approach (Wolsey, 1998), we separate these constraints during the branch-and-bound procedure by identifying a minimum cut in a directed graph. Excluding constraints (12), our formulation has variables and constraints. Supplementary Figure S1 contains the full MILP.
Algorithm 1:
Input: Trees and number k > 0 of clusters
Output: Consensus trees and clustering σ 1 random clustering
2
3 whileΔ > 0do
4 fortokdo
5 Let be the parent-child graph of input trees with edge weights
6 Compute max weight spanning arborescence Rs of
7 fortondo
8 σ(i)←argminsϵ[k]d(Ti,Rs)
9
10
11
12 return(, σ)
4.3 Coordinate ascent heuristic
We use coordinate ascent to solve the Multiple Consensus Tree heuristically. The idea is to identify consensus trees and clusterings alternatingly, starting from a random clustering σ. Then, for each cluster , we construct the parent-child graph from the set of input trees in cluster s. From , we obtain the consensus tree Rs by computing the maximum weight spanning arborescence of the graph. Finally, we update the clustering σ by reassigning each to a cluster such that is minimized. These steps are repeated until convergence is achieved (Algorithm 1). To avoid getting stuck in local optima, we allow the user to specify the number of restarts, initializing each restart with a new randomly-generated clustering. Alternatively, we allow the user to specify a time limit, restarting the algorithm until the running time exceeds the time limit.
4.4 Model selection for the number k of clusters
Given input trees with m vertices, the number k of clusters ranges from 1 to n. To decide which number k of clusters to use, we apply the Bayesian Information Criterion (BIC). Note that this criterion requires a likelihood of the data given the model. In our case, the model corresponds to a solution to MCT instance . We need to define a likelihood function that is proportional to the probability of generating the data given solution . To do so, we define the normalized distance between two trees T and as
(22) |
Therefore, the mean normalized distance of a set of n trees and a solution equals
(23) |
We assume that the probability of generating a tree T in by a model is proportional to the mean normalized similarity defined as
(24) |
Note that as k increases, the sum of the distances of the optimal solutions to a set of trees is strictly decreasing by Proposition 2. Therefore, as k increases, the likelihood of optimal solutions is increasing. Assuming independence in generating each input tree, the probability of the model generating a set of n trees is , which is proportional to .
However, as k increases, the complexity of the model, i.e. the number of parameters in solution , is also increasing. Using Proposition 1, optimal consensus trees are determined by the clustering σ. The clustering σ contains k clusters, amounting to the following Bayesian Information Criterion (BIC).
(25) |
(26) |
The factor of 1/2 ensures that the two terms are of similar scale. The task is now to choose k such that the above equation is minimized.
5 Results
We implemented the three algorithms (BF, MILP and CA) in C++ using the LEMON graph library (http://lemon.cs.elte.hu). We implemented MILP using CPLEX v12.8 (https://www.ibm.com/analytics/cplex-optimizer). In this section, we illustrate the application of our methods to simulated and real data. Specifically, Section 5.1 provides results of our algorithms on simulated data, whereas Section 5.2 applies our methods to recent lung cancer data (Jamal-Hanjani et al., 2017).
5.1 Simulations
To evaluate our methods, we simulate bulk DNA sequencing data of tumors using a previously published tumor simulator (El-Kebir et al., 2018). We generate a total of 45 instances, composed of either five or ten bulk samples per instance and mutation clusters (Supplementary Table S1). Subsequently, we run the SPRUCE algorithm (El-Kebir et al., 2015) to enumerate the set of mutation trees for each instance. The mean number of trees is 47 (Supplementary Table S1). We group the 45 simulated instances by the number of mutation trees into three classes, resulting in 16 ‘small’ instances with 6-10 trees, 15 ‘medium’ instances with 11–39 trees and 14 ‘large’ instances with 40-352 trees (Fig. 3a).
For each class of instances (small, medium or large) and number of clusters, we run the mixed integer linear program (MILP) and the brute force algorithm (BF) restricted to a running time of 1 h. In addition, we run the coordinate ascent (CA) algorithm in two modes: (i) using a time limit of 1 h, and (ii) restricted to 100 restarts. We run each algorithm in single-threaded mode on a computer with two Intel Xeon CPUs at 2.6 GHz (32 cores) and 512 GB of RAM.
Supplementary Table S2 shows the number of instances solved to optimality by MILP and BF. We find that MILP outperforms the BF algorithm, solving 65% of instances to optimality versus 45.6% for BF. All small instances were solved to optimality by MILP, whereas BF failed to solved two small instances with k = 5 clusters within the time limit. In particular, performance of BF decreases with increasing number k of clusters and number n of input trees, reflecting the exponential increase in the number kn of enumerated clusterings with increasing number n of trees. Similarly, MILP performance decreases with increasing n and k (Supplementary Fig. S1). The instances that were solved to optimality by MILP include all instances solved to optimality by BF. For these reasons, we exclude BF from further analyses and focus on MILP and CA.
To investigate the behavior of CA versus the MILP algorithm, we compute the mean normalized distance for each simulated instance and output . This distance is defined in Section 4.4. We find that CA using only 100 restarts identifies solutions with similar mean normalized distance as CA and MILP using a time limit of 1 h (Fig. 3b). These 100 restarts were completed in seconds (Fig. 3c). Thus, CA with a small number of restarts computes high-quality consensus trees at only a fraction of the time required by MILP. Moreover, the CA algorithm with 100 restarts recovers all optimal solutions computed by MILP (Supplementary Table S2).
Finally, we consider one simulated instance to illustrate the advantages of the Multiple Consensus Tree over previous approaches, and to illustrate the model selection step for choosing the number k of clusters. The instance we consider has n = 9 trees and m = 9 mutation clusters (Supplementary Fig. S2). Thus, the maximum number k of clusters equals n = 9. We use CA with 100 restarts to compute consensus trees and clusterings for each number of clusters. In line with Proposition 2, Figure 4a shows that the mean normalized distance decreases with increasing number k of clusters. In particular, for k = 9 clusters, each containing a single input tree.
Applying the Bayesian Information Criterion (BIC), we select the solution with k = 2 clusters (Fig. 4a). The two resulting consensus trees R1 and R2 contain and input trees, respectively (Fig. 4a). Figure 4b and c show the parent-child graphs and , with colored edges indicating the two corresponding consensus trees. In these figures, we see that the two consensus trees R1 and R2 differ in vertices d, e and g. Input trees include the edge (b, e) whereas input trees include the edge (a, e). In addition, trees in include a branch composed of edges (e, g) and (g, d), whereas trees contain d and g as siblings of parent b. Importantly, these topological features are not apparent when summarizing by the parent-child graph or by constructing a single consensus tree from . That is, the parent-child graph does not show patterns of co-occurrence and mutual exclusivity among edges. For instance, edge (b, e) does not co-occur with edges (b, d) or (b, g) in , which cannot be concluded from (Fig. 4c). Furthermore, the unique optimal consensus tree R obtained from does not contain the edge (b, e) (Fig. 4c), which occurs in 4 out of 9 input trees (Supplementary Fig. S2). Hence, R is an incomplete summary of . Only by summarizing using multiple consensus trees do these topological features become apparent. Supplementary Figure S3 shows the distribution of the identified number k of clusters for each class of instances, showing that the number k of clusters selected by BIC increases with the number n of trees.
5.2 Real data
We consider a lung cancer cohort of 100 patients (Jamal-Hanjani et al., 2017), composed of tumors that have undergone multi-region bulk DNA sequencing. Jamal-Hanjani et al. (2017) used PyClone (Roth et al., 2014) to cluster mutations with similar cancer cell fractions and ran CITUP (Malikic et al., 2015) to compute solution spaces for each tumor, identifying multiple trees for 25 patients. We focus our analysis on patients CRUK0013 and CRUK0037, the only two patients with more than four reported trees. Jamal-Hanjani et al. (2017) identified 8 trees for patient CRUK0013 (Supplementary Fig. S4) and 17 trees for patient CRUK0037 (Supplementary Fig. S5). To summarize these trees, we run CA coupled with the model selection procedure for the number k of clusters.
First, we consider patient CRUK0013, which has m = 9 vertices/mutation clusters. Figure 5a shows the relationship between the number k of clusters and the mean normalized distance computed by the CA method. The decrease in distance from k = 1 to k = 2 is modest. Consequently, the BIC prefers the k = 1 solution. Inspection of the parent-child graph and consensus tree R reveals that the consensus tree R covers 55 out 64 edges in , where the 9 uncovered edges are incoming to v8 and v9. In particular, there are no patterns of co-occurrence or mutual exclusivity among the edges leading to v8 and v9 in individual trees in (Fig. 5c), justifying the choice for k = 1 cluster. This example, in addition to our simulated data results (Supplementary Fig. S3), show that our method does not overfit the input data when there are no clear topological features in the solution space.
By contrast, for patient CRUK0037, with m = 10 vertices (mutations clusters) and n = 17 trees, our method infers k = 2 clusters (Fig. 6a). Inspection of the n = 17 input trees reveals that there is variation in the placement of five vertices, as shown by the parent-child graph (Fig. 6b). We focus our attention on vertices v5, v7 and v10, each with two possible parents. Figure 6c shows the contingency table of all combinations of these three clusters, enabling us to observe that and are mutually exclusive. This pattern of mutual exclusivity is not apparent in the parent-child graph obtained from all trees (Fig. 6b). Furthermore, the placement of the three mutation clusters in the k = 1 consensus tree obtained from this graph is supported by only 2 out of 17 input trees. Thus, with k = 1, neither the parent-child graph nor the consensus tree provide an adequate summary of the solution space of this patient.
Our method partitions the input trees into k = 2 clusters: one cluster with seven input trees (Fig. 6d) and the other cluster with the remaining ten trees (Fig. 6e). This partition identifies patterns of co-occurrence and mutual exclusivity that are unique to each cluster. All seven trees in the first cluster contain the edge , whereas the remaining ten trees in the second cluster contain the edge . On the other hand, the trees in the first cluster exhibit mutual exclusivity between and , whereas these two edges are present in 7/10 trees in the second cluster. Similarly, edges and are mutually exclusivity in all ten trees in the second cluster, whereas these two edges are present in 5/7 trees in the first cluster. Thus, our method partitions the solution space of 17 trees into two clusters with distinct topological features. In addition, our method infers a consensus tree for each of the two clusters. The placement of v5, v7 and v10 in the consensus tree of the first cluster is supported by 3/7 trees assigned to this cluster, being the dominant topological feature among these seven trees (Fig. 6f). Similarly, the consensus tree of the second cluster highlights the most representative placement of these three vertices (supported by 5/10 trees, see Fig. 6g).
The first consensus tree contains the branch , whereas the second consensus tree contains the branch . Vertex v10 contains the driver mutation HOOK3, whose placement may alter conclusions in downstream analyses, including those that assess tumor fitness to immunotherapy (Łuksza et al., 2017) or identify repeated evolutionary trajectories among driver mutations (Turajlic et al., 2018b). To avoid incorrect conclusions both consensus trees must be considered in these analyses. Our method facilitates such more robust downstream analyses, by simultaneously clustering input trees according to shared topological features, identifying the right number of clusters depending on the degree of differences among solution trees.
6 Discussion
We introduced the Multiple Consensus Tree (MCT) problem that enables one to accurately summarize a solution set composed of tumor phylogenies with distinct topological features using multiple consensus trees, overcoming limitations of current approaches. Current approaches that summarize by constructing a graph that is the union of all edges in fail to account for mutual exclusivity or co-occurence of edges in individual trees (Deshwar et al., 2015; El-Kebir et al., 2016; Jiao et al., 2014). In a similar vein, summarizing by constructing a single consensus tree as described by Govek et al. (2018) may fail to represent topological features that are specific to a subset of trees in .
Mathematically, MCT is a generalization of the Single Consensus Tree to k consensus trees. That is, given input trees and integer k > 0, we aim to simultaneously partition into k disjoint, non-empty clusters and reconstruct a consensus tree for each cluster with minimum total distance. We proved that MCT is NP-hard. In addition, we presented two exact approaches based on mixed integer linear programming (MILP) and exhaustive enumeration. Using simulated data, we showed that the MILP efficiently solves small instances to optimality. In addition, we introduced a heuristic based on coordinate ascent that scales to large input instances. By benchmarking our methods on simulated data, we showed that the heuristic approach recovered all optimal solutions identified by the MILP at only a fraction of the time. We demonstrated the applicability of the MCT problem on lung cancer data, illustrating that our model selection step selects the right number k of clusters depending on the degree of differences among solution trees.
There are a couple of avenues for future research. First, we used the parent-child distance in this manuscript. One could consider alternative distance functions, such as the tree distance function recently introduced by Karpov et al. (2018). Second, the complexity of the MCT given fixed number k of clusters remains open. As we have seen in our analysis of real and simulated data, it is often the case that . Thus, an algorithm that is fixed parameter tractable in k would have immediate practical applications. Third, there may be multiple optimal solutions to MCT. More specifically, for a fixed clustering there might be multiple optimal consensus trees, and there might be multiple clusterings with the same total distance. Similarly to the original problem, it will be an interesting direction to identify common patterns and differences among such optimal solutions. Fourth, the mutation trees considered in this manuscript adhere to the infinite sites assumption. Recent works in cancer phylogenetics have considered other evolutionary models, such as the infinite alleles model (El-Kebir et al., 2016), the Dollo parsimony model (Bonizzoni et al., 2017; El-Kebir, 2018) or the finite sites model (Zafar et al., 2017). It will be an interesting question to adapt the methodology and problem to trees that employ these alternative models of evolution. Finally, a characterization of the distribution of trees in the solution space and their topological features under an error model of single-cell or bulk DNA sequencing has not been attempted yet. Akin to the work by Steel and Penny (1993) in classic phylogenetics, such work would provide much needed theoretical guidance on the larger issue of non-uniqueness of solutions in cancer phylogenetics.
Funding
This work was supported by UIUC Center for Computational Biotechnology and Genomic Medicine (grant: CSN 1624790) and the National Science Foundation (grant: 1850502).
Conflict of Interest: none declared.
Supplementary Material
References
- Bonizzoni P. et al. (2017) Beyond perfect phylogeny: multisample phylogeny reconstruction via ilp. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17. ACM, New York, NY, USA, pp. 1–10.
- Dang H.X. et al. (2017) ClonEvol: clonal ordering and visualization in cancer sequencing. Ann. Oncol., 28, 3076–3082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deshwar A.G. et al. (2015) PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol., 16, 35.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Desper R. et al. (1999) Inferring tree models for oncogenesis from comparative genome hybridization data. JCB, 6, 37–51. [DOI] [PubMed] [Google Scholar]
- Donmez N. et al. (2016) Clonality inference from single tumor samples using low coverage sequence data In: Singh,M. (Ed.), Research in Computational Molecular Biology, Vol. 9649. Springer International Publishing, Cham, pp. 83–94. [DOI] [PubMed] [Google Scholar]
- El-Kebir M. (2018) SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics, 34, i671–i679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- El-Kebir M. et al. (2015) Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics, 31, i62–i70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- El-Kebir M. et al. (2016) Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell Syst., 3, 43–53. [DOI] [PubMed] [Google Scholar]
- El-Kebir M. et al. (2018) Inferring parsimonious migration histories for metastatic cancers. Nat. Genet., 50, 718–726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gabow H.N. et al. (1986) Efficient algorithms for finding minimum spanning trees in undirected and directed graphs. Combinatorica, 6, 109–122. [Google Scholar]
- Garey M.R., Johnson D.S. (1979) Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co, New York, NY, USA. [Google Scholar]
- Govek K. et al. (2018) A consensus approach to infer tumor evolutionary histories. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB '18). ACM, New York, NY, pp. 63–72. [Google Scholar]
- Jahn K. et al. (2016) Tree inference for single-cell data. Genome Biol., 17, 86.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jamal-Hanjani M. et al. (2017) Tracking the evolution of non–small-cell lung cancer. N. Engl. J. Med., 376, 2109–2121. [DOI] [PubMed] [Google Scholar]
- Jiang Y. et al. (2016) Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing. Proc. Natl. Acad. Sci. USA, 113, E5528–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiao W. et al. (2014) Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinformatics, 15, 35.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karpov N. et al. (2018) A multi-labeled tree edit distance for comparing “Clonal Trees” of tumor progression. In Parida, L. and Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), volume 113 of Leibniz International Proceedings in Informatics (LIPIcs), Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp. 22:1–22:19.
- Knuth D.E. (1997) The Art of Computer Programming, Volume 1 (3rd Ed.): Fundamental Algorithms. Addison Wesley Longman Publishing Co., Inc, Redwood City, CA, USA. [Google Scholar]
- Łuksza M. et al. (2017) A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy. Nature, 551, 517.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Malikic S. et al. (2015) Clonality inference in multiple tumor samples using phylogeny. Bioinformatics, 31, 1349–1356. [DOI] [PubMed] [Google Scholar]
- McGranahan N. et al. (2015) Clonal status of actionable driver events and the timing of mutational processes in cancer evolution. Sci. Transl. Med., 7, 283ra54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navin N.E. (2014) Cancer genomics: one cell at a time. Genome Biol., 15, 452.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowell P.C. (1976) The clonal evolution of tumor cell populations. Science, 194, 23–28. [DOI] [PubMed] [Google Scholar]
- Popic V. et al. (2015) Fast and scalable inference of multi-sample cancer lineages. Genome Biol., 16, 91.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ross E.M., Markowetz F. (2016) OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol., 17, 69.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roth A. et al. (2014) PyClone: statistical inference of clonal population structure in cancer. Nat. Methods, 11, 396–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrijver A. (2003) Combinatorial Optimization – Polyhedra and Efficiency. Springer, New York. [Google Scholar]
- Steel M.A., Penny D. (1993) Distributions of tree comparison metrics—some new results. Syst. Biol., 42, 126–141. [Google Scholar]
- Strino F. et al. (2013) TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res., 41, e165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turajlic S. et al. (2018a) Deterministic evolutionary trajectories influence primary tumor growth: TRACERx renal. Cell. doi: 10.1016/j.cell.2018.03.043. [DOI] [PMC free article] [PubMed]
- Turajlic S. et al. (2018b) Tracking cancer evolution reveals constrained routes to metastases: TRACERx renal. Cell. doi: 10.1016/j.cell.2018.03.057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warnow T. (2017) Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation, 1st edn. Cambridge University Press, New York, NY, USA. [Google Scholar]
- Wolsey L. (1998) Integer Programming. Wiley Series in Discrete Mathematics and Optimization. Wiley, New York. [Google Scholar]
- Yuan K. et al. (2015) BitPhylogeny: a probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol., 16, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zafar H. et al. (2017) SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol., 18, 178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang A.W. et al. (2018) Interfaces of malignant and immunologic clonal dynamics in ovarian cancer. Cell, 173, 1755–1769.e22. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.