Identification of conserved evolutionary trajectories in tumors

Ermin Hodzic; Raunak Shrestha; Salem Malikic; Colin C Collins; Kevin Litchfield; Samra Turajlic; S Cenk Sahinalp

doi:10.1093/bioinformatics/btaa453

. 2020 Jul 13;36(Suppl 1):i427–i435. doi: 10.1093/bioinformatics/btaa453

Identification of conserved evolutionary trajectories in tumors

Ermin Hodzic ^b1, Raunak Shrestha ^b2, Salem Malikic ^b3, Colin C Collins ^b4,^b5, Kevin Litchfield ^b6, Samra Turajlic ^b6,^b7, S Cenk Sahinalp ^b8,^✉

PMCID: PMC7355238 PMID: 32657374

Abstract

Motivation

As multi-region, time-series and single-cell sequencing data become more widely available; it is becoming clear that certain tumors share evolutionary characteristics with others. In the last few years, several computational methods have been developed with the goal of inferring the subclonal composition and evolutionary history of tumors from tumor biopsy sequencing data. However, the phylogenetic trees that they report differ significantly between tumors (even those with similar characteristics).

Results

In this article, we present a novel combinatorial optimization method, CONETT, for detection of recurrent tumor evolution trajectories. Our method constructs a consensus tree of conserved evolutionary trajectories based on the information about temporal order of alteration events in a set of tumors. We apply our method to previously published datasets of 100 clear-cell renal cell carcinoma and 99 non-small-cell lung cancer patients and identify both conserved trajectories that were reported in the original studies, as well as new trajectories.

Availability and implementation

CONETT is implemented in C++ and available at https://github.com/ehodzic/CONETT.

Supplementary information

Supplementary data are available at Bioinformatics online.

1. Introduction

Cancer is a disease initiated by somatic genomic alterations. Accumulation of such alterations drives the progression of a cancer to an advanced form. As the tumor grows, new subpopulations of the tumor cells emerge with distinct genomic alteration profiles. After subsequent rounds of selection and expansion, these subpopulations give rise to substantial intra-tumor heterogeneity, which is arguably one of the main challenges in cancer management and treatment.

As multi-region, time-series and single-cell sequencing data become more widely available; it is becoming clear that certain tumors share evolutionary characteristics with others. With new computational methods to identify recurrent cancer progression patterns from multi-dimensional tumor sequencing data, it may become possible to predict the likely course of evolution and perhaps an effective treatment strategy for certain cancer types (Diaz-Uriarte and Vasallo, 2019; Hosseini et al., 2019; Lipinski et al., 2016). For example, novel approaches for targeted immuotherapy (Crowther et al., 2020) could be developed into combinatorial strategies that simultaneously target multiple subclonal populations if a comprehensive characterization of recurrent subclonal expansion patterns could be established. Unfortunately, elevated levels of intra-tumor heterogeneity, with intricate subclonal mutational landscapes dominated by inconsequential passenger alterations, may make it difficult to discriminate predictable tumor progression signals from noise introduced by these statistically insignificant subpopulations.

In the last few years, several computational methods have been developed with the goal of inferring the subclonal composition and/or evolutionary history of tumors from tumor biopsy sequencing data, including El-Kebir et al. (2015), Deshwar et al. (2015), Popic et al. (2015), Malikic et al. (2015), El-Kebir et al. (2016), Jahn et al. (2016), Ross and Markowetz (2016), Zaccaria et al. (2018), Eaton et al. (2018), Govek et al. (2018), Malikic et al. (2019a,b) and Zafar et al. (2019). Phylogenetic trees for representing tumor evolution processes reported by such methods provide (at least one of) the likely relative temporal order of genomic alteration events. Unfortunately, the trees they report for many tumor samples differ significantly from other tumors with similar characteristics. In contrast, some recent studies have been able to recapitulate tumor evolution more accurately through long-term clinical follow-up (Abbosh et al., 2017; Turajlic et al., 2018; Yates et al., 2017). These studies were able to find evidence of recurrent progression patterns involving two or more co-occurring genomic alterations. Since, until recently, no computational tool was purposefully developed to identify recurrent patterns of tumor evolution across multiple tumors, these studies typically relied on manual inspection of all possible sequences of alteration events, starting from the root of the tumor phylogeny (representing germline), all the way to leaf nodes (Turajlic et al., 2018). Recently, methods have been developed that cluster tumor phylogenetic trees into groups of closely related trees with respect to edge similarity/distance measures (Aguse et al., 2019; Caravagna et al., 2018; Yusuke et al., 2017). In particular, the REVOLVER method (Caravagna et al., 2018) employs a maximum-likelihood learning strategy to construct a joint hidden tree model for all tumors in a cohort, which is then used to infer an individual tree depicting the temporal order of clonal alterations in each individual tumor (this particular aspect of the method is later improved upon by Khakabimamaghani et al., 2019). This is followed by a hierarchical clustering of the trees for the purpose of detecting edges shared within each cluster. We note here that REVOLVER neither constructs likely evolutionary trajectories [which may be done manually to a degree (Caravagna et al., 2018) via the use of shared edges] nor does it look into whether the trajectories are conserved in a significant portion of tumors.

In parallel to the effort summarized above, the problem of finding common sets of genomic alterations in cross-sectional tumor data has been investigated by several computational approaches. These approaches typically employ molecular interaction networks to identify connected subnetworks of genes that have been subjected to somatic alterations, that are shared across tumor samples in the sense that at least a fixed number (e.g. one) of the genes in the subnetwork is altered in each tumor sample (Alcaraz et al., 2012; Bomersbach et al., 2016; Hristov and Singh, 2017; Kim et al., 2013; Leiserson et al., 2015; Ulitsky et al., 2010; Vandin et al., 2011). Recently, cd-CAP (Hodzic et al., 2019) took a stricter approach, requiring preservation of every alteration in the subnetwork in at least a user defined number of tumor samples in a cohort, introducing the notion of a conserved alteration pattern. Through cd-CAP, it has become possible to detect sizable networks of genes, altered in a similar way across tumor samples. However none of these approaches consider the temporal order of alterations to make inference about evolutionary processes common in certain tumor types.

In this article, we present a novel computational method, CONETT (CONserved Evolutionary Trajectories in Tumors), for combinatorial detection of recurrent tumor evolution trajectories. CONETT considers as input a partial temporal ordering of alteration events for each tumor in a given set of tumors; for any tumor the input is given in the form of a directed graph where each alteration event is represented by a distinct node. Given a pair of alteration events a and b, one of the following must be true for each tumor: (i) a directed edge from a to b only (alternatively from b to a only) indicates that a is an ancestor of b, (ii) a directed edge from a to b as well as a directed edge from b to a (or a bidirectional a-b edge), indicates that the two events are known to have an ancestor–descendant relationship but their specific ordering is unknown, (iii) an ‘anti-edge’ between a and b indicates that the two events belong to two distinct lineages or (iv) a ‘don’t care’ edge between a and b indicates that either a or b or both are not observed in that tumor or no information is available with respect to their ordering (We require that the input graph for each tumor should be transitive, i.e. if there are directed edges from a to b and from b to c then there must exist a directed edge from a to c.). Given such input, and a ‘root’ (e.g. driver) alteration event s, CONETT constructs a ‘consensus’ phylogeny tree on a large subset of tumors whose topology ‘captures’ the ancestor–descendant relationship for the largest possible number of event pairs in these subset of tumors. A path from the root of that phylogeny to any other event is said to form an ‘evolutionary trajectory’ of alteration events, ‘conserved’ in this subset of tumors.

In its most stringent setup, CONETT constructs a consensus tree (rooted at the germline ‘event’ g) of all tumors in the cohort with maximum total node ‘depth’ (distance from the root), under the constraint that if event a is an ancestor of event b in the consensus tree, then in each individual tumor in which b has been observed, there must exist a directed edge from a to b. As a result, not all alteration events can be a part of the consensus tree. Thus CONETT inferred consensus tree maximizes the number of ancestor–descendant orderings between pairs of alteration events that do not conflict with those orderings observed in any individual tumor graph. In a more general setup, CONETT can relax this constraint for a small fraction of tumor graphs. It also allows the user to specify a node $s \neq g$ as the ‘driver’ event.

We have applied CONETT to the TRACERx clear-cell renal cell carcinoma (ccRCC) dataset (Turajlic et al., 2018) involving 100 tumors, as well as TRACERx non-small-cell lung cancer (NSCLC) dataset (Jamal-Hanjani et al., 2017) involving 99 tumors, and identified a number of significantly conserved evolutionary trajectories which involve sequence-altered genes and copy number alteration events—not reported in the original studies.

2 Materials and methods

Let $P$ be a set of tumors for which it is possible to infer (possibly through the use of time-series, multi-region, single-cell or single-molecule sequencing data) a partial ancestral ordering of alteration events, labeled by a gene or chromosome and the type of alteration affecting it (e.g. somatically altered single gene or copy number altered entire chromosomal arm; note that CONETT allows the user to distinguish distinct types of somatic alterations such as single-nucleotide alterations, short insertions, deletions, inversions and duplications within a gene, as well as copy number gains and losses within a chromosomal arm. However, the studies that published the datasets that CONETT was applied to in this paper, did not associate significant biological differences with these distinctions, and thus, for the sake of simplicity and the ability to compare our results, we similarly largely avoided employing such distinctions). Let $A$ be the set of all alteration events that are observed in at least one tumor $p \in P$ . For a given tumor $p \in P$ , we define a tumor graph G_p as a graph that has a node for each alteration event in $A$ (note that G_p includes a node for each event in $A$ , but not every node in G_p corresponds to an event observed in p). For each pair of distinct nodes $a, b \in V (G_{p})$ exactly one of the following must be true: the set of edges $E (G_{p})$ contains (i) either a directed edge from a to b only or from b to a only, indicating that a is an ancestor of b (or alternatively, that b is an ancestor of a); (ii) a directed edge from a to b as well as a directed edge from b to a, indicating that the two events are known to have an ancestor–descendant relationship but the specific ordering is unknown; (iii) an ‘anti-edge’ between a and b, indicating that the two events belong to two distinct lineages; or (iv) a ‘don’t care’ edge between a and b, indicating that alteration events corresponding to either one or both of the nodes are not observed in that tumor or no information is available with respect to their ordering. We require that each tumor graph should be transitive with respect to its directed edges, i.e. if there are directed edges from a to b and from b to c then there must exist a directed edge from a to c. Thus, edges of a tumor graph G_p capture the complete available information about partial ancestor–descendant ordering of alteration events in tumor p (Fig. 1).

Fig. 1. — Overview of CONETT’s framework. The method takes as its input a partial ancestor–descendant ordering of alteration events across a number of tumors and produces a tree of conserved evolutionary trajectories stemming from a common root event which maximizes the number of ancestor–descendant ordered pairs of nodes. (A) Input. Tumor graphs represent partial ancestor–descendant order of alteration events in tumors. In this particular example, any pair of events that do not have an actual edge between them in a tumor should be thought to have an ‘anti-edge’; similarly any event not present in a tumor should be thought to have a ‘don’t care’ edge to all events present in that tumor. (B) The conserved evolutionary trajectory tree. In the most stringent setting CONETT computes the maximum total depth ‘consensus’ phylogeny tree where the ordering of alteration events do not conflict with the partial orderings of alteration events in any of the tumors

We say that an ordered set of nodes $e = (v_{1}, \dots, v_{k})$ represents an evolutionary trajectory that is conserved in a tumor graph $G_{p} (p \in P)$ if for each pair of nodes $v_{i}, v_{i + 1}$ in e there is a directed edge from v_i to $v_{i + 1}$ . In its most stringent setting, CONETT builds a tree T that includes ancestor–descendant relationships observed in all tumors in $P$ —i.e. the path from the root to any given node v in T is conserved in all tumor graphs G_p for which the alteration event corresponding to v is observed in tumor p. Among such trees T, CONETT’s objective is to compute the tree $T_{P}$ that maximizes the total path length from the root to every other node v; we call this tree, the ‘consensus’ tree for $P$ , and each path in $T_{P}$ from the root to any other node a ‘conserved evolutionary trajectory’. Intuitively, the consensus tree maximizes the number of orderings between alteration event pairs. We note here that we represent the ‘germline’ as a special (pseudo) alteration event g from which each G_p has a directed edge to every other node v. In this stringent setting, the root of T is specified to be g, and thus T will include each event in $A$ as a node—sometimes as a singleton leaf, with g as its parent (this is because the $T_{P}$ maximizes the total path length). Singleton leaves can later be filtered out as they reveal no new information.

In a more general setting (which we use for all our experimental results), CONETT offers the ability to identify longer evolutionary trajectories that are conserved only in a subset of tumor graphs in $P$ , by relaxing the constraint that each trajectory should be conserved in every tumor graph. Additionally, this setting allows tumor graphs to be non-transitive (for reasons explained later in this section). Finally, this general setting allows the user to specify as the root, any alteration/node $s \in A$ , to be identified as the ‘driver’ alteration event. For every other event v, let $T {(s)}_{v}$ , denote the set of all tumor graphs G_p, where there is a directed path from s to v. We then formulate the following problem.

Maximum conserved evolutionary trajectory tree (MCETT) problem: Construct a tree $T_{P, s}$ rooted at s with the maximum total path length from s to every other node v, such that every path $(s, u_{1}, \dots, u_{k}, v)$ in $T_{P, s}$ is conserved in at least $\max (1, γ | T {(s)}_{v} |)$ tumor graphs in $T {(s)}_{v}$ , for a user defined $γ \in (0, 1]$ .

Note that in order to achieve maximum total path length (i.e. node ‘depth’), the solution to the MCETT problem necessarily includes all nodes v for which $T {(s)}_{v} \neq \emptyset$ . CONETT solves MCETT problem through an ILP (integer linear programming) formulation as described in Section 2.1. However, if s is set to be the germline node and γ = 1, the solution to the MCETT problem is the consensus tree, we defined above for the most stringent setting of CONETT, and for this case, we describe a simple polynomial time algorithm below.

A polynomial time algorithm for MCETT problem for γ = 1: Let s denote the root alteration event (not necessarily representing the germline). For any alteration/node $v \in A$ , let $E_{v}^{*}$ denote the set of edges (u, v), conserved in every tumor graph in $T {(s)}_{v}$ and let $E^{*} = \cup_{v} E_{v}^{*}$ . The solution to the MCETT problem for this setting is then the spanning tree of the nodes in $E^{*}$ in which the sum of distances between root s and every other node v is maximum possible. It is possible to compute this spanning tree through a simple depth first search strategy as shown in Supplementary Methods.

For the more general setting of MCETT where $γ < 1$ , the maximum depth spanning tree strategy above could be used as an efficient heuristic that works reasonably accurately for higher values of γ. However, it is possible to solve MCETT exactly for any value of γ through an integer linear programming formulation as described below in Section 2.1. In this formulation, we do not differentiate (i) node pairs with a single directed edge from, (ii) those with edges in both directions. We also do not differentiate, (iii) node pairs with an anti-edge from (iv) those with a don’t care edge. We show how to differentially ‘incentivize’ the first two edge types toward positive contributions, and ‘penalize’ the next two edge types toward negative contributions within the objective of this ILP formulation later in the article.

2.1 ILP formulation for solving the MCETT problem

Given a set of tumor graphs ${G_{p}} (p \in P)$ , a root alteration s and a value of the parameter $γ \in (0, 1]$ , the goal is to construct a tree T as the solution to the MCETT problem. First, for each alteration event $v \in A$ , CONETT finds the maximum-size subset $T {(s)}_{v}$ of all tumor graphs in which there is a directed path from s to v in each $G_{p} \in T {(s)}_{v}$ . Then CONETT forms an evolutionarily conserved alteration set $S = {v \in A | T {(s)}_{v} \neq \emptyset}$ , which is the set of all nodes v for which $T {(s)}_{v}$ is non-empty and thus the node must be included in the solution tree.

Given an evolutionarily conserved alteration set S, CONETT’s ILP formulation in Figure 2 specifies the tree $T_{P, s}$ built upon the node set S, with s as its root, that represents the maximum node depth tree in which each path represents a conserved evolutionary trajectory with respect to the value of γ. In the ILP formulation, binary variable $X_{u, v}$ , defined for each pair of nodes $u, v \in S$ , is set to 1 if (u, v) is a tree edge of $T_{P, s}$ , and set to 0 otherwise. For each tumor graph G_p and node $v \in S$ , variable $P T H_{v}^{p}$ is set to 1 if the exact path in $T_{P, s}$ leading from s to v is conserved is tumor graph G_p. Last, for each node $v \in S$ , the variable D_v represents the distance between s and v in $T_{P, s}$ . The objective of the ILP is to maximize the sum of distances from the root to all nodes in S.

Fig. 2. — The ILP formulation for inferring the conserved evolutionary trajectory tree

Now we describe each of the constraints in the ILP formulation. First, the root node s is set to have distance 0 by Constraint (1), and it cannot be a child of any other node due to Constraint (2). Every other node v has distance that is equal to the distance of its parent node u increased by 1. The edge to node v from its parent u is determined by auxiliary variables $P D_{u, v}$ : Due to the constraint (4), $P D_{u, v}$ is set to 0 if u is not the parent of v; The maximization in the objective function, combined with the Constraints (3) and (5) ensures that $P D_{u, v}$ is set to D_u if u is the parent of v and that distance of v is equal to the distance of its parent u increased by 1.

To ensure that $T_{P, s}$ is a tree, we require that each node can have only one parent; this is ensured by Constraint (6). We enforce that the nodes of $T_{P, s}$ form a connected component of size $| S |$ by considering a fictitious network flow originating at root node s of $| S | - 1$ units—by Constraint (7). For each directed edge (u, v), the value of flow along that edge is represented by variables $f (e_{u, v})$ .

The flow loses one unit at each node—by Constraint (8). If there is a positive amount of flow through the edge (u, v), then it is a tree edge in $T_{P, s}$ —by Constraint (9). If there is no flow through the edge (u, v) then it is not in $T_{P, s}$ —by Constraint (10).

The set of Constraints (11–17) ensures that the evolutionary trajectories represented by paths in $T_{P, s}$ are conserved. Constraint (17) requires each trajectory (or path) from the root s to a node v in the tree to be conserved in at least $γ | T {(s)}_{v} |$ tumor graphs. Similar to the use of auxiliary variables $P D_{u, v}$ above, we introduce auxiliary variables $PTH A_{u, v}^{p}$ , set to 1 if and only if there is a path from s to u in tumor graph G_p, and (u, v) is a tree edge of $T_{P, s}$ —by Constraints (13–15). A tree path from s to v exists in a tumor graph G_p if G_p both includes the path from s to u as well as the tree edge from u to v—by Constraint (16). Here, the indicator constant $α_{u, v}$ takes the value 1 if the edge $(u, v) \in G_{p}$ , and 0 otherwise. By default, the root node s has a path in tumor graph G_p if it is observed in the tumor p—by Constraints (11) and (12).

Note that only $X_{u, v}$ variables are required to be binary. The decision variables $P T H_{v}^{p}$ can be continuous, with the addition of constraints that they cannot be larger than 1. This reduces the complexity of the model and the running time and space required to solve it.

2.2 Not transitive input graphs

Section 2.1 describes finding the evolutionarily conserved alteration set $S = {v \in A | T {(s)}_{v} \neq \emptyset}$ in a prior step to building the MCETT from a given root s. The transitivity of directed edges in the input tumor graphs makes this easy— simply add to S all nodes v for which there exists a directed edge (s, v) in at least one of the tumor graphs.

However, in some applications the user may wish to break the transitivity in a tumor graph G_p so that while it includes directed edges (u, w) and (w, v), it does not have (u, v). This could be desirable if there is strong belief/confidence that the intermediate node/alteration w was necessary for the emergence of alteration v in G_p. In such cases, the user may ensure that an evolutionary trajectory involving alterations u and v does not ‘skip’ a high-confidence intermediate alteration w by excluding the edge (u, v). Note that this may lead to some nodes being excluded from the resulting consensus tree: e.g. given u as the root, suppose that half of the tumor graphs in the cohort include the path u, w, v and the other half include the directed edge (u, v) but not the directed edge (w, v). If $γ > 0.5$ then v will not be included in the consensus tree.

Furthermore, the relaxation of the transitivity property makes the problem of finding the evolutionarily conserved alteration set non-trivial; this is due to the fact that in any G_p, not all nodes on the directed path from root s to another node v will necessarily be the children of s. We show how to address this issue by a pre-processing step described here. Let $S^{*}$ be the largest set of nodes such that for each node $v \in S^{*}$ , there are at least $γ | T {(s)}_{v} |$ graphs which include a path from s to v consisting only of nodes in $S^{*}$ . If a node v is in the resulting consensus tree, the path from s to v must consist only of nodes in $S^{*}$ (since $S^{*}$ is defined as the largest such set). This implies that a node that is not in $S^{*}$ cannot be in the consensus tree, implying that computing the set $S^{*}$ and pruning out all nodes not included in it can reduce the solution space. Given a node s, we formulate the maximum path-conserved subgraph identification (MPCSI) problem whose solution is the set $S^{*}$ and describe a polynomial-time algorithm to compute it. Note that on an input set of directed graphs ${G_{i}} (i \in I)$ , we use the notation $G = (\cup_{i \in I} V (G_{i}), \cup_{i \in I} E (G_{i}))$ to denote the smallest supergraph of all graphs $G_{i} (i \in I)$ (so that each G_i is a subgraph of $G$ ).

MPCSI problem: Given ${G_{i}}$ $(i \in I)$ , their smallest supergraph $G$ , a root node $s \in G$ , and a number t_v for each $v \in G$ , find the largest set of nodes $S^{*}$ of $G$ such that for each node $v \in S^{*}$ there are at least t_v graphs G_i which include a path from s to v consisting only of nodes in $S^{*}$ .

To solve MPCSI problem for a given root s, CONETT uses an iterative algorithm which starts with $S^{*} = {s}$ . In each iteration, the algorithm considers adding to $S^{*}$ a new node v to which there is a directed edge from any node in $S^{*}$ in at least t_v graphs. Node v is only added to $S^{*}$ if there are at least t_v graphs that include a path from s to v consisting only of nodes already in $S^{*}$ . For finding out whether v satisfies this condition, CONETT constructs a data structure P(v) for each node $v \in G$ , which maintains the set of graphs that have a directed edge from any node in $S^{*}$ to v in the form of a bitmap. Naturally, P(s) is initially equal to the set of all graphs where s is present. For every other node v, P(v) is initially equal to the set of tumor graphs that include edge (s, v). The algorithm terminates if there is no $v \in (G - S)$ such that $| P_{v} | \geq t_{v}$ .

Otherwise it identifies any node $u \in (G - S^{*})$ such that $| P_{u} | \geq t_{v}$ and adds it to $S^{*}$ . Then, it checks for each node $v \in S^{*}$ for which $(u, v) \in E (G)$ , whether the addition of u to $S^{*}$ adds any new graphs to P_v. If this is the case, the algorithm inserts the new graphs to P_v and recursively propagates the new graphs to any neighbor $w \in S^{*}$ of v that does not already have them in P_w. Supplementary Figure S1 illustrates the necessity of the recursive path propagation within $S^{*}$ . Finally, the algorithm updates P_z for every node $z \in (G - S^{*})$ for which there is an edge $(v, z) \in E (G) (v \in S^{*})$ .

The correctness of the algorithm follows from the invariant that at each iteration, the algorithm considers paths from s to nodes that are either in $S^{*}$ or are neighbors of nodes in $S^{*}$ , thereby ensuring the existence of at least t_v paths (each in a separate graph). We also note that for any given $v \in S^{*}$ , the total number of times that P_v is updated by the algorithm cannot be more than the number of graphs. Furthermore the total number of times the algorithm can visit v is no more than the number of edges in $G$ . As a result, the running time of the algorithm is $O (| V (G) | \cdot (| P | + | E (G) |))$ .

2.3 Optional constraints and parameters

CONETT allows the user to add additional constraints to the ILP solution to the MCETT problem described in Section 2.1, via a new set of options/parameters as follows: (i) CONETT allows the user to restrict the set of nodes which are included into the consensus tree to more recurrent alteration events that ‘follow’ the root event in a specified minimum number of tumor graphs t. This is achieved by removing all nodes v from the evolutionarily conserved alteration set S for which $| T {(s)}_{v} | < t$ , prior to building the tree. (ii) CONETT also allows the user to add constraints on the ancestor–descendant ordering of nodes in conserved evolutionary trajectories of the resulting tree $T_{P, s}$ . Specifically, it calculates a ‘confidence score’ of u being an ancestor of v based on evidence provided by all tumor graphs in which root s is an ancestor of u and v that do not have ‘don’t care’ edges. Given a user-defined value δ, representing the threshold for confidence scores between each pair of nodes in the evolutionarily conserved alteration set S, CONETT adds constraints to the ILP formulation that require each ancestor–descendant pair of nodes in the tree to have a confidence score at least δ.

CONETT’s confidence score for node u being an ancestor of node v, penalizes those tumor graphs that include (i) a directed edge (v, u) without a directed edge (u, v) and (ii) an anti-edge between u and v, while considering tumor graphs with a don’t care edge between u and v neutral. Specifically, let $T {(s)}_{u, v} = T {(s)}_{u} \cap T {(s)}_{v}$ be the subset of all tumor graphs in which there exists a directed path from s to both u and v (note that no tumor graph in this set has a ‘don’t care’ edge between u and v). Let, $P_{u, v} \subseteq T {(s)}_{u, v}$ be the subset of tumor graphs that contain a directed edge (u, v) (the presence of directed edge (v, u) does not make a difference). Also let $R_{u, v} \subseteq T {(s)}_{u, v}$ be the set of tumor graphs that includes the directed edge (v, u) but not the edge (u, v). Similarly, let $Q_{u, v} \subseteq T {(s)}_{u, v}$ be the subset of tumor graphs which include an anti-edge between u and v (In the case that transitivity of directed edges in the tumor graphs is broken, then $P_{u, v}$ represents the subset of tumor graphs in which there is a directed path from u to v (they may lie on a cycle). Then $R_{u, v}$ represents the subset of tumor graphs in which there is a directed path from v to u but no directed path from u to v is present. Since, in general, anti-edges do not need to be explicitly specified, so long as don’t care edges are specified, we may consider $Q_{u, v}$ to represent the subset of tumor graphs in which there is no directed path between either u to v or v to u.). Note that $T {(s)}_{u, v} = P_{u, v} \cup R_{u, v} \cup Q_{u, v}$ . Then the confidence score of u being and ancestor of v in the subset of tumor graphs $T {(s)}_{u} \cup T {(s)}_{v}$ is set to be $W_{u, v} = \frac{| P_{u, v} | - | R_{u, v} |}{| T {(s)}_{u, v} |}$ . As can be seen below, this (as well as alternative) confidence score(s) penalizing (or incentivizing) specific edge types is easily incorporated in CONETT’s ILP formulation.

(18)	$\forall (u, v, w) :$	$P A_{u, v, w} \leq A_{u, w}$
(19)		$P A_{u, v, w} \leq A_{w, v}$
(20)		$P A_{u, v, w} \geq A_{u, w} + A_{w, v} - 1$
(21)	$\forall (u, v) :$	$A_{u, v} \geq X_{u, v}$
(22)		$\| S \| \cdot A_{u, v} \geq \sum_{w \in S} P A_{u, v, w}$
(23)		$A_{u, v} \leq X_{u, v} + \sum_{w \in S} P A_{u, v, w}$
(24)	$\forall (u, v) s . t . W_{u, v} < δ :$	$X_{u, v} = 0$
(25)		$A_{u, v} = 0$

Open in a new tab

For each pair of nodes u and v in S, we add a variable $A_{u, v}$ , which is set to 1 if u is an ancestor of v in $T_{P, s}$ , and 0 otherwise. Constraints (24) and (25) ensure that all ancestor–descendant pairs of nodes in $T_{P, s}$ have a high-confidence score. We introduce auxiliary variables $P A_{u, v, w}$ , set to 1 if u is an ancestor of w and w is an ancestor of v and set to 0 otherwise—by Constraints (18–20). A parent of a node is its ancestor by default, by Constraint (21). Constraints (22) and (23) impose transitivity on ancestor–descendant relationships: u is an ancestor of v if and only if u is an ancestor of w and w is an ancestor of v, for some intermediate node w.

2.4 Assessment of the statistical significance of MPCSI solutions

In order to assess the statistical significance of an evolutionarily conserved alteration set S computed by CONETT (or a solution to the MPCSI problem for graphs that are not transitive), we perform a variation of the standard permutation test on the input data. Let $N_{a, p}$ represent the number of nodes (genes and chromosome arms) altered by alteration type a in tumor graph G_p, and let $F_{v, a}$ represent the number of tumors in which node v occurs altered by a (as mentioned earlier, CONETT can be set to distinguish alteration types on individual genes or chromosomal arms. We have not treated distinct alteration types differentially in the algorithms described above, however in case certain genes/chromosomal arms included in S are not strongly associated with a specific alteration type, (across the tumors) our empirical test can be set to differentiate alteration types as described here. Otherwise a can be set to the same value for all nodes). For each tumor $p \in P$ , and each alteration type a, we randomly draw (without repetitions) $N_{a, p}$ nodes, and mark them as altered by a, in case they were altered by a in at least one tumor graph in the original input. The nodes are randomly drawn from a distribution based on their recurrence frequency, i.e. each node v is drawn with the probability $\frac{F_{v, a}}{\sum_{u \in V (G)} F_{u, a}}$ . That way, the average recurrence frequency of each node and alteration type in the permuted data is equal to the original recurrence frequency, preserving hidden inter-dependencies among somatic mutations and copy number alterations that were the cause of the observed recurrence frequencies and evolutionarily conserved alteration sets. Additionally, consistent with observations on the TRACERx ccRCC dataset we use in this paper, special care was devoted to ensuring that a copy number event affecting a whole chromosome does not appear together with a copy number event altering either of its arms within the same tumor. Last, the randomly chosen altered nodes replace the original nodes of the input tumor graph G_p after being subject to a random permutation.

The above process is repeated 1000 times. In order to estimate the statistical significance of a set S of size k, obtained with root alteration event s that occurs altered in F_s tumors, we run the method on each generated permutation of the input data using the same parameters, and record the number of times that the resulting set has a size of k or greater when computed with a root that occurs altered in F_s or more tumors. That number, divided by the total number of experiments (1000 for our case), gives us the empirical P-value estimate for S.

3 Results

3.1 Data and pre-processing

3.1.1 Clear-cell renal cell carcinoma

We obtained ccRCC data from the TRACERx renal study (Turajlic et al., 2018), which gathered multi-region sequencing data from 100 tumors. The study collected non-silent mutation data in about 110 genes that are deemed to be high-confidence ccRCC driver genes, obtained through a sequencing panel from which single-nucleotide variants (SNVs), dinucleotide variants (DNVs) and small insertion and deletions (INDELs) were inferred. The study also detected somatic copy number alterations, which were reported only on the level of chromosome arms or whole chromosome—provided at least 50% of the chromosomal arm had a copy number alteration. Then, in each tumor, the study clustered all the alterations and reconstructed a phylogenetic tree of clonal hierarchy.

For every somatically altered gene and copy number altered chromosomal arm that is found in a particular tumor’s phylogenetic tree, we created a node labeled by the name of the gene (or chromosome) and the type of alteration that affects it. Since, in this dataset, SNVs, DNVs and small INDELs events result in inactivation of the genes that they affect, we treated them all as a single ‘inactivating’ alteration type. Then, we build a directed graph on this set of nodes for a particular tumor, such that a directed edge is drawn from node u to node v if the clonal node in the phylogenetic tree which carries the alteration affecting node u is an ancestor of the clonal node in the phylogenetic tree which carries the alteration affecting node v. In the case of alteration events A and B that belong to the same clone, we draw parallel directed edge, in opposite directions, between the nodes that are associated with A and B. In five tumor trees, there was immediate branching right after the germline node. To ensure that there are no cross links in our tumor graphs (which can happen if a gene is present in both branches), leading to false precedence relationship between ancestors of the cross link source and descendants of the cross link destination, we split such trees at the root and treat them as two separate tumor phylogenetic trees.

As a result, we obtained tumor graphs from 105 ‘tumors’ (after splitting the 5 trees with branching at germline node), containing 242 different nodes in total, representing driver genes affected by inactivating sequence alterations, and copy number gains and losses on chromosomes—both on whole chromosome and arm level. Supplementary Figure S2 shows recurrence frequencies of the 242 events; the y axis shows the recurrence frequency (in log scale), and events are sorted along the x axis based on their y values for clarity purposes. The figure reveals the sparsity of the data—more than a third of genes that carry sequence alterations each occur in only a single tumor, and more than a half occur in at most 2. Among more recurrently altered genes and chromosome arms (those that are found altered by the same alteration type in at least 10 tumors), there are 19 (35%) copy number gain-altered, 26 (50%) copy number loss-altered and 12 (9%) sequence-altered nodes.

3.1.2 Non-small-cell lung cancer

We obtained data from 99 NSCLC patients from (Caravagna et al., 2018). Originating from whole-exome sequencing of multiple spatially separated regions from the TRACERx lung cancer study (Jamal-Hanjani et al., 2017), Caravagna et al. (2018) provide cancer cell fraction values, as well as clustering information, for SNVs and focal copy number alterations in 79 putative driver genes of NSCLC (we note that there was no case of a gene being affected by alterations of different types in any patient). Using CITUP (Malikic et al., 2015), we inferred phylogenetic trees with the clustering information provided by Caravagna et al. (2018), and for each patient we selected the tree with the minimum error score (which was always unique). We then construct tumor graphs from the phylogenetic trees similarly to the procedure used to generate tumor graphs in the ccRCC data (Section 3.1.1).

Contrary to the TRACERx ccRCC data, in which the clustering of alterations contained chromosome-level copy number alterations, the clustering of alterations did not contain copy number alterations on a chromosome level in this dataset. Since in Caravagna et al. (2018), the analysis was performed on the gene level, ignoring the specific alteration type (or having just one universal alteration type), we took a similar approach to make our results comparable by merging SNVs and gene deletions into a single ‘inactivating’ alteration type. However, we kept gene amplifications as separate alteration type since the downstream effect significantly differs from SNVs and deletions.

3.2 Conserved evolutionary trajectories in ccRCC

We first used CONETT to identify statistically significant alteration sets that consisted of highly-recurrent alteration events following the root event, and associated trajectories in the TRACERx renal study (see Section 3.1.1), by considering as a root node each gene and chromosomal arm that are found to be altered in at least 10 tumors. For each root, in this experiment we set the optional parameter t in the MPCSI formulation to a high fraction of the root’s recurrence frequency ( $\geq 48 %$ ). We empirically assessed the statistical significance of these alteration sets using methodology described in Section 2.4. From the evolutionarily conserved alteration sets, we constructed conserved evolutionary trajectory trees, setting the γ parameter in the MCETT formulation to $\frac{1}{3}$ and the ancestor–descendant order confidence score threshold of $δ = 0.85$ (see Section 2.3). Below we report on the 10 most significant evolutionarily conserved alteration sets.

Four of these sets contain two or three nodes, involving genes and copy number altered chromosome arms that are co-clonal (Fig. 3E): these include sequence alterations on driver gene VHL and copy number loss on Chromosome 3p, which are known to be founder alterations in ccRCC. Conserved evolutionary trajectories rooted either by sequence alterations on VHL or copy number loss on 3p (Fig. 3A and B) both include sequence alterations in PBRM1 and copy number loss on Chromosome 14q as highly conserved follow-up events (P-value < 0.001), with sequence alteration in PBRM1 typically occurring before copy number loss on 14q. Copy number gain on Chromosome 5 (occurring in about a quarter of all tumors) is accompanied by co-clonal sequence alteration in VHL, and is followed by copy number loss on Chromosome 14q and copy number gain on Chromosome 7, each observed in about half of the tumors (P-value < 0.008, Fig. 3C). The remaining three conserved trajectories are all rooted by the copy number loss on Chromosome 6, occurring in 19 tumors (Fig. 3D shows the largest trajectory tree). It is followed by losses on 14q and 9 at a high conservation rate of 70% (P-value < 0.001), gains on Chromosome 7 at 55% (P-value < 0.010) and sequence alterations of VHL and loss on Chromosome 4 at 50% (P-value < 0.001). Loss on Chromosome 4 follows loss of Chromosome 9 in more than a third of the tumors.

Next, we looked into trees of trajectories rooted by the most recurrent events in the dataset. As sequence alterations in VHL (found in 77% tumors) and copy number loss on Chromosome 3p (found in 89% tumors) play a key role in initiation of ccRCC, we examined the two evolutionarily conserved alteration sets and conserved evolutionary trajectories that are rooted by either of these events and contain nodes that are altered in at least 10 tumors each (t = 10, both of the sets have P-value < 0.001). We also examined conserved evolutionary trajectories that start with the germline node, with the same value of t, as it is present in the whole dataset. The resulting trees of trajectories are shown on Supplementary Figures S6–S8; while Figure 3F shows a number of select trajectories from those trees.

Consistent with the trees reported in the original TRACERx renal study, the trajectories obtained by CONETT include a series of sequence alteration events VHL → PBRM1 → SETD2 observed in 14 tumors, as well as VHL → BAP1 observed in 15 tumors. Additionally, CONETT also detected two conserved evolutionary trajectories terminating with a sequence alteration in gene KDM5C that collectively cover 80% of tumors where it is present: VHL & loss_3p → gain_5q → KDM5C in eight tumors and VHL → PBRM1 → KDM5C in nine tumors; these trajectories were not reported in the original study. Note that KDM5C is a chromatin modifier, similar to BAP1; its dysregulation alters the activity of many other genes (Guo and Zhang, 2017). Additionally, VHL → PBRM1 → TSC1 is found in nine tumors, representing 75% tumors where TSC1 is present. CONETT also identified a number of less frequently altered genes that do not display high-confidence ancestor–descendant ordering. These alteration events in genes such as MTOR (in 12/17 tumors), OBSCN (in 7/12 tumors), and MUC16 (in 10/16 tumors) all seem to branch out from an alteration in VHL.

Inactivations of both PBRM1 and BAP1 are known to rarely co-occur in ccRCC (Peña-Llopis et al., 2012; Turajlic et al., 2018), and trees of conserved evolutionary trajectories on Supplementary Figures S6–S8 always place them into different branches. However, most of the time when they do co-occur, PBRM1 alteration precedes BAP1 alteration. We thus ran CONETT using PBRM1 alteration as the root event, to find out whether BAP1 sequence alteration appears as a conserved follow-up event, using t = 10 (P-value < 0.001). The resulting tree of conserved evolutionary trajectories is shown on Supplementary Figure S9. Alteration on BAP1 was not identified as a downstream event, strengthening the hypothesis that their co-occurrence is most likely due to upstream events such as loss on 3p and inactivation of VHL.

Loss on Chromosome 14q is the second most recurrent large-scale copy number loss event (after loss on 3p), occurring in 50 tumors. However, contrary to loss on 3p which regularly appears as the earliest alteration event in tumors, loss on 14q is always placed later, after both VHL and loss on 3p. It is often preceded by inactivation of PBRM1, although not in every single tumor (Supplementary Fig. S9 shows that loss on 14q follows inactivation of PBRM1 in 22 tumors). There is also strong evidence of precedence of gains on Chromosomes 5 and 5q (which are themselves mutually exclusive) to loss on Chromosome 14q—trajectories rooted by loss on 14q do not detect gains on Chromosome 5 or 5q as downstream events (Supplementary Fig. S10), but trajectories rooted by gains on 5 and 5q do detect loss on 14q as a downstream event (Supplementary Fig. S11). Additionally, Supplementary Figure S11A shows presence of mutual exclusivity of losses on Chromosomes 3p and 14q in tumors that contain gain on Chromosome 5, whereas that relationship is not noticed in tumors that contain gain on 5q (Supplementary Fig. S11B). This pattern of mutual exclusivity was not reported in the original ccRCC TRACERx study.

3.3 Conserved evolutionary trajectories in NSCLC

We used CONETT to identify conserved evolutionary trajectories in the TRACERx NSCLC study (see Section 3.1.2). These data were analyzed by REVOLVER (Caravagna et al., 2018), allowing us to compare the CONETT-identified trajectories with the edges in clusters identified by their method. Since REVOLVER uses ‘collective’ information from the whole set of tumors to infer individual tumor phylogenetic trees, in order to make our results more comparable we set the value for the optional parameter δ = 1—ensuring that the temporal order of ancestor–descendant pairs of nodes in our trees is conflict-free with not just those tumors in which the specific trajectory that they lie on is conserved, but also the wider set of tumors in which the root is ancestor of both the nodes (please see Section 2.3 for exact details). We used the same value for the parameter $γ = \frac{1}{3}$ as in the analysis of the ccRCC dataset.

In Caravagna et al. (2018), the authors highlighted evolutionary trajectories constructed from edges in cluster C5 (Fig. 4B). Using CONETT, with the SNV-altered gene CDKN2A as the root, we were able to recover the full topological order of the genes in cluster C5 (Fig. 4A). The CONETT-identified tree captures more occurrences of tree edges CDKN2A → TP53 and TP53 → TERT, as well as edges from TP53 toward NOTCH1 and COL5A2, which were tree edges in cluster C5 but are just forward edges in CONETT (dashed gray lines); due to CONETT trajectories capturing additional ancestor–descendant relationship between amplification of TERT and SNVs on sibling nodes FAT1 and NOTCH1.

Additionally, we used CONETT to construct trees of trajectories with known early drivers (as well as the germline node) as roots (Supplementary Figs S12 and S13). Figure 4C shows select conserved trajectories which are supported by both trees. The trajectories capture TP53 → SOX2 & PIK3CA amplifications → NFE2L2 (in 4/9 tumors), as well as TP53 → KEAP1 (in 6/12 tumors)—inactivations of NFE2L2 and KEAP1 may be linked to increased chemoresistance (Jeong et al., 2020). All TP53-rooted evolutionarily conserved alteration sets were found to be statistically significant (P-value <0.001). Interestingly, trajectories rooted by SNV and amplification of EGFR did not pick up any node other than TP53, which they are often found by REVOLVER to be in the same cluster with (Fig. 4D). Even when we merge amplifications and SNVs of EGFR, RB1 and PIK3CA just barely emerge as downstream altered genes. KRAS-rooted trajectories capture MGA and KMT2D, consistent with the germline-rooted tree (Supplementary Fig. S12).

4. Discussion

We present CONETT, a novel computational method for combinatorial detection of conserved evolutionary trajectories in tumors. CONETT is the first such method that goes beyond just identifying clusters of common edges and directly produces a consensus phylogeny which aims to maximize the number of ancestor–descendant ordered pairs of alteration events. Given a root alteration event, we show how to construct a conserved evolutionary trajectory tree consisting of the maximum set of events that can be found on conserved evolutionary trajectories from the root via an ILP formulation. We applied our method to recently published datasets of 100 ccRCC tumors and 99 NSCLC tumors, and identified conserved evolutionary trajectories of sequence-altered genes and copy number altered chromosomal arms. CONETT identified all conserved evolutionary trajectories involving sequence-altered genes reported as significant by the original studies; it also identified several additional conserved trajectories which were not reported earlier.

As per the original studies of TRACERx renal and lung cancer data, we have not differentiated sequence-alteration types and have treated all sequence-altered genes identically; this allowed us to compare CONETT results with those of the original studies. However CONETT’s framework is sufficiently general to offer the user the ability to differentiate somatic alteration types in the trajectories it identifies so as to perform in depth analysis of larger and richer datasets to be published in the near future.

We acknowledge that the consensus tree that CONETT’s ILP formulation currently constructs represents a single tree topology in which each node occurs only once. That forces the method to select only a single trajectory from a given root toward each node, even though there might be multiple different trajectories that each occur in a significant fraction of tumors (as evidenced by the two different trajectories toward KDM5C discovered by CONETT and shown on Supplementary Figures S6 and S8, together covering 80% of tumors in which the gene is sequence-altered). In future, it would be useful to devise a way to simultaneously identify different conserved trajectories in tumor evolution toward the same node.

As larger and more precise datasets emerge, as well as depending on the motivation for and the specific application of the analysis, it might become desirable to use an alternative version of tumor graphs which do not have the property of transitivity of the directed edges. This could be desirable if there is strong belief that an ancestor event u was necessary for the emergence of another follow-up event v. In such cases, removing transitive edges toward v would ensure that detected evolutionary trajectories involving v would not be able to ‘skip’ the high-confidence intermediate alteration event u. Our method can incorporate such data into its framework as is; we also give a polynomial time algorithm to detect a set of events that is a superset of the nodes in the resulting tree—allowing to prune the search space and reduce the running time and space of the ILP model.

Additionally, we introduced an optional scoring scheme which measures the confidence of ancestry for a pair of nodes based on the evidence given by the temporal order of the two events in the input tumor data. Currently, our method uses a common threshold for all possible ancestor–descendant pairs of nodes, imposing a uniform level of stringency. It is worth exploring how to impose a level of stringency specific to each individual pair of nodes on evidence available for their partial ordering in the future.

Funding

This work is supported in part by the Intramural Research Program of the National Institutes of Health, National Cancer Institute and utilized the computational resources of the NIH Biowulf high performance computing cluster (http://hpc.nih.gov) and Gurobi (http://www.gurobi.com) to solve some optimization problems.

Conflict of Interest: none declared.

Supplementary Material

btaa453_Supplementary_Data

Click here for additional data file.^{(802.4KB, pdf)}

References

Abbosh C. et al. ; The TRACERx Consortium. (2017) Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature, 545, 446–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aguse N. et al. (2019) Summarizing the solution space in tumor phylogeny inference by multiple consensus trees. Bioinformatics, 35, i408–i416. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alcaraz N. et al. (2012) Efficient key pathway mining: combining networks and omics data. Integr. Biol., 4, 756–764. [DOI] [PubMed] [Google Scholar]
Bomersbach A. et al. (2016) An efficient branch and cut algorithm to find frequently mutated subnetworks in cancer. In: International Workshop on Algorithms in Bioinformatics, pp. 27–39. Springer, Aarhus, Denmark.
Caravagna G. et al. (2018) Detecting repeated cancer evolution from multi-region tumor sequencing data. Nat. Methods, 15, 707–714. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crowther M.D. et al. (2020) Genome-wide CRISPR-Cas9 screening reveals ubiquitous T cell cancer targeting via the monomorphic MHC class I-related protein MR1. Nat. Immunol., 21, 178–185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deshwar A.G. et al. (2015) PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol., 16, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diaz-Uriarte R., Vasallo C. (2019) Every which way? On predicting tumor evolution using cancer progression models. PLoS Comput. Biol., 15, e1007246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eaton J. et al. (2018) Deconvolution and phylogeny inference of structural variations in tumor genomic samples. Bioinformatics, 34, i357–i365. [DOI] [PMC free article] [PubMed] [Google Scholar]
El-Kebir M. et al. (2015) Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics, 31, i62–i70. [DOI] [PMC free article] [PubMed] [Google Scholar]
El-Kebir M. et al. (2016) Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell Syst., 3, 43–53. [DOI] [PubMed] [Google Scholar]
Govek K. et al. (2018) A consensus approach to infer tumor evolutionary histories. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB ‘18. ACM Press, New York, NY, USA.
Guo X., Zhang Q. (2017) The emerging role of histone demethylases in renal cell carcinoma. J. Kidney Cancer VHL, 4, 1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hodzic E. et al. (2019) Combinatorial detection of conserved alteration patterns for identifying cancer subnetworks. GigaScience, 8, giz024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hosseini S.-R. et al. (2019) Estimating the predictability of cancer evolution. Bioinformatics, 35, i389–i397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hristov B.H., Singh M. (2017) Network-based coverage of mutational profiles reveals cancer genes. Cell Syst., 5, 221–229.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jahn K. et al. (2016) Tree inference for single-cell data. Genome Biol., 17, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jamal-Hanjani M. et al. (2017) Tracking the evolution of non–small-cell lung cancer. N. Engl. J. Med., 376, 2109–2121. [DOI] [PubMed] [Google Scholar]
Jeong Y. et al. (2020) Role of KEAP1/NFE2l2 mutations in the chemotherapeutic response of patients with non-small cell lung cancer. Clin. Cancer Res., 26, 274–281. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khakabimamaghani S. et al. (2019) Collaborative intra-tumor heterogeneity detection. Bioinformatics, 35, i379–i388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim Y.-A. et al. (2013) Module cover–a new approach to genotype-phenotype studies. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, p. 135. NIH Public Access, World Scientific Publishing (Biocomputing). [PMC free article] [PubMed]
Leiserson M.D.M. et al. (2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet., 47, 106–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lipinski K.A. et al. (2016) Cancer evolution and the limits of predictability in precision cancer medicine. Trends Cancer, 2, 49–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Malikic S. et al. (2015) Clonality inference in multiple tumor samples using phylogeny. Bioinformatics, 31, 1349–1356. [DOI] [PubMed] [Google Scholar]
Malikic S. et al. (2019. a) Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun., 10, 2750. [DOI] [PMC free article] [PubMed] [Google Scholar]
Malikic S. et al. (2019. b) PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res., 29, 1860–1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peña-Llopis S. et al. (2012) BAP1 loss defines a new class of renal cell carcinoma. Nat. Genet., 44, 751–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
Popic V. et al. (2015) Fast and scalable inference of multi-sample cancer lineages. Genome Biol., 16, 91. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ross E.M., Markowetz F. (2016) OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol., 17, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turajlic S. et al. (2018) Deterministic evolutionary trajectories influence primary tumor growth: TRACERx renal. Cell, 173, 595–610.e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ulitsky I. et al. (2010) DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS One, 5, e13367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vandin F. et al. (2011) Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol., 18, 507–522. [DOI] [PubMed] [Google Scholar]
Yates L.R. et al. (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32, 169–184.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yusuke M. et al. (2017) phyC: clustering cancer evolutionary trees. PLoS Comput. Biol., 13, e1005509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zaccaria S. et al. (2018) Phylogenetic copy-number factorization of multiple tumor samples. J. Comput. Biol., 25, 689–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zafar H. et al. (2019) SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res., 29, 1847–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaa453_Supplementary_Data

Click here for additional data file.^{(802.4KB, pdf)}

[btaa453-B1] Abbosh C. et al. ; The TRACERx Consortium. (2017) Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature, 545, 446–451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B2] Aguse N. et al. (2019) Summarizing the solution space in tumor phylogeny inference by multiple consensus trees. Bioinformatics, 35, i408–i416. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B3] Alcaraz N. et al. (2012) Efficient key pathway mining: combining networks and omics data. Integr. Biol., 4, 756–764. [DOI] [PubMed] [Google Scholar]

[btaa453-B4] Bomersbach A. et al. (2016) An efficient branch and cut algorithm to find frequently mutated subnetworks in cancer. In: International Workshop on Algorithms in Bioinformatics, pp. 27–39. Springer, Aarhus, Denmark.

[btaa453-B5] Caravagna G. et al. (2018) Detecting repeated cancer evolution from multi-region tumor sequencing data. Nat. Methods, 15, 707–714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B6] Crowther M.D. et al. (2020) Genome-wide CRISPR-Cas9 screening reveals ubiquitous T cell cancer targeting via the monomorphic MHC class I-related protein MR1. Nat. Immunol., 21, 178–185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B7] Deshwar A.G. et al. (2015) PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol., 16, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B8] Diaz-Uriarte R., Vasallo C. (2019) Every which way? On predicting tumor evolution using cancer progression models. PLoS Comput. Biol., 15, e1007246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B9] Eaton J. et al. (2018) Deconvolution and phylogeny inference of structural variations in tumor genomic samples. Bioinformatics, 34, i357–i365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B10] El-Kebir M. et al. (2015) Reconstruction of clonal trees and tumor composition from multi-sample sequencing data. Bioinformatics, 31, i62–i70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B11] El-Kebir M. et al. (2016) Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell Syst., 3, 43–53. [DOI] [PubMed] [Google Scholar]

[btaa453-B12] Govek K. et al. (2018) A consensus approach to infer tumor evolutionary histories. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB ‘18. ACM Press, New York, NY, USA.

[btaa453-B13] Guo X., Zhang Q. (2017) The emerging role of histone demethylases in renal cell carcinoma. J. Kidney Cancer VHL, 4, 1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B14] Hodzic E. et al. (2019) Combinatorial detection of conserved alteration patterns for identifying cancer subnetworks. GigaScience, 8, giz024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B15] Hosseini S.-R. et al. (2019) Estimating the predictability of cancer evolution. Bioinformatics, 35, i389–i397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B16] Hristov B.H., Singh M. (2017) Network-based coverage of mutational profiles reveals cancer genes. Cell Syst., 5, 221–229.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B17] Jahn K. et al. (2016) Tree inference for single-cell data. Genome Biol., 17, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B18] Jamal-Hanjani M. et al. (2017) Tracking the evolution of non–small-cell lung cancer. N. Engl. J. Med., 376, 2109–2121. [DOI] [PubMed] [Google Scholar]

[btaa453-B19] Jeong Y. et al. (2020) Role of KEAP1/NFE2l2 mutations in the chemotherapeutic response of patients with non-small cell lung cancer. Clin. Cancer Res., 26, 274–281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B20] Khakabimamaghani S. et al. (2019) Collaborative intra-tumor heterogeneity detection. Bioinformatics, 35, i379–i388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B21] Kim Y.-A. et al. (2013) Module cover–a new approach to genotype-phenotype studies. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, p. 135. NIH Public Access, World Scientific Publishing (Biocomputing). [PMC free article] [PubMed]

[btaa453-B22] Leiserson M.D.M. et al. (2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet., 47, 106–114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B23] Lipinski K.A. et al. (2016) Cancer evolution and the limits of predictability in precision cancer medicine. Trends Cancer, 2, 49–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B24] Malikic S. et al. (2015) Clonality inference in multiple tumor samples using phylogeny. Bioinformatics, 31, 1349–1356. [DOI] [PubMed] [Google Scholar]

[btaa453-B25] Malikic S. et al. (2019. a) Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat. Commun., 10, 2750. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B26] Malikic S. et al. (2019. b) PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res., 29, 1860–1877. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B27] Peña-Llopis S. et al. (2012) BAP1 loss defines a new class of renal cell carcinoma. Nat. Genet., 44, 751–759. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B28] Popic V. et al. (2015) Fast and scalable inference of multi-sample cancer lineages. Genome Biol., 16, 91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B29] Ross E.M., Markowetz F. (2016) OncoNEM: inferring tumor evolution from single-cell sequencing data. Genome Biol., 17, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B30] Turajlic S. et al. (2018) Deterministic evolutionary trajectories influence primary tumor growth: TRACERx renal. Cell, 173, 595–610.e11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B31] Ulitsky I. et al. (2010) DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS One, 5, e13367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B32] Vandin F. et al. (2011) Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol., 18, 507–522. [DOI] [PubMed] [Google Scholar]

[btaa453-B33] Yates L.R. et al. (2017) Genomic evolution of breast cancer metastasis and relapse. Cancer Cell, 32, 169–184.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B34] Yusuke M. et al. (2017) phyC: clustering cancer evolutionary trees. PLoS Comput. Biol., 13, e1005509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B35] Zaccaria S. et al. (2018) Phylogenetic copy-number factorization of multiple tumor samples. J. Comput. Biol., 25, 689–708. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btaa453-B36] Zafar H. et al. (2019) SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res., 29, 1847–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Identification of conserved evolutionary trajectories in tumors

Ermin Hodzic

Raunak Shrestha

Salem Malikic

Colin C Collins

Kevin Litchfield

Samra Turajlic

S Cenk Sahinalp

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1. Introduction

2 Materials and methods

Fig. 1.

2.1 ILP formulation for solving the MCETT problem

Fig. 2.

2.2 Not transitive input graphs

2.3 Optional constraints and parameters

2.4 Assessment of the statistical significance of MPCSI solutions

3 Results

3.1 Data and pre-processing

3.1.1 Clear-cell renal cell carcinoma

3.1.2 Non-small-cell lung cancer

3.2 Conserved evolutionary trajectories in ccRCC

Fig. 3.

3.3 Conserved evolutionary trajectories in NSCLC

Fig. 4.

4. Discussion

Funding

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases