Abstract
The metastatic spread of a cancer can be reconstructed from DNA sequencing of primary and metastatic tumours, but doing so requires solving a challenging combinatorial optimization problem. This problem often has multiple solutions that cannot be distinguished based on current maximum parsimony principles alone. Current algorithms use ad hoc criteria to select among these solutions, and decide, a priori, what patterns of metastatic spread are more likely, which is itself a key question posed by studies of metastasis seeking to use these tools. Here we introduce Metient, a freely available open-source tool which proposes multiple possible hypotheses of metastatic spread in a cohort of patients and rescores these hypotheses using independent data on genetic distance of metastasizing clones and organotropism. Metient is more accurate and is up to 50x faster than current state-of-the-art. Given a cohort of patients, Metient can calibrate its parsimony criteria, thereby identifying shared patterns of metastatic dissemination in the cohort. Reanalyzing metastasis in 169 patients based on 490 tumors, Metient automatically identifies cancer type-specific trends of metastatic dissemination in melanoma, high-risk neuroblastoma and non-small cell lung cancer. Metient’s reconstructions usually agree with semi-manual expert analysis, however, in many patients, Metient identifies more plausible migration histories than experts, and further finds that polyclonal seeding of metastases is more common than previously reported. By removing the need for hard constraints on what patterns of metastatic spread are most likely, Metient introduces a way to further our understanding of cancer type-specific metastatic spread.
Keywords: migration history inference, metastasis, mixed-variable combinatorial optimzation
Introduction
Metastasis is associated with 90% of cancer deaths, yet its causes and physiology remain poorly understood1. It remains unclear how often multiple clones seed metastases, how often metastases are capable of seeding other metastases, and if there is a relationship between seeding clones and organ-specific metastases2–10. It is also not known whether metastatic potential is rare, and thus gained once in the same cancer, or common, and thus gained multiple times11–14. The answers to all these questions would improve the understanding and clinical management of metastasis, but doing so requires reconstructing migration histories of metastatic clones from clinical sequencing data which, until recently, was very challenging2–4.
Recent algorithms have tackled this challenge using maximum parsimony principles. These algorithms identify parsimonious migration histories that explain the clonal compositions of primary tumors and one or more matched metastatic tumors5,15–17. However, different definitions of parsimony can disagree on the best solution, and current algorithms resolves these conflicts using ad hoc rules15–17. For example, a common rule is to only allow metastases to be seeded from the primary14, whereas determining whether metastases can seed other metastases is, itself, an important question. Indeed, one prevailing model in oncology, the “sequential progression model” – which posits that lymph node metastases give rise to distant metastases – is the rationale for surgical removal of lymph nodes18. However, a recent phylogenetic analysis found that the sequential model only applied to a third of patients in a colorectal cohort19. By pre-biasing their reconstructions with ad hoc rules, current algorithms undermine a key goal in making these reconstructions: determining which patterns of metastatic spread are prevalent in different cancer types.
To address this dilemma and overcome the limitations of previous tools (Supplementary Table 1), we introduce Metient (metastasis + gradient). Metient is a principled statistical algorithm that proposes multiple potential hypotheses of metastatic spread in a patient and resolves parsimony conflicts using other, readily-available data. Metient achieves this through two key innovations. First, it adapts recent stochastic optimization algorithms for discrete variables to the problem of combinatorial optimization, thereby enabling efficient sampling of multiple parsimonious solutions. Second, it introduces new biological criteria, termed metastasis priors, to calibrate its parsimony criteria and select among equally parsimonious solutions. These calibrated criteria can also be used to uncover cancer type-specific trends in metastatic spread.
On realistic simulated data, Metient outperforms parsimony-only models in accurately recovering the true migration history. When applied to patient cohorts with metastatic breast20, skin3, ovarian4, neuroblastoma9, and lung cancer14, Metient automatically identifies all plausible expert-assigned migration histories. In notable cases, it also uncovers more plausible reconstructions, often when prior expert analyses pre-selected a favored seeding pattern.
Through its unbiased automated approach, Metient reveals that metastases are often seeded polyclonally and that most metastatic seeding follows a single, shared evolutionary trajectory. The cancer type-specific models learned by Metient reflect known differences in metastasis biology, suggesting that Metient can offer insights into metastatic dissemination for new cancer cohorts.
Metient is free, open-source software that includes easy-to-use visualization tools to compare multiple hypotheses on metastatic dissemination. Metient is accessible at https://github.com/morrislab/metient/.
Results
The Metient algorithm
Migration history inference algorithms take DNA sequencing data from primary and metastatic tumor samples as input, along with an unlabeled clone tree that encodes the genetic ancestry of cancer clones (Figure 1a). These inputs are used to estimate the proportions of clonal populations in anatomical sites (referred to as “witness nodes” in Figure 1b). The internal nodes of the clone tree are then labeled with anatomical sites, defining the historical migrations: a clone that migrates to a new site receives a different label than its parent clone (Figure 1b) and the tree edge that connects them is deemed a “migration edge”. The final output is referred to as a “migration history”17 (Figure 1b).
MACHINA17 is the most widely used and most advanced migration history reconstruction algorithm. It scores migration histories using three parsimony metrics: migrations—the number of times a clone migrates to a different site4,15–17; comigrations—the number of migration events in which one or more clones travel from one site to another17; and seeding sites—the number of anatomical sites that seed another site17. MACHINA searches for the most parsimonious history by minimizing these three metrics.
This search involves solving a mixed-variable combinatorial optimization problem, consisting of continuous variables (the clone porportions matrix in Figure 1b), and discrete variables (the labeled clone tree matrix in Figure 1b). MACHINA, and other prior approaches, formulate this problem as a mixed integer linear programming (MILP) problem that they solve using commercial solvers21. However, using an MILP imposes strong limitations on the types of scoring functions that can be applied to migration histories, as MILPs require hard constraints and a linear objective. Moreover, MILP solvers identify only a single optimal solution, whereas there are often multiple solutions which are either equally parsimonious, or that trade-off one parsimony metric for the another (e.g., reducing the number of seeding sites by increasing the number of migration events). Returning a single solution obscures these possibilities, and the ad hoc rules used to distinguish among multiple solutions often introduce implicit bias into the reconstructions.
To address these issues, Metient takes a more systematic approach by first defining a “Pareto front”22 for each patient (Figure 1c). To do so, Metient searches for migration histories under a wide range of parsimony models (Supplementary Table 2). A parsimony model is represented by a set of parsimony weights – , and – assigned, respectively, to the number of migrations (indicated by ), comigrations (), and seeding sites (). A migration history’s parsimony score, , is the model-weighted average of these three parsimony metrics, i.e., . Different parsimony models favor different histories on the Pareto front. Efficiently recovering this Pareto front required replacing the current state-of-the-art MILP with newly developed stochastic gradient descent methods that employ a low-variance gradient estimator for the discrete categorical distribution over migration histories parameterized by the parsimony model23,24 ( in Figure 1b; Methods, Supplementary Information). Metient’s gradient descent approach converges to a solution many times faster than the MILP, and it also helps to define the Pareto front by identifying multiple local maxima of the migration history score for each parsimony model (Methods, Supplementary Information). In addition, this approach reduces a large combinatorial search space of possible migration histories to only the most plausible explanations of metastatic spread for a given patient.
Metient-calibrate fits cancer type-specific parsimony models
To illustrate the importance of defining a Pareto front of multiple possible patterns of metastatic spread, we defined four different cancer type-specific patient cohorts consisting of genomic sequencing of matched primary and multiple metastases: melanoma3, high-grade serous ovarian cancer (HGSOC)4, high-risk neuroblastoma (HR-NB)9, and non-small cell lung cancer (NSCLC)14. After applying quality control (Supplementary Information), we arrived at a dataset of 479 tumors (143 with multi-region sampling) in total from 167 patients (melanoma: n=7, HGSOC: n=7, HR-NB: n=27, NSCLC: n=126). Applying Metient to these patients, we discovered that 45% (75/167) had multiple Pareto-optimal migration histories, and that the complexity of the Pareto front increased with the number of metastases: 79% (27/34) of patient cases with three or more metastases had multiple Pareto-optimal histories. Often the choice among these different Pareto-optimal histories substantially impacted the interpretation of metastatic spread. For example, Figure 1c shows a patient with metastatic breast cancer with two Pareto-optimal reconstructions: one in which a lymph node metastasis gives rise to all other metastatic tumors, and another where most metastases are seeded directly from the primary tumor. Here, forcing an arbitrary choice between the two reconstructions determines whether one concludes that the lymph node acted as a staging site for metastatic spread.
MACHINA, and all previous methods4,15,17, resolve parsimony conflicts by minimizing migrations first, and then comigrations, thus implementing a parsimony model where . However, no single parsimony model is appropriate for all cancer types. For example, in ovarian cancer, clusters of metastatic cells are thought to “passively” disseminate to the peritoneum or omentum through peritoneal fluid25–27. As such, metastatic events are more likely to be polyclonal, i.e., multiple clones seed metastases, so we might expect many more migrations than comigrations. In many solid cancers, metastatic cells make a “pit stop” at regional lymph nodes before disseminating to other distant sites28, and for the estimated 23.4% of patients with lymph node metastases across cancer types29, multiple seeding sites may be common. Different cancer type-specific patterns of metastatic spread are reflected in differences in trends in the relative numbers of migrations, comigrations, and seeding sites, and prespecifying a cancer type-independent parsimony model can prevent the recovery of these patterns. Furthermore, in our cohorts, we found that there were often multiple, equally parsimonious migration histories. MACHINA selects among these randomly, or via predefined constraints on the allowable patterns of metastatic spread.
In contrast, Metient uses metastasis priors to both define a cancer type-specific parsimony model and to rank equally parsimonious histories. These priors incorporate additional biological constraints relevant to migration histories. We provide a tool, Metient-calibrate, that fits a patient cohort-specific parsimony model using the metastasis priors (Figure 1d–f; Methods). This calibrated model is used to rank Pareto-optimal histories that differ in their metrics. Metient also provides a pan-cancer parsimony model, calibrated to all four cohorts combined, for use when an appropriate patient cohort is not available.
Metient provides two metastasis priors. One, genetic distance, can be applied to any cohort. The other, organotropism, can be used when appropriate tissue-type information are available for the sequenced tumor samples. The genetic distance prior considers the average genetic distance of migration edges in the labeled clone tree; where the genetic distance on an edge is the number of mutations gained in the child clone and not present in the parent clone. In general, we expect genetic distance to tend to be higher on migration edges than other clone tree edges for a number of reasons. First, the colonizing clones of a metastasis have undergone a clonal expansion in their metastatic site, which makes their private mutations more easily detectable by finite depth sequencing. In contrast, the vast majority of private mutations in the source tumor will not be at high enough cellular frequency to be detectable, and subclones detected in the source tumor need not have undergone a clonal expansion30. In addition to increased mutation detectability, colonizing cells likely have more mutations than randomly selected cells in the source population due to the strong selection pressures they faced in metastasizing, as strong selection pressures select, perhaps indirectly, for higher mutation rates in asexually reproducing populations31–33. Finally, metastases exhibit greater genomic instability29,34,35, possibly as a consequence of these selection pressures, which is associated with heightened mutation rates36. Indeed, metastases across many cancer types have moderately or significantly higher tumor mutation burden (TMB) than matched primaries29,35,37. Metient’s genetic distance prior deems more probable those migration histories with higher averaged genetic distances on migration edges (Methods, Supplementary Information). Figure 1d illustrates an example of using the genetic distance prior to select between two equally parsimonious migration histories.
The second metastassis prior, organotropism, is derived from data from 25,775 Memorial Sloan Kettering metastatic cancer patients29 on the preference that some cancer types have to colonize other organs38. We used these data to construct a matrix for 27 common cancer types, where each entry is the frequency of metastasis to a particular anatomical site that is observed in patients with that cancer type (Figure 1e). Note that there are no direct data for frequencies of migrations from one metastatic site to another metastatic site, so Metient only uses this matrix to score migrations coming from the primary site (Methods). For example, breast cancer metastasizes to lung more often than brain, so Metient’s organotropism prior favors a solution with migrations to the brain from a breast-seeded lung metastasis over one with migrations from a breast-seeded brain metastasis to the lung (Figure 1e). Indeed, brain to lung metastasis is rare39. As we illustrate in later sections, our metastasis priors lead to better performance on simulated benchmarks, and more plausible migration history reconstructions than using maximum-parsimony rules and cancer type-independent rules. Nonetheless, Metient reports all Pareto-optimal solutions; in this example, both solutions in Figure 1e are visualized in a simple summary report, so that these multiple hypotheses can be easily evaluated by the user.
Importantly, Metient uses its metastasis priors to complement but not replace its parsimony model. In our benchmarking analyses on simulated data, we find that using genetic distance alone to score migration histories performs poorly and can result in the inference of highly non-parsimonious migration histories (Supplementary Tables 4, 3, see also PathFinder40). Instead, the metastasis priors are only used once the Pareto front is defined, to calibrate parsimony models and to rank equally parsimonious solutions.
Simulated data validates the genetic distance prior and shows that Metient is state-of-the-art
To assess Metient’s new objective and gradient-based optimization on data with a provided ground-truth, we ran benchmarking analyses along with the state-of-the-art migration history inference method (MACHINA17) on simulated data, originally used to validate MACHINA, for 80 patients with 5–11 tumor sites and various patterns of metastatic spread.
First, to assess the added value of the genetic distance prior, we used Metient-calibrate to fit a calibrated parsimony model, and compared calibrated Metient with a version of Metient that used the parsimony model implied by MACHINA. We fit two calibrated models, one on a cohort with primary-only seeding and another on a cohort with metastasis-to-metastasis seeding. Metient-calibrate improved recovery of the ground truth migration graph (Figure 1c) over fixed parsimony model (Calibrate vs. Evaluate (MP) in Supplementary Table 3), showcasing the ability of the metastasis priors to learn metastatic patterns specific to a cohort and improve overall accuracy. In addition, Metient-calibrate predicts ground truth seeding clones and migrations graphs at least as accurately as MACHINA, with overall improvements as tree sizes get larger (Figure 2a,b) and significant improvements in inferring the seeding clones for patients with more complex metastasis-to-metastasis seeding (Figure 2b top; p=0.0021).
Notably, although the Metient framework is non-deterministic, it identifies the same top solution 97% of the time across multiple runs (Figure 2c). Furthermore, in addition to its improved accuracy, Metient runs up to 55x faster (3.95s with Metient-64 vs. 221.19s with MACHINA for a cancer tree with 18 clones and 9 tumors), showcasing our framework’s scalability even as tree sizes get very large (Figure 2d).
Validation of organotropism prior
To validate the organotropism prior, we ran Metient, using the pan-cancer parsimony model, on samples available from two patients with metastatic breast cancer20 where site labels could be mapped to those used in our organotropism matrix. When faced with multiple parsimonious migration histories, Metient chooses a more plausible tree, wherein lung to brain seeding is preferred over brain to lung seeding, which is clinically rare39 (Figure 3a).
Multi-cancer analysis of clonality, phyleticity, and dissemination patterns
Having established that Metient can accurately recover ground-truth and learn cohort-specific metastatic patterns on simulated data, we next sought to apply the method to real patient data from the melanoma, HGSOC, HR-NB and NSCLC cohorts to investigate shared and unique patterns of metastatic dissemination. Due to missing or inadequate anatomical site labels for many patients in these cohorts, we were unable to use Metient’s organotropism matrix on these cohorts, and we only calibrated to genetic distance.
Using Metient, we examined three aspects of metastatic dissemination across the four cohorts. The first aspect is seeding pattern, which can be sub-categorized as single-source from the primary or from another site, multi-source, or reseeding (Figure 4a). The other two criteria are clonality, i.e., the number of distinct clones seeding metastases (Figure 4b), and phyleticity, i.e., whether metastatic potential is gained in one or multiple evolutionary trajectories of the clone tree (Figure 4c; Methods). We distinguish between genetic polyclonality, in which more than one clone seeds metastases in a patient, and site polyclonality, in which more than one clone seeds an individual site (Figure 4b; Methods). We introduce this distinction to highlight cases where each metastasis is seeded by a single clone, but all sites are not seeded by the same clone (i.e., the cancer is genetically polyclonal but site monoclonal), because these may be cases where different site-specific mutations are needed for metastasis. We also update the previous definitions of metastasis-initiating clones (commonly called seeding clones). We define a seeding or colonizing clone as a node in a migration history whose parent has a different label than itself (Methods), because this clone is the only one guaranteed to have the mutations necessary to establish the metastasis. Previous work often refers to the parent of the colonizing clone as the seeding clone14,17, although this clone may not have all of mutations required for the observed metastasis.
Consistent with expert annotations3,4,9,14,17, Metient finds that single-source seeding from the primary tumor is the most common pattern in every cohort (Figure 4d). However, Metient identifies a larger fraction of polyclonal migration patterns than previous reports8,14: 53.3% of patients have sites that are seeded by different clones, i.e., genetically polyclonal (Figure 4e), and 38.3% of patients have at least one site seeded by multiple clones, i.e. site polyclonal (Figure 4f). Overall, Metient estimates that 34.1% of sites (107/314) are seeded by multiple clones; nearly double prior estimates of site polyclonality (19.2%) based on an analysis of breast, colorectal and lung cancer patients8. Notably, parsimony model choice influences the polyclonality of migration histories, because reducing the number of seeding sites tends to increase the number of polyclonal migrations (Supplementary Figure S1a). However, the higher polyclonality in Metient’s reconstructions does not result from an assumption of primary-only seeding, as done in prior work, which would result in even more polyclonal migrations (Supplementary Figure S1a, Supplementary Information).
Metient’s phyleticity estimates mirror previous reports: 77.2% of patients (129/167) have a monophyletic tree where metastatic potential is gained once and maintained (Figure 4g). For some patients, this is due to the root clone being observed in one or more metastatic sites (Supplementary Figure S1b), and for other patients, all colonizing clones belong to a single path of the clone tree. Either scenario suggests that metastatic potential is less likely to be gained via multiple, independent evolutionary trajectories across cancers.
Cancer type-specific metastasis trends
We next examined cancer type-specific differences in metastatic trends, first using a bootstrapping approach to ensure that the parsimony metric weights were reproducible and reflective of population level patterns for a particular cancer type. We fit parsimony metric weights to 100 bootstrapped samples of patients within the cohort (Methods), and found that 98.4% of patients ranked the same top solution across bootstrap samples, indicating that Metient can learn a reproducible cancer type-specific model for the melanoma and HGSOC cohorts which have only seven patients each.
These cancer type-specific parsimony metric weights lead to cohort-specific choices on how Metient ranks a patient’s Pareto front of migration histories. For example, Metient chooses the solution on the Pareto front with lowest migration number (i.e. colonizing clones) for HR-NB patient H103207 (Figure 4h), but the solution with the median value of each metric for NSCLC patient CRUK0290 (Figure 4i). To systematically assess the impact of cohort-specific rankings we computed the percentage of polyclonality and number of seeding sites in the top ranked solution for patients with each cancer type. Overall, we found a significantly higher fraction of polyclonal migrations in melanoma than HGSOC, HR-NB and NSCLC patients (Figure 4j). One explanation for this heightened polyclonality in melanoma patients is that all patients in the cohort had locoregional skin metastases, a common “in-transit” metastatic site around the primary melanoma or between the primary melanoma and regional lymph nodes. These locoregional sites could have multiple cancer cells traveling together through hematogeneous or lymphatic routes to seed new localized tumors41. The HR-NB and NSCLC cohorts had significantly higher percentages of metastasis-to-metastasis seeding than melanoma (Figure 4k). As described below, in the HR-NB cohort, multiple patients exhibit metastasis-to-metastasis seeding within an organ or between commonly metastatic sites. In the NSCLC cohort, 76.2% of patients have lymph node metastases, from which it is known that further metastases are commonly seeded42. Indeed, Metient predicted that 75% (12/16) of NSCLC patients who had metastasis-to-metastasis seeding had seeding from a lymph node to other metastases.
Metastasis priors identify biologically relevant migration histories and alternative explanations of spread
A core advance of Metient is its ability to identify and rank the Pareto-optimal histories of a patient’s cancer. To assess how well our top ranked solution aligns with the most biologically plausible explanation, we compared our inferred migration histories to previously reported, expert-annotated seeding patterns.
Of the 167 patients analyzed, 152 patients had an expert or model-derived annotation available. Because the HR-NB annotations only indicate the presence of a migration between two sites and not the directionality, for an overall comparison of these 152 patients we compared our site-to-site migrations to those that were previously reported (i.e., a binarized representation of migration graph (Figure 1c)). In 84% of patients (128/152), Metient-calibrate’s highest ranked solution aligns with the previously reported migration history. For the remaining 24 patients, Metient either identifies a more parsimonious history or recovers the expert annotation on the Pareto front but the metastasis priors prefer a different history than the expert. We provide a detailed case-by-case comparison in the Supplementary Information and Supplementary Figures S2, S3, S4, S5, and highlight some of the interesting cases below.
Metient predicted metastasis-to-metastasis seeding for two HR-NB cases (H103207, H132384), which were previously reported to have initially seeded directly from the primary9. HR-NB patient H103207 shows evidence of two possible metastasis-to-metastasis seeding scenarios. One, which is ranked the highest by the calibrated parsimony metrics posits a serial progression of metastatic seeding from the primary to the right lung, then to the liver, and finally to the left lung. The other, which has the second highest rank, posits seeding from the primary to the liver and then the left lung (Figure 5a). While the exact prevalence of metastasis-to-metastasis seeding between the liver and lung in HR-NB is unknown, both are common sites of metastases across cancer types due to cancer cells’ ability to take advantage of rich blood supply, vascular organization and physiology38. Colonization of the lung by clones from a primary liver tumor is common38,43,44 and, similarly, the liver is a common site of metastasis for primary lung cancer patients38,45, suggesting that transitions from a liver-competent cancer clone to a lung-competent one and vice versa could also be common. For this patient, multiple colonizing clones emerge on distinct branches of the clone tree, providing another line of evidence that the suggested metastasis-to-metastasis seeding probably occurred (Supplementary Figure S2a). Specifically, the CNS-colonizing clones appear on a shared branch, and the lung- and liver-colonizing clones appear on a separate, shared branch after further primary tumor evolution occurred (Supplementary Figure S2a). This suggests that evolution within the primary tumor gave rise to multiple clones with organ-specific metastatic competence, and is concordant with the clonal analysis reported by Gundem et al.9 for this patient. Patient H132384 also shows evidence of metastasis-to-metastasis seeding, but from bone-to-bone, first to the left cervical and secondarily to the chest wall (Figure 5b). Metastasizing cells exhibit organ-specific genetic and phenotypic changes to survive in a new microenvironment38, suggesting that seeding an additional tumor within the same organ microenvironment is more likely than a secondary migration from the primary adrenal tumor in this case. In addition, prior experimental evidence shows that bone metastases prime and reprogram cells to form further secondary metastases46,47. These posited metastasis-to-metastasis seedings are thus upported by site proximity or organotropism, or both, and these Metient reconstructions were made without providing such information.
Next we compared the inferred migration histories from the NSCLC samples we analyzed to an in-depth analysis of the same samples by the TRACERx consortium14. The TRACERx analysis enforces a primary single-source dissemination model, i.e., that metastases are only seeded from the lung, for its analysis of clonality and phyleticity. While Metient generally agrees with this dissemination model, Metient predicts metastasis-to-metastasis seeding for several (12.8%; 16/126) patients (Figure 6a). CRUK0484 is one such patient where Metient proposes that an initial metastasizing clone to the rib leads to secondary metastasis formation in the scapula (Figure 6b), which we propose is a more plausible solution based on the same line of reasoning described for the bone-to-bone metastasis predicted in HR-NB patient H132384 above.
When comparing the TRACERx classifications of clonality and phyleticity for each patient to those implied by Metient’s highest-scoring solution, we find 84.1% agreement (106/126) in clonality (Figure 6c) and 78% agreement (96/123) in phyleticity (Figure 6d) (three patients classified as “mixed” phyleticity by TRACERx were excluded). The discrepancies between these classifications stem from the way in which metastatis initiating clones are defined. TRACERx identifies shared clones between a primary tumor and its metastases, defining the seeding clone as the most recent shared clone between the primary tumor and the metastasis. In contrast, Metient uses the entire migration history to define seeding clones (Methods) and accounts for metastasis-to-metastasis seeding, rather than assuming that seeding occurs only from the primary tumor. As a result, Metient has significantly higher sensitivity in detecting colonizing populations within metastases and, subsequently, increases the detection of polyclonal and polyphyletic events.
In 20 NSCLC patients, Metient inferred that multiple colonizing clones are needed to explain the full migration history, whereas no history is consistent with the TRACERx identified colonizing clones. For example, for patient CRUK0256 (Figure 6e), only the root clone is shared between primary and metastases, making it the only seeding clone by TRACERx’s definition. However, according to the clone tree and the observed presence of clone 6 in LN_SU_FLN1 and clone 5 in both LN_SU_FLN1 and LN_SU_LN1, we conclude that there must have been either a metastasis-to-metastasis seeding event (Figure 6e solution 1), or two clones originally from the primary (no longer detectable in the metastatic samples due to either ongoing evolution or undersampling) that seeded the metastases (Figure 6e solution 2). In either migration history, multiple clones had to participate in seeding in order to explain the clone tree and observed clones inferred from the sequencing data.
Inference of phyleticity is also impacted by the use of the clone tree to determine colonizing clones, as the path connecting colonizing clones is used to determine if metastatic competence arises once or multiple times during evolution. Because the number of colonizing clones is underestimated in the TRACERx analysis, monoclonal seeding is inferred more often, automatically classifying these histories as monophyletic. Furthermore, we find 27 cases where TRACERx classifies a patient as monophyletic and Metient classifies the same patient as polyphyletic; in such cases the multiple clones needed to explain seeding occur on separate paths of the clone tree (e.g. patient CRUK0762, Figure 6f). Therefore, while we agree that monophyleticity is the majority pattern in NSCLC (63%), we suggest that polyphyleticity might be underestimated due to less sensitivity in previous methods’ ability to detect colonizing clones.
Discussion
We have presented and validated Metient, a new framework for reconstructing the migration histories of metastases. In contrast to prior work, Metient defines a Pareto front of possible migration histories, and then uses metastasis priors to resolve parsimony conflicts in a data-dependent manner. Another key innovation is that it adapts Gumbel straight-through stochastic gradient estimation to optimize the combinatorial problem required for history reconstruction. Collectively, these advances improve performance on simulated data, improve biological interpretation on real data, and define a Pareto front in a fraction of the time that MACHINA, the current state-of-the-art, takes to output a single solution. Notably, Metient uses open source software packages, whereas other methods rely on commercial MILP solvers. Metient, due to its much improved speed, could easily be adapted to much larger migration history reconstruction problems, such as those posed by single-cell data.
Here we show that by selecting among Pareto-optimal solutions using a pre-specified parsimony model and ad hoc rules, previous algorithms biased the conclusions of studies of metastatic spread. In one study14, primary-only seeding was assumed when analyzing migration histories, thus plausible histories with metastasis-to-metastasis seeding were ignored, even when they were identified by MACHINA. Metient thus provides an unbiased means of identifying cancer-type specific trends in metastasis biology, thus addressing a critical problem in metastasis research.
Metient’s increased precision in identifying colonizing clones allowed it to detect almost twice as much polyclonality as previously reported, suggesting that it is common for multiple clones to contribute to metastatic progression. Despite this, Metient still inferred that metastatic potential rarely emerges independently in separate evolutionary paths.
Currently, Metient uses genetic distance and organotropism as its metastasis priors, however, the Metient framework is designed to be easily extensible. Adding a new prior simply requires writing a scoring function because Metient incorporates auto-differentiation to compute its gradient updates. For instance, the framework could be easily extended to incorporate mutational signatures as a prior, since metastases exhibit shifts in mutational signature composition48,49.
Metient has some limitations. It scales well in compute time for larger clone trees or more samples but, because the loss landscape complexity increases substantially, in some cases (less than 1%), Metient became stuck in local minima. This problem was resolved when we ran Metient multiple times and with larger sample sizes, and we recommend this practice with larger reconstruction problems. One criteria to ensure convergence is when the Pareto front remains unchanged. Other migration history algorithms are also highly sensitive to the complexity of the loss landscape, and convergence issues that they face are not necessarily resolved by rerunning the algorithm. Also, Metient is not designed to consider subclonal copy number alternations (CNAs) when correcting its estimated variant allele frequencies for CNAs. Using the descendant cell fraction (DCF)50 or phylogenetic cancer cell fraction (phyloCCF)51 as inputs to Metient could solve this. Alternatively, one could input which clones are in which samples directly into Metient instead of the allele frequencies. Finally, we note that choice of clustering and tree inference algorithm used when inputting data into Metient can impact both the clonality and phyleticity classifications. In an attempt to most accurately compare our migration histories to previously reported results, where possible, we use the same clustering and trees inferred for the original datasets.
In conclusion, we show that Metient offers a fast and adaptable, fully automated framework that leverages bulk DNA sequencing data to probe enduring questions in metastasis research.
Methods
Estimating observed clone proportions
The first step of Metient is to estimate the binary presence or absence of clone tree () nodes in each site. The clone tree can either be provided as input, or inferred from the DNA sequencing data using, e.g., Orchard52, PairTree53, SPRUCE54, CITUP55, or EXACT56. Building on a previous approach as described by Wintersinger et al.53, Metient estimates the proportion of clones in each site using the input clone tree and read count data from bulk DNA sequencing. For a genomic locus in anatomical site , the probability of observing read count data is defined using the following:
is the number of reads that map to genomic locus in anatomical site with the variant allele
is the number of reads that map to genomic locus in anatomical site with the reference allele
is a conversion factor from mutation cellular frequency to variant allele frequency (VAF) for genomic locus in anatomical site
Using a binomial model, we then estimate the proportion of anatomical site containing clone using . Where is the mutation cellular frequency matrix, is 1:1 with a clone tree, where is the number of clones and is the number of mutations or mutation clusters, and if clone contains mutation (Figure 1b). , where is the number of anatomical sites, and is the fraction of anatomical site made up by clone (Figure 1b). An L1 regularization is used to promote sparsity, since we expect most values in to be zero. For details on how to set , see “Variant read probability calculation ()” in Supplementary Information. An alternative way to find a point estimate of is using a previously described projection algorithm for this problem52,53,56,57. A point estimate can be found by optimizing the following quadratic approximation to the binomial likelihood of given and :
(1) |
where is the Frobenius norm, is a vector of 1s, are the observed mutation frequencies, is a matrix of inverse-variances for each mutation in each sample derived from , and is the Hadamard, i.e., element-wise product. The definition for is as described in previous work53,56.
We use (estimated in either of the previously described ways) to determine if a clone is present in an anatomical site . If is present, we attach a witness node with label (leaf nodes connected by dashed lines in Figure 1b, c) to clone in clone tree . We deem to be present in if for a given anatomical site and clone . If a clone does not make up 5% of any of the anatomical sites, and is a leaf node of the clone tree , we remove this node since it is not well estimated by the data.
Here the term “anatomical site” is used to describe a distinct tumor mass. If multiple samples are taken from the same tumor mass, we combine them as described in “Bulk DNA sequencing pre-processing: Non-small Cell Lung Cancer Dataset”.
Note that read count data are only used to determine which clones are present in which sites, if a matrix indicating the presence or absence of each clone in each anatomical site is available, it can be used as an input to replace the read count data. These clone-to-site assignment matrices can be derived, e.g., from single-cell data.
Labeling the clone tree
The next step in inferring a migration history is to jointly infer a labeling of the clone tree and resolve polytomies, i.e., nodes with more than two children. Polytomy resolution is discussed in the section “Resolving polytomies”.
Because we are interested in identifying multiple hypotheses of metastatic spread, Metient seeks to find multiple possible labelings of a clone tree . Each possible labeling is represented by a matrix , where is the number of anatomical sites and is the number of clones, and if clone is first detected in anatomical site . Each column of is a one-hot vector. We solve for an individual by optimizing the evidence lower bound, or ELBO, as defined by:
(2) |
Where evaluates a labeling based on parsimony, genetic distance, and organotropism, and the second term is the entropy term. has been optimized as described in the previous section “Estimating observed clone proportions”, or taken as input from the user. See Supplementary Information for a full derivation of this objective. Because is a matrix of discrete categorical variables, we do not optimize directly, but rather the underlying probabilites of each category that we optimize using a Gumbel-softmax estimator (see “Gumbel-softmax optimization”).
Gumbel-softmax optimization
In the previous section, we described how to score the matrix representation of the labeled clone tree, . Here, we describe how to optimize via the straight-through estimator of the Gumbel-Softmax distribution23,24. Starting with a matrix , of randomly initialized values, where is the number of anatomical sites and is the number of clones, and each column represents the unnormalized log probabilities of clone being labeled in site :
At every iteration, for each clone , we sample i.i.d. samples from Gumbel(0,1) and compute .
We then sample from the categorical distribution represented by the column vector by setting and represent that sample with a one-hot encoding in , i.e., if , 0 otherwise.
- Then we evaluate the where
using a stochastic approximation based on , and take the gradient of this ELBO in the backward pass, thus implementing the straight-through estimator. During training, start with a high to permit exploration, then gradually anneal to a small but non-zero value so that the Gumbel-Softmax distribution, resembles a one-hot vector.
At the end of training, as approaches 0, then the gradient becomes unbiased and approaches . In order to capture multiple modes of the posterior distribution, each representing different hypotheses about the migration history, we optimize multiple in parallel. To do this, we set up steps 1–3 such that are solved for in parallel58 (with a different random initialization for each parallel process), where is equal to the sample size and is calculated according to the size of the inputs . See Supplementary Information for further explanation.
Resolving polytomies
An overview of the algorithm to resolve polytomies is given in Supplementary Figure S7a and b.
If a node in has more than 2 children, we create a new “resolver” node for every site where either or ’s children are observed in. Specifically, for every node in , we look at the set of nodes , which contains node and node ’s children. We then tally the anatomical sites of all witness nodes for nodes in . If any anatomical site is counted at least twice, a resolver node with that anatomical site label is added as a new child of . The genetic distance between the parent node and its new resolver node is set to 0 since there are no observed mutations between the two nodes.
We allow the children of to stay as a child of , or become a child of one of the resolver nodes of .
Any resolver nodes that are unused (i.e. have no children) or which do not improve the migration history (i.e.the parsimony metrics without the resolver node are the same or worse) are removed.
Fixing optimal subtrees
To improve convergence, we perform two rounds of optimization when solving for a labeled clone tree and resolving polytomies:
Solve for labeled trees and resolve polytomies jointly (as described in previous sections).
For each pair of labeled tree and polytomy resovled tree, find optimal subtrees. I.e., find the largest subtrees, as defined by the most number of nodes, where all labels for all nodes are equal. This means that there is no other possible optimal labeling for this subtree (there are 0 migrations, 0 comigrations, 0 seeding sites), and we can keep it fixed. Fix these nodes’ labelings and adjacency matrix connections (if using polytomy resolution).
Repeat step 1 for any nodes that have not been fixed in step 2.
Metient-calibrate
In Metient-calibrate, we aim to fit a patient cohort-specific parsimony model using the metastasis priors. To score a migration history using genetic distance, we use the following equation: , where contains the normalized number of mutations between clones, and if clone is the parent of clone and clone and clone have different anatomical site labels.
To score a migration history using organotropism, we use the following equation: , where vector contains the frequency at which the primary seeds other anatomical sites, and vector contains the number of migrations from the primary site to all other anatomical sites for a particular migration history.
To optimize the parsimony metric weights, Metient identifies a Pareto front of labeled trees for each patient and scores these trees based on (1) the weighted parsimony metrics and (2) the metastasis priors: genetic distance and, if appropriate anatomical labels are available, organotropism. These form the parsimony distribution and metastasis prior distribution, respectively. We initialize with equal weights and use gradient descent to minimize the cross entropy loss between the parsimony distribution and metastasis prior distribution for all patients in the cohort. Once the optimization converges, Metient rescores the trees on the Pareto front using the fitted weights, to identify the maximum calibrated parsimony solution, and genetic distance and organotropism are used to break ties between equally parsimonious migration histories. See Supplementary Information for a more detailed derivation.
Metient-evaluate
In Metient-evaluate, weights for each maximum parsimony metric (migrations, comigrations, seeding sites) and optionally, genetic distance and organotropism, are taken as input. These weights are used to rank the solutions on the Pareto front. If no weights are inputted, we provide a pan-cancer parsimony model calibrated to the four cohorts (melanoma, HGSOC, HR-NB, NSCLC) discussed in this work.
Defining the organotropism matrix
Data from the MSK-MET study29 for 25,775 patients with annotations of distant metastases locations was downloaded from the publicly available cbioportal59. Each patient had annotations of one of 27 primary cancer types and the presence or absence of a metastasis in one of 21 distant anatomical sites. The original authors extracted this data from electronic health records and mapped it to a reference set of anatomical sites. We sum over all patients to build a 27 × 21, cancer type by metastatic site occurrence matrix. We then normalize the rows to turn these into frequencies. We interpret the negative log frequencies as a “relative time to metastasis”, and only score migrations from the primary site to other sites, because there is no data to indicate frequencies of seeding from metastatic sites to other metastatic sites, or back to the primary. We make this data available for users, with the option for users to instead input their own organotropism vector for each patient.
Evaluations on simulated data
We use the simulated data for 80 patients provided by MACHINA17 to benchmark our method’s performance. To prepare inputs to Metient, we use the same clustering algorithm and clone tree inference algorithm used in MACHINA (MACHINA17 and SPRUCE54, respectively) in order to accurately compare only our migration history inference algorithm (including polytomy resolution) against MACHINA’s. All performance scores are reported using MACHINA’s PMH-TI mode and Metient-calibrate with a sample size of 1024, both with default configurations. We do not use polytomy resolution for Metient-calibrate in these results, since it does not improve performance on simulated data. (Supplementary Tables 4, 3). However, this performance is not necessarily indicative of polytomy resolution working poorly, because it actually finds more parsimonious solutions than the ground truth solution in 75% of simulated data (Supplementary Figure S6).
Evaluation metrics.
We use the same migration graph and seeding clones F1-scores as MACHINA. Given a reconstructed migration graph , its recall and precision with respect to the ground truth migration graph are calculated as follows:
where are the edges of , and multiple edges between the same two sites are included in . When there are multiple edges from site to site , where and are the number of edges from site to site in and , respectively.
Recall and precision of the seeding clones in the inferred migration history (which includes inference of both the clone tree labeling and observed clone proportions) is calculated as follows:
where is the set of mutations, i.e., the subclone, associated with the clone nodes that have an outgoing migration edge. For example, in solution A of Figure 1c. The definition for seeding clones used in these evaluations is distinct from how we define seeding clones in the rest of the paper (“Defining colonizing clones, clonality, and phyleticity” in Methods). Specifically, if there is an edge between two nodes , where the labeling of and are not equal, we define the seeding clone as . However in order to consistently compare to MACHINA in these evaluations, we use their definition and define the seeding clone as . We note that identifying the mutations of is generally a harder problem.
Timing benchmarks.
All timing benchmarks (Figure 2e) were run on 8 Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz CPU cores with 8 gigabytes of RAM per core. Runtime of each method is the time needed to run inference and save dot files of the inferred migration histories (and for Metient, an additional serialized file with the results of the top k migration histories). We compare MACHINA’s PMH-TI mode to Metient-calibrate with a sample size of 1024, both with default configurations. These are the same modes used to report comparisons in F1-scores. Each value in Figure 2e is the time needed to run one patient’s tree. Because Metient-calibrate has an additional inference step where parsimony metric weights are fit to a cohort, we take the time needed for this additional step and divide it by the number of patient trees in the cohort, and add this time to each patient’s migration history runtime.
Defining colonizing clones, clonality, and phyleticity
A colonizing clone is defined as a node in a migration history whose parent is a different color than itself. There are two exceptions to this rule: when node has a parent with a different color than itself, but the node is a witness node (Figure 1c) or a polytomy resolver node (e.g. A_POL in Supplementary Figure S7a). In these cases, these nodes do not represent any new mutations, but rather contain the same mutations as its parent. For these two cases, the colonizing clone is defined to be ’s parent node.
In order to rectify different meanings of the terms “monoclonal” and “polyclonal” used in previous work, we define two terms:
genetic clonality: if all sites are seeded by the same colonizing clone, this patient is genetically monoclonal, otherwise, genetically polyclonal.
site clonality: if each site is seeded by one colonizing clone, but not necessarily the same colonizing clone, this patient is site monoclonal, otherwise, site polyclonal.
Genetic clonality and site clonality are depicted schematically in Figure 4b.
To define phyleticity, we first extract all colonizing clones from a migration history. We then identify the colonizing clone closest to the root, , i.e., the colonizing clone with the shortest path to the root. If all other colonizing clones are descendants of the tree rooted at , the migration history is monophyletic, otherwise, it is polyphyletic. Under this definition, if a tree is monophyletic, then there are no independent evolutionary trajectories that give rise to colonizing clones. This is depicted schematically in Figure 4c.
In order to accurately compare our phyleticity measurements to TRACERx, we use their definition in Figure 6c and the TRACERx comparison analysis. To apply their definition to our migration histories, we extract colonizing clones as described above, and then determine if there is a Hamiltonian path in the clone tree that connects the colonizing clones. I.e., we determine if there is a path in the clone tree that visits each colonizing clone exactly once. If such a Hamiltonian path exists, we call this migration history monophyletic under the TRACERx definition, and polyphyletic otherwise.
Bootstrap sampling for fitting parsimony metric weights
Running Metient-calibrate on the 167 patients from the melanoma, HGSOC, HR-NB and NSCLC datasets infers a Pareto front of migration histories for each patient. For each dataset, we subset patients that have a Pareto front with size greater than one, and take 100 bootstrap samples of patients from this subset. Patients with a single solution on the Pareto front do not have an impact on the cross-entropy loss used to fit the parsimony metric weights. For each bootstrap sample of patients, their Pareto front migration histories are used to fit the parsimony metric weights (“Calibrate alignment” in Supplementary Information). For each of the parsimony metric weights fit to a bootstrap sample, we evaluated how these weights would order the Pareto front, and evaluated how consistently the same top solution was chosen. We average the percent of times the same solution is ranked as the top solution across the four datasets.
Supplementary Material
Acknowledgments
We thank Julia Simundza for her valuable feedback on this manuscript, and Deeksha Madala for coming up with the method name Metient. K.G. is supported by NIH grants R37CA266185, U2CCA233284 and U54CA274492. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 227260-01 (D.K.) and NIH/NCI Cancer Center Support Grant P30 CA008748 (Vickers).
Footnotes
Code availability
Metient is available as a software package installable with pip at https://github.com/morrislab/metient/. Tutorials for usage can be found at https://github.com/morrislab/metient/tree/main/tutorial. Code to reproduce figures from this manuscript can be found at https://github.com/morrislab/metient/tree/main/metient/jupyter_notebooks.
Data availability
The HR-NB dataset was accessed from the NCI’s Cancer Research Data Commons (https://datacommons.cancer.gov) under the study phs03111.v1.p1. The anatomical site labels for TRACERx patients used data generated by The TRAcking Non-small Cell Lung Cancer Evolution Through Therapy (Rx) (TRACERx) Consortium and provided by the UCL Cancer Institute and The Francis Crick Institute. The TRACERx study is sponsored by University College London, funded by Cancer Research UK and coordinated through the Cancer Research UK and UCL Cancer Trials Centre. The organotropism matrix derived from MSK-MET is available at https://github.com/morrislab/metient/blob/main/metient/data/msk_met/msk_met_freq_by_cancer_type.csv. The following publicly available datasets were used: melanoma3, breast20, HGSOC4, NSCLC14, MSK-MET29.
Bibliography
- 1.Ganesh Karuna and Massagué Joan. Targeting metastatic cancer. Nature medicine, 27(1):34–44, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gundem Gunes, Van Loo Peter, Kremeyer Barbara, Alexandrov Ludmil B, Tubio Jose MC, Papaemmanuil Elli, Brewer Daniel S, Kallio Heini ML, Högnäs Gunilla, Annala Matti, et al. The evolutionary history of lethal metastatic prostate cancer. Nature, 520(7547):353–357, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zachary Sanborn J, Chung Jongsuk, Purdom Elizabeth, Wang Nicholas J, Kakavand Hojabr, Wilmott James S, Butler Timothy, Thompson John F, Mann Graham J, Haydu Lauren E, et al. Phylogenetic analyses of melanoma reveal complex patterns of metastatic dissemination. Proceedings of the National Academy of Sciences, 112(35):10995–11000, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.McPherson Andrew, Roth Andrew, Laks Emma, Masud Tehmina, Bashashati Ali, Zhang Allen W, Ha Gavin, Biele Justina, Yap Damian, Wan Adrian, et al. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nature genetics, 48(7):758–767, 2016. [DOI] [PubMed] [Google Scholar]
- 5.Birkbak Nicolai J and McGranahan Nicholas. Cancer genome evolutionary trajectories in metastasis. Cancer cell, 37(1):8–19, 2020. [DOI] [PubMed] [Google Scholar]
- 6.Wei Q, Ye Z, Zhong X, Li L, Wang C, Myers RE, Palazzo JP, Fortuna D, Yan A, Waldman SA, et al. Multiregion whole-exome sequencing of matched primary and metastatic tumors revealed genomic heterogeneity and suggested polyclonal seeding in colorectal cancer metastasis. Annals of oncology, 28(9):2135–2141, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hu Zheng, Ding Jie, Ma Zhicheng, Sun Ruping, Seoane Jose A, Shaffer J Scott, Suarez Carlos J, Berghoff Anna S, Cremolini Chiara, Falcone Alfredo, et al. Quantitative evidence for early metastatic seeding in colorectal cancer. Nature genetics, 51(7):1113–1122, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hu Zheng, Li Zan, Ma Zhicheng, and Curtis Christina. Multi-cancer analysis of clonality and the timing of systemic spread in paired primary tumors and metastases. Nature genetics, 52(7):701–708, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gundem Gunes, Levine Max F, Roberts Stephen S, Cheung Irene Y, Medina-Martínez Juan S, Feng Yi, Arango-Ossa Juan E, Chadoutaud Loic, Rita Mathieu, Asimomitis Georgios, et al. Clonal evolution during metastatic spread in high-risk neuroblastoma. Nature Genetics, pages 1–12, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brown David, Smeets Dominiek, Székely Borbála, Larsimont Denis, Szász A Marcell, Adnet Pierre-Yves, Rothé Françoise, Rouas Ghizlane, Nagy Zsófia I, Faragó Zsófia, et al. Phylogenetic analysis of metastatic progression in breast cancer using somatic mutations and copy number aberrations. Nature communications, 8(1):14944, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Brastianos Priscilla K, Carter Scott L, Santagata Sandro, Cahill Daniel P, Taylor-Weiner Amaro, Jones Robert T, Van Allen Eliezer M, Lawrence Michael S, Horowitz Peleg M, Cibulskis Kristian, et al. Genomic characterization of brain metastases reveals branched evolution and potential therapeutic targets. Cancer discovery, 5(11):1164–1177, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Turajlic Samra, Xu Hang, Litchfield Kevin, Rowan Andrew, Chambers Tim, Lopez Jose I, Nicol David, O’Brien Tim, Larkin James, Horswell Stuart, et al. Tracking cancer evolution reveals constrained routes to metastases: Tracerx renal. Cell, 173(3):581–594, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Noorani Ayesha, Li Xiaodun, Goddard Martin, Crawte Jason, Alexandrov Ludmil B, Secrier Maria, Eldridge Matthew D, Bower Lawrence, Weaver Jamie, Lao-Sirieix Pierre, et al. Genomic evidence supports a clonal diaspora model for metastases of esophageal adenocarcinoma. Nature genetics, 52(1):74–83, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bakir Maise Al, Huebner Ariana, Martínez-Ruiz Carlos, Grigoriadis Kristiana, Watkins Thomas B. K., Pich Oriol, Moore David A., Veeriah Selvaraju, Ward Sophia, Laycock Joanne, and et al. The evolution of non-small cell lung cancer metastases in tracerx. Nature, Apr 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dang HX, White BS, Foltz SM, Miller CA, Luo Jingqin, Fields RC, and Maher CA. Clonevol: clonal ordering and visualization in cancer sequencing. Annals of oncology, 28(12):3076–3082, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Reiter Johannes G, Makohon-Moore Alvin P, Gerold Jeffrey M, Bozic Ivana, Chatterjee Krishnendu, Iacobuzio-Donahue Christine A, Vogelstein Bert, and Nowak Martin A. Reconstructing metastatic seeding patterns of human cancers. Nature communications, 8(1):14114, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.El-Kebir Mohammed, Satas Gryte, and Raphael Benjamin J. Inferring parsimonious migration histories for metastatic cancers. Nature genetics, 50(5):718–726, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang Chong, Zhang Lin, Xu Tianlei, Xue Ruidong, Yu Liang, Zhu Yuelu, Wu Yunlong, Zhang Qingqing, Li Dongdong, Shen Shuohao, et al. Mapping the spreading routes of lymphatic metastases in human colorectal cancer. Nature communications, 11(1):1993, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Naxerova Kamila, Reiter Johannes G, Brachtel Elena, Lennerz Jochen K, Van De Wetering Marc, Rowan Andrew, Cai Tianxi, Clevers Hans, Swanton Charles, Nowak Martin A, et al. Origins of lymphatic and distant metastases in human colorectal cancer. Science, 357(6346):55–60, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hoadley Katherine A, Siegel Marni B, Kanchi Krishna L, Miller Christopher A, Ding Li, Zhao Wei, He Xiaping, Parker Joel S, Wendl Michael C, Fulton Robert S, et al. Tumor evolution in two patients with basal-like breast cancer: a retrospective genomics study of multiple metastases. PLoS medicine, 13(12):e1002174, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. [Google Scholar]
- 22.Stiglitz Joseph E. Pareto optimality and competition. The Journal of Finance, 36(2):235–251, 1981. [Google Scholar]
- 23.Jang Eric, Gu Shixiang, and Poole Ben. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. [Google Scholar]
- 24.Maddison Chris J, Mnih Andriy, and Teh Yee Whye. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016. [Google Scholar]
- 25.Lengyel Ernst. Ovarian cancer development and metastasis. The American journal of pathology, 177(3):1053–1064, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Mitra Anirban K. Ovarian cancer metastasis: a unique mechanism of dissemination. IntechOpen, 2016. [Google Scholar]
- 27.Gui Philippe and Bivona Trever G. Evolution of metastasis: New tools and insights. Trends in Cancer, 8(2):98–109, 2022. [DOI] [PubMed] [Google Scholar]
- 28.Reticker-Flynn Nathan E, Zhang Weiruo, Belk Julia A, Basto Pamela A, Escalante Nichole K, Pilarowski Genay OW, Bejnood Alborz, Martins Maria M, Kenkel Justin A, Linde, et al. Lymph node colonization induces tumor-immune tolerance to promote distant metastasis. Cell, 185(11):1924–1942, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nguyen Bastien, Fong Christopher, Luthra Anisha, Smith Shaleigh A, DiNatale Renzo G, Nandakumar Subhiksha, Walch Henry, Chatila Walid K, Madupuri Ramyasree, Kundra Ritika, et al. Genomic characterization of metastatic patterns from prospective clinical sequencing of 25,000 patients. Cell, 185(3):563–575, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Williams Marc J, Werner Benjamin, Barnes Chris P, Graham Trevor A, and Sottoriva Andrea. Identification of neutral tumor evolution across cancer types. Nature genetics, 48(3):238–244, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Taddei François, Radman Miroslav, Maynard-Smith John, Toupance Bruno, Gouyon Pierre-Henri, and Godelle Bernard. Role of mutator alleles in adaptive evolution. Nature, 387(6634):700–702, 1997. [DOI] [PubMed] [Google Scholar]
- 32.Mao Emily F, Lane Laura, Lee Jean, and Miller Jeffrey H. Proliferation of mutators in a cell population. Journal of bacteriology, 179(2):417–422, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gentile Christopher F, Yu Szi-Chieh, Serrano Sebastian Akle, Gerrish Philip J, and Sniegowski Paul D. Competition between high-and higher-mutating strains of escherichia coli. Biology letters, 7(3):422–424, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Turajlic Samra and Swanton Charles. Metastasis as an evolutionary process. Science, 352(6282):169–175, 2016. [DOI] [PubMed] [Google Scholar]
- 35.Martínez-Jiménez Francisco, Movasati Ali, Brunner Sascha Remy, Nguyen Luan, Priestley Peter, Cuppen Edwin, and Van Hoeck Arne. Pan-cancer whole-genome comparison of primary and metastatic solid tumours. Nature, pages 1–9, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Sansregret Laurent and Swanton Charles. The role of aneuploidy in cancer evolution. Cold Spring Harbor perspectives in medicine, 7(1):a028373, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Christensen Ditte S, Ahrenfeldt Johanne, Sokač Mateo, Kisistók Judit, Thomsen Martin K, Maretty Lasse, McGranahan Nicholas, and Birkbak Nicolai J. Treatment represents a key driver of metastatic cancer evolution. Cancer Research, 82(16):2918–2927, 2022. [DOI] [PubMed] [Google Scholar]
- 38.Gao Yang, Bado Igor, Wang Hai, Zhang Weijie, Rosen Jeffrey M, and Zhang Xiang H-F. Metastasis organotropism: redefining the congenial soil. Developmental cell, 49(3):375–391, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Alvord Ellsworth C. Why do gliomas not metastasize? Archives of Neurology, 33(2):73–75, 1976. [DOI] [PubMed] [Google Scholar]
- 40.Kumar Sudhir, Chroni Antonia, Tamura Koichiro, Sanderford Maxwell, Oladeinde Olumide, Aly Vivian, Vu Tracy, and Miura Sayaka. Pathfinder: Bayesian inference of clone migration histories in cancer. Bioinformatics, 36(Supplement_2):i675–i683, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wolf Ingrid H, Richtig Erika, Kopera Daisy, and Kerl Helmut. Locoregional cutaneous metastases of malignant melanoma and their management. Dermatologic surgery, 30:244–247, 2004. [DOI] [PubMed] [Google Scholar]
- 42.Sleeman Jonathan, Schmid Anja, and Thiele Wilko. Tumor lymphatics. In Seminars in cancer biology, volume 19, pages 285–297. Elsevier, 2009. [DOI] [PubMed] [Google Scholar]
- 43.Lee Yeu-Tsu Margaret and Geer Deborah A. Primary liver cancer: pattern of metastasis. Journal of surgical oncology, 36(1):26–31, 1987. [DOI] [PubMed] [Google Scholar]
- 44.Wu Wenrui, He Xingkang, Andayani Dewi, Yang Liya, Ye Jianzhong, Li Yating, Chen Yanfei, and Li Lanjuan. Pattern of distant extrahepatic metastases in primary liver cancer: a seer based study. Journal of Cancer, 8(12):2312, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Riihimäki Matias, Hemminki A, Fallah Mahdi, Thomsen Hauke, Sundquist Kristina, Sundquist Jan, and Hemminki Kari. Metastatic sites and survival in lung cancer. Lung cancer, 86(1):78–84, 2014. [DOI] [PubMed] [Google Scholar]
- 46.Bado Igor L, Zhang Weijie, Hu Jingyuan, Xu Zhan, Wang Hai, Sarkar Poonam, Li Lucian, Wan Ying-Wooi, Liu Jun, Wu William, et al. The bone microenvironment increases phenotypic plasticity of er+ breast cancer cells. Developmental cell, 56(8):1100–1117, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zhang Weijie, Bado Igor L, Hu Jingyuan, Wan Ying-Wooi, Wu Ling, Wang Hai, Gao Yang, Jeong Hyun-Hwan, Xu Zhan, Hao Xiaoxin, et al. The bone microenvironment invigorates metastatic seeds for further dissemination. Cell, 184(9):2471–2486, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Ashley Charles W, Da Cruz Paula Arnaud, Kumar Rahul, Mandelker Diana, Pei Xin, Riaz Nadeem, Reis-Filho Jorge S, and Weigelt Britta. Analysis of mutational signatures in primary and metastatic endometrial cancer reveals distinct patterns of dna repair defects and shifts during tumor progression. Gynecologic oncology, 152(1):11–19, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Angus Lindsay, Smid Marcel, Wilting Saskia M, van Riet Job, Van Hoeck Arne, Nguyen Luan, Nik-Zainal Serena, Steenbruggen Tessa G, Tjan-Heijnen Vivianne CG, Labots Mariette, et al. The genomic landscape of metastatic breast cancer highlights changes in mutation and signature frequencies. Nature genetics, 51(10):1450–1458, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Satas Gryte, Zaccaria Simone, El-Kebir Mohammed, and Raphael Benjamin J. Decifering the elusive cancer cell fraction in tumor heterogeneity and evolution. Cell systems, 12(10):1004–1018, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Jamal-Hanjani Mariam, Wilson Gareth A, McGranahan Nicholas, Birkbak Nicolai J, Watkins Thomas BK, Veeriah Selvaraju, Shafi Seema, Johnson Diana H, Mitter Richard, Rosenthal Rachel, et al. Tracking the evolution of non–small-cell lung cancer. New England Journal of Medicine, 376(22):2109–2121, 2017. [DOI] [PubMed] [Google Scholar]
- 52.Kulman Ethan, Kuang Rui, and Morris Quaid. Orchard: building large cancer phylogenies using stochastic combinatorial search. arXiv preprint arXiv:2311.12917, 2023. [Google Scholar]
- 53.Wintersinger Jeff A, Dobson Stephanie M, Kulman Ethan, Stein Lincoln D, Dick John E, and Morris Quaid. Reconstructing complex cancer evolutionary histories from multiple bulk dna samples using pairtreereconstructing cancer evolutionary histories using pairtree. Blood Cancer Discovery, pages OF1–OF12, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.El-Kebir Mohammed, Satas Gryte, Oesper Layla, and Raphael Benjamin J. Inferring the mutational history of a tumor using multi-state perfect phylogeny mixtures. Cell systems, 3(1):43–53, 2016. [DOI] [PubMed] [Google Scholar]
- 55.Malikic Salem, McPherson Andrew W, Donmez Nilgun, and Sahinalp Cenk S. Clonality inference in multiple tumor samples using phylogeny. Bioinformatics, 31(9):1349–1356, 2015. [DOI] [PubMed] [Google Scholar]
- 56.Ray Surjyendu, Jia Bei, Safavi Sam, van Opijnen Tim, Isberg Ralph, Rosch Jason, and Bento José. Exact inference under the perfect phylogeny model. arXiv preprint arXiv:1908.08623, 2019. [Google Scholar]
- 57.Jia Bei, Ray Surjyendu, Safavi Sam, and Bento José. Efficient projection onto the perfect phylogeny model. Advances in Neural Information Processing Systems, 31, 2018. [Google Scholar]
- 58.Li Yaoxin, Liu Jing, Lin Guozheng, Hou Yueyuan, Mou Muyun, and Zhang Jiang. Gumbel-softmax-based optimization: a simple general framework for optimization problems on graphs. Computational Social Networks, 8(1):1–16, 2021. [Google Scholar]
- 59.Gao Jianjiong, Aksoy Bülent Arman, Dogrusoz Ugur, Dresdner Gideon, Gross Benjamin, Sumer S Onur, Sun Yichao, Jacobsen Anders, Sinha Rileen, Larsson Erik, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Science signaling, 6(269):pl1–pl1, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Sankoff David. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35–42, 1975. [Google Scholar]
- 61.Tarabichi Maxime, Salcedo Adriana, Deshwar Amit G, Leathlobhair Máire Ni, Wintersinger Jeff, Wedge David C, Van Loo Peter, Morris Quaid D, and Boutros Paul C. A practical guide to cancer subclonal reconstruction from dna sequencing. Nature methods, 18(2):144–155, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Roth Andrew, Khattra Jaswinder, Yap Damian, Wan Adrian, Laks Emma, Biele Justina, Ha Gavin, Aparicio Samuel, Bouchard-Côté Alexandre, and Shah Sohrab P. Pyclone: statistical inference of clonal population structure in cancer. Nature methods, 11(4):396–398, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Gillis Sierra and Roth Andrew. Pyclone-vi: scalable inference of clonal population structures using whole genome data. BMC bioinformatics, 21:1–16, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Nik-Zainal Serena, Van Loo Peter, Wedge David C, Alexandrov Ludmil B, Greenman Christopher D, Lau King Wai, Raine Keiran, Jones David, Marshall John, Ramakrishna Manasa, et al. The life history of 21 breast cancers. Cell, 149(5):994–1007, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The HR-NB dataset was accessed from the NCI’s Cancer Research Data Commons (https://datacommons.cancer.gov) under the study phs03111.v1.p1. The anatomical site labels for TRACERx patients used data generated by The TRAcking Non-small Cell Lung Cancer Evolution Through Therapy (Rx) (TRACERx) Consortium and provided by the UCL Cancer Institute and The Francis Crick Institute. The TRACERx study is sponsored by University College London, funded by Cancer Research UK and coordinated through the Cancer Research UK and UCL Cancer Trials Centre. The organotropism matrix derived from MSK-MET is available at https://github.com/morrislab/metient/blob/main/metient/data/msk_met/msk_met_freq_by_cancer_type.csv. The following publicly available datasets were used: melanoma3, breast20, HGSOC4, NSCLC14, MSK-MET29.