Significance
The knowledge that cancer is an evolutionary process is old, but only recently can sequencing technology provide data for clinically relevant evolutionary analyses of cancer. Approaches developed for evolutionary biology can reveal the relationship among clonal lineages, the ancestral states of gene sequences, and the timing of evolutionary events. We performed whole exome sequencing of cancer tissues from multiple sites of dozens of subjects, demonstrating nonlinear patterns of tumor progression and early origins of metastatic lineages and quantifying the times of occurrence of driver mutations. These findings direct research attention away from the search for genes that induce metastasis toward genes that are mutated early in tumorigenesis, providing therapeutic targets effective against both primary tumors and metastases.
Keywords: tumor phylogenetics, ancestral reconstruction, cancer, chronograms, oncogenes
Abstract
Many aspects of the evolutionary process of tumorigenesis that are fundamental to cancer biology and targeted treatment have been challenging to reveal, such as the divergence times and genetic clonality of metastatic lineages. To address these challenges, we performed tumor phylogenetics using molecular evolutionary models, reconstructed ancestral states of somatic mutations, and inferred cancer chronograms to yield three conclusions. First, in contrast to a linear model of cancer progression, metastases can originate from divergent lineages within primary tumors. Evolved genetic changes in cancer lineages likely affect only the proclivity toward metastasis. Single genetic changes are unlikely to be necessary or sufficient for metastasis. Second, metastatic lineages can arise early in tumor development, sometimes long before diagnosis. The early genetic divergence of some metastatic lineages directs attention toward research on driver genes that are mutated early in cancer evolution. Last, the temporal order of occurrence of driver mutations can be inferred from phylogenetic analysis of cancer chronograms, guiding development of targeted therapeutics effective against primary tumors and metastases.
It has long been understood that tumorigenesis is an evolutionary process (1) associated with the accumulation of somatic mutations (2). However, many aspects of that process that are fundamental to cancer biology and targeted treatment have been challenging to reveal, such as the divergence times and genetic clonality of metastatic lineages (3, 4). Somatic mutations have revealed tumor type-specific drivers by comparison of primary tumor and normal tissues (5, 6), and studies examining the evolutionary process of cancer across multiple sites have used a handful of subjects to identify ubiquitous, shared, and private mutations (1) and to reconstruct a number of tumor phylogenies using parsimony or unweighted pair group methods with arithmetic mean (1, 7) but have lacked the power to generalize about the tumorigenic or metastatic process across cancer types (1).
Tumor phylogenetics, using a larger sample with explicit evolutionary models, can be applied using molecular evolutionary models to reconstruct ancestral states of somatic mutations and infer cancer chronograms, revealing novel information about the timing of gene mutations and their contributions to tumorigenesis and metastasis and addressing three fundamental aspects of cancer biology. First, the topology of divergence of primary and metastatic lineages can differentiate between a linear model of cancer progression, in which all metastatic tumors are descended from a single original primary cell such that all metastases are more closely related to each other than they are to any tissue in the primary tumor, and a branched model, in which metastases can originate from divergent lineages within primary tumors. Second, molecular evolutionary trees and chronograms can quantify how early metastatic lineages arise in tumor development, clarifying the role of mutations in facilitating metastasis. Last, integration of temporal inferences across patients can convey the order of occurrence of driver mutations, guiding development of targeted therapeutics effective against primary tumors and metastases.
Here, we perform tumor phylogenetics to address these questions. Although ascertaining variable degrees of tumor heterogeneity (1) by computational analyses of subclonality within primary tumors has proven challenging (8), another approach to revealing heterogeneity is analysis of the sequence divergence of major clones extracted from distant sites. We replayed the “tape of cancer” and mapped genetic mutations on the tree of cancer evolution extending from normal tissue to primary tumor and metastases. Analyzing new exome sequence data from primary and metastatic sites, we applied maximum likelihood and Bayesian approaches to reveal phylogenetic relationships and tumor evolution chronology. We identified genetic mutations associated with tumorigenesis that commonly precede the first genetic divergence of all cancer lineages, also examining those that precede all metastases. Furthermore, we quantified the temporal distributions of the first genetic divergence of metastases from the primary tumor and evaluated the temporal order of gene mutations in cancer.
Results and Discussion
We generated a unique dataset of sequenced normal, primary, and matched metastatic tumor tissues from 40 subjects with 13 types of cancer, including 13 subjects with lung cancer and 7 subjects with pancreatic cancer. We sequenced exomes of normal, primary, and matched metastatic tumor tissues, including 32 primary tumors and 139 sites of metastases, ranging from two to seven metastases per subject (Fig. 1A, Fig. S1A, Dataset S1, and SI Materials and Methods). Twenty-four of these primary tumors were identified clinically without ambiguity whereas, in eight lung cancer subjects, the primary tumor was identified with less certainty. Exomes were captured from normal tissues and tumors of sufficient purity and sequenced to an average of nearly 200× coverage per targeted base, 95% of which were covered by more than 20 independent reads (Fig. S1B and Dataset S2). After alignment to the reference human genome, comparison of normal and all matched tumor sequences identified 20–5,370 somatic mutations in each subject. Variant calls were tested for a subset of somatic mutations by Sanger sequencing and were 100% validated (Dataset S3).
Fig. 1.
Tumor samples and methodology. (A) Cancer type and number of metastases analyzed for 40 subjects in our study. (B) Maximum likelihood tree of normal tissue (blue circle, 416), primary tumor of the cervix (red circle, P), and metastatic tumors (black circles, M2, M3, and M4) from subject 416, with a horizontal scale proportional to the number of mutations, and with all known driver mutations mapped to branches. The red internode represents the lineage ancestral to the primary tumor and all metastases. The orange internode represents the lineage ancestral to all metastases but not to the primary tumor. Genes in red have more than one mutation occurring on multiple branches; mutation locations are indicated in parentheses. Numbers at each internode indicate the statistical support for the corresponding branch (1 means 100% support). (C) Inferred cancer chronogram for subject 416, scaled in years, encompassing the first genetic divergence from Normal sequence (29.4 y), the first genetic divergence of metastases (17 y, blue dashes), and the diagnosis time (8 mo, red dashes). The phylogeny for subject 416 exhibited a diversity of cancer driver mutations occurring across multiple branches.
Fig. S1.
Tumor samples and methodology. (A) Formalin-fixed paraffin-embedded (FFPE) cored samples of normal, primary, and metastatic tumor tissue from one patient (427). (B) Coverage of targeted bases at 20× or greater (% of bases, orange points) and sequencing coverage (% of reads mapping to genome, blue points; % of reads mapping to targeted exome, red points). Error bars indicate ±1 SD, except tumor purity (%, yellow) of primary and metastatic tumors, for which error bars indicate the 25th and 75th percentile of the empirical distribution.
Tumor Phylogenetics of Multiregion Tumors Revealed the Origins of Metastatic Lineages from Divergent Lineages Within Primary Tumors.
We constructed 40 multiple sequence alignments, each alignment including the somatic variants from all tumors within each subject and their matched normal tissue sequence (Dataset S4). To determine the genetic relationship of these tumors, we applied parsimony-based (9), maximum likelihood (10), and Bayesian inference (11) to the multiple sequence alignment, estimating phylogenies of the tumor samples within subjects (Fig. 1B and Fig. S2). We then calibrated these phylogenies with tumor type-specific data on tumor cell division times (12) and with the clinical timings of diagnosis, biopsy, resection, and autopsy for each patient (Datasets S5 and S6) to evaluate the evolutionary pattern and tempo of the genetic divergence of primary and metastatic tumor lineages (Fig. 1C and Fig. S3).
Fig. S2.
Maximum likelihood evolutionary trees for each of the 40 subjects, aligned with the genomic LOH (blue lines). The blue circle labeled with the subject ID number represents the normal tissue, the red circle labeled “P” represents the primary tumor, and the black circles represent metastatic tumors (“M1”, “M2”, “M3”, etc.). Internodes are scaled to be proportional to the number of mutations. Bootstrap branch support is reported at corresponding internal nodes. LOH is plotted with gray dashed lines indicating divisions between chromosomes. Red bars on the LOH plots indicate nonsilent mutations in driver genes. Yellow bars indicate insertions or deletions in driver genes.
Fig. S3.
Chronograms with diagnosis (red dashed line), biopsy, and resection times. Gray dashed lines indicate the times of inferred genetic divergences of tumors. The blue violin plots indicate the 95% central interquartile distribution of branching times for each node.
Rare lineage-specific events, such as new somatic mutations or epigenetic marks, might trigger metastasis, a hypothesis that is a component of linear models of progression (13). If such rare lineage-specific events induce metastasis from the primary tumor, then primary tumor lineages would be expected to produce a single monophyletic clade of metastatic lineages with the primary tumor and normal tissue as outgroups (i.e., all metastatic lineages would share a common ancestor that is more recent than their most recent common ancestor with the primary tumor; e.g., Fig. 1B). In contrast, if production of metastatic lineages is rare but independent of causative genetic and epigenetic alterations (i.e., stochastic production of metastatic lineages from all extant lineages), then primary lineages would not be an outgroup to all metastatic lineages in all patients. The stochastic expectation for the primary lineage being the outgroup in a patient with n tumor samples is equal to the number of possible phylogenetic trees with the primary tumor constrained to be an outgroup, (2n − 5)!!, divided by the number of possible phylogenetic trees, (2n−3)!!, which simplifies to 1/(2n − 3).
Of the 24 cancer phylogenies that featured a clinically unambiguous primary tumor and two or more metastases, 16 featured a well-supported topological position of the primary tumor also consistent with patterns of loss of heterozygosity (Fig. S2). Of these 16, we found that 6 (38%) exhibited a most likely topology in which metastatic tumor lineages were not monophyletic and the primary tumor was not the outgroup to all metastases (e.g., Fig. 2A) [Bayesian posterior probability (BPP) of the primary tumor as the outgroup ranged from 13–30%, median 21%] (Dataset S7). Monophyly and paraphyly of metastases were not correlated with any particular tumor type. This remarkable frequency of metastatic tumor paraphyly rejects a general model stating that a rare, heritable, and lineage-specific event is necessary to induce metastasis. Instead, it favors a model where a metastatic event can occur stochastically from a heterogeneous primary tumor. Our phylogenies clearly demonstrate a branched evolution model of tumorigenesis and metastasis, as advocated by Burrell et al. (14). Furthermore, they demonstrate that not only do metastatic lineages evolve in parallel with the lineage sequenced from the primary tumor (3), but also they originate from divergent lineages within primary tumors.
Fig. 2.
Four maximum likelihood cancer molecular evolutionary trees, with a horizontal scale proportional to the number of mutations. (A) Subject 424 had a colon primary tumor and metastases to the duodenum (M1) and liver (M2). The primary tumor was an ingroup to all metastases in 80.4% of the Bayesian posterior of trees for subject 424. (B) Subject 446 had a pancreatic adenocarcinoma primary tumor and metastases to the kidney (M2), bowel (M3), and liver (M4). The primary tumor was an outgroup to all metastases in 99.8% of the Bayesian posterior of trees for subject 446. (C) Subject 435 had a poorly differentiated lung adenocarcinoma primary tumor and metastases to the lung (M0), liver (M1), pancreas (M3), hilar lymph node (M4), paraprostatic soft tissue (M5), perirenal soft tissue (M6), and mediastinum (M7). The primary tumor was an outgroup to all metastases in 100% of the Bayesian posterior of trees for subject 435. (D) Subject 459 had a lung adenocarcinoma primary tumor and metastases to the lung (M1), liver (M2), spleen (M3), kidney (M4), adrenal (M5) and paratracheal lymph node (M6). The primary tumor was an outgroup to all metastases in 100% of the Bayesian posterior of trees for subject 459.
In contrast, 10 phylogenies exhibited a topology in which the unambiguous primary tumor was the outgroup to all metastatic tumors (BPP 67–100%; median 100%; e.g., Fig. 2 B–D and Dataset S7). This proportion [62.5%; 95% credible interval (CI) 35–85%] is significantly higher than the 19% random expectation for this 16-subject subset (Dataset S7). To incorporate the cancer phylogenies of the 8 additional subjects whose inferred topology with regard to the primary tumor was indicative but moderately to highly uncertain, we integrated over uncertainty of all trees using the Bayesian posterior distributions for all 24 subjects with clinically unambiguous primary tumors. The result was consistent with our previous analysis, yielding a posterior average of 14.3 out of 24 subjects (60%; CI 46–75%), with their primary tumor as the outgroup (Dataset S7).
We then included an additional eight cancer phylogenies with two or three metastases for which the clinical identification of the resected tumors as primary was deemed to be of moderate confidence, yielding a consistent 55% (CI 44–69%) of phylogenies with the primary tumor as the outgroup (Dataset S7). The results are significantly higher than the random expectation (21% for the 32 subjects). This higher value demonstrates that heritable genetic, epigenetic, or other lineage-specific events can contribute a proclivity within lineages toward metastasis of the primary tumor. However, this result also demonstrates that this lineage-specific effect is not so strong as to universally lead to monophyletic metastases (Dataset S7; P < 10−11 that at least one of the subjects had a primary tumor that was an ingroup to metastases). Thus, either heritable genetic or epigenetic events at best induce a proclivity toward increased metastasis, or they typically occur early in the evolution of the primary tumor and thus are regularly present in all or nearly all primary tissue long before detection.
The simple linear model specified—that all metastatic tumors are descended from a single original primary cell, such that all metastases are more closely related to each other than they are to any tissue in the primary tumor—requires no explicit modeling of the processes of tumorigenesis and metastasis to imply a specific phylogenetic pattern of metastases and primary tumors. There are many additional questions about the process of metastasis that cannot be addressed without more complex models of the processes underlying tumorigenesis and metastasis and comprehensive intratumor sampling (15). Nevertheless, our phylogenetic analysis demonstrates that (i) paraphyly happens, violating the simplest of linear null hypotheses about how primary and metastatic tumors are related (the hypotheses that all metastases are more closely related to each other than they are to any tissue in the primary tumor) and that (ii) paraphyly happens less than would be expected completely by chance, indicating that there is a proclivity within some lineages toward metastasis.
Inference of Cancer Chronograms Revealed the Early Genetic Divergence of Primary and Metastatic Tumor Lineages.
Examination of the inferred molecular evolutionary trees illustrates that metastatic lineages can diverge genetically from the primary tumor early in the cancer history. In the cancer molecular evolutionary trees for 11 out of 40 subjects (Fig. S2), the shared ancestral lineage of all tumors was shorter than the subsequent branch lengths leading to a metastatic tissue sampled at autopsy. To examine this pattern of early metastasis further, we analyzed the implications of the inferred tree topology combined with constraints from the timing of clinical sample acquisition. For this purpose, we transformed the molecular evolutionary trees into chronograms (Fig. 1C), applying a molecular clock calibrated with the timings of diagnosis, biopsy, surgical resection, and autopsy, and parameterized by cell division times of primary tumor cells (Datasets S5 and S6). For all subjects, we mapped the time of the first genetic divergence of tumor lineages to a 0–1 timescale, where 0 represented the time of genetic origin of tumorigenesis from normal tissue, and 1 represented death of the patient. In the cancer chronograms of 7 subjects, the most recent common ancestor of the primary tumor and metastases occurred in the first half of both the chronological timescale and the tree with branches scaled to mutations (e.g., Fig. 3 A vs. B–D). Thus, the divergence of metastatic lineages from the primary tumor can occur closer to the first genetic divergence of the tumor from normal tissue, than to death.
Fig. 3.
Timings of the first genetic divergence from normal tissue sequence (blue circle), of the first genetic divergence of metastases (blue dashes) and of diagnosis (red dashes) during tumor progression. (A) Subject 459 (lung adenocarcinoma, aged 54 y at death) provided an example of early diagnosis of the primary tumor without diagnosis of metastases, but also early divergence of the metastases. (B–D) Subjects 414 (lung, large cell, aged 25 y), 418 (ovarian, aged 47 y), and 439 (renal clear cell carcinoma, aged 58 y) provided examples of late metastasis in which diagnosis of the primary tumor and metastases occurred after the first genetic divergence of metastasis. (E) Probability density for the occurrence of the first genetic divergence of metastases and for the time of diagnosis. The x axis is scaled from 0 (the first genetic divergence of primary tumor tissue from normal tissue) to 1 (death). In our set of 40 lethal cancers, the first genetic divergences of metastatic lineages (blue triangles) are distributed so as to often occur earlier than diagnosis time (red triangles).
The distribution of the times of the earliest divergences of the metastatic and primary tumor lineages is not uniform (Fig. 3E; P < 0.001). The mean time of the most recent common ancestors of all metastatic and primary tumor lineages was 0.72 on our 0–1 timescale whereas the mean cancer diagnosis time was 0.90, late in the unique genetic history of the cancer (Fig. 3E; P < 0.001). Saliently, in 35 of our subjects (87.5%), genetic divergence of the first metastatic lineage had already occurred by the time of diagnosis of the primary tumor.
In six cases (15%) in which no metastatic tumors were clinically identified at diagnosis of the primary tumor, we found that metastases had in four cases already genetically diverged. In another five subjects (12.5%), genetic divergence of the first metastatic lineage occurred after diagnosis. In three of these five, however, metastases were diagnosed together with the primary tumor at clinical presentation. Thus, either the metastases sampled were not those present at diagnosis or the true genetic divergences of metastases in these subjects were earlier than inferred by our methods. Timing of divergence of the metastatic and primary tumor lineages was uncorrelated with tumor type. Based on these results, lineages that proceed to metastasis can genetically differentiate from the primary tumor lineage early in the evolutionary and temporal history of cancer, and genetic divergence of metastases often predates diagnosis even when they are not evident at diagnosis. When metastases genetically diverge early, there is less time for the accumulation of any mutations conferring the ability to metastasize subsequent to tumorigenesis. Therefore, early genetic divergence of metastases favors a parallel model (16) in which frequently disseminated cancer cells rarely establish themselves, rather than a linear model (13) of cancer evolution requiring somatic mutations to produce metastases. The observed early genetic divergence of metastases has implications for the retrospective examination of medical care, including whether surgery is seeding distant metastases (17). For example, claims that metastases necessarily arose subsequent to a delay in diagnosis and resection could be refuted.
Our results indicate that metastases from primary tumors can be produced early and stochastically. They suggest that clinical treatments addressing genetically heterogeneous metastases could be warranted even when the sole diagnosis is of a primary tumor (18). In subjects with early or stochastic metastases, the likelihood that crucial metastasis-inducing mutations will be identified as suitable targets for pharmacological intervention is low. Nevertheless, if present, mutations conveying an increased chance of metastasis would be expected to occur before the divergence of any metastatic tumor lineages from the primary tumor. Thus, their identification by examination of the shared ancestry of metastatic lineages separate from the evolution of the sampled primary tumor lineage would be of particular importance (19).
Ancestral State Reconstruction Revealed the Temporal Occurrence of Mutations in Driver Genes Along Tumor Progression.
To identify mutations that are divergent from the primary tumor lineage but shared by evolved metastatic lineages, we applied molecular evolutionary models to estimate ancestral sequences in the phylogeny, inferred the gene sequence at every branch point, and mapped all mutations to internodes. Internodes diverging from a primary tumor lineage that constitute a shared ancestral lineage of all metastatic lineages might be particularly likely to be associated with mutations that facilitate metastasis (Fig. 1B). We evaluated such mutations with MutSigCV (5) and found genes exhibiting an abnormal burden of mutations.
The early mutations disposing a cell lineage toward cancer are often thought to play key functional roles because they are shared by all tumors (primary and metastatic) and are far more likely to be drivers of tumorigenesis that are key to the origin and persistence of cancer than those that are acquired subsequently. Therefore, we examined whether any genes exhibited an abnormal burden of nonsilent mutations mapping specifically to the origin of the primary tumor (Fig. 1B). MutSigCV identified the well-known tumor suppressor TP53 and the oncogene KRAS (false discovery rate of ≤0.1) as genes repeatedly mutated early in the evolution of these diverse tumors. Their frequent presence in the root of cancer lineages implies that they play key formative roles in the origin of cancer and that they deserve redoubled attention for their roles in tumorigenesis. The potential to therapeutically target these longstanding drivers of cancer continues to evolve, including targeting TP53 via the TP53/MDM2 interaction (20), targeting KRAS via KRAS G12C (21), targeting posttranslational modifications of KRAS (22), or targeting upstream or downstream effectors (23).
To investigate whether mutations of a larger set of known cancer driver genes (6, 19; Dataset S8) tend to occur early in cancer evolution, we compared the occurrence of somatic mutations in driver versus nondriver genes on the early branch shared by all primary and metastatic tumors (Fig. 1B) with their occurrence in subsequent branches. Across all subjects, we found that nonsilent tumor type-specific driver mutations (Datasets S9 and S10) are enriched in the early branch (35 early drivers, 13 late drivers, 2,080 early nondrivers, 3,056 late nondrivers; P = 0.0001, Fisher’s exact test) after removing hypermutated subject 430 (24). Statistical significance held even when retaining subject 430 (P = 0.001; Dataset S11) or when tallying all mutations in cancer drivers (P = 0.001; Dataset S11).
Furthermore, in the early branch, the ratio of nonsilent versus silent mutations (dN/dS) was significantly higher in tumor type-specific drivers than in nondrivers (41 dN driver, 1 dS driver; 4,418 dN nondriver, 1,908 dS nondriver; P = 0.0001). This significance still held when we tallied mutations in all cancer genes from Dataset S8 for all subjects (P = 0.0027; Dataset S12). However, dN/dS was not significantly higher in tumor-specific drivers in the late branches than in nondrivers (P = 0.11; Dataset S12). The significance of results for both early and late branches remained unchanged after removing hypermutated tumors from subject 430 from the analysis (Dataset S12).
Perhaps a high somatic selection intensity for these driver mutations within clonal cancer lineages means that they either are critical initiating events or are likely to outreplicate other mutations within a small clonal neoplasm. Accordingly, previous assessments of the important genes in cancer (6, 19; Dataset S8) correlate with our findings of their timing early in tumorigenesis. For example, six cancer driver gene mutations known to play key roles in cancer (KRAS, TP53, PIK3CA encoding a phosphoinositide 3-kinase, KMT2C and KMT2D encoding histone methytransferases, and ALK encoding a receptor tyrosine kinase) occurred in multiple cancer patients across tumor types. Based on the inferred locations of these mutations in the cancer chronograms—and consistent with our early driver finding—KRAS and TP53, involved in cancer-related pathways such as MAPK signaling and telomere maintenance, are likely to be key early mutations in tumorigenesis. Mutations in KMT2D and PIK3CA frequently occurred midway along the cancer history (P < 0.001), and ALK and KMT2C frequently occurred late (P < 0.001) in the evolution of cancer (Fig. 4). Note that TP53 would be expected to more frequently receive its first mutation than a gene with a smaller mutational target like KRAS, which is oncogenic only with mutations at codon sites 12, 13, 18, 61, and 117, because the mutational target size of genes is highly relevant to when they are likely to receive their first disabling mutation. The temporal interdispersion of mutations of a tumor suppressor with high mutational target size (TP53) within oncogenes with low mutational target sizes (KRAS, PIK3CA, KMT2D, ALK, and KMT2C) implies that the temporal order inferred is partially driven by positive selection and possibly epistatic interactions rather than solely being driven by the waiting time to the first driver mutation.
Fig. 4.
Inferred distributions of the temporal occurrence of mutations (∆) in cancer driver genes KRAS (dark purple line, no shading), TP53 (black line, gray shading), PIK3CA (olive line, olive shading), KMT2D (medium purple line, no shading), ALK (light gray line, light gray shading), and KMT2C (light purple line, no shading) across diverse cancer types. Probability densities for the appearance of alleles with nonsilent mutations across these cases indicate that mutations of KRAS and TP53 tend to occur earlier than mutations of PIK3CA or KMT2D (P < 0.001), which, in turn, tend to occur earlier than mutations of ALK or KMT2C (P < 0.001). Mutations are depicted at the midpoint of the interval during which the gene was inferred to be mutated in the particular subject.
Conclusion
We have demonstrated that genetic lineages of metastases can arise early in primary tumors, sometimes long before primary tumor diagnosis. This result directs research efforts away from prevention of metastasis-facilitating mutations and toward a better understanding of fundamental drivers of tumorigenesis. Second, we demonstrate conclusively that, in contrast with the longstanding model of linear progression of cancer, metastases can originate from divergent lineages within primary tumors. This result argues that, although evolved genetic changes in cancer lineages seem to affect the proclivity of tumor cells to metastasize, it is unlikely that there are single genetic changes that are necessary or sufficient for metastasis. Lastly, we demonstrate the temporal order of occurrence of relevant driver mutations, indicating their relative roles in tumorigenesis. Although many studies have identified therapeutically targetable alterations associated with late-stage tumors (25), evolutionary analyses of the timing of mutations in driver genes in cancerous tissues will be important not only because pharmaceutical treatment could have a large therapeutic effect on tumors exhibiting these early-appearing mutations, but also because any such therapeutic effect is likely to be exerted across otherwise heterogeneous primary and metastatic tumor lineages. These alleles represent targets for pharmaceutical intervention that exist in nearly all subsequent lineages, and their targeting would provide the greatest potential for long-term success of therapy.
Materials and Methods
Ethics Statement and Specimen Sampling.
The tissues assessed in this study were obtained from the Yale Pathology Archives based on Yale Human Investigation Committee at Yale University, Protocol no. 0304025173 to D.L.R., which allows retrieval of tissue from archives that was consented or has been approved for use with waiver of consent. Formalin-fixed paraffin-embedded (FFPE) primary tumor tissues, metastases, and normal tissues were obtained from 40 subjects (Dataset S1). Clinical characteristics, tumor type, numbers of metastatic samples, and locations are summarized in Fig. 1.
DNA Extraction, Exome Sequencing, Variant Calling, and Variant Validation.
Genomic DNA was extracted from FFPE core biopsies using the BiOstic FFPE Tissue DNA isolation kit (MO BIO) following standard procedures that we have used successfully in the past to recover single nucleotide variants (26). Genomic DNA underwent targeted exome capture using the NimbleGen SeqCap EZ Human Exome Library v2.0 (Roche), followed by sequencing on the Illumina HiSeq platform to generate 74-base paired end reads, as in Choi et al. (27). To determine variant sites, sequences from all tumor tissues were aligned to the human reference genome (hg19) using Eland (Illumina), and single nucleotide variants were called using SAMtools (28). Tumor purity was estimated using the mean of minor allele frequencies of SNPs within regions of loss of heterozygosity (LOH). Somatic mutations were identified by comparing read counts from tumor samples (primary and/or metastasis) with those from corresponding normal tissues, as described in Choi et al. (27).
To accurately characterize the nucleotide state at the variant sites across tumor tissues within a patient, somatic mutations were further analyzed to account for tumor impurity, loss of heterozygosity, and other factors that confound variant calling (Fig. S4). To eliminate false positive calls present in common SNP databases, variants were removed if they were present in the National Heart, Lung, and Blood Institute Exome Sequencing Project (release ESP6500SI-V2), 1000 Genomes (release no. 84, March 2012), or the Yale Human Exome Database. To validate somatic mutations identified (Dataset S4), we performed direct Sanger sequencing of known cancer genes after targeted PCR (Dataset S3) for patients depicted in Figs. 1 and 4.
Fig. S4.
Base calling approach. At a given site, the read counts for the four nucleotides are labeled A, B, C, and D, in descending order of frequency. The lower bound of the multinomial 95% confidence intervals are denoted Amin, Bmin, and Cmin, respectively (the least frequent nucleotide read count, D, is ignored). Where these values equal zero (i.e., the confidence interval overlaps zero), we treat the signal for the corresponding nucleotide as noise. The final call, as reference, alternative, or missing, is deduced.
Phylogenetic Analyses and Ancestral State Reconstruction.
Somatic variant sites across each group of corresponding tumor and normal samples were concatenated into multiple sequence alignments and analyzed by phylogenetic and evolutionary methods (Dataset S4). Trees that maximized parsimony were conclusively identified with an exhaustive tree bisection–reconnection search by PAUP 4.0 (9). Maximum likelihood trees were estimated using GARLI v2.0 (10), which applies a modified genetic algorithm to infer the best topology, branch lengths, and substitution model parameters simultaneously, accessed by the convergence of log Likelihood (lnL) scores in two runs. Bayesian inference of tree topology was performed using MrBayes (11).
To better understand tumor progression, we inferred sequences for the ancestor of all tumors and all other ancestral/internal nodes in the phylogeny. Ancestral state reconstruction was performed using the package baseml from PAML 4.7 (29), implemented with the tree topology from the maximum likelihood tree based on the alignment using the K80 model. The inferred ancestral state for each internal node was compared with the sequence state of the adjacent node to infer mutations that occurred on each branch of the phylogeny (Fig. 1B). All of the mutated genes were then mapped on each branch of the phylogeny for each of the 40 subjects for further temporal analyses.
Inference of Cancer Chronograms.
Because mutations in self-renewing tissues are cumulative over time and directly correlated with age (30), we used a molecular clock phylogenetic approach to infer the order of genetic divergence of tumors, calibrating it with the timings of diagnosis, biopsy, and/or surgical resection of the primary tumor. We inferred chronograms for all cancer phylogenies (Fig. S3), under a relaxed uncorrelated-clock model specifying uniform branching priors using mcmctree implemented in PAML 4.7 (29), which accommodates rate heterogeneity, including increases in the rate of mutations (31) among tumor tissues over time and across the phylogeny. We specified a gamma prior distribution for the rate of somatic cell substitutions at 8.4 × 10−8 site−1⋅y−1 (32), scaled to the observed potential doubling time (Tpot) and SD for each specific tumor (12).
Temporal Inference of Genetic Divergence of Tumors and Diagnosis Time.
To know the distribution of the times of the first genetic divergences of tumor lineages, we mapped them to the 0–1 timescale inferred for each chronogram, where zero represented the genetic origin of tumorigenesis from normal tissue and one represented death of the patient (Fig. 1C). For each subject, we then drew 1,000 values from the posterior distribution for the genetic origin of tumorigenesis from the normal tissue, and 1,000 values from the posterior distribution for the time of first diagnosis. We calculated the mean as a central estimate for each of these distributions for each subject and also calculated the maximum likelihood beta distribution for each of these sets of 40 time points using Mathematica 9. For comparison, we plotted the beta distributions superimposed on our relative timescale (Fig. 3). To evaluate whether inferred beta distributions were statistically significantly different from uniform, the inferred maximum likelihood model was compared with the maximum likelihood model of a uniform beta distribution (α = 1, β = 1) via a likelihood ratio test evaluated against the χ2 distribution with two degrees of freedom. To establish the significance of difference between distributions of tumor ancestor times and diagnosis times, we compared the maximum likelihood null model of a single beta distribution for all inferred times to the maximum likelihood of two distinct beta distributions for tumor ancestor times and diagnosis times via a nested likelihood ratio test evaluated against the χ2 distribution with two degrees of freedom.
Temporal Inference of Gene Occurrence Along Tumor Progression.
We performed a Fisher’s exact test to check whether tumor driver genes were occurring significantly early during tumor progression. To detect the temporal order of mutated alleles, we assembled the time intervals when genes were inferred to be mutated across all subjects on a 0–1 timescale (Fig. 1C). Each mutation was associated with a branch in the cancer phylogeny by our ancestral state inference. We integrated over uncertainty in the timing of occurrence of the mutation by sampling uniformly from the associated branch on 1,000 of our 0–1-scaled chronograms in our Bayesian posterior and averaging the resulting sample set. Using Mathematica 9, we fitted these averages (Fig. 4) to the maximum likelihood beta distribution for when the mutations commonly arise on the 0–1 timescale.
SI Materials and Methods
Specimen Sampling.
The tissues assessed in this study were obtained by Yale Human Investigation Committee Protocol 0304025173 to Dr. David L. Rimm, which allows retrieval of tissue, from archives, that was consented or has been approved for use with waiver of consent. FFPE tissues were obtained from 40 subjects who underwent autopsy and surgical procedures at Yale University/New Haven Hospital from 1991 to 2012. Inclusion for analysis in this study was limited to specimens where tissue and sequence were available from primary tumor and at least two metastatic sites. Clinical characteristics, including tumor type, number of metastatic samples, and locations are summarized in Fig. 1 and Fig. S1. After slide histopathological review, regions of interest in the FFPE blocks were cored using a 1-mm-diameter hollow needle (JG18-0.5X; Jensen Global), followed by specimen removal with a narrower needle (JG22-1.5X; Jensen Global). Total combined core length was ∼3–6 mm (e.g., Fig. S1A). Precut and postcut hematoxylin and eosin stained sections were scanned as tissue morphology controls.
For our experimental design, we sampled only a single portion of the primary tumor and single portions of multiple metastases. This sampling is sufficient to test the hypotheses addressed here. If metastatic lineages branch off very early in a cancer history, sampling of additional primary tumor clones would add to the branching topology but would not change how early the branching would occur—because it would not influence the divergence from other observed samples. Metastatic lineages would then branch off the single clone from the primary tumor at exactly the same depth of time that they do in our tumor phylogenetics. Furthermore, in cases in which the primary tumor clone were to branch internal to the metastases, placement of additional primary tumor clones would not “reverse” an observed, strongly supported nonmonophyly: the first clone abrogating the monophyly would always remain internal to the clade of metastases. Additional primary tumor clones could branch off internal to an erstwhile nonmonophyletic set of metastases; this observation would provide even stronger support for the frequent branched and nonlinear evolution demonstrated in our manuscript. Lastly, with regard to chronological inferences, sampling of additional primary tumor clones would increase the power to more precisely infer timing of divergences and increase the power to more precisely infer timing of mutation occurrence. However, our design of sampling more metastases should yield the greatest power by sampling the most divergent tissues: additional sites of metastasis rather than additional primary tumor clones.
Variant Calling, Variant Validation, and Mutation Burden Analyses.
For all variant calls, to classify a somatic mutation as “reference,” “alternative,” or “missing” at a given variant site, we considered the number of reads supporting each nucleotide. By calculating multinomial confidence intervals on the read counts of the nucleotides, we identified the number of signals for distinct nucleotides present at each site and made the final call based on the content of these signals. For positions with a single signal (i.e., where only one nucleotide exhibited confidence intervals that didn't overlap zero), we called the site homozygous. If the nucleotide was the same as the reference allele, the site was marked “reference”; otherwise, it was marked “alternative.” For positions with two nucleotides with read proportions above zero and confidence intervals of zero or less for the remaining nucleotides, the position was called heterozygous and both alleles were compared with the normal sample. If the reference allele was not among these nucleotides, the site was marked as “missing”; otherwise, it was called “alternative.” For those sites with confidence intervals above zero for more than two nucleotides, reads from each sample from the originating patient were compared to assess the likelihood of alignment error. Positions with more than two nucleotide signals in more than one sample per patient were considered prone to misalignment, and the third or more remaining calls were ignored (Fig. S4). Sites with fewer than 14 reads in total were considered “low coverage sites” that required additional processing. Where the multinomial algorithm produced an alternative call for a low coverage site, this call was preserved if the same alternative genotype was observed in a non-low coverage sample from the same patient. In all other cases, low coverage sites were marked as missing.
To resolve as many missing calls as possible, we performed visual inspection of the alignments at sites marked missing using an in-house R script. These sites were manually reclassified based on whether the reads that produced the alternative call were the result of mismapping (marking them as alternative or missing accordingly). At any site where mismapping was determined to have produced the alternative calls, the position was also called manually for all other samples from the same subject. Additional positions were called as missing in tumor samples with sites called as a reference in the cases where the site was within an LOH region and ancestral reconstruction indicated a reversal mutation on the branch of the phylogenetic tree leading to the tumor sample. Final somatic variant calls after eliminating false positive calls in common SNP databases are summarized in Dataset S4.
Formalin fixation adversely affects the quality of nucleic acids and causes degradation of material (33). We have successfully recovered single nucleotide variants from FFPE in this study and previous studies (26). Consistent with previous studies, variant calls tested by Sanger sequencing in this study were 100% validated (Dataset S3). Nevertheless, accurate detection of genome-wide copy number variation (CNV) from our FFPE samples, beyond detection of loss of heterozygosity, was not feasible. Although novel techniques are being developed to improve CNV calling in formalin-fixed tissues (34), high-throughput sequencing of FFPE samples demonstrated large variation in read depths that decreased both the sensitivity and specificity of current CNV prediction algorithms. Therefore, we excluded CNV analysis in our study to decrease potential noise in our phylogenetic analysis, which is heavily reliant on the accuracy of input data (35). The depth of coverage of our exome sequencing was not sufficient to examine the intratumor heterogeneity. However, we were able to recover the dominant clones yielding robust phylogenetic results.
Phylogenetic Analyses and Ancestral State Reconstruction.
For maximum likelihood and Bayesian inference, we specified the generalized time reversible substitution model with equal base frequency (base frequencies are approximately equal in the human exome), gamma-distributed rate variation, and no invariant sites. For GARLI and PAUP, branch support within the optimal tree was assessed by bootstrap proportion, and the consensus tree and bootstrap value supports were summarized using sumtrees.py (36). Maximum likelihood inference was performed with 5 million generations' run and 100 bootstrap replicates. PAUP was run using an exhaustive search and 1,000 bootstrap replicates.
Bayesian inference of the tumor phylogeny was performed with two runs of 2 million generations, and four Markov chains for each run. The large number of generations guaranteed that the split probability of the two runs was below 0.05. Two independent runs converged as indicated by a potential scale reduction factor close to 1.0. Inferred parameters of branches and nodes, as well as the Bayesian consensus tree based on 50% majority rule, were summarized by evaluating the posterior distribution, estimated by sampling the Markov chain after discarding the first 50% of generations to ensure stationarity of the chain. Branch support was assessed by proportion of trees featuring the branch in the Bayesian posterior tree set.
To represent every evolutionary tree in Fig. S2, including those cases where the cancer evolutionary tree was ambiguous, we depicted the tree inferred by maximum likelihood. To integrate over the ambiguity of these trees in our Bayesian analyses, we tallied the probabilities of primary tumor as outgroup or ingroup within the cancer phylogenies based on the posterior support for Bayesian trees. For each subject, we calculated the ratio of inferred cancer evolutionary trees with the primary tumor as the outgroup to all metastases. To perform this calculation, we tallied the proportion of 2,000 trees from the posterior sets of two independent runs for each subject, each independent run with 1,000 trees after discarding the first 50% of generations. Then we summed the proportions inferred for each subject to get the probabilistic number of subjects with the primary tumor as the outgroup for the 24 subjects with clinically unambiguously identified primary tumors. We performed this inference again, including the additional eight subjects whose primary tumors were clinically identified with moderated confidence. To get the 95% confidence interval, we resampled trees for subject and summed the resulting set to yield a proportion of trees with the primary tumor as the outgroup. We ranked 10,000 independent resamplings and retrieved the values ranked 25th and 975th as the lower and upper 95% confidence interval for the ratio of the primary tumor as the outgroup. For the 32 subjects with primary tumors, we calculated the P value for the claim that at least one primary was an ingroup by calculating the equivalent probability that all primaries were outgroups (i.e., multiplying the Bayesian posterior probabilities of the primary as the outgroup for each subject.)
Inference of Cancer Chronograms.
Patient resection, biopsy, and/or diagnosis times were used to calibrate the inference of time (in months) in the chronograms (Datasets S5 and S6). For 19 subjects, the maximum likelihood tree topologies were used with normal tissue enforced as root. Because mcmctree requires trees with only binary divergences, for 18 subjects where the maximum likelihood trees contained polytomies, we used the maximum parsimony tree topologies. For subjects 435 and 432, where both maximum likelihood and maximum parsimony trees exhibited polytomies, we used the binary Bayesian tree. For subject 456, where maximum likelihood, maximum parsimony, and Bayesian trees exhibited polytomies, all different binary topologies were evaluated by the Shimodaira–Hasegawa test. Those with the highest likelihood were used (29).
To infer chronograms with mcmctree implemented in PAML, we specified uniform priors (λ = 2, µ = 2, and ρ = 0.1) for the lineage birth rate λ, lineage death rate µ, and sampling fraction ρ (37). All topologies were estimated by four different Markov chains conducted for at least 200,000 iterations, sampling every eight iterations, until convergence was achieved, surmised by effective sample size (ESS) estimates of >200 in Tracer 1.6 (38). For four chronograms that did not achieve convergence with the uniform priors (432, 438, 440, and 458), other priors were attempted. For these four chronograms, convergence was achieved with priors skewed toward early branching (λ = 10, µ = 5, and ρ = 0.001) but could not be achieved with the uniform prior skewed toward long internal and short external branches (37). We were unable to achieve convergence to a unimodal posterior for four chronograms; without convergence, the estimated tumor ages were incompatible with patients’ ages. These four were subject 410, with ovarian carcinoma, for which the extant data on cell division time Tpot (39) likely provided too broad a distribution to constrain the timing—ovarian carcinoma is also subject to hormonal influence (gonadotropins) that may vary with birth control pills and menopause (40); subject 443, with a bladder carcinoma, which also has a broadly distributed Tpot (41), as well as variable mutation rates arising from environmental carcinogens (6); subject 439, with a rhabdomyosarcoma, for which no informative Tpot was found, likely because the tumor occurred in a very rare skeletal muscle; and subject 441, with a kidney clear cell renal carcinoma, for which extreme heterogeneity of mutation rates might have been problematic because metastases can replicate 2.5–10 times faster than the primary renal source (42) whereas the primary kidney tumor usually exhibits a constant growth rate (43).
In 27 of the subjects, all tissues sequenced were removed simultaneously at autopsy, providing no sequenced tissue sample at a deeper time to associate with sequence data to establish a mapping of units of molecular evolutionary distance to units of chronological time. These chronograms were inferred without time unit calibration so that the chronological time should not be associated with actual months or years. In a few cases, the primary tumor was resected at an earlier date than the autopsy that provided metastatic tissue samples, providing a single calibration point for associating molecular evolutionary distances with time. These chronograms are presented in chronological time (Fig. 3 and Fig. S3) although it should be noted that the precision of the temporal inferences arising from this procedure is low because of molecular evolutionary stochasticity, the potential for dynamic changes in mutation, the substitution rate in evolving cancer lineages, the minimal number of calibration points, and the fact that the single calibration times are generally recent (patients were diagnosed on average 13.3 mo before death) relative to the entire time course (mean patient age of 61 y). The inference of the relative timing of serial mutations is less sensitive to these issues; thus, the chronograms present a way to compare relative timing on a scale from primary tumorigenesis to death. To convey the inherent uncertainties of chronograms, we included blue violin plots that indicate the 95% central interquartile distribution of branching time estimates for each node (Fig. S3).
In some cases, all tissues from a subject were removed simultaneously at autopsy, providing no calibration points to associate units of molecular evolutionary distance with units of chronological time. In these cases, chronograms were inferred, with the corresponding unit of chronological time that associated with the branch lengths assumed to be unknown. For each patient, the chronograms were resolved in two steps to constrain so that the somatic tissue was a fixed root of the phylogeny. For step one, mcmctree was run on only the tumor sequences, to infer the most recent common ancestor (MRCA) of cancerous tissues without including a sample whose sequence is identical to the common ancestor (the somatic tissue). The mean estimate of the first genetic divergence of the tumors was used as the calibration point for their most recent separation time from the normal somatic sequence in a second run of mcmctree with all samples that generated the final chronograms (Fig. S3).
By using tumor type-specific mutation rates, we took into consideration changes in genes such as tumor suppressors, whose absence may allow mutations to accumulate at a faster pace (44). Nevertheless, mutations in mutator genes such as OGG1 and MSH6 can impair DNA repair (45) and actively induce a higher mutation rate than in ordinary tumor cells (24). For patients 408, 430, and 440, we found nonsilent somatic mutations in MSH6, a mismatch repair gene previously associated with cancer (46). For patient 407, we found nonsilent somatic mutations in OGG1 (8-oxoguanine DNA glycosylase), an enzyme that repairs guanine lesions in DNA (47). Because we did not have a way to readily incorporate the increase in mutation at the time of mutations in MSH6 and OGG1 into our chronological inference, we did not estimate the chronological time of tumorigenesis for these patients. Even though we observed heterogeneous mutation rates among primary and metastatic lineages, we did not see a fixed pattern regarding rate heterogeneity within phylogenetic trees for any particular tumor type. To further check whether there is any general rule explaining the rate heterogeneity of branches within tumor phylogenetic trees, we performed a quantitative analysis to check whether there is a correlation between the number of DNA repair genes on each branch and the mutation rate for the corresponding branch. We tallied somatic nonsilent mutations occurring within 178 human DNA repair genes (48). There were 194 mutations occurring in 29 out of the total 40 subjects. However, there was no significant correlation between the number of DNA repair genes mutated (x) and the mutation rate within the branch (y; y = 0.0024x + 2.6235, R2 = 4 × 10−6). Thus, we did not find any general rule explaining the rate heterogeneity of branches within tumor phylogenetic trees.
For the four breast carcinoma patients, the Bayesian inference for the chronograms did not converge, which indicates that mutations and time estimates were inconsistent with the molecular evolutionary models used to infer chronograms. From the tumor types analyzed here, breast tumors have the largest variance in potential doubling times (49). Characteristics particular to breast carcinoma could influence tumor growth and lead to heterogeneous mutation rates, such as hormonal influence and clustered hypermutation propensities (50).
Temporal Inference of Gene Occurrence Along Tumor Progression.
For Fisher’s exact tests, we assembled tumor type-specific driver gene lists for the tumor types featuring well-established drivers, totaling 159 genes (Dataset S9). We counted the number of mutations for tumor type-specific driver genes for each subject in the earliest branch, leading from normal tissue to the ancestor/internode of all tumors, and later branches (all subsequent branches other than the branch leading from normal to the ancestor of all tumors). Similarly, we counted the number of mutations of nondriver genes (genes other than the tumor type-specific drivers for each subject with a particular type of tumors) in the early and late branches. Finally, we applied Fisher’s exact test to determine whether there was any association among the four categories of mutation counts of tumor drivers and nondrivers in early branches (i.e., the root of the phylogeny, from normal tissue to the first metastatic divergence) and late branches (i.e., branches between inferred ancestral tumor states and other inferred ancestral tumor states, and branches between inferred ancestral tumor states and sampled tumors). Furthermore, we used Fisher's exact test to assess whether the ratio of nonsilent versus silent mutations (dN/dS) was significantly higher in cancer driver genes than nondriver genes in the early branches, and, independently, in the late branches.
To identify the driver gene list across all tumors examined, we compiled lists of well-established tumor type-specific driver genes that could be applied to 21 out of 40 subjects (Dataset S9; ten subjects did not have a well-established tumor type-specific driver gene list. Nine out of the remaining 30 subjects did not reveal any mutations in their well-established driver gene list.) Sequences from 21 of 30 subjects exhibited mutations in 17 established tumor type-specific driver genes. These genes were AMER1 (FAM123B), APC, ARID5B, CDKN2A, EGFR, EP300, FBXW7, KMT2D (MLL2), KRAS, NOTCH1, PIK3CA, RSBN1L, SELP, SMAD2, SMAD4, TP53, and UGT2B10 (Dataset S10). Two out of 11 genes featured specific sites that were recurrently mutated across subjects and were known to be cancer drivers in other cancer types (CTNNB1 and ALK) (Datasets S8 and S13). Well-established tumor type-specific drivers (Dataset S10) and well-established cancer drivers irrespective of cancer type (Dataset S8) were both used to compare timing of occurrence of drivers vs. nondrivers in early and late branches (Dataset S11), as well as to compare the ratio of nonsilent and silent mutations in drivers and nondrivers in early branches, and, independently, in the late branches (Dataset S12).
Tumor genes can be subject to strong positive selection (51) that can violate constant rates of substitution assumed by molecular clocks in evolutionary methods. Together with heterogeneities in cell proliferation and mutation rates and chemotherapy, these factors could distort our time estimates. However, most somatic mutations arising in tumor tissues are believed to be neutral or close to neutral in selective effect (52). Nevertheless, to be careful, we specified hypotheses that were dependent on the relative timing among mutations (and their temporal order), rather than on the accuracy of the scaling of inferred chronology of events by calibrations such as the time of patient diagnosis.
To detect the temporal order of mutated alleles, we compared the distribution of inferred mutation times for six well-established driver genes (Datasets S9 and S10) that experienced nonsilent mutations in at least four of the subjects (Dataset S14): KRAS, TP53, KMT2D, PIK3CA, ALK, and KMT2C. Each mutation was associated with a branch in the cancer phylogeny by our ancestral state inference. We integrated over uncertainty to obtain the maximum likelihood beta distribution for when the mutations commonly arise on the 0–1 timescale. To evaluate whether inferred beta distributions of occurrence time were statistically significantly different between TP53/KRAS, PIK3CA/KMT2D, and ALK/KMT2C, the maximum likelihood null model of a single beta distribution for combined nonsilent mutations of pairs of TP53 and KRAS with PIK3CA and KMT2D was compared with the maximum likelihood of distinct beta distributions for all nonsilent mutations in the individual genes. Similarly, we combined nonsilent mutations of pairs of PIK3CA and KMT2D with ALK and KMT2C, compared with the maximum likelihood of distinct beta distributions for all nonsilent mutations in the individual genes. Statistical significance was assessed via a nested likelihood ratio test evaluated against the χ2 distribution with two degrees of freedom.
Supplementary Material
Acknowledgments
We thank Douglas Brash, Gilbert Omenn, Andrew Clark, Tandy Warnow, Daniel Hartl, Allen Rodrigo, and Angelika Hofmann for providing feedback on draft versions of the manuscript. We also thank the Yale High Performance Computing Center for providing computational resources and Robert Bjornson for assistance regarding disposition of clusters and servers to the project. We thank Irina Tikhonova and Christopher Castaldi in the Yale Center for Genome Analysis for sample preparation and sequencing. This project was supported by Gilead Sciences, Inc. A.I. was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo scholarships 12/04818-5 and 13/15144-8.
Footnotes
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1525677113/-/DCSupplemental.
References
- 1.Gerlinger M, et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat Genet. 2014;46(3):225–233. doi: 10.1038/ng.2891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Weinberg RA. Tumor suppressor genes. Science. 1991;254(5035):1138–1146. doi: 10.1126/science.1659741. [DOI] [PubMed] [Google Scholar]
- 3.Harbst K, et al. Molecular and genetic diversity in the metastatic process of melanoma. J Pathol. 2014;233(1):39–50. doi: 10.1002/path.4318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Nguyen DX, Bos PD, Massagué J. Metastasis: From dissemination to organ-specific colonization. Nat Rev Cancer. 2009;9(4):274–284. doi: 10.1038/nrc2622. [DOI] [PubMed] [Google Scholar]
- 5.Lawrence MS, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–218. doi: 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lawrence MS, et al. Discovery and saturation analysis of cancer genes across 21 tumour types. Nature. 2014;505(7484):495–501. doi: 10.1038/nature12912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Naxerova K, Jain RK. Using tumour phylogenetics to identify the roots of metastasis in humans. Nat Rev Clin Oncol. 2015;12(5):258–272. doi: 10.1038/nrclinonc.2014.238. [DOI] [PubMed] [Google Scholar]
- 8.Miller CA, et al. SciClone: Inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLOS Comput Biol. 2014;10(8):e1003665. doi: 10.1371/journal.pcbi.1003665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Swofford DL. 2003. PAUP*, Phylogenetic Analysis Using Parsimony (* and Other Methods), Version 4.
- 10.Zwickl DJ. 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis (The University of Texas, Austin)
- 11.Ronquist F, et al. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61(3):539–542. doi: 10.1093/sysbio/sys029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rew DA, Wilson GD. Cell production rates in human tissues and tumours and their significance. Part II: Clinical data. Eur J Surg Oncol. 2000;26(4):405–417. doi: 10.1053/ejso.1999.0907. [DOI] [PubMed] [Google Scholar]
- 13.Hynes RO. Metastatic potential: Generic predisposition of the primary tumor or rare, metastatic variants-or both? Cell. 2003;113(7):821–823. doi: 10.1016/s0092-8674(03)00468-9. [DOI] [PubMed] [Google Scholar]
- 14.Burrell RA, McGranahan N, Bartek J, Swanton C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature. 2013;501(7467):338–345. doi: 10.1038/nature12625. [DOI] [PubMed] [Google Scholar]
- 15.Hong WS, Shpak M, Townsend JP. Inferring the origin of metastases from cancer phylogenies. Cancer Res. 2015;75(19):4021–4025. doi: 10.1158/0008-5472.CAN-15-1889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Klein CA. Parallel progression of primary tumours and metastases. Nat Rev Cancer. 2009;9(4):302–312. doi: 10.1038/nrc2627. [DOI] [PubMed] [Google Scholar]
- 17.Baum M, Demicheli R, Hrushesky W, Retsky M. Does surgery unfavourably perturb the “natural history” of early breast cancer by accelerating the appearance of distant metastases? Eur J Cancer. 2005;41(4):508–515. doi: 10.1016/j.ejca.2004.09.031. [DOI] [PubMed] [Google Scholar]
- 18.Gray JW. Evidence emerges for early metastasis and parallel evolution of primary and metastatic tumors. Cancer Cell. 2003;4(1):4–6. doi: 10.1016/s1535-6108(03)00167-3. [DOI] [PubMed] [Google Scholar]
- 19.Vogelstein B, et al. Cancer genome landscapes. Science. 2013;339(6127):1546–1558. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wang S, et al. SAR405838: An optimized inhibitor of MDM2-p53 interaction that induces complete and durable tumor regression. Cancer Res. 2014;74(20):5855–5865. doi: 10.1158/0008-5472.CAN-14-0799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ostrem JM, Peters U, Sos ML, Wells JA, Shokat KM. K-Ras(G12C) inhibitors allosterically control GTP affinity and effector interactions. Nature. 2013;503(7477):548–551. doi: 10.1038/nature12796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ahearn IM, Haigis K, Bar-Sagi D, Philips MR. Regulating the regulator: Post-translational modification of RAS. Nat Rev Mol Cell Biol. 2012;13(1):39–51. doi: 10.1038/nrm3255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Downward J. Targeting RAS signalling pathways in cancer therapy. Nat Rev Cancer. 2003;3(1):11–22. doi: 10.1038/nrc969. [DOI] [PubMed] [Google Scholar]
- 24.Harfe BD, Jinks-Robertson S. DNA mismatch repair and genetic instability. Annu Rev Genet. 2000;34:359–399. doi: 10.1146/annurev.genet.34.1.359. [DOI] [PubMed] [Google Scholar]
- 25.Vasan N, et al. A targeted next-generation sequencing assay detects a high frequency of therapeutically targetable alterations in primary and metastatic breast cancers: Implications for clinical practice. Oncologist. 2014;19(5):453–458. doi: 10.1634/theoncologist.2013-0377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Goh G, et al. Recurrent activating mutation in PRKACA in cortisol-producing adrenal tumors. Nat Genet. 2014;46(6):613–617. doi: 10.1038/ng.2956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Choi M, et al. K+ channel mutations in adrenal aldosterone-producing adenomas and hereditary hypertension. Science. 2011;331(6018):768–772. doi: 10.1126/science.1198785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li H, et al. 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
- 30.Tomasetti C, Vogelstein B, Parmigiani G. Half or more of the somatic mutations in cancers of self-renewing tissues originate prior to tumor initiation. Proc Natl Acad Sci USA. 2013;110(6):1999–2004. doi: 10.1073/pnas.1221068110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Palles C, et al. CORGI Consortium WGS500 Consortium Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nat Genet. 2013;45(2):136–144. doi: 10.1038/ng.2503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Lynch M. Rate, molecular spectrum, and consequences of human mutation. Proc Natl Acad Sci USA. 2010;107(3):961–968. doi: 10.1073/pnas.0912629107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tokuda Y, et al. Fundamental study on the mechanism of DNA degradation in tissues fixed in formaldehyde. J Clin Pathol. 1990;43(9):748–751. doi: 10.1136/jcp.43.9.748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yu YP, Michalopoulos A, Ding Y, Tseng G, Luo J-H. High fidelity copy number analysis of formalin-fixed and paraffin-embedded tissues using Affymetrix Cytoscan HD chip. PLoS One. 2014;9(4):e92820. doi: 10.1371/journal.pone.0092820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Townsend JP, Su Z, Tekle YI. Phylogenetic signal and noise: Predicting the power of a data set to resolve phylogeny. Syst Biol. 2012;61(5):835–849. doi: 10.1093/sysbio/sys036. [DOI] [PubMed] [Google Scholar]
- 36.Sukumaran J, Holder M. 2008. SumTrees, Summarization of Split Support on Phylogenetic Trees (part of the DendroPy Phylogenetic Computation Library), Version 2(3)
- 37.Yang Z. Computational Molecular Evolution. Oxford Univ Press; Oxford: 2006. [Google Scholar]
- 38.Rambaut A, Drummond A, Suchard M. 2013. Tracer, Version 1. 6.
- 39.Erba E, et al. Cell kinetics of human ovarian cancer with in vivo administration of bromodeoxyuridine. Ann Oncol. 1994;5(7):627–634. doi: 10.1093/oxfordjournals.annonc.a058935. [DOI] [PubMed] [Google Scholar]
- 40.Choi J-H, Wong AST, Huang H-F, Leung PCK. Gonadotropins and ovarian cancer. Endocr Rev. 2007;28(4):440–461. doi: 10.1210/er.2006-0036. [DOI] [PubMed] [Google Scholar]
- 41.Rew DA, Thomas DJ, Coptcoat M, Wilson GD. Measurement of in vivo urological tumour cell kinetics using multiparameter flow cytometry: Preliminary study. Br J Urol. 1991;68(1):44–48. doi: 10.1111/j.1464-410x.1991.tb15255.x. [DOI] [PubMed] [Google Scholar]
- 42.Oda T, et al. Growth rates of primary and metastatic lesions of renal cell carcinoma. Int J Urol. 2001;8(9):473–477. doi: 10.1046/j.1442-2042.2001.00353.x. [DOI] [PubMed] [Google Scholar]
- 43.Zhang J, Kang SK, Wang L, Touijer A, Hricak H. Distribution of renal tumor growth rates determined by using serial volumetric CT measurements. Radiology. 2009;250(1):137–144. doi: 10.1148/radiol.2501071712. [DOI] [PubMed] [Google Scholar]
- 44.Schofield MJ, Hsieh P. DNA mismatch repair: Molecular mechanisms and biological function. Annu Rev Microbiol. 2003;57:579–608. doi: 10.1146/annurev.micro.57.030502.090847. [DOI] [PubMed] [Google Scholar]
- 45.Kariola R, et al. MSH6 missense mutations are often associated with no or low cancer susceptibility. Br J Cancer. 2004;91(7):1287–1292. doi: 10.1038/sj.bjc.6602129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Edelmann W, et al. Mutation in the mismatch repair gene Msh6 causes cancer susceptibility. Cell. 1997;91(4):467–477. doi: 10.1016/s0092-8674(00)80433-x. [DOI] [PubMed] [Google Scholar]
- 47.Rowland MM, Schonhoft JD, McKibbin PL, David SS, Stivers JT. Microscopic mechanism of DNA damage searching by hOGG1. Nucleic Acids Res. 2014;42(14):9295–9303. doi: 10.1093/nar/gku621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Wood RD, Mitchell M, Lindahl T. Human DNA repair genes, 2005. Mutat Res. 2005;577(1-2):275–283. doi: 10.1016/j.mrfmmm.2005.03.007. [DOI] [PubMed] [Google Scholar]
- 49.Stanton PD, Cooke TG, Forster G, Smith D, Going JJ. Cell kinetics in vivo of human breast cancer. Br J Surg. 1996;83(1):98–102. doi: 10.1002/bjs.1800830130. [DOI] [PubMed] [Google Scholar]
- 50.Nik-Zainal S, et al. Breast Cancer Working Group of the International Cancer Genome Consortium Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149(5):979–993. doi: 10.1016/j.cell.2012.04.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ostrow SL, Barshir R, DeGregori J, Yeger-Lotem E, Hershberg R. Cancer evolution is associated with pervasive positive selection on globally expressed genes. PLoS Genet. 2014;10(3):e1004239. doi: 10.1371/journal.pgen.1004239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Sottoriva A, et al. A Big Bang model of human colorectal tumor growth. Nat Genet. 2015;47(3):209–216. doi: 10.1038/ng.3214. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








