Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Aug 23;16(8):e0243595. doi: 10.1371/journal.pone.0243595

Extreme value theory as a framework for understanding mutation frequency distribution in cancer genomes

Natsuki Tokutomi 1,*, Kenta Nakai 1,2, Sumio Sugano 3,4
Editor: Zechen Chong5
PMCID: PMC8382180  PMID: 34424899

Abstract

Currently, the population dynamics of preclonal cancer cells before clonal expansion of tumors has not been sufficiently addressed thus far. By focusing on preclonal cancer cell population as a Darwinian evolutionary system, we formulated and analyzed the observed mutation frequency among tumors (MFaT) as a proxy for the hypothesized sequence read frequency and beneficial fitness effect of a cancer driver mutation. Analogous to intestinal crypts, we assumed that sample donor patients are separate culture tanks where proliferating cells follow certain population dynamics described by extreme value theory (EVT). To validate this, we analyzed three large-scale cancer genome datasets, each harboring > 10000 tumor samples and in total involving > 177898 observed mutation sites. We clarified the necessary premises for the application of EVT in the strong selection and weak mutation (SSWM) regime in relation to cancer genome sequences at scale. We also confirmed that the stochastic distribution of MFaT is likely of the Fréchet type, which challenges the well-known Gumbel hypothesis of beneficial fitness effects. Based on statistical data analysis, we demonstrated the potential of EVT as a population genetics framework to understand and explain the stochastic behavior of driver-mutation frequency in cancer genomes as well as its applicability in real cancer genome sequence data.

Introduction

The “Big Bang” model of cancer development and population genetics of cancer cells

To deconvolve complex biology of cancer, it is useful to trace the temporal order of population dynamics events of cancer cells as well as the underlying somatic genetic events [1]. In cancer, a long mutation and selection process precedes a rapid population increase that results in a clonal expansion which will be observed as formation of a tumor. This model is called the “Big Bang” model of cancer development (Fig 1A) and first proposed for the colorectal cancer growth [2].

Fig 1. Extreme value theory is applied to modeling the fitness landscape of preclonal cancer cells.

Fig 1

(A) Overview of cancer evolution. According to the Big Bang model of colorectal tumor growth, a tumor originates from a single clonal expansion. Clonal evolution proceeds after a Big Bang that produces a single clone and multiple subclones. The majority of the subclones are generated at the Big Bang, and only a fraction of them are generated later. These subclones do not experience stringent selection. The period before the Big Bang, on the other hand, is called preclonal evolution. We assume a small and constant population size on cancer precursor cells at this stage. We model population dynamics at this stage with the SSWM regime. The transition from the preclonal dynamics to the Big Bang is modeled with greedy adaptation. The triangles in the siphon indicate subclones in the tumor cell population. The rectangles and circles indicate normal and malignant cells. In the schematic, mutation A, for example, is at position i in gene X. Similarly, mutation B can be either at position j in gene Y, or at position i’ in gene X. (B) In the schematic, the dotted lines indicate fitness thresholds that allow cells to cause a clonal expansion in the given environment. We assume that the cell which had reached the threshold and started the clonal expansion has the highest fitness level among the preclonal mutant cells. (C) Non-additive mutation effects in diploid genetics. Fitness effects of mutant alleles at the same locus and at different genomic loci can interact resulting in non-additive mutation fitness effects. A heterozygous mutant locus contains a mutant allele and a wildtype allele. A homozygous mutant locus contains two mutant alleles whose fitness effects can interact. In the mutation-selection balance (MSB) model, such fitness effect interaction is called dominance effect. In contrast, fitness effect interaction at any different genomic loci is called epistasis. We distinguish dominance effect from epistasis. Stars indicate mutations. The circle indicates a cell. The vertical bars indicate a copy of the cell’s genome. (D) Dependence relations among the ten assumptions. In the schematic, AB means A is a prerequisite for B. For example, the Additive Dominance Assumption is necessary for the Proportionality Assumption. Only a subset of all dependence relationships is shown.

Unlike population genetics theories of familial human cancers [3], the population genetics of cancer cells focuses solely on dividing somatic cells within a human individual. Those cells reproduce themselves by asexual somatic cell division and thus are not Mendelian population [4]. However, population genetics models such as SSWM [5], Moran process [6], and branching process [7] are powerful tools for describing population dynamics of cancer cells.

Mutation and selection are two driving forces of cancer evolution. The mutation-selection balance (MSB) model [811] assumes infinite population, presence of epistasis [12] and dominance effect. Implication of this model to cancer evolution is enormous [13, 14]. In this model, the loss of genetic variation by selection is equal to the gain of it by mutation. The increase in fitness (i.e., fitness effect) that a mutation confers to the organism is defined as a selection coefficient which may be affected by several types of epistasis [810, 15]. In addition, the dominance effect in diploid organisms such as human cancer cells is quantified as dominance coefficient (Fig 1C) [8, 9, 1517].

According to the “Big Bang” model, the “Big Bang” event separates the cancer development process into two parts: the clonal expansion phase and the preclonal evolution phase. In the clonal expansion phase, multiple tumor subclones are formed with a single expansion of cell population at an early stage of tumor growth (Fig 1A) [2]. In this model, these cells not only acquire clonal lesions that are shared among many cells within the tumor but also acquire subclonal lesions that are observed only in a fraction of the tumor cells at almost the same time. These subclonal lesions are not subjected to stringent selection in cancer evolution (i.e., neutral evolution), thus leading to a state of intra-tumor-heterogeneity (ITH). However, it is known that cancer as a whole is a Darwinian evolutionary system, in which driver mutations undergo selective evolution and passenger mutations undergo neutral evolution [18]. Population genetics studies on cancer evolution in this period have been performed elsewhere, typically with emphasis on neutral evolution, ITH and an increasing population size [7, 1921]. In contrast, in the tumor initiation step, the preclonal evolution of cancer cells before the population expansion, where the rate of such increase is so small that it is unobservable, has not been sufficiently addressed thus far.

Preclonal evolution of cancer cells precedes the “Big Bang”

Preclonal evolution before the expansion of the cell population is thought to be selective, with cancer cells acquiring driver mutations that extremely increase cellular fitness [22]. All cancer driver mutations are beneficial for the survival of cancer cells, and acquisition of these mutations brings fitness gain to cancer cells [23]. Among population genetics theories that are able to describe such behavior of the fitness effects of beneficial mutations, the SSWM foundation provides a first-approximation fitness landscape of beneficial mutations [24, 25]. As a specialization of the MSB model for asexual population such as human cancer cells [26, 27], this model assumes that beneficial mutations are introduced and fixed in the population successionally [11, 28, 29] and have independent fitness effects: no epistasis [30] and no dominance (i.e., epistasis and dominance effects are “additive”)(Fig 1C, S1B Fig) [31]. In this framework, all beneficial mutations that occur in the population will increase fitness monotonically: i.e., they are on “selectively accessible” paths [25]. Thus, their adaptive walk is typically short.

The SSWM foundation is mainly characterized by two key assumptions: strong selection and weak mutation (Fig 1D) [5]. Here, “strong selection” is equivalent to excluding neutral mutations from consideration. Because all cancer driver mutations are beneficial, the strong selection assumption holds for these mutations. In contrast, “weak mutation” means excluding cases in which multiple mutations exist simultaneously within the population. In this assumption, we consider only a genotype that is made by introducing a single mutation to a single wild-type genome. We model population adaptation as a repetition of this step, in which a genotype that is selected from such genotypes becomes a new wild type in the population. Although this assumption always holds true for a microscopic landscape with a small population size, it will be violated if multiple clones interact and multiple mutations compete against each other during their fixation process (clonal interference) [32]. In the preclonal evolution of cancer cells, the weak mutation assumption holds because we assume a small, genetically homogenous ancestral cell population of the tumor sample of interest.

An additional but fundamental component of the SSWM foundation is the additive dominance assumption (Fig 1D). Here we refer to dominance as non-additive effect of mutations at a single locus in a diploid genome, separating it from epistasis [33] that involves multiple genomic loci such as different genes on the same pathway. Human cancer cells are often assimilated to a population of asexual dividing diploid organisms and thus can have dominance effect at any genomic loci (Fig 1C). In the MSB model, the degree of the non-additivity in this context is quantified by dominance coefficient that describes fitness effect of the heterozygous mutant allele on an additive scale (S1B Fig) [34]. In this study, the dominance of a beneficial mutation is assumed additive as long as its fitness effect is small [31, 35, 36].

In the preclonal evolution of cancer, a non-silent mutation within a driver gene can have small fitness effect due to the complexity of cancer pathology. Even within a single cancer type, tumor micro-environments of each tumor sample are sometimes greatly different from each other [37, 38]. As a result, it is possible that a driver mutation which has caused a Big Bang in one tumor sample remains to have only limited fitness effect in another. In addition, elements of tumor micro-environments including resource availability may differ depending on location and change over time [3943]. This will partly explain the genetic diversity of cancer including difference in necessary driver genes among tumors. And fundamentally, a clonal expansion of the cell population which is the direct cause of the Big Bang is expected to occur only once in the Big Bang model [2]. This means that the fitness effects of beneficial mutations induced before the Big Bang are relatively small in comparison with the driver mutation which is the direct cause of the single clonal expansion. Therefore, we assume that the dominance effect of beneficial mutations including those in driver genes in a preclonal cancer cell population is small and to remain approximately additive before the Big Bang (the additive dominance assumption).

In contrast, driver mutations which are causative of the Big Bang are often characterized by non-additive fitness effects in the event. These “causative” driver mutations are each a direct cause of the Big Bang in the tumor sample of interest, and thus likely to have non-additive interaction such as dominance and recessiveness (S1A and S1B Fig). Of the two categories of cancer driver genes (i.e., oncogenes (OGs) and tumor suppressor genes (TSGs)) [44], the former such as NRAS and KRAS is likely to have dominant driver mutations [45], while the latter such as Rb and TP53 is likely to have recessive driver mutations [4648]. These events represent not only breaking of the additive dominance assumption, but also breaking of the weak mutation assumption since they cause a rapid population expansion. Similarly, any such expansion-causing oncogenic events including chromosomal missegregation [4952], epigenetic lesion [53], copy number alteration [54], mutant driver homozygosity [55], and loss-of-heterozygosity [48, 56, 57] are all assumed to break the weak mutation assumption.

Breaking of SSWM dynamics primes clonal expansion of cancer cell population

An expansion of the cancer cell population size generates multiple clones with multiple different beneficial mutations, thus causing clonal evolution [20, 58] of the newer cells. Here, the weak mutation assumption is violated and the population dynamics of the cells is thought to shift from SSWM to greedy adaptation (Fig 1A) [59, 60]. A phenomenon wherein mutations with greater fitness effects are certainly fixed in the population, resulting in increased repeatability of the fitness trajectory of cells in such a situation [61], has been confirmed by an Escherischia coli experiment [62]. The fact that a limited number of driver mutations has been observed repeatedly in multiple cancer samples suggests not only the repeatability of the oncogenic process, but also the underlying evolutionary structure itself.

An increase in population size not only strengthens the deterministic traits [61] of the population by clonal interference but also enables its fitness valley crossing [25, 63]. This gives rise to an escape genotype that is not on the selectively accessible paths in SSWM, thereby counteracting the deterministic traits of the population. Stochastic tunneling, a critical element in the valley-crossing mechanism, enables fixation of deleterious mutations as well as neutral evolution [20, 64]. These theoretical aspects of cell population behavior have been validated by E. coli experiments [65]. In cancer, such population dynamics corresponds to the subclonal evolution of passenger mutations that enhances ITH.

Importantly, an increase in population size also introduces another complexity in population dynamics of cancer cells. Since the weak mutation assumption is violated, a newer mutation can arise when an older mutation has not been fixed. With multiple mutations in a single cell present, the effects of those mutations can interact resulting in a non-additive fitness effect at different loci (i.e., epistasis) [33, 66]. The effect of epistasis on fitness landscape on a practical time scale has been analytically and computationally examined in relation to the weak-mutation limit and the distribution of fitness effects [67]. Roles of epistasis in cancer initiation are highlighted elsewhere [68, 69].

Epistatic effect introduced by violation of the weak mutation assumption changes fitness landscape of each cell in the population. This may result in a rugged fitness landscape [33, 70, 71] in which previously “beneficial” driver mutations become no longer beneficial. For example, in a tumor with an already activated driver pathway, an activating mutation in a driver gene on that pathway no longer contributes to tumorigenesis. This corresponds to violation of the strong selection assumption that contributes to the neutral evolution [20] of cancer driver genes. This in combination with the increasing population size reinforces the basis of applicability of the Moran process [6, 20] and, if the increase is exponential, the branching process [7, 33, 44, 47, 68, 72]. Neutral evolution itself enhances ITH [19].

Extreme value theory as a framework for explaining mutant sample frequency distributions of cancer driver mutations

Among the evolutionary models mentioned above, this manuscript focuses on SSWM (and the ten assumptions: see Discussion and Fig 1D) to explain the acquisition of driver mutations by cancer cells. According to the mutational landscape by Gillespie, the fitness of a wild type allele at a given gene locus tends to lie on the right side of a distribution because this fitness is usually high, and the fitness of beneficial mutations lie further to the right [5, 31]. In this setting, the fitness values are extreme, and thus their statistical behavior is described by EVT [73, 74].

The first application of EVT to cancer preclonal evolution examined the proliferation of stem cells in intestinal crypts [75]. Within the internal space of an individual crypt, the adaptation and evolution of the cell population will be completely independent of the states of different crypts. In our analysis, we employ an analogy that compares an intestinal crypt and a tumor sample. Similar to a single crypt in the colon, a tumor sample has multiple cells that are each independent evolutionary players.

In cancer genomics, “mutation frequency” in general refers to the variant allele frequency (VAF) which is defined as a fraction of mutant reads to total reads at a given genomic site.

VAF=#{mutatedreads}/#{totalreads} (1)

This is approximately proportional to the mutant cells contained in a single sample if the tumor sample purity is assumed constant. Recent studies have shown that the neutral evolution of the cancer genome results in a power law distribution of tumor bulk-sample VAFs reported by a next-generation sequencer [76, 77].

In contrast, we define MFaT as a normalized frequency of mutant samples within a single dataset.

MFaT=#{mutatedtumors}/#{totaltumors} (2)

This measure is proportional to patient/sample frequency in the dataset, and thus focuses more on the repeatability instead of the clonality of a mutation. In this study, we adopt MFaT as a proxy for expectation of VAF and fitness effect of a beneficial mutation (the proportionality assumption). Since beneficial mutations experience selective evolution in carcinogenesis, the distribution of MFaTs will be different from those of VAFs of neutral mutations. It should be noted that this formulation is valid only with driver mutations in the Big Bang model [2].

Materials and methods

Study design

The overall analysis consists of two parts: total tumor analysis and tumor type-specific analysis. The former contains two categories of cancer driver definitions: “driver-gene definition” and “driver-site definition”. The “driver-gene definition” refers to a list of gene symbols which are experimentally or computationally identified driver genes. The “driver-site definition” specifies nucleotide positions as well as their gene symbols. In the tumor type-specific analysis, only driver-gene definitions are used. An intersection set is calculated for every combination of a large-scale cancer genomic dataset and a cancer driver definition. Taking intersection of a given dataset with a “driver-gene definition” equals to applying filtering by gene symbols (symbol-based filtering). On the other hand, taking intersection of a given dataset with a “driver-site definition” equals to applying filtering by genomic positions (position-based filtering).

Throughout the analyses, different mutations in the same gene but at different nucleotide positions are counted separately. For example, if a gene G contained only non-silent mutations at positions P and Q, the gene G has two mutations. Sample frequencies in each dataset are calculated based on combinations of G and P, distinguishing post-substitution nucleotides (e.g., an A-to-C mutation and an A-to-G mutation at P are distinguished).

Datasets

In total, we used unique doubleton protein-altering single nucleotide variations (SNVs) that numbered 68 973 from 15 285 patients in the ICGC Release 27 dataset [78], 325 244 from 24 355 tumors in the COSMIC Version 85 dataset [79], and 91 312 from 11 089 samples in the Chang dataset [80]. These groups of unique mutations respectively correspond to groups of unique genomic sites: 68 613 in the ICGC dataset, 322 962 in the COSMIC dataset, and 90 885 in the Chang dataset. These mutations were annotated as being either in the “missense variant,” “initiator codon variant,” “stop gained,” or “stop lost” mutation class. In all datasets, the reference genome version is GRCh37, and the annotation is based on Ensembl v75.

In addition, we used the RTCGA mutations dataset (https://rtcga.github.io/RTCGA/) to enable a tumor-type-specific analysis over cancer clonal mutations identified in an identical sequencing strategy. This dataset includes 178 701 unique somatic protein-altering SNVs from 2 732 samples and corresponding 177 898 unique genomic sites. The samples are collected from eight cancer types: BLCA (127), BRCA (949), HNSC (266), LIHC (191), LUAD (225), PRAD (268), SKCM (340), and THCA (366) (sample sizes indicated in parentheses). The sequencing strategy is whole exome sequencing (WXS) performed with two platforms (Illumina GAIIx, Illumina HiSeq) at six centers (broad.mit.edu, genome.wustl.edu, hgsc.bcm.edu, ucsc.edu, mdanderson.org, and bcgsc.ca). All mutations are annotated as being in either the “missense mutation,” “nonsense mutation,” “translation start site,” or “nonstop mutation” class. We defined mutations with VAFs in the range of [0.25, 1.00] as clonal after a pioneering study in the field [76].

We used the following cancer driver-gene definitions in our analysis: the Mutational Driver definition in the IntOgen Cancer Drivers Database (2014.12) dataset [81], the COSMIC Cancer Gene Census [79], and the Tokheim Oncogenes and Tumor Suppressor Genes [82]. Also, we used the following data as cancer driver mutation site definitions: SNVs whose driver activity is “known” in the IntOgen 2016.5 Driver Mutations Catalog Mutation Analysis dataset [83], SNVs in the DoCM database [84], and amino acid substitution information per gene in a recent study (S4 Table in [85]).

Data processing

For the RTCGA dataset, mutations were filtered based on the VAF threshold described in the previous section. For the rest of the data, mutations were filtered based on the criteria as in [7880].

MFaT values for each nucleotide mutation were calculated on the basis of the fraction of mutated sample/patient count in the total samples/patients (for details, see Supplementary Materials in S1 Appendix). In brief, the calculation was performed with special attention to minimize the effect of artificial manipulation to data that would possibly be affecting observed MFaT distributions. To filter out potentially low-confidence mutations, we selected mutations that appeared at least twice in a dataset (total sample/patient counts are based only on sequences containing such mutations). This would have ameliorated the problem of the original datasets which are lacking correction or control of tumor sample purity. Protein-altering mutations were selected based on all possible annotations to avoid unintentionally omitting mutations that are synonymous in one annotation and at the same time protein-altering in another (the strong selection assumption). Sample/patient counts were calculated in a manner sensitive to reference/alternate allele combinations (e.g., a T to C mutation and a T to A mutation at the same genomic site are distinguished throughout the analysis).

The definition and calculation of MFaT

We defined mutant allele frequency among tumors (MFaT) as a frequency of samples within a cancer genome dataset that have a given mutation at a given genomic site (see Eq 2). The MFaT value is defined at any genomic sites corresponding to individual mutations recorded in the dataset. In this formulation, the count of samples having a mutation is normalized by the number of total samples within the dataset, permitting a comparison of mutated sample frequencies between different datasets.

To practically and approximately calculate continuous MFaT values, we utilized multiple total-tumor cancer genome datasets. In general, observed MFaT values will be discrete in any dataset with a sample size less than 2 000. The total-tumor datasets, which typically have thousands of sequencing samples, will provide enough resolution in MFaT values to satisfy the continuity assumption. With the aims of ensuring individual mutation observations and reducing computational cost, we restricted our analysis with doubleton (i.e., observed at least twice in each dataset) somatic protein-altering SNVs. A protein-altering mutation in our study is defined as a mutation that changes the protein sequence in either of the annotated transcripts. On mutations that do not satisfy this condition, or non-protein-altering mutations, one can define MFaTs although the proportionality assumption does not hold over the values (for details, see Discussion).

To exclude mutations with smaller fitness effects, we restricted our analysis to the mutation sites with MFaT values ranked in the top 200 in the case of total-tumor analysis and top 50 in the case of type-specific analysis. Here, we assigned smaller ranks to greater MFaT values. For tied MFaT observations, we assigned the maximum rank (e.g., for MFaT observations {0.1, 0.3, 0.3, 0.5}, ranks {4, 3, 3, 1} will be assigned). The number of plotted MFaT values after the filtering is shown in the figures as “b,” and the effective threshold “th” is shown where relevant.

The maximization of MFaT

In practice for large-scale cancer genome analysis, it is common to observe multiple mutations on a single genomic site. This is due to the phenomenon that a single site is recurrently mutated to differing bases (substitution) in a set of samples. Since we consider only simple substitutions at this time, three alternate sequences of a given genomic site are possible (e.g., {A, T, C} against a reference base G). This could result in, at maximum, three different MFaT values for one site (e.g., MFaT for G to A, MFaT for G to T, and G to C).

In general, the number of cases of possible cancer cell environments that dominate cancer evolution via selective pressure is unlimited. Thus, the number of possible selective coefficients (i.e., the fitness effect) at a given genomic site is well approximated by positive infinity (the infinite micro-environments assumption). This is equivalent to the case where the number of possible alternate sequences of a given genomic site is infinite.

In addition, we assume that the mutational selection coefficient of a given site is maximized among possible alternate values after evolutionary selection (the maximization assumption).

Si=max(Si,1,Si,2,,Si,n)(n) (3)

where S denotes a mutational selection coefficient, i is the index for genomic sites, and n is the number of possible values, respectively. This is consistent with the selectionist idea of the survival of the fittest, and the block maxima model in the extreme value theory.

To suffice for the maximization assumption of MFaT observations, we selected the maxima of MFaT values in each genomic site, and excluded the rest from our analysis. As a result, the counts of MFaT values, mutations, and sites will all be equal. Here, we define this process as “maximization” of MFaTs, which will enable more exact and reliable data processing in future analyses. In the case of a tumor-type-specific analysis, we performed this maximization process for the respective tumor types.

Parameter estimation

We estimated the generalized extreme value distribution (GEV) and generalized Pareto distribution (GPD) parameters by the maximum likelihood (Nelder-Mead optimization) method using the “evd” R package (i.e., evd::fgev, evd::fpot functions). Initial values for the optimization are set in the GEV parameter estimation in the total-tumor analysis using the ICGC, COSMIC, and Chang datasets (shape: 1.25, location: 0, scale: 0), while they are not set in bootstrap simulations using these datasets. We also used different initial values (shape: 1.0, location: 0, scale: 0) in the total-tumor analysis using the RTCGA dataset.

For parameter estimation of the Pareto distribution, we defined its probability density function (PDF) and maximum likelihood estimators as the following, referring to implementations in the “VGAM” R package. Here, n is the length of the observed data vector.

fPareto(x)=αxmαxα+1x>xm (4)
x^m=minixi (5)
α^=ni=1nlnxi-nlnx^m (6)

Goodness-of-fit assessment by χ2 goodness-of-fit test

To assess goodness-of-fit of the GEV and Pareto distributions, we performed a conventional χ2 goodness-of-fit test over the total-tumor and tumor-type-specific cancer driver MFaTs using the “stats” R package (stats::chisq.test function; rescale.p and simulate.p.value flags were both set to TRUE). The theoretical frequencies of the MFaTs were calculated using parameters estimated by the maximum likelihood method (see Parameter Estimation section) and then compared with the observed MFaT frequencies. In this setting, the null hypothesis H0 stated, “the observed distribution is identical with the theoretical,” and the alternative hypothesis H1 stated, “the observed distribution is different from the theoretical.” Consequently, a higher significance level such as p < 0.05 rejected the H0 and indicated that “the observed distribution is significantly different from the GEV,” for example. Conversely, a lower significance level such as p = 0.50 weakly supported the notion that the observed and theoretical distributions are alike.

Bootstrap simulation and confidence interval

Bootstrap simulations over parameter estimators were performed using the “boot” R package (boot::boot function) in the total-tumor analysis. In all analyses, the iteration number is set to n = 1 × 105. We did not set initial values in the cases of the ICGC, COSMIC, and Chang datasets, while they were set in the case of the RTCGA dataset (shape: 1.0, location: 0, scale: 0).

We calculated 95% confidence intervals for the GEV shape parameter (i.e., tail index) using simulated total-tumor bootstrap distributions and a variety of calculation methods (in boot::boot.ci function): the first-order normal approximation (Normal), the basic bootstrap interval (Basic), the studentized bootstrap interval (Student), the bootstrap percentile interval (Percentile), and the adjusted bootstrap percentile (BCa) interval.

Density plot

For MFaT observations in each total-tumor case, we plotted theoretical probability density functions (PDFs) obtained from the parameter estimation of distributions and empirical PDFs obtained from observed values. We used the “evd” R package (evd::dgev and evd::dgpd functions) to plot GEV and GPD PDFs. Similarly, we defined the PDF of the Pareto distribution by referring to the “VGAM” R package (VGAM::dpareto function) (see Parameter Estimation section). We used the “stats” R package (stats::density function) to draw the empirical density function obtained from observations.

Data normalization in Q-Q Plot

We calculated a theoretical value of an observed MFaT value xk with Q(k/(N + d)), where Q(⋅) is the quantile function of GEV, k is the rank of the observation when the values are ranked from the smallest, N is the number of observations (N = b in plots), and d is the numerator correction factor (both b and d are shown in plots). The quantile function is defined by an inverse function of the cumulative distribution function (CDF) of GEV. We used the “evd” R package (evd::qgev function) to calculate the quantile function values.

Formulation of Fréchet plot

Based on the Weibull plot introduced in [86], we developed and drew a Fréchet plot (see demonstration in Supplementary Materials in S1 Appendix). After the assessment of the goodness-of-fit of the GEV distribution to the observations, we calculated slope and intercept values in a linear regression. In the cases of tumor-type-specific and gene-specific Fréchet plots, we added an additional data point MFaT = 1.0 to the observations in calculating each value of the empirical distribution function Femp(x) against the observed MFaT x. This satisfied the condition 0 < Femp(x) < 1 so that the y-axis value y(x)= −ln(−ln Femp(x)) is finite. In the gene-specific cases, we plotted only genes with more than two unique MFaT values. In such cases, we analyzed skin cutaneous melanoma (SKCM) tumors separately that are more enriched in mutations.

Bayesian parameter estimation by Markov Chain Monte Carlo simulation

In Bayesian extreme value analysis, we estimated tumor-type-specific GEV parameters using the Markov Chain Monte Carlo (MCMC) approach. We used the “evdbayes” R implementation for this purpose. Prior distributions of respective GEV parameters (i.e., shape, scale, and location) were assumed independent and normal. This means that the variance-covariance matrix that is used to calculate prior distribution is diagonal. Also, this distribution is equal to the trivariate (i.e., with three variables) normal distribution with variables that are mutually independent. In addition, we determined the prior parameters regarding the results of the bootstrap simulation, which are dependent on observed data. This means that the prior distribution used in this step is “informative.”

Specifically, the normal parameters in the informative priors were set as follows (mean ± standard deviation): shape 1.0 ± 0.5, scale 2.5 × 10−4 ± 3.2 × 10−4, and location 1.5 × 10−3 ± 1 × 10−4. The standard deviations of proposal distribution in MCMC were set as follows: shape 0.01, scale 1 × 10−5, and location 1 × 10−4. The burn-in number of MCMC was set to 5 000, whereas the number of iterations was 1 × 105. The initial values of the MCMC simulation were set equal to the expectation of the normal priors. However, for scale parameters in tumor-type-specific parameter estimation, we set the next value τ as the initial values considering the loss of genetic diversity in sequencing samples due to the shrinking sample size, which has been a problem in tumor-type-specific analyses.

τ=1/[Samplesizeofthetumortype] (7)

After the MCMC simulation, we estimated tumor-type-specific GEV parameters using an expected à posteriori (EAP) estimator.

Estimation of mutational fitness effects by MFaT

Consider the relative fitness Wi,A of a cancer cell with a post-mutational sequence A at a certain genomic site i. We define this value using the classic evolutionary selection coefficient, or fitness effect, Si,A:

Wi,A=1+Si,A (8)

With this definition, we formulate the selection coefficient whereby the cell’s genome is mutated from a pre-mutation (or reference) sequence R (with length 1 bp) to a post-mutational sequence A [27].

Si,RA=Wi,A-Wi,RWi,R=Si,A-Si,R1+Si,R (9)

Then, we consider cancer cell environments that determine selective pressure throughout cancer evolution. We consider the set of all possible environmental states Θ and its elements θ that determine the value of a selection coefficient of a cancer cell with a given genotype. Next, we define the selection coefficient whereby the cell mutates from a pre-mutational sequence R to a post-mutational sequence A at a genomic site i within a given environmental state θ:

Si,RA,θ=Si,A,θ-Si,R,θ1+Si,R,θ (10)

In addition, we assume that the fitness effect of a cancer driver mutation after preclonal evolution is maximized (the maximization assumption). In other words, the fitness effect of a given mutation is the maximum of many alternative values that are possible depending on the combinations of alternate sequence information and possible cancer cell environments. Under this formulation, the mutational selection coefficient (MSC), whose values are unique to cancer driver mutation sites, is defined by:

Si=max({Si,Rd,θ}dD,θΘ) (11)

Here, D is the set of post-mutational DNA sequences that are possible at the site i, and d is its single element. D is dependent on the genomic site i and the scope of consideration of mutational classes. For example, in the case considering only simple substitutions, then D = {A, T, C} if the pre-mutational sequence R was G. In this case, the size (i.e., the number of elements) of D is #{D} = 3. In contrast, the set Θ is independent of genomic sites and pre- and post-mutational sequences. Here, we assume the size of the set #{Θ} is infinite (i.e., the number of possible micro-environmental states is numerous) (the infinite micro-environments assumption). Then, the number of possible selection coefficients n = #{D} ⋅ #{Θ} for a given site i tends to infinity, and n → ∞ holds.

Let the individual values of Si,Rd,θ at respective genomic sites i be continuous and independent and identically distributed (IID) about D and Θ (the IID assumption and the continuity assumption). Based on the maximization, IID, and continuity assumptions, the block maxima model in extreme value theory is valid over Si, demonstrating that Si converges to the generalized extreme value distribution (GEV) as n → ∞. Finally, as this limit holds due to the infinite micro-environments assumption, the probability distribution of Si will be GEV over the set of genomic sites J(iJ).

Here, we assume that MSC at a site i is proportional to the MFaT of the site i (the proportionality assumption).

Si[MFaT]i (12)

We generalize this by considering normalizing constants of MFaTs, σ0 and μ0.

Si=[MFaT]i-μ0σ0 (13)

In general, fitness W is unobservable, and so is the relative fitness Wi,A by a mutation at site i in preclonal cancer cells. Thus, here we heuristically set μ0 = 0 and σ0 = 1. Then, the estimator of MSC at site i is given by

Si^=[MFaT]i (14)

EAP estimation of MSCs by GEV-binomial model

We consider the probability distribution of the sample count k of tumor samples which have a certain post-mutational sequence at a given genomic site i. Since 0 < MFaT < 1, here we consider the binomial distribution Binom(m, Si) with Si (the MSC at site i) as a ratio parameter and m (the number of tumor samples) as a size parameter. We update the Bayesian knowledge of the parameter Si in our Bayesian framework using observations of k and m. Specifically, we assume GEV as a prior distribution over Si, and then we calculate the posterior distribution using discrete observations at individual genomic sites.

The prior distribution of Si is expressed using the PDF of GEV fGEV(s) with three parameters (shape ξ, scale σ, and location μ) as follows:

fGEV(s)=1σA(s)-1ξ-1exp[-A(s)-1ξ] (15)
ξ0;A(s)=1+ξs-μσ>0 (16)

Here, A(s) is a function of s, the argument of the PDF. The likelihood at each value of tumor sample counts, k = 0, 1, 2, …, m, is expressed using the probability mass function (PMF) of the binomial distribution fBinom(k) as follows:

P(X=k|Si=s)=fBinom(k)=(mk)sk(1-s)m-k (17)

From Bayes theorem and appropriate assumptions over PMF and PDF, we assume the following equation over the posterior distribution of Si

P(Si=s|X=k)=P(X=k|Si=s)P(Si=s)P(X=k) (18)

Here, the necessary assumptions are

P(Si=s)=fGEV(s) (19)
P(X=k)=iP(X=k|Si) (20)

For the stepwise calculation of P(X = k), we need to consider tumor sample counts at all genomics sites. However, this is easily achieved by normalization of the numerator P(X = k|Si = s)P(Si = s). Finally, we estimate Si by the expected à posteriori (EAP) estimation and obtain its 95% confidence interval (EAP ± 1.96 APSD) by calculating the à posteriori standard deviation (APSD).

Results

The probability distribution of driver MFaT is likely a Type II extreme value distribution

A visual inspection of the distribution shape using a density plot (Fig 2A and S2A Fig) suggested that the probability distribution of cancer driver mutation MFaTs is approximately equal to the extreme value distribution (for details, see Materials and methods). A Q-Q plot with Sugano normalization (see Materials and methods) confirmed that the driver MFaT distribution is approximately described by the extreme value distribution given the relationship between observed and theoretical values (Fig 2B, S2B Fig). In addition, the arrangements of data points in Fréchet plots are approximately linear (R2 > 0.92 in symbol-based filtering cases and R2 > 0.97 in position-based filtering cases), suggesting driver MFaTs follow the Fréchet distribution (Fig 2E, S2E Fig). The results of a bootstrap simulation (Fig 2C and 2D, S2C and S2D Fig) over the tail index as the shape parameter of the generalized extreme value distribution (GEV) revealed that the tail index of driver MFaTs is common among different cancer genome datasets (1.25 ± 0.62 with a 95% confidence interval), and its value is positive (Type II extreme value distribution or Fréchet distribution). Also, the result suggested that the tail index can contribute to the estimate of systematic biases within a dataset, and in this case, the ICGC dataset harbors greater noise compared to other datasets (Fig 2C, S2C Fig).

Fig 2. Exploratory plots on cancer driver mutation MFaTs in the total-tumor analysis with symbol-based filtering.

Fig 2

In these figures, the “ICGC-IntOgen,” for example, denotes a set of mutations as an intersection of the ICGC mutations with the IntOgen driver-gene definition (i.e., symbol-based filtering)(for details, see Materials and methods). (A) Density plot of MFaTs. The colored solid line is probability density of observations, the black solid line is the probability density function (PDF) of GEV, and the dotted lines are the PDFs of the GPD and Pareto distributions. For each plot, a gene symbol and an amino acid substitution of a mutation with the highest MFaT value is shown in a box. (B) Q-Q plot of MFaTs. Here, “b” denotes the number of genomic sites of beneficial mutations considered, “th” denotes the effective threshold against MFaTs when selecting mutations according to ranks (for details, see Materials and methods), and “d” denotes the parameter in the normalization in the Q-Q plot. The straight line has the equation y = x. In the Q-Q plot, each point denotes a pair consisting of an observation and its corresponding theoretical value. (C) Bootstrap distributions of the tail index. The white point indicates the median of the distribution. The black square shows the first and the third quantiles. (D) Bootstrap confidence intervals. The dotted line shows the value that is in range of all confidence interval cases. (E) Fréchet plot. The R2 values and the equation for the linear regression lines are shown. The symbols “b” and “th” are as defined in (B).

Finally, the results of a χ2 goodness-of-fit test did not exclude the possibility that the actual distribution of driver MFaTs is GEV (with a significance level α = 0.05) (Tables 1 and 2). In this statistical test, the null hypothesis H0 that may be rejected is “the observed distribution is identical with the theoretical,” and the alternative hypothesis H1 that may be accepted is “the observed distribution is different from the theoretical.” Thus, the former null hypothesis (H0) is our claim. For example, in the case of ICGC-DoCM in the driver-site analyses, we cannot conclude that the two distributions are different because the null hypothesis is not rejected at a significance level α = 0.05 (p = 0.51 > 0.05). This does not state that the two distributions are identical but weakly supports the claim that the distribution of driver MFaT is GEV.

Table 1. The χ2 goodness-of-fit test p-values of total tumors with symbol-based filtering (rank < 200).

Combination Case MFaT Driver Definition GEV p-value Pareto p-value
ICGC-IntOgen ICGC IntOgen 0.280 0.005
ICGC-CGC ICGC CGC 0.301 0.009
ICGC-Tokheim ICGC Tokheim 0.269 0.081
COSMIC-IntOgen COSMIC IntOgen 1.000 0.758
COSMIC-CGC COSMIC CGC 1.000 0.767
COSMIC-Tokheim COSMIC Tokheim 1.000 1.000
CHANG-IntOgen CHANG IntOgen 1.000 1.000
CHANG-CGC CHANG CGC 1.000 1.000
CHANG-Tokheim CHANG Tokheim 1.000 1.000

Note that in this setting, a higher significance level such as p < 0.05 indicates that the observed and theoretical distributions are different, and a lower significance level such as p = 0.50 weakly supports a notion that the two distributions are alike. In this table, the GEV p-values are all above 0.25 maintaining the null hypothesis (H0) that the two distributions are identical, while some of the Pareto p-values have fallen below 0.05 accepting the alternative hypothesis (H1) that the two distributions are different.

Table 2. The χ2 goodness-of-fit test p-values of total tumors with position-based filtering.

Combination Case MFaT Driver Definition GEV p-value Pareto p-value
ICGC-IntOgen ICGC IntOgen 0.405 0.336
ICGC-DoCM ICGC DoCM 0.529 0.316
ICGC-Tokheim ICGC Bailey 0.244 0.203
COSMIC-IntOgen COSMIC IntOgen 1.000 1.000
COSMIC-DoCM COSMIC DoCM 1.000 1.000
COSMIC-Bailey COSMIC Bailey 1.000 0.672
CHANG-IntOgen CHANG IntOgen 1.000 1.000
CHANG-DoCM CHANG DoCM 1.000 1.000
CHANG-Bailey CHANG Bailey 0.632 0.625

Note that in this setting, a lower significance level with higher p-value such as p = 0.50 weakly supports a notion that the observed and theoretical distributions are alike. In this table, p-values are equal or higher in the GEV than the Pareto distribution.

Collectively, the total-tumor analysis, using both symbol-based and position-based filtering methods, strongly suggested that cancer driver mutation MFaTs in total-tumor datasets are appropriately modeled by the Type II extreme value (Fréchet) distribution.

Bayesian MCMC approach confirms that tumor-type-specific driver MFaT distributions are also Fréchet type

We then asked whether driver MFaT distributions were also of the Fréchet type in each tumor type. We used the RTCGA dataset for this analysis regarding data availability. The results of the “total”-tumor analysis using the RTCGA dataset (Fig 3A and 3B) confirmed that estimated values of GEV parameters in the case of the RTCGA dataset are reproducible and independent of differences in filtering methods, as shown with the ICGC, COSMIC and CHANG datasets (Fig 2C and S2C Fig). Tumor-type-specific analyses of eight tumor types (Fig 3C and 3D) confirmed that the results were similar to the results of the total-tumor analysis. Parameter estimation by Bayesian MCMC (Fig 3C) showed the MFaT distributions belonged to the Fréchet type, although some degree of variability in GEV parameters according to the differences in tumor types were observed. In addition, the comparison of histograms of observations to estimated densities (Fig 3D) confirmed that the probability distribution of GEV approximately describes the actual distributions of type-specific driver MFaTs. The arrangements of data points in Fréchet plots are approximately linear (R2 > 0.90 in any of eight tumor types), suggesting that driver MFaTs follow the Fréchet distribution (Fig 3E). Finally, the results of the χ2 goodness-of-fit test did not reject the null hypothesis that the distribution is GEV, except for the case of THCA (Table 3).

Fig 3. Bayesian MCMC approach for estimating tumor-type-specific GEV parameters.

Fig 3

In these analyses, only mutations filtered by gene symbols are analyzed (i.e., symbol-based filtering). In the total-tumor analysis, we used the IntOgen, CGC, and Tokheim driver-gene definitions, and in the case of tumor-type-specific analyses, we used only the IntOgen definition. (A) Density plot of MFaTs in the cases of the RTCGA total-tumor with symbol-based filtering. The colored solid line is the probability density of observations, the black solid line is the probability density function (PDF) of GEV, and the dotted lines are the PDFs of the GPD and Pareto distributions, respectively. For each plot, a gene symbol and an amino acid substitution of a mutation with the highest MFaT value is shown in a box. Here, “b” denotes the number of genomic sites of beneficial mutations considered, and “th” denotes the effective threshold against MFaTs when selecting mutations according to ranks (for details, see Materials and methods). (B) Bootstrap distributions of GEV parameters in the total-tumor analysis. We used the maximum likelihood method for the parameter estimation. The white point indicates the median of the distribution. The black square shows the first and the third quantiles. (C) Bootstrap distributions of GEV parameters in the tumor-type-specific analysis. We used the Bayesian MCMC method for the parameter estimation. The white point indicates the median of the distribution. The gray square shows the first and the third quantiles. (D) MFaT histograms with tumor-type-specific GEV densities. Black bars show frequencies in the histogram. The colored lines are estimated densities. (E) Type-specific Fréchet plots. The R2 values and the equations of the linear regression lines are shown. Here, “b” denotes the number of genomic sites of beneficial mutations considered.

Table 3. The χ2 goodness-of-fit test p-values of the RTCGA dataset with tumor-type-specific symbol-based filtering (rank < 50).

Dataset Tumor Type GEV p-value
RTCGA BLCA 1.000
RTCGA BRCA 0.434
RTCGA HNSC 1.000
RTCGA LIHC 1.000
RTCGA LUAD 0.506
RTCGA PRAD 1.000
RTCGA SKCM 0.381
RTCGA THCA 0.002

Note that in this setting, a higher p-value such as p = 0.50 weakly supports a notion that the observed and theoretical distributions are alike. In this table, p-values are above 0.30 except for the case of THCA.

Collectively, the parameter estimation by the Bayesian MCMC approach confirmed that the probability distributions of tumor-type-specific driver MFaTs are also of the Fréchet type, as shown for the total tumors (Figs 2 and 3A and 3B, S2 Fig).

GEV-binomial model estimates mutational selection coefficients (MSCs) of cancer driver protein mutations

Here, we assessed, through the estimation of MSCs, the degree of contribution of each amino acid mutation in driver proteins to oncogenesis by applying the Bayesian approach in the framework of EVT using the RTCGA mutations dataset. We recognize that a set of assumptions are required for estimating MSCs through MFaTs in the framework of EVT. The details of such assumptions are provided in the Discussion section.

In this series of analyses, we were able to compare these mutational fitness effects estimated by mutation frequencies between tumor types, because the estimators utilize MFaTs normalized by respective sample counts. This normalization thus far is independent of the classes of genes and other features of interest. Thus, EVT as a field of population genetics is consistent with the quantitative comparison of mutational fitness effects among tumor types involving both OGs and TSGs.

For example, we estimated the MSCs of BRAF mutations among various tumor types (Fig 4A). The estimated MSC was the highest for BRAF V600E in thyroid cancer (THCA), followed by BRAF V600E in skin cutaneous melanoma (SKCM). The BRAF V600E mutation in lung adenocarcinoma (LUAD) or other amino acid changes in BRAF showed relatively low MSCs.

Fig 4. The EAP estimates of driver mutation MSCs.

Fig 4

Tumor-type-specific posterior distributions of mutational selection coefficients (MSCs) and EAP (expected à posteriori) estimates in driver-protein mutations calculated by the GEV-binomial model (for details, see Materials and methods). For ease of plotting, we discretized probability densities of posteriors to obtain probability masses. Among the probability masses that count the total number of tumor samples (i.e., the parameter m in the GEV-binomial model), we omitted those tail probabilities that are smaller than 1/1000 in the plot. The white points at the center of each violin in the plots represent EAP estimates given the distribution. The shapes of plotted distributions have information of MFaT tails that cannot be modeled by a simple binomial model. (A) Violin plots of tumor-type-specific posterior distributions of MSCs in genes. Violins are colored according to each amino acid substitution to enable a comparison among MSC distributions of identically coordinated mutations observed in different tumor types. Genes that have four “mutation-tumor type” combinations or more in the RTCGA dataset are shown. (B) Violin plots of tumor-type-specific posterior distributions of MSCs in the TP53 protein sequences. The violins are arranged according to the amino acid coordinates of the mutations and are colored according to the amino acid residue after the substitution introduced by the mutation.

In the literature, the impact on the fitness effect of the BRAF V600E mutation is likely different among tumor types, including thyroid cancer, skin cutaneous melanoma, and lung adenocarcinoma. The mutation has been associated with poor prognosis and mortality in patients with papillary thyroid cancer [8789]. This is likely the strongest association between these three tumor types. The prevalence of this mutation is reported also in melanoma [90] and is shown to induce metastasis of melanoma in mice [91]. However, this is conditional to PTEN loss, suggesting weaker association compared to the case of thyroid cancer. The contribution of this mutation is even weaker in lung adenocarcinoma [92]. The estimated MSCs of BRAF mutations among tumor types are consistent with these known facts.

Although not much is known for the impact of each mutations on the fitness effect among the tumor types for other genes in Fig 4A, OGs (i.e., HRAS and PIK3CA) tend to have a few “mutation-tumor type” combinations that show relatively high MSCs. In contrast, in the case of TSGs (i.e., CDKN2A, CTNNB1, and SPOP), MSCs were small and the differences between “mutation-tumor type” combinations are also small. This tendency is evident in the case of TP53 (Fig 4B).

Discussion

Driver MFaTs are expectations of VAFs in the Big Bang model of cancer evolution

According to the Big Bang model of the cancer genome [2], individual tumor samples have independent and different oncogenic trajectories. Moreover, a driver mutation that has had an impact on carcinogenesis within a single sample is a “public” mutation shared by all cells in the tumor. In one sample, a class of driver mutations may have an impact on carcinogenesis, while in another sample, it does not. If a mutation at a certain driver-site within a tumor sample has an impact on carcinogenesis, then all of the tumor cells have that mutation, and about that driver site, VAF = 1.0 holds if the tumor sample purity is ideal (i.e., 100%). Similarly, if a mutation at a certain driver site within a tumor sample did not impact the oncogenic process, that mutation should not be observed in any of the tumor cells, and about that driver site, VAF = 0.0 holds regardless of tumor sample purity.

If these two important aspects of cancer driver mutation VAFs are considered, the value of aggregated tumor VAF (i.e., VAF that is aggregated across tumor samples in a given dataset; for details, see Supplementary Materials in S1 Appendix) will be equal to the value of mutant allele frequency among tumors (MFaT), which is given by the ratio of mutated samples to total samples. The use of MFaT will thus be a powerful approach in normalizing, investigating, and deciphering the records of preclonal evolution in large-scale cancer genome data.

The ten assumptions in extreme value cancer genomics

With the framework of SSWM (strong selection and weak mutation) in population genetics [5, 25], we were able to mechanistically and stochastically describe the preclonal evolution of cancer. To achieve this and perform valid extreme value analysis over cancer driver MFaTs, we propose the following ten assumptions. Some of the relationships among these assumptions are shown in Fig 1D. These assumptions will specify the scope of the application of the theory and will enable precise interpretation of the results.

  1. The Single Macro-Environment Assumption: Cancer evolution is an evolutionary process in which cancer evolves and adapts to a single macro-environment. This is based on a notion that, from a viewpoint that considers multiple sets of patients and multiple tissue types, the oncogenic and progressive processes are repeatable (the macro-repeatability assumption). This also is a prerequisite for two other assumptions (i.e., the proportionality assumption and the maximization assumption) stated below.

  2. The Infinite Micro-Environments Assumption: In a microscopic point of view that focuses on patients’ genetic background, physical condition, tissue type, and tissue micro-structure, as well as the genetic diversity of cancer itself, the uncertainty of evolutionary processes, and many other critical aspects of cancer evolution, we have infinite cases of possible cancer micro-environments. This is a prerequisite of the independent and identical distribution assumption and the maximization assumption stated below.

  3. The Additive Dominance Assumption: To translate knowledge of population genetics of monoploid organisms to cancer evolution in which diploid cancer cells play major roles, we assume that, in preclonal evolution of cancer cells, the fitness effect of a heterozygous allele is approximately additive [93] to explain the fitness effect of the two associated homozygotes [31]. This approximation is useful when fitness effects of beneficial mutations are small [31, 35, 36] like those mutations before violation of the weak mutation assumption. For activating mutation alleles in OGs, we assume that as long as there are only wildtype or heterozygous alleles in the population, effects of these alleles are additive considering the effect of wildtype as zero. Activating mutations in OGs may violate the weak mutation assumption while allowing us to retain the additive dominance assumption. For loss-of-function mutation alleles in TSGs, violation of the additive dominance assumption is equivalent to mutation homozygosity [55] or LoH [48, 56, 57] as a carcinogenic event.

  4. The No Epistasis Assumption: Similar to the additive dominance assumption, we assume that mutation effects at different genomic loci exhibit no epistasis and thus are additive [30]. Although the framework itself is capable of incorporating epistasis by reassigning selection coefficients at each evolutionary step [30, 94], we focus on a fact that the no epistasis assumption is practically not overly violated, possibly because the majority of the tumor samples has driver mutations on non-redundant pathways. It is known that for tumorigenesis, at least six cancer hallmarks should be established with corresponding pathways affected either by driver mutations or by other tumorigenic events [95].

  5. The Strong Selection Assumption: The mutations in the scope are all either beneficial or deleterious, and no neutral mutations are considered [5]. Because we presume driver mutations are all beneficial in cancer evolution [18], we can safely accept this assumption over driver mutations. In the analysis, the validity of this assumption is ensured by removing mutations with lower observed fitness gains. For the case of total-tumor analysis with symbol-based filtering, we focused on mutation sites with the top 200 MFaT values, and for type-specific analysis, we focused on the top 50.

  6. The Weak Mutation Assumption: The beneficial mutations are fixed independently in the population [25]. This is equivalent to the exclusion of clonal interference from the scope. In the stage of preclonal evolution of cancer, it is plausible to assume that the probability of acquiring multiple mutations within a single generation is sufficiently low. This leads to an approximation of the presence of a single clone at any time. In combination with the strong selection assumption, we obtain the SSWM fitness landscape in which the population adapts to the environment following a series of selective sweep.

  7. The Proportionality Assumption: From the above-stated single macro-environment assumption, cancer adapts and evolves in a single macro-environment. Here, we assume that MFaT is proportional to the fitness gain that a mutation yields in the single macro-environment. This enables quantification of mutational fitness gains in the cancer macro-environment using MFaT.

  8. The Continuous Fitness Effect Assumption (Continuity Assumption): So far, the fitness effect that a mutation yields in cancer evolution is only indirectly observable. Thus, the precise formulation of the probability distribution of that variable is unknown as well as its existence. However, the repeatability of cancer evolution implies that such fitness effects by a mutation have a certain probability distribution, and the complexity suggests that the variable is approximately continuous. From the above, we assume that the mutational fitness gains in cancer evolution have a certain continuous probability distribution.

  9. The Independent and Identical Distribution Assumption (IID Assumption): In general, if two mutations had different genomic coordinates, then the phenotypic effects of the two mutations also vary. This is because different genomic sites encode different structures and functions of the organism. For example, mutations in the first and third letters of triplets in the codon table will yield different amino acid substitutions (i.e., the first- and third- letter substitutions are independent). In contrast, the fitness effect that a phenotypic effect of a given mutation confers on the organism is dependent on the environment in which the organism adapts and evolves. From the above-mentioned infinite micro-environments assumption, we have numerous cases of such environments in cancer evolution. Under these possible environments, we assume that the fitness effect of a given, single mutation has a certain probability distribution that is independent of a genomic site of the mutation (i.e., any given mutation have identical probability distribution). Then, the value of the fitness effect of a mutation is independent and identically distributed (i.i.d.) across genomic sites. This is equivalent to excluding cases of interaction of mutation effects (i.e., epistasis) from the scope.

  10. The Maximization Assumption: From the infinite micro-environments assumption, the number of cases of possible cancer micro-environments is infinite. Fitness effect caused by a mutation at a given genomic site have a different value in a different micro-environment. It is known that, in the preclonal evolution of cancer, such micro-environments play critical roles in the evolution of cancer cells. Here, from the single macro-environment assumption, we consider adaptation of cancer cells to the single cancer macro-environment in the preclonal evolution step. We consider that, under such a macro-environment, cancer cells are selected based on combinations of different micro-environments and different fitness effects of cancer driver mutations. Thus, we assume that mutational fitness gain at a given cancer driver site is maximized across possible values (the block maxima model) after selection in the preclonal evolution. This is consistent with the idea of “survival of the fittest” in the theory of natural selection.
    Si=max(Si,1,Si,2,,Si,n)(n) (21)
    Here, S is a selection coefficient, i is an index for genomic sites, and n is the number of possible micro-environments.

    Although some of the above assumptions may not fit with our current knowledge of cancer biology, the results of our analysis suggest that they may hold at least for the first approximation.

The probability distribution of cancer driver MFaTs is likely the Fréchet type

At the current level of our knowledge, statistical properties of fitness effect distributions of cancer driver mutations are to a large extent unknown. A primary reason for this is the lack of a method that directly measures such fitness effects. However, the population genetics of cancer cells based on SSWM and extreme value theory has the flexibility to deal with the behavior of cancer cell populations while avoiding this problem. With these frameworks and several appropriate assumptions, we are able to discuss at least some of the properties of the fitness effect distributions from observed mutation frequencies as a result of adaptation and evolution of cells, without knowing the exact fitness effect distribution in each mutation in each cell. In other words, if the distribution of interest is continuous, the distribution of maxima drawn from samples from that distribution will be one of these three types: Gumbel, Fréchet, or Weibull [96].

Many “ordinary” probability distributions, such as normal, exponential, and gamma, belong to the Gumbel maximum domain of attraction. Based on this fact and discussion that Fréchet-type and Weibull-type distributions are not “biological,” Gumbel-type distributions have been justified as distributions of fitness effects of beneficial mutations [97]. In addition, a historical background in which such a fitness effect distribution has been considered to be exponential also supported this preconception (Fisher’s geometric model; [27, 98]). However, recent theoretical advances clarified that distributions that belong to the Fréchet and Weibull domains are also possible [73].

Biological experiments roughly supported this non-Gumbel hypothesis of fitness effects of beneficial mutations. A recent experiment involving two virus strains showed that fitness effects practically yielded by a beneficial mutation do not follow an exponential distribution [99101]. The mathematical background of this experiment is that, if values of the fitness effects have a right-truncated distribution due to their upper limit being characteristic of a given experimental setting, then the maxima of values drawn from that distribution will follow the Weibull distribution. In this experiment, a fitness effect value of a mutation is quantified as a count of formed plaques and is a proxy for virus particles [102, 103]. A quantity based on a count of biological entities is among the most powerful candidates for a variable to quantify the fitness effects of a mutation.

Also, an Escherichia coli experiment designed as an application of EVT empirically confirmed that the fitness effects of fixed beneficial mutations follow a distribution with a positive mode [104]. Although experimental settings including the method to quantify mutant fitness are greatly different from this study, the Fréchet distribution as a statistical distribution that describes the behavior of fitness effects of fixed driver mutations in tumor samples also has a positive mode in its mathematics.

Our study suggested that the distribution of fitness effects of driver mutations calculated from a sample frequency in a large-scale sequence dataset is of the Fréchet type (Figs 2 and 3), while it also allows distributions of the fitness effects of the individual mutations to remain unknown. The results of goodness-of-fit tests (Tables 13) did not reject the null hypothesis that the given two distributions are identical, supporting the possibility that the distribution of MFaTs as estimates of mutational selection coefficients (MSCs) is Fréchet distribution. Mathematically, a zero value of the shape parameter of the generalized extreme value (GEV) distribution means the Gumbel type, and a positive value means the Fréchet type. Also, the Fréchet distribution itself belongs to the attraction domain of the Fréchet distribution. These results not only present a problem to the previously held Gumbel hypothesis [73] from a practical point of view but also suggest the applicability of the Fréchet distribution in cancer genomics (Fig 4). In the case of THCA (Table 3), the null hypothesis was rejected in the goodness-of-fit test and it did not reproduce this result. It is obvious from the graph that this irreproducibility is due to the lack of mutations used in the analysis (Fig 3E).

The applicability of extreme value theory in cancer genomics

The posterior distributions of tumor-type-specific mutational selection coefficients (MSCs) of driver mutations calculated by the GEV-binomial model (Fig 4) contain information of distribution tails described by EVT. In the violin plots, because the EAP estimates drawn as white dots contain information of the tails that cannot be handled by a simple binomial model, those estimates have shifted to the right from the central point, as suggested by shapes of the posterior distributions. Such shifts are significant in posterior distributions of mutations, such as the S33P mutation in the CTNNB1 gene in the LIHC tumor type and the Q61R mutation in the HRAS gene in the THCA tumor type (Fig 4A). Similarly, while EAP estimates of driver mutation MSCs in the TP53 gene strongly reflect MFaTs that are mutant sample frequencies, these values also reflect information of the tails so as to be more continuous (Fig 4B). The shifts in these estimates suggest the applicability of EVT in cancer genomic analyses that entail estimation of fitness effects of beneficial mutations.

Conclusion

Based on statistical data analysis involving multiple tumor types and multiple definitions of cancer driver mutations, this study not only demonstrates that EVT helps us to understand the statistical distribution of driver-mutation frequencies in the cancer genome, which is a critical aspect in cancer genetics, but also suggests its applicability in cancer genomics based on its potential to model the tail behavior of mutation frequency distributions.

Supporting information

S1 Fig. The SSWM population dynamics of cells and the additivity of fitness effects of mutant alleles.

(PDF)

S2 Fig. Exploratory plots on cancer driver mutation MFaTs in the total-tumor analysis with position-based filtering.

(PDF)

S1 Appendix. Supplementary materials.

(PDF)

S1 Script. Get_IcgcProteinAltering.py.

(PY)

S1 File. Database_ICGC_temp_PostMax.tsv.gz.

(GZ)

S2 File. Database_COSMIC_temp_PostMax.tsv.gz.

(GZ)

S3 File. Database_CHANG_temp_PostMax.tsv.gz.

(GZ)

S4 File. Driver_TotalGene_database_IntOGen.tsv.

(TSV)

S5 File. Driver_TotalGene_database_CGC.tsv.

(TSV)

S6 File. Driver_TotalGene_database_Tokheim.tsv.

(TSV)

S7 File. Driver_TotalSite_database_IntOGen.tsv.

(TSV)

S8 File. Driver_TotalSite_database_DoCM.tsv.

(TSV)

S9 File. Driver_TotalSite_database_Bailey.tsv.

(TSV)

S10 File. Database_RTCGA_temp_Format.tsv.gz.

(GZ)

S11 File. Database_RTCGA_temp_ProteinAlteringSnv.tsv.gz.

(GZ)

S12 File. Database_RTCGA_temp_ProteinAlteringSnvDoubleton.tsv.gz.

(GZ)

S13 File. Database_RTCGA_temp_PostMaxTotal.tsv.

(TSV)

S14 File. Database_RTCGA_temp_PostMaxType.tsv.

(TSV)

S15 File. Driver_TypeGene_database_IntOGen.tsv.

(TSV)

S16 File. Drivertype_RTCGA-IntOGen_temp_PreRank.tsv.

(TSV)

S17 File. R_analysis_TotalTumor.zip.

(ZIP)

S18 File. R_analysis_TumorTypeSpecific.zip.

(ZIP)

S19 File. Table_NormalizedEap.tsv.

(TSV)

Acknowledgments

NT thanks Takao Suetake for his support and encouragement throughout this research.

Data Availability

Data used in this work are available from the following links. The reformatted files that can be directly processed by our scripts are provided as supplemental files. For details, see S1 Appendix. ICGC: https://icgc.org COSMIC: https://cancer.sanger.ac.uk/cosmic CHANG: https://github.com/taylor-lab/hotspots/blob/master/LINK_TO_MUTATIONAL_DATA RTCGA: https://rtcga.github.io/RTCGA/ IntOGen: https://www.intogen.org/search DoCM: http://docm.info.

Funding Statement

SS received funding (grant number: 17934018) from the National Bioscience Database Center (NBDC) of the Japan Science and Technology Agency (JST)(URL: https://biosciencedbc.jp/en/). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Attolini CSO, Cheng YK, Beroukhim R, Getz G, Abdel-Wahab O, Levine RL, et al. A mathematical framework to determine the temporal sequence of somatic genetic events in cancer. Proceedings of the National Academy of Sciences. 2010;107(41):17604–17609. doi: 10.1073/pnas.1009117107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, et al. A Big Bang model of human colorectal tumor growth. Nature genetics. 2015;47(3):209–216. doi: 10.1038/ng.3214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Frank SA. Genetic predisposition to cancer—insights from population genetics. Nature reviews genetics. 2004;5(10):764–772. doi: 10.1038/nrg1450 [DOI] [PubMed] [Google Scholar]
  • 4.Dobzhansky T. A review of some fundamental concepts and problems of population genetics. In: Cold Spring Harbor Symposia on Quantitative Biology. vol. 20. Citeseer; 1955. p. 1–15. [DOI] [PubMed]
  • 5.Orr HA. The genetic theory of adaptation: a brief history. Nature Reviews Genetics. 2005;6(2):119–127. doi: 10.1038/nrg1523 [DOI] [PubMed] [Google Scholar]
  • 6.Durrett R, Foo J, Leder K. Spatial Moran models, II: cancer initiation in spatially structured tissue. Journal of mathematical biology. 2016;72(5):1369–1400. doi: 10.1007/s00285-015-0912-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Durrett R. Population genetics of neutral mutations in exponentially growing cancer cell populations. The annals of applied probability: an official journal of the Institute of Mathematical Statistics. 2013;23(1):230. doi: 10.1214/11-AAP824 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Charlesworth B. Mutation-selection balance and the evolutionary advantage of sex and recombination. Genetics Research. 1990;55(3):199–221. doi: 10.1017/S0016672300025532 [DOI] [PubMed] [Google Scholar]
  • 9.Chasnov J. Mutation-selection balance, dominance and the maintenance of sex. Genetics. 2000;156(3):1419–1425. doi: 10.1093/genetics/156.3.1419 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hermisson J, Redner O, Wagner H, Baake E. Mutation–selection balance: ancestry, load, and maximum principle. Theoretical population biology. 2002;62(1):9–46. doi: 10.1006/tpbi.2002.1582 [DOI] [PubMed] [Google Scholar]
  • 11.Desai MM, Fisher DS. Beneficial mutation–selection balance and the effect of linkage on positive selection. Genetics. 2007;176(3):1759–1798. doi: 10.1534/genetics.106.067678 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics. 2008;9(11):855–867. doi: 10.1038/nrg2452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Nunney L. The population genetics of multistage carcinogenesis. Proceedings of the Royal Society of London Series B: Biological Sciences. 2003;270(1520):1183–1191. doi: 10.1098/rspb.2003.2351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Persi E, Wolf YI, Horn D, Ruppin E, Demichelis F, Gatenby RA, et al. Mutation–selection balance and compensatory mechanisms in tumour evolution. Nature Reviews Genetics. 2020; p. 1–12. [DOI] [PubMed] [Google Scholar]
  • 15.CHARLESWORTH B. The effect of synergistic epistasis on the inbreeding load. Genetics Research. 1998;71(1):85–89. doi: 10.1017/S0016672398003140 [DOI] [PubMed] [Google Scholar]
  • 16.Zhang XS, Hill WG. Genetic variability under mutation selection balance. Trends in Ecology & Evolution. 2005;20(9):468–470. doi: 10.1016/j.tree.2005.06.010 [DOI] [PubMed] [Google Scholar]
  • 17.Orr HA. The population genetics of beneficial mutations. Philosophical Transactions of the Royal Society B: Biological Sciences. 2010;365(1544):1195–1201. doi: 10.1098/rstb.2009.0282 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481(7381):306–313. doi: 10.1038/nature10762 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hu Z, Sun R, Curtis C. A population genetics perspective on the determinants of intra-tumor heterogeneity. Biochimica et Biophysica Acta (BBA)-Reviews on Cancer. 2017;1867(2):109–126. doi: 10.1016/j.bbcan.2017.03.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Niida A, Iwasaki WM, Innan H. Neutral theory in cancer cell population genetics. Molecular biology and evolution. 2018;35(6):1316–1321. doi: 10.1093/molbev/msy091 [DOI] [PubMed] [Google Scholar]
  • 21.Caravagna G, Heide T, Williams MJ, Zapata L, Nichol D, Chkhaidze K, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nature Genetics. 2020;52(9):898–907. doi: 10.1038/s41588-020-0675-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Tomasetti C, Marchionni L, Nowak MA, Parmigiani G, Vogelstein B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proceedings of the National Academy of Sciences. 2015;112(1):118–123. doi: 10.1073/pnas.1421839112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.McFarland CD, Mirny LA, Korolev KS. Tug-of-war between driver and passenger mutations in cancer and other adaptive processes. Proceedings of the National Academy of Sciences. 2014;111(42):15138–15143. doi: 10.1073/pnas.1404341111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lobkovsky AE, Koonin EV. Replaying the tape of life: quantification of the predictability of evolution. Frontiers in genetics. 2012;3:246. doi: 10.3389/fgene.2012.00246 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.De Visser JAG, Krug J. Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics. 2014;15(7):480–490. doi: 10.1038/nrg3744 [DOI] [PubMed] [Google Scholar]
  • 26.Gillespie JH. Molecular evolution over the mutational landscape. Evolution. 1984; p. 1116–1129. doi: 10.2307/2408444 [DOI] [PubMed] [Google Scholar]
  • 27.Orr HA. The population genetics of adaptation: the adaptation of DNA sequences. Evolution. 2002;56(7):1317–1330. doi: 10.1554/0014-3820(2002)056[1317:TPGOAT]2.0.CO;2 [DOI] [PubMed] [Google Scholar]
  • 28.Desai MM, Fisher DS, Murray AW. The speed of evolution and maintenance of variation in asexual populations. Current biology. 2007;17(5):385–394. doi: 10.1016/j.cub.2007.01.072 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sniegowski PD, Gerrish PJ. Beneficial mutations and the dynamics of adaptation in asexual populations. Philosophical Transactions of the Royal Society B: Biological Sciences. 2010;365(1544):1255–1263. doi: 10.1098/rstb.2009.0290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Unckless RL, Orr HA. The population genetics of adaptation: multiple substitutions on a smooth fitness landscape. Genetics. 2009;183(3):1079–1086. doi: 10.1534/genetics.109.106757 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gillespie JH. A simple stochastic gene substitution model. Theoretical population biology. 1983;23(2):202–215. doi: 10.1016/0040-5809(83)90014-X [DOI] [PubMed] [Google Scholar]
  • 32.Park SC, Krug J. Clonal interference in large populations. Proceedings of the National Academy of Sciences. 2007;104(46):18135–18140. doi: 10.1073/pnas.0705778104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Bauer B, Gokhale CS. Repeatability of evolution on epistatic landscapes. Scientific reports. 2015;5(1):1–6. doi: 10.1038/srep09607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Otto SP. The advantages of segregation and the evolution of sex. Genetics. 2003;164(3):1099–1118. doi: 10.1093/genetics/164.3.1099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Simmons MJ, Crow JF. Mutations affecting fitness in Drosophila populations. Annual review of genetics. 1977;11(1):49–78. doi: 10.1146/annurev.ge.11.120177.000405 [DOI] [PubMed] [Google Scholar]
  • 36.Gillespie JH. Substitution processes in molecular evolution. I. Uniform and clustered substitutions in a haploid model. Genetics. 1993;134(3):971–981. doi: 10.1093/genetics/134.3.971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gout S, Huot J. Role of cancer microenvironment in metastasis: focus on colon cancer. Cancer Microenvironment. 2008;1(1):69–83. doi: 10.1007/s12307-008-0007-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Nagarsheth N, Wicha MS, Zou W. Chemokines in the cancer microenvironment and their relevance in cancer immunotherapy. Nature Reviews Immunology. 2017;17(9):559. doi: 10.1038/nri.2017.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Yun J, Rago C, Cheong I, Pagliarini R, Angenendt P, Rajagopalan H, et al. Glucose deprivation contributes to the development of KRAS pathway mutations in tumor cells. Science. 2009;325(5947):1555–1559. doi: 10.1126/science.1174229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kareva I. Biological stoichiometry in tumor micro-environments. PloS one. 2013;8(1):e51844. doi: 10.1371/journal.pone.0051844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zapata L, Pich O, Serrano L, Kondrashov FA, Ossowski S, Schaefer MH. Negative selection in tumor genome evolution acts on essential cellular functions and the immunopeptidome. Genome biology. 2018;19(1):1–17. doi: 10.1186/s13059-018-1434-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Bhandari V, Hoey C, Liu LY, Lalonde E, Ray J, Livingstone J, et al. Molecular landmarks of tumor hypoxia across cancer types. Nature genetics. 2019;51(2):308–318. doi: 10.1038/s41588-018-0318-2 [DOI] [PubMed] [Google Scholar]
  • 43.Somarelli JA, Gardner H, Cannataro VL, Gunady EF, Boddy AM, Johnson NA, et al. Molecular biology and evolution of cancer: from discovery to action. Molecular biology and evolution. 2020;37(2):320–326. doi: 10.1093/molbev/msz242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, et al. Accumulation of driver and passenger mutations during tumor progression. Proceedings of the National Academy of Sciences. 2010;107(43):18545–18550. doi: 10.1073/pnas.1010978107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Guerrero I, Villasante A, Corces V, Pellicer A. Loss of the normal N-ras allele in a mouse thymic lymphoma induced by a chemical carcinogen. Proceedings of the National Academy of Sciences. 1985;82(23):7810–7814. doi: 10.1073/pnas.82.23.7810 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Knudson AG. Mutation and cancer: statistical study of retinoblastoma. Proceedings of the National Academy of Sciences. 1971;68(4):820–823. doi: 10.1073/pnas.68.4.820 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Iwasa Y, Michor F, Komarova NL, Nowak MA. Population genetics of tumor suppressor genes. Journal of theoretical biology. 2005;233(1):15–23. doi: 10.1016/j.jtbi.2004.09.001 [DOI] [PubMed] [Google Scholar]
  • 48.Hino O, Kobayashi T. Mourning Dr. Alfred G. Knudson: the two-hit hypothesis, tumor suppressor genes, and the tuberous sclerosis complex. Cancer science. 2017;108(1):5–11. doi: 10.1111/cas.13116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Lengauer C, Kinzler KW, Vogelstein B. Genetic instability in colorectal cancers. Nature. 1997;386(6625):623–627. doi: 10.1038/386623a0 [DOI] [PubMed] [Google Scholar]
  • 50.Michor F, Iwasa Y, Vogelstein B, Lengauer C, Nowak MA. Can chromosomal instability initiate tumorigenesis? In: Seminars in cancer biology. vol. 15. Elsevier; 2005. p. 43–49. [DOI] [PubMed] [Google Scholar]
  • 51.Geigl JB, Obenauf AC, Schwarzbraun T, Speicher MR. Defining ‘chromosomal instability’. Trends in Genetics. 2008;24(2):64–69. doi: 10.1016/j.tig.2007.11.006 [DOI] [PubMed] [Google Scholar]
  • 52.Bakhoum SF, Compton DA, et al. Chromosomal instability and cancer: a complex relationship with therapeutic potential. The Journal of clinical investigation. 2012;122(4):1138–1143. doi: 10.1172/JCI59954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Jones PA, Baylin SB. The fundamental role of epigenetic events in cancer. Nature reviews genetics. 2002;3(6):415–428. doi: 10.1038/nrg816 [DOI] [PubMed] [Google Scholar]
  • 54.Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pan-cancer patterns of somatic copy number alteration. Nature genetics. 2013;45(10):1134–1140. doi: 10.1038/ng.2760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Vázquez-García I, Salinas F, Li J, Fischer A, Barré B, Hallin J, et al. Clonal heterogeneity influences the fate of new adaptive mutations. Cell reports. 2017;21(3):732–744. doi: 10.1016/j.celrep.2017.09.046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Naylor S, Johnson B, Minna J, Sakaguchi A. Loss of heterozygosity of chromosome 3p markers in small-cell lung cancer. Nature. 1987;329(6138):451–454. doi: 10.1038/329451a0 [DOI] [PubMed] [Google Scholar]
  • 57.Ryland GL, Doyle MA, Goode D, Boyle SE, Choong DY, Rowley SM, et al. Loss of heterozygosity: what is it good for? BMC medical genomics. 2015;8(1):1–12. doi: 10.1186/s12920-015-0123-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194(4260):23–28. doi: 10.1126/science.959840 [DOI] [PubMed] [Google Scholar]
  • 59.Jain K, Krug J, Park SC. Evolutionary advantage of small populations on complex fitness landscapes. Evolution: International Journal of Organic Evolution. 2011;65(7):1945–1955. doi: 10.1111/j.1558-5646.2011.01280.x [DOI] [PubMed] [Google Scholar]
  • 60.Zhao B, Hemann MT, Lauffenburger DA. Modeling tumor clonal evolution for drug combinations design. Trends in cancer. 2016;2(3):144–158. doi: 10.1016/j.trecan.2016.02.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Szendro IG, Franke J, de Visser JAG, Krug J. Predictability of evolution depends nonmonotonically on population size. Proceedings of the National Academy of Sciences. 2013;110(2):571–576. doi: 10.1073/pnas.1213613110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Rozen DE, Habets MG, Handel A, de Visser JAG. Heterogeneous adaptive trajectories of small populations on complex fitness landscapes. PLoS One. 2008;3(3):e1715. doi: 10.1371/journal.pone.0001715 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Weissman DB, Desai MM, Fisher DS, Feldman MW. The rate at which asexual populations cross fitness valleys. Theoretical population biology. 2009;75(4):286–300. doi: 10.1016/j.tpb.2009.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Iwasa Y, Michor F, Nowak MA. Stochastic tunnels in evolutionary dynamics. Genetics. 2004;166(3):1571–1579. doi: 10.1534/genetics.166.3.1571 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Woods RJ, Barrick JE, Cooper TF, Shrestha U, Kauth MR, Lenski RE. Second-order selection for evolvability in a large Escherichia coli population. Science. 2011;331(6023):1433–1436. doi: 10.1126/science.1198914 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Svensson EI, Berger D. The role of mutation bias in adaptive evolution. Trends in ecology & evolution. 2019;34(5):422–434. doi: 10.1016/j.tree.2019.01.015 [DOI] [PubMed] [Google Scholar]
  • 67.Good BH, Desai MM. The impact of macroscopic epistasis on long-term evolutionary dynamics. Genetics. 2015;199(1):177–190. doi: 10.1534/genetics.114.172460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Bauer B, Siebert R, Traulsen A. Cancer initiation with epistatic interactions between driver and passenger mutations. Journal of Theoretical Biology. 2014;358:52–60. doi: 10.1016/j.jtbi.2014.05.018 [DOI] [PubMed] [Google Scholar]
  • 69.van de Haar J, Canisius S, Michael KY, Voest EE, Wessels LF, Ideker T. Identifying epistasis in cancer genomes: a delicate affair. Cell. 2019;177(6):1375–1383. doi: 10.1016/j.cell.2019.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ. Empirical fitness landscapes reveal accessible evolutionary paths. Nature. 2007;445(7126):383–386. doi: 10.1038/nature05451 [DOI] [PubMed] [Google Scholar]
  • 71.Loewe L, Hill WG. Introduction: The population genetics of mutations: good, bad and indifferent. Philosophical Transactions: Biological Sciences. 2010; p. 1153–1167. doi: 10.1098/rstb.2009.0317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Regoes RR. Population genetics meets cancer genomics. Proceedings of the National Academy of Sciences. 2010;107(43):18241–18242. doi: 10.1073/pnas.1013177107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Joyce P, Rokyta DR, Beisel CJ, Orr HA. A general extreme value theory model for the adaptation of DNA sequences under strong selection and weak mutation. Genetics. 2008;180(3):1627–1643. doi: 10.1534/genetics.108.088716 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.John S, Seetharaman S. Exploiting the adaptation dynamics to predict the distribution of beneficial fitness effects. PloS one. 2016;11(3):e0151795. doi: 10.1371/journal.pone.0151795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Calabrese P, Tavaré S, Shibata D. Pretumor progression: clonal evolution of human stem cell populations. The American journal of pathology. 2004;164(4):1337–1346. doi: 10.1016/S0002-9440(10)63220-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Williams MJ, Werner B, Barnes CP, Graham TA, Sottoriva A. Identification of neutral tumor evolution across cancer types. Nature genetics. 2016;48(3):238–244. doi: 10.1038/ng.3489 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Williams MJ, Werner B, Heide T, Curtis C, Barnes CP, Sottoriva A, et al. Quantification of subclonal selection in cancer from bulk sequencing data. Nature genetics. 2018;50(6):895–903. doi: 10.1038/s41588-018-0128-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Consortium ICG, et al. International network of cancer genome projects. Nature. 2010;464(7291):993. doi: 10.1038/nature08987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J, et al. COSMIC: somatic cancer genetics at high-resolution. Nucleic acids research. 2017;45(D1):D777–D783. doi: 10.1093/nar/gkw1121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Chang MT, Asthana S, Gao SP, Lee BH, Chapman JS, Kandoth C, et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nature biotechnology. 2016;34(2):155–163. doi: 10.1038/nbt.3391 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Rubio-Perez C, Tamborero D, Schroeder MP, Antolín AA, Deu-Pons J, Perez-Llamas C, et al. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. Cancer cell. 2015;27(3):382–396. doi: 10.1016/j.ccell.2015.02.007 [DOI] [PubMed] [Google Scholar]
  • 82.Tokheim CJ, Papadopoulos N, Kinzler KW, Vogelstein B, Karchin R. Evaluating the evaluation of cancer driver genes. Proceedings of the National Academy of Sciences. 2016;113(50):14330–14335. doi: 10.1073/pnas.1616440113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Tamborero D, Rubio-Perez C, Deu-Pons J, Schroeder MP, Vivancos A, Rovira A, et al. Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome medicine. 2018;10(1):25. doi: 10.1186/s13073-018-0531-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Ainscough BJ, Griffith M, Coffman AC, Wagner AH, Kunisaki J, Choudhary MN, et al. DoCM: a database of curated mutations in cancer. Nature methods. 2016;13(10):806–807. doi: 10.1038/nmeth.4000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, et al. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018;173(2):371–385. doi: 10.1016/j.cell.2018.02.060 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Cherkasov A, Jones SJ. Structural characterization of genomes by large scale sequence-structure threading. BMC bioinformatics. 2004;5(1):1–16. doi: 10.1186/1471-2105-5-101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Kebebew E, Weng J, Bauer J, Ranvier G, Clark OH, Duh QY, et al. The prevalence and prognostic value of BRAF mutation in thyroid cancer. Annals of surgery. 2007;246(3):466. doi: 10.1097/SLA.0b013e318148563d [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Kim TH, Park YJ, Lim JA, Ahn HY, Lee EK, Lee YJ, et al. The association of the BRAFV600E mutation with prognostic factors and poor clinical outcome in papillary thyroid cancer: A meta-analysis. Cancer. 2012;118(7):1764–1773. doi: 10.1002/cncr.26500 [DOI] [PubMed] [Google Scholar]
  • 89.Xing M, Alzahrani AS, Carson KA, Viola D, Elisei R, Bendlova B, et al. Association between BRAF V600E mutation and mortality in patients with papillary thyroid cancer. Jama. 2013;309(14):1493–1501. doi: 10.1001/jama.2013.3190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Si L, Kong Y, Xu X, Flaherty KT, Sheng X, Cui C, et al. Prevalence of BRAF V600E mutation in Chinese melanoma patients: large scale analysis of BRAF and NRAS mutations in a 432-case cohort. European journal of cancer. 2012;48(1):94–100. doi: 10.1016/j.ejca.2011.06.056 [DOI] [PubMed] [Google Scholar]
  • 91.Dankort D, Curley DP, Cartlidge RA, Nelson B, Karnezis AN, Damsky WE Jr, et al. Braf V600E cooperates with Pten loss to induce metastatic melanoma. Nature genetics. 2009;41(5):544–552. doi: 10.1038/ng.356 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Paik PK, Arcila ME, Fara M, Sima CS, Miller VA, Kris MG, et al. Clinical characteristics of patients with lung adenocarcinomas harboring BRAF mutations. Journal of clinical oncology. 2011;29(15):2046. doi: 10.1200/JCO.2010.33.1280 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Paixão T, Barton NH. The effect of gene interactions on the long-term response to selection. Proceedings of the National Academy of Sciences. 2016;113(16):4422–4427. doi: 10.1073/pnas.1518830113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Weinreich DM, Delaney NF, DePristo MA, Hartl DL. Darwinian evolution can follow only very few mutational paths to fitter proteins. science. 2006;312(5770):111–114. doi: 10.1126/science.1123539 [DOI] [PubMed] [Google Scholar]
  • 95.Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. cell. 2011;144(5):646–674. doi: 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]
  • 96.Beisel CJ, Rokyta DR, Wichman HA, Joyce P. Testing the extreme value domain of attraction for distributions of beneficial fitness effects. Genetics. 2007;176(4):2441–2449. doi: 10.1534/genetics.106.068585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Orr HA. The distribution of fitness effects among beneficial mutations. Genetics. 2003;163(4):1519–1526. doi: 10.1093/genetics/163.4.1519 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Orr HA. The distribution of fitness effects among beneficial mutations in Fisher’s geometric model of adaptation. Journal of theoretical biology. 2006;238(2):279–285. doi: 10.1016/j.jtbi.2005.05.001 [DOI] [PubMed] [Google Scholar]
  • 99.Rokyta DR, Joyce P, Caudle SB, Wichman HA. An empirical test of the mutational landscape model of adaptation using a single-stranded DNA virus. Nature genetics. 2005;37(4):441–444. doi: 10.1038/ng1535 [DOI] [PubMed] [Google Scholar]
  • 100.Ferris MT, Joyce P, Burch CL. High frequency of mutations that expand the host range of an RNA virus. Genetics. 2007;176(2):1013–1022. doi: 10.1534/genetics.106.064634 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Rokyta DR, Beisel CJ, Joyce P, Ferris MT, Burch CL, Wichman HA. Beneficial fitness effects are not exponential for two viruses. Journal of molecular evolution. 2008;67(4):368. doi: 10.1007/s00239-008-9153-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Chao L. Fitness of RNA virus decreased by Muller’s ratchet. Nature. 1990;348(6300):454–455. doi: 10.1038/348454a0 [DOI] [PubMed] [Google Scholar]
  • 103.Rokyta D, Badgett M, Molineux I, Bull J. Experimental genomic evolution: extensive compensation for loss of DNA ligase activity in a virus. Molecular Biology and Evolution. 2002;19(3):230–238. doi: 10.1093/oxfordjournals.molbev.a004076 [DOI] [PubMed] [Google Scholar]
  • 104.Rozen DE, De Visser JAG, Gerrish PJ. Fitness effects of fixed beneficial mutations in microbial populations. Current biology. 2002;12(12):1040–1045. doi: 10.1016/S0960-9822(02)00896-5 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Zechen Chong

23 Dec 2020

PONE-D-20-36757

Extreme value theory as a general framework for understanding mutation frequency distribution in cancer genomes

PLOS ONE

Dear Dr. TOKUTOMI,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please check the reviewers' comments and address all the issues. Otherwise, the manuscript will not be considered for publication. The main issues include the readability of the manuscript, lack of referring to current research, and insufficient biological insights besides other issues. 

Please submit your revised manuscript by 1/31/2021. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Zechen Chong

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present a framework based on extreme value theory to model cancer mutation frequencies. They demonstrate that this model explains the distributions observed in several large cancer genomic datasets.

1. The main issue is the insufficient framing of the study within existing work, beginning with the abstract claiming that "Currently, there is no recognized population genetics framework describing the population dynamics of cancer cells that is applicable to real cancer genome data." The meaning of the last part of the sentence is unclear to me, but in any case, there have been many studies analyzing cancer mutations under population genetics frameworks. For example:

Iwasa, Y., Michor, F., Komarova, N.L. and Nowak, M.A., 2005. Population genetics of tumor suppressor genes. Journal of theoretical biology, 233(1), pp.15-23.

Attolini, C.S.O., Cheng, Y.K., Beroukhim, R., Getz, G., Abdel-Wahab, O., Levine, R.L., Mellinghoff, I.K. and Michor, F., 2010. A mathematical framework to determine the temporal sequence of somatic genetic events in cancer. Proceedings of the National Academy of Sciences, 107(41), pp.17604-17609.

Durrett, R. Population genetics of neutral mutations in exponentially growing cancer cell populations. Ann. Appl. Probabil. 23, 230–250 (2013).

Hu, Z., Sun, R. and Curtis, C., 2017. A population genetics perspective on the determinants of intra-tumor heterogeneity. Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 1867(2), pp.109-126.

Niida, A., Iwasaki, W.M. and Innan, H., 2018. Neutral theory in cancer cell population genetics. Molecular biology and evolution, 35(6), pp.1316-1321.

Caravagna, G., Heide, T., Williams, M. J., Zapata, L., Nichol, D., Chkhaidze, K., ... & Chesler, L. (2020). Subclonal reconstruction of tumors by using machine learning and population genetics. Nature Genetics, 52(9), 898-907.

The Introduction cites studies in reference to elements of the proposed framework, but omits much research that seems similar to the study as a whole. And in regards to extreme value theory, a paper with important background on EVT applied to beneficial mutations is listed in the references (#41) but doesn't seem to actually be cited in the manuscript.

Without such connections to related existing research, it is unclear whether the framework is novel or replicates existing methods.

2. MFaT is defined in two subsections ("The Definition and Calculation of MFaT" in Methods and "The Definition of MFaT" in Results). This seems excessive, especially since it gives the impression of a novel metric, even though the fraction of tumors with a particular mutation is a very common statistic in cancer genomics studies.

3. Page 17: "In the field of population genetics, a lot of effort has been devoted solely to monoploid organisms. To translate this knowledge to population genetics of cancer cells, we need an additional assumption for the fitness of heterozygotes: the fitness of a heterozygote is exactly intermediate between the two associated homozygotes." This seems to imply that diploid organisms have been insufficiently studied in population genetics, when in fact, much of population genetics has been devoted to the study of diploid organisms (especially humans), with dominance being a crucial phenomenon in the models. Similarly, the assumption of heterozygote fitness being exactly intermediate is rarely true for cancer mutations.

4. Page 17/18: The framework assumes there is no genomic epistasis, but important cancer mutations often exhibit epistasis, for example between mutations with redundant effects in the same pathway. More justification should be given that this assumption is not overly violated in practice.

Minor points:

5. Gene names should be italicized.

6. Line 614, caner -> cancer

Reviewer #2: The manuscript by Tokutomi et al uses extreme value theory (EVT) to help explain driver mutation frequencies in large public cancer datasets. The paper is interesting, but has several shortcomings. This reviewer is more qualified to comment on the more biological aspects of this manuscript.

1) The manuscript is hard to understand, especially for a more general audience of PLOS ONE, versus a more specialized journal. In particular, the “model” is not very well-described. The data appear to be mutation frequencies, where a limited number of driver mutations (in aggregate) have high mutation frequencies (MFaT) but most driver mutations have very low mutation frequencies. This phenomenon is well-described and well-known. The manuscript could better describe, perhaps with a cartoon exactly what is being modeled biologically. Otherwise, it appears to be mainly a curve-fitting type of exercise, which is interesting but of uncertain significance.

2) The data do not adjust for sample purity, which will reinforce the low frequency driver mutations

3) It seems unlikely that “all” cancers follow the same rules or even have the same driver mutations. Yet the manuscript combines multiple types of cancer. It is uncertain if the analysis works better for individual cancer type, or if the analysis works better when combining all the data (ie as in Figure 1). It would be helpful if some of the peaks with high MFat (TP53?) could be identified in Fig1 A.

4) Along these lines, it is unclear if different mutations in the same gene but at different nucleotides are counted as a “driver” mutation. (Does TP53 count as a single driver or can it be different drivers?---this reviewer has trouble with the description on page 4)

5) It would be interesting or perhaps informative if the authors also looked at passenger mutations (such as TTN) to check if the distributions do or do not follow an EVT type of distribution. This is optional, but could be interesting because of the uncertainty of whether many low frequency “driver” mutations are in fact driver mutations.

6) The paper concludes that EVT and the analysis can help estimate the fitness of beneficial mutations. It would be useful if the authors provide a list of beneficial mutations and their estimated fitness from their analysis (if possible).

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Aug 23;16(8):e0243595. doi: 10.1371/journal.pone.0243595.r002

Author response to Decision Letter 0


8 Mar 2021

First, we wish to thank reviewers for providing critical insights and many good suggestions. We revised our manuscript based on these suggestions. We think our manuscript has been improved in terms of structure, clarity, correctness, and connection to existing research.

The point-to-point response:

Reviewer #1:

Point #1-1: Insufficient framing and connection.

The main issue is the insufficient framing of the study within existing work, beginning with the abstract claiming that "Currently, there is no recognized population genetics framework describing the population dynamics of cancer cells that is applicable to real cancer genome data." The meaning of the last part of the sentence is unclear to me, but in any case, there have been many studies analyzing cancer mutations under population genetics frameworks. For example:

Iwasa, Y., Michor, F., Komarova, N.L. and Nowak, M.A., 2005. Population genetics of tumor suppressor genes. Journal of theoretical biology, 233(1), pp.15-23.

Attolini, C.S.O., Cheng, Y.K., Beroukhim, R., Getz, G., Abdel-Wahab, O., Levine, R.L., Mellinghoff, I.K. and Michor, F., 2010. A mathematical framework to determine the temporal sequence of somatic genetic events in cancer. Proceedings of the National Academy of Sciences, 107(41), pp.17604-17609.

Durrett, R. Population genetics of neutral mutations in exponentially growing cancer cell populations. Ann. Appl. Probabil. 23, 230–250 (2013).

Hu, Z., Sun, R. and Curtis, C., 2017. A population genetics perspective on the determinants of intra-tumor heterogeneity. Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 1867(2), pp.109-126.

Niida, A., Iwasaki, W.M. and Innan, H., 2018. Neutral theory in cancer cell population genetics. Molecular biology and evolution, 35(6), pp.1316-1321.

Caravagna, G., Heide, T., Williams, M. J., Zapata, L., Nichol, D., Chkhaidze, K., ... & Chesler, L. (2020). Subclonal reconstruction of tumors by using machine learning and population genetics. Nature Genetics, 52(9), 898-907.

The Introduction cites studies in reference to elements of the proposed framework, but omits much research that seems similar to the study as a whole. And in regards to extreme value theory, a paper with important background on EVT applied to beneficial mutations is listed in the references (#41) but doesn't seem to actually be cited in the manuscript.

Without such connections to related existing research, it is unclear whether the framework is novel or replicates existing methods.

Response to the Point #1-1:

We rewrote and re-organized the Introduction section in order to clarify the framing of our work within antecedent studies.

More than 50 new reference papers including six papers that the reviewer kindly referred to are cited in the Introduction section.

The EVT background paper (formerly reference #41, in this revised manuscript #20, Orr HA, 2010) is cited twice (L29 and L106).

In addition, we included the following elements in the Introduction to provide sufficient connection to existing research.

An important population genetics model (the mutation-selection balance; MSB) that seems similar to the strong selection and weak mutation (SSWM) model in context of cancer evolution (L23-L44).

SSWM model applications to cancer evolution after the clonal expansion (L51-L63).

Knudson's Two Hit hypothesis (L19-L22).

Mode of somatic cell division (L139-L151).

Point #1-2: MFaT is doubly described.

MFaT is defined in two subsections ("The Definition and Calculation of MFaT" in Methods and "The Definition of MFaT" in Results). This seems excessive, especially since it gives the impression of a novel metric, even though the fraction of tumors with a particular mutation is a very common statistic in cancer genomics studies.

Response to the Point #1-2:

We removed one MFaT definition from the Results section. The related description is merged to the MFaT definition in the Materials and Methods section (L312-L335). Although the fraction of tumors with a particular mutation is a common statistic, we need to avoid confusion with another common statistic, VAF. To achieve this, explanatory paragraphs are added to the Introduction section (L232-L247).

Point #1-3: Limited description of diploid genetics.

Page 17: "In the field of population genetics, a lot of effort has been devoted solely to monoploid organisms. To translate this knowledge to population genetics of cancer cells, we need an additional assumption for the fitness of heterozygotes: the fitness of a heterozygote is exactly intermediate between the two associated homozygotes." This seems to imply that diploid organisms have been insufficiently studied in population genetics, when in fact, much of population genetics has been devoted to the study of diploid organisms (especially humans), with dominance being a crucial phenomenon in the models. Similarly, the assumption of heterozygote fitness being exactly intermediate is rarely true for cancer mutations.

Response to the Point #1-3:

We clarified the related assumption, approximation and condition. The item "The Additivity Assumption" is renamed to "The Additive Dominance Assumption" to clarify the extent to which the assumption remains approximately valid (L653-L666). Mutations in oncogenes and tumor suppressor genes are treated as exceptions (L660-L666). In addition, the dominance coefficient in relation to cancer biology is discussed in the Introduction section (L29, L42, and L97-L151).

Point #1-4: Lack of justification on no genomic epistasis.

Page 17/18: The framework assumes there is no genomic epistasis, but important cancer mutations often exhibit epistasis, for example between mutations with redundant effects in the same pathway. More justification should be given that this assumption is not overly violated in practice.

Response to the Point #1-4: An item is added.

A new item "The No Epistasis Assumption" is added to the Discussion section (L667-L675). In the item, the complexity of tumor phenotype and its associated pathways are discussed (L670-L675).

Point #1-5: Italicizing gene symbols

Gene names should be italicized.

Response to the Point #1-5:

We italicized gene names.

Point #1-6: Spell missing (caner -> cancer)

Line 614, caner -> cancer

Response to the Point #1-6:

We corrected the misspelling.

Reviewer #2:

Point #2-1: Cartoon of the model.

The manuscript is hard to understand, especially for a more general audience of PLOS ONE, versus a more specialized journal. In particular, the “model” is not very well-described. The data appear to be mutation frequencies, where a limited number of driver mutations (in aggregate) have high mutation frequencies (MFaT) but most driver mutations have very low mutation frequencies. This phenomenon is well-described and well-known. The manuscript could better describe, perhaps with a cartoon exactly what is being modeled biologically. Otherwise, it appears to be mainly a curve-fitting type of exercise, which is interesting but of uncertain significance.

Response to the Point #2-1: A new figure is added.

A new figure with four panels including cartoons is added so as to clarify our model (Figure 1). And this figure was referred to several times in the Introduction (L67, L102-L104, and L130) section. At the same time, we reorganized and rewrote the Introduction section so as to clarify our assumptions for the models as well as the model itself. More than 50 new reference papers cited in the Introduction section in order to clarify the framing of our work within antecedent studies. In addition, we included the following elements in the Introduction to provide sufficient connection to existing research.

An important population genetics model (the mutation-selection balance; MSB) that seems similar to the strong selection and weak mutation (SSWM) model in context of cancer evolution (L23-L44).

SSWM model applications to cancer evolution after the clonal expansion (L51-L63).

Knudson's Two Hit hypothesis (L19-L22).

Mode of somatic cell division (L139-L151).

Point #2-2: Lack of reference to adjustment of sample purity.

The data do not adjust for sample purity, which will reinforce the low frequency driver mutations

Response to the Point #2-2: Reference added.

As the reviewer implied, large-scale genomics datasets have different levels of sample purity. Therefore, we restricted ourselves to separated analysis within each dataset. Adjusting datasets each containing tens of thousands of specimens for sample purity is often computationally impractical, leading to a shared agreement that referring to the database name in combination with an original filtering method is sufficient. For example:

Wong, Wing Chung, et al. "CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer." Bioinformatics 27.15 (2011): 2147-2148.

Mao, Yong, et al. "CanDrA: cancer-specific driver missense mutation annotation with optimized features." PloS one 8.10 (2013): e77945.

Cisneros, Luis, et al. "Ancient genes establish stress-induced mutation as a hallmark of cancer." PLoS One 12.4 (2017): e0176258.

However, to ensure consistency of the sample purity, it is widely accepted to refer to the original paper of such a dataset in which mutation filtering criteria are described. Thus, we referred to such literatures in the "Data Processing" item in the Materials and Methods section (L294-L296).

Point #2-3: Unclear graph representation.

It seems unlikely that “all” cancers follow the same rules or even have the same driver mutations. Yet the manuscript combines multiple types of cancer. It is uncertain if the analysis works better for individual cancer type, or if the analysis works better when combining all the data (ie as in Figure 1). It would be helpful if some of the peaks with high MFat (TP53?) could be identified in Fig1 A.

Response to the Point #2-3:

With sincere respect to the points that the reviewer made, we improved appearance of the graphs. In all density plots of total-tumor MFaTs, mutations of the highest MFaTs are indicated (Figs 2A, 3A, and 4A).

Point #2-4: Lack of clarity and troublesome descriptions.

Along these lines, it is unclear if different mutations in the same gene but at different nucleotides are counted as a “driver” mutation. (Does TP53 count as a single driver or can it be different drivers?---this reviewer has trouble with the description on page 4)

Response to the Point #2-4:

To address this issue, we replaced old text and added a new item. The original contents of the item "Data Processing" is now moved to S1 Appendix and newer text is supplied in the item (L293-L310). In addition, a new item "Study Design" is added to clarify the method of counting driver mutations (L249-L262).

Point #2-5: Low frequency potential driver mutations.

It would be interesting or perhaps informative if the authors also looked at passenger mutations (such as TTN) to check if the distributions do or do not follow an EVT type of distribution. This is optional, but could be interesting because of the uncertainty of whether many low frequency “driver” mutations are in fact driver mutations.

Response to the Point #2-5:

The suggestion is remarkably interesting. But, in current manuscript, we wish to concentrate on establishing and presenting the EVT framework. Also, we feel one needs some careful considerations involving effect sizes of mutations (Some of our assumptions hold exclusively for driver genes). We hope to pursue this passenger mutation problem as a separate study in future. Thank you very much for your great suggestion.

Point #2-6: The table of calculated fitness effects.

The paper concludes that EVT and the analysis can help estimate the fitness of beneficial mutations. It would be useful if the authors provide a list of beneficial mutations and their estimated fitness from their analysis (if possible).

Response to the Point #2-6:

We added tables of estimated driver fitness effects for each cancer types in S1 Appendix (Appendix Tables 1-8). In addition, the single table for these tumor types is included in the Supplementary Files (S19 File table NormalizedEap.tsv). We hope these tables will be useful.

Attachment

Submitted filename: rebuttal_Review01_v14.pdf

Decision Letter 1

Zechen Chong

19 Apr 2021

PONE-D-20-36757R1

Extreme value theory as a framework for understanding mutation frequency distribution in cancer genomes

PLOS ONE

Dear Dr. Tokutomi,

Thank you for resubmitting your manuscript to PLOS ONE. The revision has been carefully reviewed by the original reviewers. However, there are still some issues have not been addressed. Therefore, we invited another reviewer to further judge. Please take seriously about the reviewers' comments and address all the issues sufficiently. 

Please submit your revised manuscript by Jun 03 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Zechen Chong

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: The manuscript continues to be very difficult to understand. The concepts and the logic seem insufficient for a more general readership that is seen with PLOS ONE. The added "cartoon" (Fig 1) illustrates general principles, but does not seem to allow a reader to "see" how extreme value theory is being applied in the paper.

Reviewer #3: This is an interesting paper that applies the extreme value theory to model the mutation frequency distribution in the cancer genome. The authors showed that the driver mutation count distribution can be well fitted by the Frechet-type extreme distribution across varying cancer genomic datasets (e.g. TCGA, ICGC, etc). Based on the model fitting, the authors concluded that early tumor evolution (pre-clonal expansion) likely follows the classic strong selection and weak mutation (SSWM) regime in population genetics theory.

1) It’s not clear to me regarding the relationship between SSWM model and the extreme value distribution. Can the authors provide a mathematical prove that the SSWM regime of tumor evolution will give rise to Frechet-type extreme value distribution for driver mutation number? Or others have proven this?

2) I don’t understand why the authors focused on specific site of driver genes? It seems gene-level mutation distribution fits the Frechet-type distribution better than site-level distribution (Fig 2 and 3). I suggest the authors to show gene-level mutation distribution fitting in main figures and site-level mutation distribution in supplementary data. Also, Fig 2a and 3a are both site-level fitting, why it was said Fig 2 is on gene level while Fig 3 is site level.

3) Is the dominance h (Fig. 1D) important in the mathematical analysis? I didn’t see any information regarding h in results.

4) The introduction section is too long and should be cutted and many of the information is misleading.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Aug 23;16(8):e0243595. doi: 10.1371/journal.pone.0243595.r004

Author response to Decision Letter 1


22 Jul 2021

Reviewer #1: (No Response)

Reviewer #2: The manuscript continues to be very difficult to understand. The concepts and the logic seem insufficient for a more general readership that is seen with PLOS ONE. The added "cartoon" (Fig 1) illustrates general principles, but does not seem to allow a reader to "see" how extreme value theory is being applied in the paper.

We completely changed Fig 1 to clarify not only the medical and cell-biological aspects of our study, but also the point where extreme value theory (EVT) is applied to the cellular fitness landscape (Fig 1A, B, D). We also made big modification in the Introduction part. In the first section (The “Big Bang” Model of Cancer Development and Population Genetics of Cancer Cells) in Introduction, the concepts and the logic are made more straightforward for the reader's ease of understanding. We wish these changes, in combination with the schematic S1FigC in Supporting information, may allow a reader to "see" the way by which EVT is applied in the paper.

Reviewer #3: This is an interesting paper that applies the extreme value theory to model the mutation frequency distribution in the cancer genome. The authors showed that the driver mutation count distribution can be well fitted by the Frechet-type extreme distribution across varying cancer genomic datasets (e.g. TCGA, ICGC, etc). Based on the model fitting, the authors concluded that early tumor evolution (pre-clonal expansion) likely follows the classic strong selection and weak mutation (SSWM) regime in population genetics theory.

1) It’s not clear to me regarding the relationship between SSWM model and the extreme value distribution. Can the authors provide a mathematical prove that the SSWM regime of tumor evolution will give rise to Frechet-type extreme value distribution for driver mutation number? Or others have proven this?

There is no direct relationship between the SSWM model and the extreme value distribution. However, the SSWM model is important here as a prerequisite for an assumption that the adaptation process is Markov process. This allows us to assume that only mutations with fitness effects which is the maximum or above a certain threshold will fix in each adaptation step. The "above the threshold" model gives rise to Generalized Pareto Distribution (GPD); and the "maxima" model gives rise to Generalized Extreme Value distribution (GEV).

More precisely, Joyce et al (Joyce et al, 2008) demonstrated that the fitness landscape of adaptation of DNA sequence would follow GPD under the assumption of the SSWM and the "above the threshold" model. In this paper, we used the "block maxima" model because it fitted better with the "Big Bang" model of the cancer evolution. The Fisher-Tippett-Gnedenko theorem proves that the maxima from independent and identically distributed random variables will follow GEV distributions.

(Joyce P et al, 2008): Joyce, Paul, et al. "A general extreme value theory model for the adaptation of DNA sequences under strong selection and weak mutation." Genetics 180.3 (2008): 1627-1643.

2) I don’t understand why the authors focused on specific site of driver genes? It seems gene-level mutation distribution fits the Frechet-type distribution better than site-level distribution (Fig 2 and 3). I suggest the authors to show gene-level mutation distribution fitting in main figures and site-level mutation distribution in supplementary data. Also, Fig 2a and 3a are both site-level fitting, why it was said Fig 2 is on gene level while Fig 3 is site level.

Classical population genetics theories analyze evolution of a DNA sequence as a set of nucleotide residues instead of gene symbols (e.g., (Orr HA, 2002)). To perform "gene-level" analysis, we imagine that more careful formulation and study design are necessary. Thus, in fact, all the analyses performed in this paper were at "site-level". We wish that we have a chance to design "gene-level" analysis in a separate study in future.

We agree that the formerly Fig 3A (now S2FigA) shows lack of data points in ICGC cases. We have moved the formerly Fig 3 figures to Supporting information. Also, misleading labels in the figure and table texts (i.e., "driver-gene definition" and "driver-site definition") are now replaced with clearer ones (i.e., "symbol-based filtering" and "position-based filtering"). In addition, the corresponding descriptions in Results (L466-L467, L486-L487, L496) and Discussion (L617) are updated.

(Orr HA, 2002): Orr, H. Allen. "The population genetics of adaptation: the adaptation of DNA sequences." Evolution 56.7 (2002): 1317-1330.

3) Is the dominance h (Fig. 1D) important in the mathematical analysis? I didn’t see any information regarding h in results.

We agree that although the dominance coefficient h may serve our understanding, it is less important in the context of the Introduction part. The details of this coefficient are omitted from Introduction and its related figures are moved to Supporting information (S1FigA, S1FigB).

4) The introduction section is too long and should be cutted and many of the information is misleading.

We removed redundancy in the description and made it more straightforward, especially in the first section of the introduction. As a result, the introduction part has decreased by 24 % in word count (3088 -> 2354 words, citations and formula included). We wish the description in the section has become more concise, straightforward, and clear.

Attachment

Submitted filename: Response to Reviewers_v03.pdf

Decision Letter 2

Zechen Chong

11 Aug 2021

Extreme value theory as a framework for understanding mutation frequency distribution in cancer genomes

PONE-D-20-36757R2

Dear Dr. TOKUTOMI,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zechen Chong

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: The authors have made a commendable effort to better explain their approach to a more general audience. The text and diagrams are much better.

Reviewer #3: My concerns have been well addressed by the authors. This is an interesting theoretical work on cancer evolution. I would like to see this manuscript available.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Acceptance letter

Zechen Chong

13 Aug 2021

PONE-D-20-36757R2

Extreme value theory as a framework for understanding mutation frequency distribution in cancer genomes 

Dear Dr. Tokutomi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Zechen Chong

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. The SSWM population dynamics of cells and the additivity of fitness effects of mutant alleles.

    (PDF)

    S2 Fig. Exploratory plots on cancer driver mutation MFaTs in the total-tumor analysis with position-based filtering.

    (PDF)

    S1 Appendix. Supplementary materials.

    (PDF)

    S1 Script. Get_IcgcProteinAltering.py.

    (PY)

    S1 File. Database_ICGC_temp_PostMax.tsv.gz.

    (GZ)

    S2 File. Database_COSMIC_temp_PostMax.tsv.gz.

    (GZ)

    S3 File. Database_CHANG_temp_PostMax.tsv.gz.

    (GZ)

    S4 File. Driver_TotalGene_database_IntOGen.tsv.

    (TSV)

    S5 File. Driver_TotalGene_database_CGC.tsv.

    (TSV)

    S6 File. Driver_TotalGene_database_Tokheim.tsv.

    (TSV)

    S7 File. Driver_TotalSite_database_IntOGen.tsv.

    (TSV)

    S8 File. Driver_TotalSite_database_DoCM.tsv.

    (TSV)

    S9 File. Driver_TotalSite_database_Bailey.tsv.

    (TSV)

    S10 File. Database_RTCGA_temp_Format.tsv.gz.

    (GZ)

    S11 File. Database_RTCGA_temp_ProteinAlteringSnv.tsv.gz.

    (GZ)

    S12 File. Database_RTCGA_temp_ProteinAlteringSnvDoubleton.tsv.gz.

    (GZ)

    S13 File. Database_RTCGA_temp_PostMaxTotal.tsv.

    (TSV)

    S14 File. Database_RTCGA_temp_PostMaxType.tsv.

    (TSV)

    S15 File. Driver_TypeGene_database_IntOGen.tsv.

    (TSV)

    S16 File. Drivertype_RTCGA-IntOGen_temp_PreRank.tsv.

    (TSV)

    S17 File. R_analysis_TotalTumor.zip.

    (ZIP)

    S18 File. R_analysis_TumorTypeSpecific.zip.

    (ZIP)

    S19 File. Table_NormalizedEap.tsv.

    (TSV)

    Attachment

    Submitted filename: rebuttal_Review01_v14.pdf

    Attachment

    Submitted filename: Response to Reviewers_v03.pdf

    Data Availability Statement

    Data used in this work are available from the following links. The reformatted files that can be directly processed by our scripts are provided as supplemental files. For details, see S1 Appendix. ICGC: https://icgc.org COSMIC: https://cancer.sanger.ac.uk/cosmic CHANG: https://github.com/taylor-lab/hotspots/blob/master/LINK_TO_MUTATIONAL_DATA RTCGA: https://rtcga.github.io/RTCGA/ IntOGen: https://www.intogen.org/search DoCM: http://docm.info.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES