Abstract
Cancers are typically fueled by sequential accumulation of driver mutations in a previously healthy cell. Some of these mutations, such as inactivation of the first copy of a tumor suppressor gene, can be neutral, and some, like those resulting in activation of oncogenes, may provide cells with a selective growth advantage. We study a multi-type branching process that starts with healthy tissue in homeostasis and models accumulation of neutral and advantageous mutations on the way to cancer. We provide results regarding the sizes of premalignant populations and the waiting times to the first cell with a particular combination of mutations, including the waiting time to malignancy. Finally, we apply our results to two specific biological settings: initiation of colorectal cancer and age incidence of chronic myeloid leukemia. Our model allows for any order of neutral and advantageous mutations and can be applied to other evolutionary settings.
Keywords: Cancer initiation, Driver mutations, Cancer incidence, Branching process
Introduction
Cancer is a genetic disease fueled by accumulation of driver mutations which confer a selective growth advantage to tumor cells (Vogelstein and Kinzler 2004). For solid cancers, typically more than one driver mutation is required for the development of malignancy, while a single genetic alteration may be sufficient to cause certain types of leukemia (Vogelstein et al. 2013). With the emergence of advanced sequencing technology, specific driver genes, including oncogenes, tumor suppressor genes and DNA repair genes, have been found to be responsible for carcinogenesis. For example, tumor suppressor genes APC, TP53 and oncogene KRAS are the most commonly mutated driver genes in colorectal cancer (Morin et al. 1997; Fearon 2011; Tomasetti et al. 2015), and fusion gene BCR-ABL is found to cause chronic myeloid leukemia (Deininger et al. 2000).
Some of the key questions in cancer research involve uncovering the identities, the number, the order and the effects of specific driver mutations on tumorigenesis. To facilitate mathematical quantification of the carcinogenic process, stochastic models can be used to model the accumulation of driver mutations, in particular population sizes and arrival time distributions for premalignant and malignant subpopulations. This approach goes back to the multi-stage theory of Armitage and Doll (1954), in which the shape of a cancer age incidence curve is shown to be associated with the required number of driver mutations. More recently, branching processes have been employed to investigate the age incidence of cancer (Meza et al. 2008; Paterson et al. 2020; Wang et al. 2022), cancer relapse and treatment response (Komarova and Wodarz 2005; Bozic et al. 2013; Foo et al. 2014; Avanzini and Antal 2019), and cancer heterogeneity (Durrett et al. 2011).
In the context of cancer initiation, the onset of the process occurs in healthy tissue, when a previously healthy cell receives the first oncogenic alteration. The process proceeds through abnormal growth of the altered subpopulation, acquisition of subsequent driver mutations and further waves of clonal expansion. Previous works that studied accumulation of driver mutations on the way to cancer focused on modeling evolution in exponentially growing populations (Durrett and Moseley 2010; Bozic et al. 2010; Nicholson et al. 2023). These works analyze a process that starts with a single cell that already has selective growth advantage, and model the evolution arising from this single activated cell.
In this paper, we study a process in which the large initial cell population is in homeostasis, capturing the population dynamics both before and during the exponential growth stage. In our model, any sequence of neutral or advantageous genetic alteration can occur and eventually lead to malignancy. Building upon Durrett and Moseley (2010) and Nicholson et al. (2023), we give explicit formulas for population size and arrival time distributions given the order, mutation rates and fitness increments of the driver genes along a specific mutational pathway. Our results are applicable to other multi-hit models that involve the evolution of an initially non-growing population.
Model
Inspiration for our model comes from initiation of colorectal cancer, which is thought to require inactivation of two tumor suppressor genes and activation of one oncogene (Vogelstein et al. 2013; Tomasetti et al. 2015; Paterson et al. 2020). Tumor suppressor genes, such as APC and TP53, are the most commonly mutated genes in colorectal cancer, and require inactivation (through genetic alterations) of both alleles to act as cancer driver genes. Oncogenes, such as KRAS or BRAF, which are also commonly mutated in colorectal cancer, require a single activating mutation in one allele of the gene in question. In other words, initiation of colorectal cancer requires five genetic alterations (two each in two tumor suppressor genes and one in an oncogene). If the first of the five alterations is activation of an oncogene, the crypts carrying that mutation can already exhibit selective growth advantage compared to neighboring crypts, as their rate of crypt fission (division) is significantly increased (Snippert et al. 2014). However, if the first alterations are in tumor suppressor genes, the first one to three alterations may not immediately lead to selective advantage (Paterson et al. 2020). This is because inactivation of a single allele of a tumor suppressor gene typically does not provide selective growth advantage to crypts. Furthermore, some driver genes, such as TP53, do not provide selective growth advantage when they are the initial driver alteration, but may lead to abnormal growth if another mutation is subsequently obtained (Paterson et al. 2020).
We study a multi-type branching process generalization of the process above, that starts with a large wild-type population at homeostasis, corresponding to healthy tissue (type 0). As colorectal crypts in homeostasis rarely divide or die (Nicholson et al. 2018), we set the division and death rates of the initial population to 0. In the model, we allow for a number of further oncogenic alterations that initially do not provide selective growth advantage, which occur with distinct constant rates per crypt. After a sufficient number of neutral alterations, the next alteration leads to selective growth advantage in the form of increased division rate. This corresponds, for example, to the inactivation of the second allele of tumor suppressor gene APC. After that, subsequent oncogenic mutations, which may be initially neutral, or provide additional selective growth advantage, can accrue. Once a sufficient number of mutations is collected, the crypt becomes malignant. The model can be summarized by the following diagram:
| 1 |
More formally, we study a continuous-time branching process with types forming a linear evolutionary pathway from type 0 (the healthy type) to type (the malignant type). We denote the population size of type i at time t by . The process is started at time 0 with a large healthy population, . In general, population sizes of individual types may change due to three events: division, death, and mutation. Type i cells (or crypts) divide into two daughter cells (crypts) of the same type at rate , die at rate , and mutate into type cells at rate . We define to be the net growth rate of type i.
Fig. 1.
a Model illustration. Our model concerns an evolutionary process that starts with a large healthy population (blue circles). In this example, the first oncogenic alteration (yellow) does not provide selective growth advantage. The subsequent genetic alteration (orange) results in growth advantage. Orange cells divide at a higher rate, breaking the homeostasis while still not being considered cancerous. After another genetic alteration takes place, the malignant type (red) emerges. b Comparison with prior publications. Prior work mainly focuses on the cases in which the initial type has a positive growth rate. In contrast, we allow the first several types to have zero growth rates, representing cells still in homeostasis (Color figure online)
In the model, the initial type is at homeostasis, with net growth rate . We also assume that the first mutations that accumulate in the process are neutral, leading to no change in division of death rates. The next mutation provides selective growth advantage, leading to positive net growth rate of the -st type, . Subsequent mutations may be advantageous or neutral. The main quantity of interest in the model is the waiting time to the first type cell (crypt)
Our model is related to previous models from Durrett and Moseley (2010), Nicholson and Antal (2019), and Nicholson et al. (2023). In particular, Durrett and Moseley (2010) considered a model for clonal expansion, in which the branching process starts with a single cell with a positive growth rate and subsequent net growth rates are strictly increasing. Nicholson and Antal (2019) studied the evolution of drug resistance. Their model focused on a branching process in which the first type has the largest positive net growth rate. Recently, Nicholson et al. (2023) considered a branching process model that allows an arbitrary sequence of growth rates following the initial supercritical type. In contrast, this paper focuses on the scenario in which the initial type(s) have a zero growth rate, corresponding to homeostatic tissue. For comparison, in Fig. 1b we have listed prior publications that consider a model similar to this work but focus on different parameter regimes.
Results
In this section, we provide analytic results to estimate population sizes and arrival times in the branching process model described in the previous section, and compare them with exact computer simulations of the process. For simplicity, we only discuss the case when mutations are advantageous or neutral and there is no cell death. The scenarios that allow deleterious mutations and cell death are discussed in A.4. We also present two possible applications of the model: initiation of colorectal cancer and incidence of chronic myeloid leukemia.
Population Sizes
Individual cells in the model (1) evolve independently. Therefore, the population can be stratified into N independent lineages, each of which consists of cells descended from a single original healthy (type 0) cell. Consequently, the population size of a neutral type l, , counts the number of healthy cells that have evolved to type l, but have not changed to type yet. In particular, at any fixed time t, the population of type l is distributed as a Binomial(N, p(t)), with
| 2 |
being the time-dependent success probability (for derivation, see A.3). This estimate success probability has a same form as the rate of incidence in Armitage and Doll (1954) with a single unit initial population (see Durrett and Moseley (2010), equation (1)). It follows that, the expectation of reads
In the small mutation rate regime, p(t) is a small number, which causes the variance to have a magnitude similar to the mean value:
Therefore, the populations of neutral types approximately grow as a power function (Fig. 2).
Fig. 2.

Population sizes in a multi-type branching process. We consider a 5-type branching process with two initial neutral types (with zero growth rate) and three advantageous types (with positive growth rate). Panels a and b display two different realizations of the process. Solid lines represent computer simulations, and dashed lines represent asymptotic behaviors. Type 1 (light yellow) population grows linearly; Type 2 (orange), type 3 (red), and type 4 (black) populations grow exponentially at large times. Parameter values:
Following Durrett and Moseley (2010); Nicholson and Antal (2019); Nicholson et al. (2023), we approximate the population sizes of advantageous types (i.e. types with positive growth rate) in a parameter regime of large times and small mutation rates. For , we have shown that
| 3 |
where is a constant, and is a random variable. This approximation separates the stochasity and the time dependence: The population can be decomposed into a multiplication of a time-dependent deterministic function and a time-independent random variable .
Random variable can be characterized using its Laplace transform:
| 4 |
Here denotes the PolyLogrithm (DLMF 2022, 25.12.10). can be computed iteratively, with , and
for . Finally, we have
We show that approximations (3) and (4) are in excellent agreement with exact computer simulations of the process in Figs. 2 and 7. In particular, in Fig. 2 we depict two realizations of the same branching process. The realizations in panels a and b share the same asymptotic growth rates. Specifically, the growth rates (the slope of the dashed line) for types 2, 3, and 4 are characterized by and (, or 3 in Eq. (3)), respectively. However, the intercepts of the two dashed lines (for each of the types 2-4) differ because the limiting random variables and (Eq. (3)) have non-identical values in the two realizations.
Fig. 7.

Random amplitude. Laplace transforms of and obtained from Eq. (4) and computer simulations of the process described in Fig. 2. Solid lines depict the simulated Laplace transforms, which are obtained by computing the Laplace transform of scaled populations at . Dashed lines show formula (4). Parameter values: . Number of realizations in computer simulation: 1000
Arrival Times
Before the arrival of the first advantageous type , the total population of the branching process stays fixed. The only possible event for any cell is to change its type into the subsequent type. For a single cell, each alteration requires an exponential waiting time. In a population of cells, the waiting time for a specific type is the minimum time for individual cells to reach that type. This results in the following waiting time distribution for type :
| 5 |
The arrival time of a type that appears after the homeostasis has been partially broken (i.e. whose growth rate is positive) can be split into two segments: (i) The time from the beginning of the process to the arrival of the first advantageous cell, and (ii) The time from the first advantageous cell to the first target type cell. Adapting results from Nicholson et al. (2023), we find an estimate of (ii). Then, treating (i) as a time delay of (ii), we make use of the hypo-exponential distribution to obtain the following approximate formula for the waiting time distribution for type , :
| 6 |
The shape of the waiting time curve is largely determined by , the growth rate of the first advantageous type. characterizes the amount of time delayed in (i). represents the median evolution time from a single type to the first type . The value of can be derived by the following iterative scheme, which depends on whether -th alteration is neutral or advantageous:
For derivation of results (5) and (6), see A.3.
We show that approximations (5) and (6) are in good agreement with exact computer simulations of the process in Fig. 3.
Fig. 3.

Comparison of analytic results and computer simulations for the waiting time distribution of neutral and advantageous types. Solid lines depict cumulative distribution functions for waiting times to types 1 through 4 in the model described in Fig. 2. Points denote probabilities obtained from computer simulations of the process, with bars showing the 95% confidence interval. Yellow line and orange line show approximation (5); Red line and bleck line show approximation (6). Parameter values: . Number of realizations in computer simulation: 1000
Application: Colorectal Cancer Initiation
Colorectal cancer (CRC) is the end result of a process in which healthy tissue accumulates sequential oncogenic alterations. Multiple driver genes are identified to contribute to this cancerous transformation, but the effect of mutational order of the driver genes on cancer initiation time is not fully understood. Recent work (Paterson et al. 2020) developed a multi-type branching process model to study CRC initiation through acquisition of three common driver genes, tumor suppressors APC and TP53, and the KRAS oncogene. Both alleles of a tumor suppressor gene need to be inactivated for it to function as a driver gene, while only one mutant allele is sufficient for the activation of an oncogene. It follows that CRC initiation involves five sequential genetic alterations. In the model, these genetic alterations may take place through either loss of heterozygosity (LOH) or mutation in any order and at constant rates (Table 1). Zhang et al. (2023) recently studied the waiting time distributions along a single mutational pathway in the order of APC inactivation, KRAS activation, and TP53 inactivation.
Table 1.
CRC driver genes and corresponding parameter values
| Gene | APC inactivation | KRAS activation | TP53 inactivation | ||
|---|---|---|---|---|---|
| Alteration | LOH | Mutation | Mutation | LOH | Mutation |
| Rate (per year) | |||||
| Fitness advantage (per year) | 0.20 | 0.07 | 0 | ||
We consider a model of colorectal cancer that starts with wild type crypts. Colonic crypts are basic functional units found in the epithelium of the colon. Within a single crypt, cells rapidly renew and migrate upward. New mutations that appear in individual cells of the crypt are either lost quickly or fixate in the crypt (Campbell et al. 1996), which enables us to focus on crypts as units of selection (Paterson et al. 2020). The number of crypts in the human colon is approximately – (Tomasetti et al. 2015; Potten et al. 2003; Paterson et al. 2020). Along each individual pathway, population sizes and waiting time distributions can be estimated using the formulas derived in this paper. To demonstrate this, we select two different mutational pathways to CRC and compare our waiting time approximations and the exact computer simulation of the process (Fig. 4). In the first pathway, wild type colonic crypts undergo APC inactivation, KRAS activation, and TP53 inactivation consecutively (Fig. 4a). In the second pathway, APC inactivation is followed by TP53 inactivation, and KRAS activation (Fig. 4b).
Fig. 4.
CRC waiting times. Comparison of analytic results and computer simulations for the waiting time distributions of types 3, 4 and 5. Points denote probabilities obtained from computer simulations of the process, with bars showing the 95% confidence interval. Solid lines depict cumulative distribution functions for waiting times obtained from equation (6). In panel a, the mutational order is APC inactivation, KRAS activation, and TP53 inactivation. In panel b the mutational order is APC inactivation, TP53 inactivation, and KRAS activation. Parameter values: crypts. Mutation rates and selective growth advantageous are listed in Table 1. Number of realizations in computer simulation:
The two numerical verifications (Figs. 3 and 4) indicate that the analytic results and the exact waiting time distributions are in good agreement. In particular, for types with no selective growth advantage (i.e. types before the first supercritical type), only small mutation rate approximations have been carried out. This results in good agreement for both early times and later times (e.g. in Fig. 3 and in Fig. 4). However, for supercritical types, one can still observe discrepancies between the approximations and the computer simulations at early times and for higher types. The error at early times comes from the fact that the approximation for the waiting time distribution for type relies on the large time limit for the population size of type k, which is less accurate for early times.
For the error observed for higher types, we note that the approximations for population sizes are performed in an iterative way. In particular, to obtain an approximate population size for type , one uses the approximation for population size of type i. Therefore, the approximation error accumulates through the iterations, and the approximation becomes less accurate as the types increase.
In Appendix C, we introduce an approach to improve the approximation by employing a more precise estimate of the population size of the first type with a positive growth rate (type ). Specifically, we employ a nonhomogeneous Poisson approximation such that the distribution of the waiting time to type is related to an integral of the population size of type over time. Typically, to evaluate the integral, the population of type over time is approximated as a multiplication of an exponential function of time and a time-independent random variable. We have found linear terms in addition to the exponential term such that the population size can be more accurately estimated at early times. Calculating the integral with these linear terms improves the approximation of the distribution function of the waiting time for type (Fig. 9).
Fig. 9.
CRC waiting times. Comparison of the original approximations, Eqs. (5) and (6), with the improved approximations, Eqs. (C28) and (C29). Points denote probabilities obtained from computer simulations of the process, with bars showing the 95% confidence interval. Solid lines depict cumulative distribution functions for waiting times obtained from Eq. (6) ( through ). Dashed lines depict the approximation obtained from Eqs. (C28) () or (C29) ( and ). In panel a, the mutational order is APC inactivation, KRAS activation, and TP53 inactivation. In panel b the mutational order is APC inactivation, TP53 inactivation, and KRAS activation. Parameter values: crypts. Mutation rates and selective growth advantageous are listed in Table 1. Number of realizations in computer simulation:
Application: Incidence of Chronic Myeloid Leukemia
Chronic myeloid leukemia (CML) is an uncommon type of cancer that is thought to arise in hematopoietic stem cells. Fusion ocogene BCR-ABL is identified to initiate the CML carcinogenesis (Deininger et al. 2000). Michor et al. (2006) established a single-hit model that characterizes the malignant transformation of healthy hematopoietic stem cells. In the model, a Moran process is employed to describe the underlining stem cell dynamics. The process starts with a fixed number of healthy stem cells. At each division, a cell is randomly picked and replaced by a newly produced cell, which can carry an oncogenic mutation with some probability. The mutant cell has a selective growth advantage compared to healthy stem cells, leading to clonal expansion of the mutant population. It is assumed that the detection rate of CML is proportional to the population of mutants cells. Michor et al. (2006) derive the detection probability explicitly, fit their model to CML prevalence data and conclude that BCR-ABL alone might be sufficient to initiate CML.
Here, we find that a single-hit branching process model can also recover the CML age-prevalence curve. To this end, we consider a three-type branching process in which type 0 cells are healthy hematopoietic stem cells, type 1 cells are mutant stem cells with activated BCR-ABL, and type 2 corresponds to CML that has been detected. We assume that healthy stem cells (type 0) are at homeostasis and have a 0 growth rate, and that mutant stem cells (type 1) have a positive growth rate . In our model, the probability of CML detection at time t can be characterized by the type 2 waiting time distribution . We use an improved approximation equation as an estimate for :
| 7 |
For the derivation of (7), see Section C in the Appendix. Curve fitting in log space is performed using the CML age-prevalence data to identify parameter values, including the number of healthy hematopoietic stem cells N, the production rate for BCR-ABL mutants , the CML detection rate , and the growth rate of mutant cells (Fig. 5). In the approximation, N, , and appear together in the form . Consequently, only their multiplication is identified as being of the order of . The growth rate of mutant cells is identified to be per year.
Fig. 5.

CML prevalence. Comparison of the cumulative probability distribution of CML detection (prevalence) from SEER data (Table 1 in Michor et al. (2006)) and equation (7). Parameter values with 95% confidence interval: per year
Recently, the number of hematopoietic stem cells was estimated to be in the range of 50,000-200,000 using deep sequencing and phylogenetic inference (Lee-Six et al. 2018). Mitchell et al. (2022) used a similar approach and inferred that the hematopoietic stem cell population is in the range of 20,000-200,000. These works suggest that the number of healthy hematopoietic stem cells is of the order of —. Thus, we obtain an estimate for the product , which is on the order of —.
In Fig. 5, we show that Eq. (7) is in good agreement with the age-prevalence curve of CML.
Discussion
In this work, we study a multi-type branching process that starts with a large cellular population in homeostasis, and models accumulation of neutral and advantageous mutations on the way to malignancy. We derive approximations for population size and arrival time distributions for initial types with no phenotypic changes compared to healthy tissue, as well as for later types that grow abnormally. Applications to modeling the initiation of colorectal cancer and age-prevalence of chronic myeloid leukemia demonstrate the applicability of our results. Besides cancer evolution, our results are also applicable to other biological phenomena that involve a transformation of a non-growing population through sequential genetic or phenotypic alterations.
We note that the approximations presented here assume that mutation rates are much smaller than growth rates of advantageous types. In particular, for the approximations to be valid, the initial mutation rates have to be small enough compared with the first positive growth rate so that the subsequent mutation occurs when the population of the first advantageous type grows exponentially. This assumption is most likely to be violated when there is a large influx into the first type with a positive growth rate from the previous type, resulting in polynomial population growth when a subsequent mutation occurs.
Acknowledgements
This work is supported by the National Science Foundation Grant DMS-2045166.
Appendix A Methods
In this section, we provide technical details and derive the results presented in the main text. We start by listing assumptions regarding the parameter values that underlie the mathematical proofs. Part of them could be relaxed and will be discussed in A.4. We assume that genetic alterations occur at distinct rates:
Assumption 1
All the mutation rates are mutually different, i.e. .
We also assume that the mutation rates of genetic alterations are much lower than the growth rates of advantageous mutants. This results in
Assumption 2
(Small Mutation Rates) and , . The is in the sense that when taking any , for any j is unaffected.
As we are mainly concerned with neutral and advantageous mutations, we have
Assumption 3
For all , .
Lastly, for simplicity, we initially assume the death rates are zero, i.e.
Assumption 4
For all , .
The last two assumptions (3, 4) are not necessary, and we will discuss the case when these two assumptions do not hold in A.4.
We build upon the work by Nicholson et al. (2023), which provides long-time approximations for population sizes and waiting times in a branching process model with a surviving supercritical initial type. To state the procedure of developing results in this paper, it is necessary to introduce two sub-processes of the main model. Let be the vector with the th coordinate being 1 and all other coordinates being 0, representing the case when the process initial only consists a single type i cell. As a Markovian process, the model (1) is induced by the initial distribution , i.e. N initial healthy cells. We will consider two sub-processes of the main model: (i) a process that starts with a single healthy cell, i.e. induced by an initial distribution , and (ii) a process that starts with a single type cell, i.e. induced by an initial distribution .
An outline for obtaining the approximations in this paper includes three steps: First, we employ the results Nicholson et al. (2023) to approximate population sizes and waiting time distributions of . Next, by using the fact that is essentially a delayed version of , we establish approximations of population sizes and arrival times for . Finally, we move from the process with a single initial cell to the model with N initial cells and establish our main results by utilizing the branching property.
A.1 Properties of the Model Initiated by a Single Cell with Positive Growth Rate
For , approximations for the arrival times and population sizes from type to type p are exhaustively discussed in Nicholson et al. (2023). To apply their results, we introduce the following new notation:
: Number of times has been attained over types .
: Arrival time until the first type cell, i.e. .
: Median arrival time of type in the process . In other words, .
Under our assumptions (1–4), Nicholson et al. (2023) show that there exists (see A.1.1) an approximation of process , denoted by such that . Additionally, for ,
where is a known function of the mutations rates and is a Mittag-Leffler distributed random variable. The Laplace transform of reads
with being a constant that depends on the parameters. For computing the value of , see the recursive formulation after Eq. (4). The Laplace transforms of and are connected through
| A1 |
with
| A2 |
Importantly, the recursive relationship does not depend on the distribution of due to the construction of (see A.1.1). This large-time small-mutation-rate limit leads to the following population size approximation for type :
| A3 |
It follows that as well.
From this approximation, Nicholson et al. (2023) find that the arrival time of type can be estimated by
| A4 |
with being the median of . can be expressed using by
Alternatively, a recursive formulation of is presented in the main text (see Eq. (6)).
We note that there is subtle difference between the model in Nicholson et al. (2023) and our model regarding the mutation events. In our model, a mutation from type i to type causes the population of type i to decrease by one, i.e. , while in Nicholson et al. (2023), a mutation events occurs during an asymmetric division in which the type i population is not changed, i.e. . Due to the fact that this difference only exists for “growing” types whose net growth rates are assumed to be much greater than mutation rates, Nicholson’s model can be treated as a good approximation for the types to in our model.
A.1.1 Construction of
The general strategy of constructing the approximation was first introduced by Durrett and Moseley (2010) and recently studied by Nicholson et al. (2023). A rigorous mathematical description is presented in Nicholson et al. (2023). Briefly speaking, in the construction, we use a random variable to establish a two-type stochastic process , where counts a Poisson process with intensity for each single realization of . is found by taking a large time limit of , so that is a good approximation of . After that, is found to be a large time limiting random variable of , and one constructs by the same methodology. Nicholson et al. (2023) established results that incorporate small mutation rate limits in the approximations. In our case, the construction starts at type where is imposed. Importantly, Nicholson et al. (2023) indicates that the population dynamics of , in a small transition rate limit, is fully induced by the initial type, i.e. . Furthermore, in the approximate model, the probability distribution of the waiting time to type (i.e. ) can be expressed by directly (Eq. (A4)).
A.2 Properties of the Model that Starts with a Single Cell with a Zero Growth Rate
Now we move our focus to the process , which represents the branching process that starts with a single type 0 cell. Recall that the growth rates for type 0 through type k are zero. This indicates that, the process becomes after the initial cell collects the first mutations and changes its type into type . Let be the arrival time of type in . We find that for
and for
| A5 |
Due to the above expression of , for any , the population size follows a Bernoulli distribution with a time-dependent parameter
Thus, to understand the properties of , we need to obtain the distribution of , . Since each mutation among the first neutral mutations takes an exponentially distributed time to occur, we have
where denotes an exponentially distributed random variable with density . From our Assumption 1, whenever . Thus, follows a hyperexponential distribution with density
Using this density function, we find that
Thus, it follows that the Bernoulli parameter p(t) for the neutral population is given by
| A6 |
In the last equation, we have used a Taylor expansion in at 0.
Next, we want to identify the large-time small-mutation-rate population behavior for . To do that, we first observe that admits a large-time small-mutation-rate limit.
Lemma 1
For the process , the following large-time small-mutation-rate limit exists almost surely:
Proof
Notice that . Thus, for each realization , we have that
This shows that a large-time small-mutation-rate limit still exists after has been shifted by a waiting time .
We denote the above new limiting random variable by Using (Eq. (A2)), we find the Laplace transform of
| A7 |
denotes the Gaussian hypergeometric function (DLMF 2022, 15.1.1). Next, let be the approximation of . We construct an auxiliary process following the procedure described in A.1.1. As an approximation of , the auxiliary process suggests that
where
The constant is given by the recursive relationship after Eq. (4). In addition, the construction also guarantees that
| A8 |
Taking advantage of (A8), we get
| A9 |
We find that (A9) can be simplified in the small regime, leading to
| A10 |
The derivation of Eq. (A10) is provided in Section B.
A.3 Properties of the Model that Starts with a Large Non-growing Population
Finally, we consider the population dynamics of that starts with N type 0 cells. By the branching property, and are related through:
| A11 |
where is a collection of independent processes that are identically distributed as . Equality (A11) reflects that, in a multitype branching process model, each individual in the initial population evolves independently.
For types with zero growth rates, we find that for any ,
where are i.i.d. copies of the process , and are i.i.d. copies of . This expression indicates that the population of type l (with zero growth rate) at time t follows a Binomial (N, p(t)) distribution. We show that this result is in good agreement with exact computer simulations of the process in Fig. 6.
Fig. 6.

Population dize distribution of a non-growing population. In the model described in Fig. 2, type 1 does not have a selective growth advantage. The simulated cumulative distribution (CDF) of type 1 population at is presented by the blue bars. Binomial CDFs with exact theoretical success probability p(t) (A6) and approximate p(t) (2) are presented in solid lines. Parameter values: . Number of realizations in computer simulation: 1000
The arrival time of type l can be treated as the minimum of the type l arrival times among the processes that each start with a single cell, i.e.
It follows that Since is hypo-exponentially distributed, we find
For types with positive growth rates, relationship (A11) indicates that
where and are i.i.d. copies of . The Laplace transform of is found to be
We find an approximate version of in the small mutation rate parameter regime
| A12 |
For the arrival time of post-advantageous types, we take advantage of the relationship (A8) and simplification (A10) and to get the distribution of
| A13 |
A.4 Allowing Death and Fitness Decreasing Events
Following Nicholson et al. (2023), we allow deleterious mutations and positive death rates in the model (1) after the first advantageous mutation as long as the first type with non-zero growth rate is supercritical. Below, we list the results that reflect this parameter regime relaxation. We first introduce the following notation:
: Running-max fitness.
: Number of times has been attained over types .
Note that when Assumption 3 holds, type i always has the largest growth rate among all types before i. Thus, .
The waiting time distribution of type can be approximated in a form similar to Eq. (6):
| A14 |
The only difference between (6) and (A14) is that the median time has been changed to . Here, we have (see displays (2) and (5) in Nicholson et al. (2023))
with satisfying and
for . Finally, we have
Appendix B Derivation of the Simplified Distribution Function
In this section, we derive Eq. (A10), an approximation of Eq. (A9), using Taylor expansion of hypergeometric functions.
B.1 The Leading Order Term of the Distribution Function
The main goal is to find the leading order term of
| B15 |
as . Let
Then (B15) can be written as
Now, consider the Taylor expansion of G at , we have
Thus, we have that
Then, by the fact that (see equation (7) in Supplement File (1) of Bozic et al. (2013))
we have
In the next section, we will give an explicit expression of .
B.2 Computing the Partial Derivatives of the Hypergeometric Function at a Particular Point
Here we discuss the derivatives of function G at 0. Let
It follows that
Thus, we have
Next, the derivative of g is given by the following lemma.
Lemma 2
Let be the Hypergeometric function and define
Then we have
where denotes the PolyLogrithm (DLMF 2022, 25.12.10).
With this lemma, we immediately see that
Proof
Following Ancarani and Gasaneo (2009), to find the mth derivative of g, we define and apply the hypergeometric differential equation. The hypergeometric function satisfies the following second-order ODE:
Thus, for F, we have that
| B16 |
Next, since F is analytic in u, we can take a derivative with respect to u on both sides, which results in
Then, by the formula (DLMF 2022, 15.5.21)
we get that
Let . If we look at the derivative at on both sides, we get
where we have used the fact that . The above equation is a second order linear equation, but one can see that by letting , the ODE reduces to a first order equation. Hence, we get a unique solution
For a general m, the ODE for is
| B17 |
This can be obtained by taking derivatives of the both sides of (B16) m times. Let us denote
We use induction to show that
satisfies (B17). For the base case when , we have proved above that
Next, we move to the induction part. Suppose that our claim holds true for m, that is
solves (B17). Then, the ()-st equation reads
From the derivative formula of the polylogrithm (see (18) at Polylogarithm), we have that
It follows that
Thus, we see that
Multiplying both sides by , we have
This shows that and finishes the induction. Finally, we conclude that
Appendix C Improving the Accuracy of Approximations for the Waiting Time Distributions
An improvement of the main results can be achieved when the population size of type can be better estimated. Recall the density function of the waiting time to the first type cell when starting from a single wild type cell (see Section A.2):
Hence the expectation for the size of type population (when starting from a single wild type cell) is given by
| C18 |
We performed a small mutation rate approximation of Eq. (C18) regarding and found that
Thus, a corresponding approximation for the expected population size of type when there are N initial cells is
| C19 |
Note that the limiting random variable of type admits (see (4) for its approximate Laplace transform)
This motivates us to estimate the population size of by
| C20 |
Let represent the probability of having at least one type n cell in a branching process that starts with a single type m cell. The distribution of the waiting time to type can be approximated by
| C21 |
Note that coincides with the cumulative distribution of the waiting time to the first type n cell when the process initially has a single type m cell.
In the case that , is well defined. Adapting the estimation of from the construction of the first auxiliary process gives us
| C22 |
Plugging (C22) into (C21) gives us the improved formula for approximating the waiting time to type . When , we obtain an explicit expression:
| C23 |
In general, for , there is no explicit solution for the integral
| C24 |
However, for small k, the integral (C24) can be evaluated explicitly. In this case, the approximation for the distribution function of for any k, i is available through expression (C21).
C.1 Explicit Approximations for an Evolutionary Pathway with a Single Neutral Type
We have introduced an approach in Section C for improving the approximations of the waiting time distributions through a more precise estimate of the population size of type cells. In particular, when there is only one neutral type, , the population size of type 1 cells can be approximated by (see (C20))
Next, Eq. (C23) gives us an approximation of the distribution function of (, )
| C25 |
where we have used the fact that . Specifically, the expression can be further simplified since and are small. Using the approximation when x is small, we obtain
| C26 |
Expression (C26) is used for fitting the CML prevalence curve in the main text (See Fig. 5). In the curve fitting, the two identifiable terms are the multiplication and the growth rate . In particular, each of the parameters , and is not identifiable in the curve fitting. The approximation (C26) (or (7) in the main text) is in good agreement with the exact computer simulations (see Fig. 8).
Fig. 8.

Comparison of Eq. (C26) (or (7) in the main text) with exact Gillespie computer simulations. Parameter values for the the computer simulations: per year. Number of realizations in computer simulation:
Next, to obtain the distribution for , we evaluate the integral (C24). The result reads
Hence, for , the improved waiting time distribution approximation is given by
| C27 |
C.2 Explicit Approximations for an Evolutionary Pathway with Two Neutral Types
When there are two two neutral types in our model, i.e. , the improved approach for approximating waiting time distributions (see Section C) implies that (see (C20))
When , Eq. (C23) gives the approximation for the cumulative distribution function of :
| C28 |
To approximate the distribution function for each waiting time for , we first evaluate the integral (defined by (C24)). We found that
Next, plugging the integral into the expression (C21) give as the distribution for each where . Specifically, the improved waiting time distribution approximation of is given by
| C29 |
The improved approximations (C28) or (C29) are in good agreement with exact computer simulations (Fig. 9).
C.3 Derivation of Approximate Expectation for Type k + 1
The goal in this section is to perform the approximation
| C30 |
Let
Then, the target expression can be rewritten into
Next, following the same derivations in Section B.1, we can get that
| C31 |
as . Lastly, since
and
we obtain that
Code availability
For access to Gillespie simulation code, please contact the authors.
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Ancarani LU, Gasaneo G (2009) Derivatives of any order of the gaussian hypergeometric function 2F1(a, b, c; z) with respect to the parameters a, b and c. J Phys A Math Theor 42(39):395208. 10.1088/1751-8113/42/39/395208 [Google Scholar]
- Armitage P, Doll R (1954) The age distribution of cancer and a multi-stage theory of carcinogenesis. Br J Cancer 8(1):1–12. 10.1038/bjc.1954.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avanzini S, Antal T (2019) Cancer recurrence times from a branching process model. PLoS Comput Biol 15(11):e1007423. 10.1371/journal.pcbi.1007423 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, Chen S, Karchin R, Kinzler KW, Vogelstein B, Nowak MA (2010) Accumulation of driver and passenger mutations during tumor progression. Proc Natl Acad Sci USA 107(43):18545–18550. 10.1073/pnas.1010978107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bozic I, Reiter JG, Allen B, Antal T, Chatterjee K, Shah P, Moon YS, Yaqubie A, Kelly N, Le DT, Lipson EJ, Chapman PB, Diaz Luis AJ, Vogelstein B, Nowak MA (2013) Evolutionary dynamics of cancer in response to targeted combination therapy. Elife 2:e00747. 10.7554/eLife.00747 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campbell F, Williams GT, Appleton MA, Dixon MF, Harris M, Williams ED (1996) Post-irradiation somatic mutation and clonal stabilisation time in the human colon. Gut 39 [DOI] [PMC free article] [PubMed]
- Deininger MW, Goldman JM, Melo JV (2000) The molecular biology of chronic myeloid leukemia. Blood 96(10):3343–3356. 10.1182/blood.V96.10.3343 [PubMed] [Google Scholar]
- DLMF (2022) Nist digital library of mathematical functions. Release 1.1.8 of 2022-12-15, http://dlmf.nist.gov/
- Durrett R, Foo J, Leder K, Mayberry J, Michor F (2011) Intratumor heterogeneity in evolutionary models of tumor progression. Genetics 188(2):461–477. 10.1534/genetics.110.125724 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durrett R, Moseley S (2010) Evolution of resistance and progression to disease during clonal expansion of cancer. Theor Popul Biol 77(1):42–48. 10.1016/j.tpb.2009.10.008 [DOI] [PubMed] [Google Scholar]
- Fearon ER (2011) Molecular genetics of colorectal cancer. Annu Rev Pathol 6:479–507. 10.1146/annurev-pathol-011110-130235 [DOI] [PubMed] [Google Scholar]
- Foo J, Leder K, Zhu J (2014) Escape times for branching processes with random mutational fitness effects. Stoch Process Their Appl 124(11):3661–3697. 10.1016/j.spa.2014.06.003 [Google Scholar]
- Komarova NL, Wodarz D (2005) Drug resistance in cancer: principles of emergence and prevention. Proc Natl Acad Sci USA 102(27):9714–9719. 10.1073/pnas.0501870102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, Osborne RJ, Huntly BJ, Martincorena I, Anderson E, O’Neill L, Stratton MR, Laurenti E, Green AR, Kent DG, Campbell PJ (2018) Population dynamics of normal human blood inferred from somatic mutations. Nature 561 [DOI] [PMC free article] [PubMed]
- Meza R, Jeon J, Moolgavkar SH, Luebeck EG (2008) Age-specific incidence of cancer: phases, transitions, and biological implications. Proc Natl Acad Sci USA 105(42):16284–16289. 10.1073/pnas.0801151105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michor F, Iwasa Y, Nowak MA (2006) The age incidence of chronic myeloid leukemia can be explained by a one-mutation model. Proc Natl Acad Sci USA 103(40):14931–14934. 10.1073/pnas.0607006103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell E, Chapman MS, Williams N, Dawson KJ, Mende N, Calderbank EF, Jung H, Mitchell T, Coorens TH, Spencer DH, Machado H, Lee-Six H, Davies M, Hayler D, Fabre MA, Mahbubani K, Abascal F, Cagan A, Vassiliou GS, Baxter J, Martincorena I, Stratton MR, Kent DG, Chatterjee K, Parsy KS, Green AR, Nangalia J, Laurenti E, Campbell PJ (2022) Clonal dynamics of haematopoiesis across the human lifespan. Nature 606 [DOI] [PMC free article] [PubMed]
-
Morin PJ, Sparks AB, Korinek V, Barker N, Clevers H, Vogelstein B, Kinzler KW (1997) Activation of
-catenin-Tcf signaling in colon cancer by mutations in
-catenin or APC. Science 275(5307):1787–1790. 10.1126/science.275.5307.1787
[DOI] [PubMed] [Google Scholar] - Nicholson AM, Olpe C, Hoyle A, Thorsen AS, Rus T, Colombé M, Brunton-Sim R, Kemp R, Marks K, Quirke P et al (2018) Fixation and spread of somatic mutations in adult human colonic epithelium. Cell Stem Cell 22(6):909–918. 10.1016/j.stem.2018.04.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicholson MD, Antal T (2019) Competing evolutionary paths in growing populations with applications to multidrug resistance. PLoS Comput Biol 15:1–25. 10.1371/journal.pcbi.1006866 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicholson MD, Cheek D, Antal T (2023) Sequential mutations in exponentially growing populations. PLoS Comput Biol 19(7):1–32. 10.1371/journal.pcbi.1011289 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paterson C, Clevers H, Bozic I (2020) Mathematical model of colorectal cancer initiation. Proc Natl Acad Sci USA 117(34):20681–20688. 10.1073/pnas.2003771117 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potten CS, Booth C, Hargreaves D (2003) The small intestine as a model for evaluating adult tissue stem cell drug targets. Cell Prolif. 10.1046/j.1365-2184.2003.00264.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Snippert HJ, Schepers AG, Van Es JH, Simons BD, Clevers H (2014) Biased competition between Lgr5 intestinal stem cells driven by oncogenic mutation induces clonal expansion. EMBO Rep 15(1):62–69. 10.1002/embr.201337799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomasetti C, Marchionni L, Nowak MA, Parmigiani G, Vogelstein B (2015) Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc Natl Acad Sci USA 112(1):118–123. 10.1073/pnas.1421839112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vogelstein B, Kinzler KW (2004) Cancer genes and the pathways they control. Nat Med 10(8):789–799. 10.1038/nm1087 [DOI] [PubMed] [Google Scholar]
- Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr, Kinzler KW (2013) Cancer genome landscapes. Science 339(6127):1546–1558. 10.1126/science.1235122 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Boland CR, Goel A, Wodarz D, Komarova NL (2022) Aspirin’s effect on kinetic parameters of cells contributes to its role in reducing incidence of advanced colorectal adenomas, shown by a multiscale computational study. Elife 11:e71953. 10.7554/eLife.71953 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang R, Ukogu OA, Bozic I (2023) Waiting times in a branching process model of colorectal cancer initiation. Theor Popul Biol 151:44–63. 10.1016/j.tpb.2023.04.001 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
For access to Gillespie simulation code, please contact the authors.



