Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2013 Dec 25;42(5):2870–2878. doi: 10.1093/nar/gkt1320

A phylogenetic model for understanding the effect of gene duplication on cancer progression

Qin Ma 1, Jaxk H Reeves 2, David A Liberles 3, Lili Yu 4, Zheng Chang 5, Jing Zhao 2, Juan Cui 6, Ying Xu 1,7,8,*, Liang Liu 1,2,*
PMCID: PMC3950708  PMID: 24371277

Abstract

As biotechnology advances rapidly, a tremendous amount of cancer genetic data has become available, providing an unprecedented opportunity for understanding the genetic mechanisms of cancer. To understand the effects of duplications and deletions on cancer progression, two genomes (normal and tumor) were sequenced from each of five stomach cancer patients in different stages (I, II, III and IV). We developed a phylogenetic model for analyzing stomach cancer data. The model assumes that duplication and deletion occur in accordance with a continuous time Markov Chain along the branches of a phylogenetic tree attached with five extended branches leading to the tumor genomes. Moreover, coalescence times of the phylogenetic tree follow a coalescence process. The simulation study suggests that the maximum likelihood approach can accurately estimate parameters in the phylogenetic model. The phylogenetic model was applied to the stomach cancer data. We found that the expected number of changes (duplication and deletion) per gene for the tumor genomes is significantly higher than that for the normal genomes. The goodness-of-fit test suggests that the phylogenetic model with constant duplication and deletion rates can adequately fit the duplication data for the normal genomes. The analysis found nine duplicated genes that are significantly associated with stomach cancer.

INTRODUCTION

Cancer is one of the leading causes of death in Americans (1). Cancer research has led to a variety of effective treatments and diagnostic techniques for cancers. Yet, the fundamental genetic mechanisms that turn normal cells into tumors remain mysterious. Advances in the biotechnology field have provided an unprecedented opportunity for understanding the origin and progression of cancer (2–4). The availability of genetic data ignites the hope that we may discover the genetic mechanisms of cancer by examining the genetic differences between normal and cancer genomes (5). It is, however, a challenging task to effectively analyze such genetic data by modeling the genetic variation observed within and between the normal and cancer groups (6). Previous studies have demonstrated that cancer progression is an evolutionary process in which mutation and natural selection are two key factors (7,8). Mutation causes genetic variation among normal cells that can trigger cancer (9). On the other hand, selection plays an important role in therapeutic resistance (10–12) and in the birth and death process of cancer cells, as cancer cells vary and the fittest ones survive after competition (13).

In the last few decades, theory from evolution and ecology has been adapted in cancer studies to investigate the genetic mechanisms of cancer (14–18). Muto et al. (19) studied colon cancers and found that most colon cancers have evolved from adenomatous polyps known as polyp-cancer sequences. Evolutionary ideas have been explored in many cancer analyses (20–22). Nowell (23) proposed a landmark colonial evolution model for tumor progression, which assumes that most neoplasms originate from a single cell. Gillies et al. (24) proposed an evolutionary model for malignant cancers, in which the micro-environmental forces such as hypoxia can stimulate genetic instability and impose selection pressures on cancer cells. Recently, Wu (25) investigated the evolution of cancer cells after the primary tumor had spread to secondary sites (26). Ultimately, cancer evolution within an individual can be viewed genetically as adaptation to a new lifestyle and ecologically in the context of the other cell types and resources available in an individual.

Heterogeneity of cancer caused by genetic instability is the main challenge in the process of understanding cancer evolution and in the process of identifying driver genes (8). Due to this challenge, cancer data sets often lack signal regarding the evolutionary process of cancer. It is difficult to find genomic mutations/events that trigger cancer, especially those that trigger the early-stage cancer. High throughput technologies, particularly next generation sequences (NGS) provide researchers with new opportunities to understand the evolutionary process of cancer development at a single cell nucleotide level (16,27–29). NGS technology is able to identify alterations in the genome, e.g. chromosomal rearrangement and copy number variation, rather than point mutations; and can sequence genetic material from lower-frequency samples (30). Because of the advantages of NGS data, the NGS technology has been extensively used in cancer studies to examine genetic mechanisms that cause cancer progression.

Gene duplication is believed to play an important role in tumor progression (31). Duplicated genes have been frequently observed in the genomes of cancer patients. Waris and Ahsan (32) suggested that gene duplication and other changes in DNA may be involved in the initiation of various cancers. Previous studies found that there is a strong correlation between gene duplication and large tumor size, indicating that gene duplication may play a critical role in tumor progression (33). However, the information at early stages of cancer is usually unavailable and little is known about the relationship between gene duplication and early-stage cancer.

The primary goal of the study is to investigate the effects of gene duplication and deletion on the incidence and progression of cancers. Specifically, this study aims to estimate the duplication and deletion rates on normal and tumor genomes, and to identify duplicated genes that are highly associated with stomach cancer. We have developed a probabilistic model in the context of coalescent trees of normal genomes attached with five tumor genomes for understanding how gene duplication and deletion are related to different stages of cancer as cancer progresses. A maximum likelihood approach is adopted to estimate model parameters, including duplication and deletion rates. This approach can identify duplicated genes that are significantly associated with stomach cancer.

MATERIALS AND METHODS

Genome annotation and duplication data

The genomic data was obtained from five stomach cancer patients (34). Two samples (tumor and normal) were taken from each patient; tumor tissues were surgically removed from part of the patients’ stomachs, while blood samples were extracted as normal tissues from the same patients (34). Determination of pathologic stages of tumor tissues is based on the standards recommended by World Health Organization (WHO). Pathological examination suggested that two patients were in stage II of stomach cancer, while remaining three patients were in stages I, III and IV (Table 1). Stomach cancer has two subtypes in terms of the genome instability—micro satellite instability (35) and chromosome instability (36). However, the subtypes of the stomach cancer for five patients in this study are not available in (34). The genomes were sequenced for each of the two samples (normal and tumor). Both the normal and tumor genomes were compared with the human reference genome to identify duplicated genes. As the human reference genome is a haploid sequence, it may result in underestimation of duplication events. High-confidence duplication events were identified if a junction in the genomic data satisfied all of the following criteria: (i) at least 10 mate-pairs in cluster for its junction, (ii) successful de novo assembly of the junction, (iii) high mapping diversity with both left length and right length no less than 70 and (iv) absence of specific repeat sequences on left and right side of junction. As it is assumed that duplication events occur independently among genes, the junctions that covered more than two genes were excluded. With these criteria, we identified 210 genes on which duplication occurred for at least one of the 10 genomes (5 normal and 5 tumors). We use ‘1’ to denote duplication and ‘0’ to denote no duplication. Cui et al. (34) did not estimate the total number of genes in the genomes of five patients. We used the estimate from ENCODE (37) that the total number of genes in the human genome is 21 000. Because the most significant inferences are based on the relative duplication and deletion rates, uncertainty in the total number of human genes does not affect the major conclusions of the data analysis. In summary, the data matrix D has 21 000 rows and 10 columns; each row represents a gene and each column represents a genome (normal or tumor). The entries in the matrix D are either 0 (no duplication) or 1 (duplication).

Table 1.

The number of duplicated genes on the genomes of five stomach cancer patients

Subject Stage No of duplicated genes in tumor genomes
S1 I 64
S2 II 84
S3 II 57
S4 III 75
S5 IV 72

A phylogenetic model for gene duplication

The stochastic process of gene duplication and deletion

The process of gene duplication and deletion is a continuous time Markov process with two states 1 (duplication) and 0 (no duplication). Let d(t) denote the Markov process on states 0 and 1. We assume that transition probabilities Pij(t) are stationary and the infinitesimal duplication and deletion rates are a and b, respectively. The probability of a duplication event (and a deletion event) during a time period of duration Δt, as Δt → 0, is (38,39)

graphic file with name gkt1320m1.jpg (1)

The notation ot) indicates Inline graphic. The probability distribution P(t) of d(t) can be derived from theory of Markov process. Let T = (a + b)t and m = a/(a + b). The transition probabilities Pij(t) are given as follows:

graphic file with name gkt1320m2.jpg (2)

As Inline graphic,

graphic file with name gkt1320m3.jpg (3)

Thus the limiting distribution as Inline graphic is P(d(T) = 0) = 1 − m and P(d(T) = 1) = m. This model is time reversible, in the sense that P(d(t) = 0, d(t +T) = 1) = P(d(t) = 1, d(t + T) = 0).

The likelihood function under the phylogenetic model

A phylogenetic tree with five extended branches describes the history of 10 genomes of five cancer patients. The five normal genomes are at the tips of the tree, which are attached with five extended branches leading to the tumor genomes (Figure 1). The tree without the extended branches represents the history of five normal genomes, while the extended branches represent the duplication/deletion process that leads to the tumor genomes. Each tumor progression was treated as an independent process, even though the tumors progressed through the same stages phenotypically. As described above, the normal and tumor sequences are coded as binary data (0: no duplication or 1: duplication). Each site of the sequences contains duplication status of a gene across five patients. Parameters of the phylogenetic model include the topology of the tree, branch lengths Ti, and parameter m = a/(a + b). Let M be the number of branches in the tree and W be the number of extended branches. We assume that m is constant on the main branches of the tree, but the extended branches have variable (relative) duplication rates {Inline graphic, k = 1, … , W}. Given a phylogenetic tree S (topology, branch lengths T and parameter m) and the extended branches (branch lengths Te and me), the probability distribution of data matrix D can be derived from the transition probability function in (2). Let dij and d*ij be the duplication status of gene j at the two ends of branch i (with length Ti) in tree S with topology τ. It follows from (2) that the probability of dij and Inline graphic, given branch length Ti, parameter m, and tree topology τ, is

graphic file with name gkt1320m4.jpg (4)

Given the states of the internal nodes, the Markov processes on different branches of the phylogenetic tree are independent of one another. The probability distribution function of duplication events on gene j (denoted by Dj) is the product of the probabilities for individual branches in (4), i.e.,

graphic file with name gkt1320m5.jpg (5)

In (5), I denotes the duplication status at the internodes of the tree. The first term in (5) is the probability of duplication events in normal genomes, given the phylogenetic tree without the extended branches. The second term is the probability of duplication events in tumor genomes, given the extended branches, in which ckj and Inline graphic are the duplication status of gene j at the two ends of the extended branch k. The probability function P(Dj, I | T, m, τ, Te, Inline graphic) in (5) assumes that the duplication status at the internodes of the tree are given. Because in reality I is often not given, we calculate P(Dj | T, m, τ, Te, me), which is the sum over all possible realizations of I, i.e.,

graphic file with name gkt1320m6.jpg (6)

The probability distribution P(Dj | T, m, τ, Te, me) in (6) can be efficiently calculated by a peeling technique described by Felsenstein (40).

Figure 1.

Figure 1.

The tree in the phylogenetic model. The normal genomes (N) at the tips of the tree are attached with five extended branches leading to the tumor genomes (T). Ni and Ti are the normal and tumor genomes of patient i. The tree (above normal genomes) represents the history of five normal genomes, while the extended branches represent the process leading to the five tumor genomes. The normal genomes are the ancestral genomes of the tumor genomes.

In the context of population genetics, the phylogenetic tree of n individuals varies over different loci due to the coalescent. Let t = {tj, j = 2, … , n} be the waiting times until the next coalescence event. Let θ = 4uNe be the population size parameter, in which Ne is the effective population size and u is the change (duplication and deletion) rate per gene. According to the coalescent theory, the waiting times tj’s are independently distributed with the exponential density

graphic file with name gkt1320m7.jpg (7)

The expected coalescent time for a haploid genome from two individuals chosen at random from the human population is E(t2) = θ/2, which indicates that if the sequences of a gene are sampled from one of the genomes of two individuals chosen at random from the human population, the expected duplication probability E(P0,1(T)) is equal to the probability P0,1(T), averaging over coalescence time T, which has an exponential density described in (7), i.e.,

graphic file with name gkt1320m8.jpg (8)

Similarly, the expected deletion probability is Inline graphic Inline graphic. The expected number of changes per gene between two genomes chosen at random from the human population is

graphic file with name gkt1320m9.jpg (9)

in which x1 represents the duplication status of a gene in one of the two genomes, and x2 represents the duplication status of the same gene in the other genome.

The parameters θ and m are estimated by averaging over gene trees, in which branch lengths T are the sum of a set of coalescence waiting times t with a density function described in (7). The probability of observing certain duplication states (D for current individuals and I for their ancestors) of a gene is then equal to the likelihood in (5) (without the extended branches) averaging over coalescence waiting times t i.e.,

graphic file with name gkt1320um1.jpg

The probability P(Dj | m, θ) for a single locus is equal to the probability in (5) summing over all possible duplication states at the internal nodes of the phylogenetic tree,

graphic file with name gkt1320um2.jpg

Since probability P(Dj | m, θ) under the coalescent model is invariant to the order of the duplication states of individuals, the relevant random variable here is the number of duplications across individuals. When there are n individuals, the number of individuals who have duplication for a particular gene could be 0 up to n. Let {xi, i = 0, … , n} be the number of genes for which i individuals have duplications. The sum of xi’s (N) is the total number (21 000) of genes considered in this study. We use {pi, i = 0, … , n} to denote the probability of observing i individuals with duplication. Thus {xi, i = 0, … , n} follows a multinomial distribution, i.e.,

graphic file with name gkt1320um3.jpg

Because the multinomial coefficient does not involve model parameters, we delete this term and write the log-likelihood function as

graphic file with name gkt1320m10.jpg (10)

In this equation, pi is a function of m and θ, which will be derived as follows under the coalescent theory. Considering a simple case of two individuals, the coalescence time t has an exponential distribution with density 2e−2t/θ. Let y be the duplication state at the root, and z1 and z2 be the duplicate states of two individuals at the tips of the tree. As there are only two states for y, z1 and z2, the domain of y, z1 and z2 has only two values, 0 and 1. The goal is to derive the probabilities of (z1 = 0, z2 = 0), (z1 = 0, z2 = 1), (z1 = 1, z2 = 0) and (z1 = 1, z2 = 1). We assume that the states at the root have the equilibrium distribution with probability mass function P(y = 0) = 1 − m and P(y = 1) = m. The probability of (z1 = 0, z2 = 0) is

graphic file with name gkt1320um4.jpg

Similarly, P(z1 = 0, z2 = 1) = P(z1 = 1, z2 = 0) = θm(1 m)/(θ + 1), and P(z1 = 1, z2 = 1) = m2 + m(1 m)/(θ + 1). Thus we have p0 = (1 − m)2 + m(1 m)/(θ + 1), p1 = 2θm(1 m)/(θ + 1), and p2 = m2 + m(1 m)/(θ + 1). The log-likelihood function becomes

graphic file with name gkt1320m11.jpg (11)

For an arbitrary number of individuals, we use an iterative algorithm (Supplementary Material S1) to calculate probability pi. The maximum likelihood estimates of θ and m are obtained by using the L-BFGS-B algorithm (41) implemented in an R optimization function optim. In addition to the estimates of model parameters, function optim outputs the hessian matrix (also called observed Fisher information matrix) that can be used to calculate the variances of the estimates.

Additionally, equation (5) implies that duplication processes occurring on the extended branches are conditionally independent of those occurring on the other branches of the tree. Thus, parameters on the extended branches can be estimated separately. Let {ak; k = 1, … , W} and {bk; k = 1, … , W} be the duplication and deletion rates on the W extended branches. The ratio parameter is mk = ak/(ak + bk) and the branch length is Inline graphic = (ak + bk)tk. Parameters {mk, Inline graphic; k = 1, … , W} on the extended branches can be estimated from the empirical frequencies of observations 00, 01, 10 and 11 on the normal and tumor genomes of each patient. The two digits are the duplication status of the genes on the normal and tumor genomes, respectively, from the same patient. Let n00, n01, n10, n11 be the count of the genes with pattern 00, 01, 10 and 11, respectively. The count n01 has binomial distribution with Inline graphic and n0 = n00 + n01, in which Te is the length of the extended branch. Similarly, the count n10 has binomial distribution with probability Inline graphic Inline graphic and n1 = n10 + n11. The maximum likelihood estimate of p01 is n01/n0, i.e., Inline graphic and Inline graphic. Thus, the estimates of me and Te are given by Inline graphic and Inline graphic.

Simulation study

To evaluate the performance of the phylogenetic model developed in the previous section, duplication and deletion events were simulated from the phylogenetic model. The values of parameters (m, θ) were set to (0.01, 0.01), (0.01, 0.1), (0.3, 0.01), (0.3, 0.1), respectively. For each parameter setting, we simulated duplication and deletion events for 1000, 5000 and 10 000 genes. The simulated data were then used to estimate parameters (m, θ) in the phylogenetic model. Each simulation was repeated 10 times, and the square root of mean square error (RMSE) between the estimate and the true value of the model parameter was calculated. Let Inline graphic be the estimate of parameter θ. The RMSE is Inline graphic, where g is the number of simulations and Inline graphic is the estimate of θ for the i-th simulation. Overall, the results show that the RMSEs of parameters m and θ decrease as the number of genes increases for all parameter settings (Figure 2). The RMSE of m depends on not only the value of m, but also the value of θ and the number of genes. It appears that m has a smaller RMSE when θ is large (θ = 0.1) at 1000 genes (Figure 2a). But this pattern is reversed for m = 0.3 at 5000 and 10 000 genes. In contrast, θ consistently has a smaller RMSE when m is large (Figure 2b). This may be caused by the fact that a large m tends to generate more duplications in the simulated data. Thus it is relatively straightforward to estimate θ when m is large. For all parameter settings, the RMSEs of m and θ decrease to a reasonable level (<0.008), when the number of genes reaches 10 000.

Figure 2.

Figure 2.

Simulation results. The square root of the mean square error for (a) estimating parameter m, and for (b) estimating parameter θ.

RESULTS

In the stomach cancer data set, there are 210 genes on which duplication has occurred for at least one of the 10 genomes. The remaining 20 790 genes have the pattern of (0,0,0,0,0,0,0,0,0,0), given that the total number of genes on the human genomes is 21 000 (37). The number of duplicated genes in the tumor genomes varies across the stages of stomach cancer (Table 1). There is a high degree of individual variation in this data, although this might suggest that duplication and deletion rates may vary across different stages of stomach cancer.

We use the phylogenetic model to estimate and compare the overall changes (duplications and deletions) on the normal genomes and tumor genomes. We expect that the overall changes in the tumor genome are significantly higher than those in the normal genomes. We also investigate the pattern (increasing or decreasing) of the duplication and deletion rates across cancer stages. Finally, we identify significant duplication and deletion events associated with tumor genomes. As duplication and deletion rates depend on the total number of genes in the human genomes, the duplication and deletion rates estimated from the stomach cancer data set are relative rates. Moreover, the strategies of identifying duplication events on the genomes of five patients may underestimate the number of duplications. Because underestimation occurs for both normal and tumor genomes, it will not affect the gross conclusions based on the comparison of the relative duplication and deletion rates on normal and tumor genomes, or among the tumor genomes in different cancer stages.

We first assumed a fix tree for all genes (described in Figure 1), and used a Bayesian approach (Supplementary Material S2) to estimate the phylogenetic tree and the duplication and deletion rates. The Bayesian estimate of the phylogenetic tree is poorly supported with all posterior probabilities <0.4 (Supplementary Figure S3c). The low posterior probabilities for the nodes in the Bayesian tree, despite such a large number of observations, suggest that there is not a single tree generating the empirical data. Thus, we modeled the gene trees in the context of population genetics using the coalescent theory as described above, and calculated the maximum likelihood estimates of model parameters m and θ.

The maximum likelihood estimates of model parameters

The model parameters m and θ were estimated by maximizing the log-likelihood function described in equation (10). We used the L-BFGS-B algorithm (41) implemented in the R optimization function optim to maximize the log-likelihood function in equation (10). The estimate of θ for tumor genomes is twice as high as that for normal genomes (Table 2). The expected number of changes per gene for the tumor genomes is Inline graphic (see equation (9)), which is significantly higher than that (0.0012) for the normal genomes. This result suggests that the number of changes (duplications and deletions) in the tumor genomes is significantly greater than the number of changes in the normal genomes.

Table 2.

The maximum likelihood estimates of m and θ for normal and tumor genomes

m (SE) θ (SE)
Normal 0.0028 (0.0002) 0.7134 (0.1024)
Tumor 0.0037 (0.0003) 1.4750 (0.1881)

The values within parentheses are standard errors of the estimates.

The goodness of fit of the phylogenetic model was evaluated by the chi-square goodness-of-fit test implemented in an R function chisq.test. The observed counts of genes for which 0 up to 5 individuals have duplication were calculated for the normal and tumor genomes (Table 3). Moreover, the expected count of genes for which i individuals have duplication equals 21 000 × pi, in which the probability pi of observing i individuals with duplication was obtained from the L-BFGS-B algorithm described in the previous section. The probabilities {p0, p1, p2, p3, p4, p5} for the normal genomes are 0.9942, 0.0021, 0.0011, 0.0008, 0.0007, 0.0008, respectively. The chi-square test cannot reject the phylogenetic model for the normal genomes, with P-value = 0.828 (Table 3). In contrast, the probabilities {p0, p1, p2, p3, p4, p5} for the tumor genomes are 0.9903, 0.0049 0.0022, 0.0012, 0.0007, 0.0004, respectively. The phylogenetic model for tumor genomes is strongly rejected by the chi-square test, with P-value < 106 (Table 3). The phylogenetic model assumes constant duplication and deletion rates across branches of the tree. However, duplication and deletion rates may be highly variable in different stages of stomach cancer, and thus the assumption of constant duplication and deletion rates may be seriously violated when modeling tumor genomes in different stages of cancer. To take into account variable duplication and deletion rates, we separately fit the two-states duplication and deletion model to each of the external branches. As we expected, duplication and deletion rates vary across external branches (Table 4). Overall, the deletion rates are much higher than the duplication rates on the extended branches (Table 4), suggesting that deletion occurred more often than duplication in tumor genomes. The duplication and deletion rates appear not to have either an increasing or decreasing pattern associated with cancer stages.

Table 3.

The chi-square goodness-of-fit test for the phylogenetic model

No of patients with duplication Normal
Tumor
Observed counts Expected counts Observed counts Expected counts
0 20 885 20 880 20 803 20 798
1 46 44.6 129 103.2
2 23 24.1 26 46.4
3 18 17.8 16 26.9
4 10 15.6 7 16.4
5 18 17.6 19 8.9
P-value = 0.8283 P-value = 7.3e-07

Table 4.

The estimates of relative duplication and deletion rates on the extended branches

Duplication rate Deletion rate
T1 0.0013 0.2626
T2 0.0024 0.2079
T3 0.0007 0.2412
T4 0.0009 0.1443
T5 0.0017 0.2817

Identifying cancer-related duplicated genes

Let x denote the duplication status of a gene, with x = 1 referring to the cases where the gene collected from the tumor tissue is duplicated, while the gene collected from the normal tissue of the same patient is not duplicated. If duplication of the same gene is observed on a large number of tumor genomes, it is strong evidence that the duplicated gene is associated with tumor. We call this type of duplication ‘cancer-related duplication’ (occurring on the tumor genome, but not on the normal genome). Let y be the number of cancer-related duplications for a particular gene observed in the genomes of five patients. For the stomach cancer data, the value of y can be 0 up to 5. Under the null hypothesis that the duplication of a particular gene in the tumor genome is normal, we expect that the observed cancer-related duplication probability of a gene will be similar to the duplication probability in normal genomes. Thus, a duplicated gene is associated with cancer if the observed probability is significantly higher than the duplication probability in normal genomes. Given that m = 0.0028 and θ = 0.7134, the average duplication probability in normal genomes is p = mθ/(1 + θ) = 0.0012 (see equation (8)). Under the null hypothesis, the random variable y (number of duplications) has a binomial distribution with P = 0.0012 and n = 5. The null hypothesis was rejected for nine duplicated genes (CDH4, CLPS, CLSTN2, EML5, NPEPL1, SENP5, SPTB, VAMP7, XAGE-4), with the overall P-value < 0.05 adjusted by Bonferroni correction for multiple comparisons (Table 5). The same list of genes was identified when the duplication probability p was calculated from the two ends of the 95% confidence interval (mean ± 2SE) of m and θ. Similarly, the frequencies of deleted genes on the tumor genomes are compared with the deletion probability in normal genomes. If the observed frequency of deletions is significantly higher than the deletion probability in the normal genomes, we conclude that the deletion is significantly associated with cancer. We did not find any deletion that is significantly associate with cancer.

Table 5.

Identification of duplicated genes associated with cancer

Number of duplications Number of genes P-value Cumulative P-value Significance
0 20 853 1.0 1.0
1 109 0.006 >0.5
2 7 1.4e-05 1e-04 *
3 1 1.7e-8 1.7e-08 *
4 0 1e-11 2.4e-15 *
5 1 2.4e-15 2.4e-15 *

The significant genes are indicated by *. The cumulative P-value was adjusted with Bonferroni correction for multiple comparisons.

The functional annotation of nine significantly duplicated genes (Supplementary Table S1) was conducted by the DAVID web server (42). The analyses generated two significant annotation clusters (Supplementary Table S2). The first annotation cluster includes three genes (CDH4, CLSTN2 and NPEPL1), which are related to ion binding, specifically metal ion binding. Metal ion binding has been found to play an important role in the anticancer activity of UK-1 analogs (43). The four genes (CDH4, VAMP7, CLSTN2 and SPTB) in the second annotation cluster are mainly related to membrane or transmembrane proteins, which function as gateways to link inside and outside of a cell. Previous cancer studies suggest that membrane proteins are related to cancer progression (44), and transmembrane genes are usually quite important in drug design (45).

DISCUSSION

Genomic data have become one of the most valuable resources of information for understanding the genetic mechanisms of cancer (22). Due to the complexity of the genomic data, it is challenging to develop a probabilistic model that can effectively extract useful information from genomic data. The genome-wide association study (GWAS) is a powerful approach for identifying cancer-related genes, based on comparison of single-nucleotide polymorphisms (SNP) in the normal and cancer genomes (46–48). The phylogenetic model developed in this article is based on the same principle to identify cancer-related duplications by comparing the normal and tumor genomes. Additionally, the phylogenetic model adds a layer of biological realism to the analysis that was otherwise not present in the GWAS analysis.

The phylogenetic model developed in this article is designed for genome-wide duplication data analysis. It has been shown through simulation that the phylogenetic method can accurately estimate the model parameters, including duplication and deletion rates. Previous studies suggest that the mechanism of cancer is complex and may involve multiple biological processes (49). For those cases, the analysis based on the phylogenetic model in which only duplication and deletion events are considered may produce biased results. In the future, we will extend the current phylogenetic model by including more biological factors (see for example, 50 in an evolutionary context). In the phylogenetic model, we assume that genes are independent of each other. This assumption may not hold, because several genes might be in the same linkage block or under selection for functional purposes. Treating genes as independent samples, while they are not, may increase the effective sample size and thus produce an estimator with an artificially smaller variance. In addition, non-independent gene trees may bias the estimates of model parameters, especially when the recombination events are highly correlated with duplication and deletion events. The effect of non-independent gene trees depends on the recombination rate of human genomes. Non-independent gene trees have been modeled for a three-taxon case (51), but it is generally quite difficult to deal with non-independent gene trees due to linkage disequilibrium. Although we do not deal with non-independent gene trees in this article, this issue clearly needs more attention.

Despite the fact that genomic data from cancer patients will become increasingly available, the high cost of sequencing whole genomes significantly limits the size of such genomic data. The data set analyzed in this article contains genomes from only five patients, one or two patients for each stage of stomach cancer. We expect that the availability of multiple genomes from more patients (along with the actual number of gene copies for each gene) will significantly improve the estimation of model parameters and increase the power for testing relevant biological hypotheses about the mechanisms of cancer under the phylogenetic model.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Foundation Grant [DMS-1222745] to Dr Liu and National Science Foundation Grant [DMS-1222940] to Dr Liberles. Funding for open access charge: National Science Foundation Grant [DEB-0830024] and the DOE BioEnergy Science Center [contract no. DE-PS02-717 06ER64304] [DOE 4000063512].

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

We thank Dr Guojun Li for the helpful discussions on the phylogenetic model. We thank Prof. Madan Babu and two anonymous reviewers for the thoughtful comments.

REFERENCES

  • 1.Siegel R, Ward E, Brawley O, Jemal A. Cancer statistics. A Cancer J. Clinicians. 2011;61:212–236. doi: 10.3322/caac.20121. [DOI] [PubMed] [Google Scholar]
  • 2.Brosnan JA, Iacobuzio-Donahue CA. A new branch on the tree: next-generation sequencing in the study of cancer evolution. Semin. Cell Dev. Biol. 2012;23:237–242. doi: 10.1016/j.semcdb.2011.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Iacobuzio-Donahue CA. Genetic evolution of pancreatic cancer: lessons learnt from the pancreatic cancer genome sequencing project. Gut. 2012;61:1085–1094. doi: 10.1136/gut.2010.236026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ma QC, Ennis CA, Aparicio S. Opening Pandora's Box—the new biology of driver mutations and clonal evolution in cancer as revealed by next generation sequencing. Curr. Opin. Genet. Dev. 2012;22:3–9. doi: 10.1016/j.gde.2012.01.008. [DOI] [PubMed] [Google Scholar]
  • 5.Jones S, Chen W-d, Parmigiani G, Diehl F, Beerenwinkel N, Antal T, Traulsen A, Nowak MA, Siegel C, Velculescu VE, et al. Comparative lesion sequencing provides insights into tumor evolution. Proc. Natl Acad. Sci. USA. 2008;105:4283–4288. doi: 10.1073/pnas.0712345105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Prisman EZ, Gafni A, Finelli A. Testing the evolution process of prostate-specific antigen in early stage prostate cancer: what is the proper underlying model? Stat. Med. 2011;30:3038–3049. doi: 10.1002/sim.4329. [DOI] [PubMed] [Google Scholar]
  • 7.Ayala FJ. ‘Nothing in biology makes sense except in the light of evolution’: Theodosius Dobzhansky: 1900-1975. J. Hered. 1977;68:3–10. doi: 10.1093/oxfordjournals.jhered.a108767. [DOI] [PubMed] [Google Scholar]
  • 8.Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481:306–313. doi: 10.1038/nature10762. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Erren TC. On the origin of cancer: evolution and a mutation paradox. Med. Hypotheses. 2009;73:124–125. doi: 10.1016/j.mehy.2009.02.002. [DOI] [PubMed] [Google Scholar]
  • 10.Aktipis CA, Kwan VSY, Johnson KA, Neuberg SL, Maley CC. Overlooking evolution: a systematic analysis of cancer relapse and therapeutic resistance research. Plos One. 2011;6 doi: 10.1371/journal.pone.0026100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Calcagno AM. Evolution of drug resistance in cancer: the emergence of unique mechanisms and novel techniques. Mol. Pharmaceutics. 2011;8:1993–1993. doi: 10.1021/mp200531m. [DOI] [PubMed] [Google Scholar]
  • 12.Gillies RJ, Verduzco D, Gatenby RA. Evolutionary dynamics of carcinogenesis and why targeted therapy does not work. Nat. Rev. Cancer. 2012;12:487–493. doi: 10.1038/nrc3298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Goymer P. Natural selection: the evolution of cancer. Nature. 2008;454:1046–1048. doi: 10.1038/4541046a. [DOI] [PubMed] [Google Scholar]
  • 14.Allred DC, Wu Y, Mao S, Nagtegaal ID, Lee S, Perou CM, Mohsin SK, O'Connell P, Tsimelzon A, Medina D. Ductal carcinoma in situ and the emergence of diversity during breast cancer evolution. Clin. Cancer Res. 2008;14:370–378. doi: 10.1158/1078-0432.CCR-07-1127. [DOI] [PubMed] [Google Scholar]
  • 15.Ewald PW, Ewald HAS. Infection, mutation, and cancer evolution. J. Mol. Med. 2012;90:535–541. doi: 10.1007/s00109-012-0891-2. [DOI] [PubMed] [Google Scholar]
  • 16.Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, Martinez P, Matthews N, Stewart A, Tarpey P, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 2012;366:883–892. doi: 10.1056/NEJMoa1113205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Greaves M. Cancer: The Evolutionary Legacy. Oxford: Oxford University Press; 2000. [Google Scholar]
  • 18.Yu K-D, Shao Z-M. Initiation, evolution, phenotype and outcome of BRCA1 and BRCA2 mutation-associated breast cancer. Nat. Rev. Cancer. 2012;12:372–373. doi: 10.1038/nrc3181-c1. [DOI] [PubMed] [Google Scholar]
  • 19.Muto T, Bussey HJ, Morson BC. The evolution of cancer of the colon and rectum. Cancer. 1975;36:2251–2270. doi: 10.1002/cncr.2820360944. [DOI] [PubMed] [Google Scholar]
  • 20.Merlo LMF, Pepper JW, Reid BJ, Maley CC. Cancer as an evolutionary and ecological process. Nat. Rev. Cancer. 2006;6:924–935. doi: 10.1038/nrc2013. [DOI] [PubMed] [Google Scholar]
  • 21.Otsuka J. The large-scale evolution by generating new genes from gene duplication; similarity and difference between monoploid and diploid organisms. J. Theor. Biol. 2011;278:120–126. doi: 10.1016/j.jtbi.2011.03.006. [DOI] [PubMed] [Google Scholar]
  • 22.Podlaha O, Riester M, De S, Michor F. Evolution of the cancer genome. Trends Genet. 2012;28:155–163. doi: 10.1016/j.tig.2012.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194:23–28. doi: 10.1126/science.959840. [DOI] [PubMed] [Google Scholar]
  • 24.Gillies RJ, Verduzco D, Gatenby RA. Evolutionary dynamics of carcinogenesis and why targeted therapy does not work. Nat. Rev. Cancer. 2012;12:487–493. doi: 10.1038/nrc3298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wu X, Northcott PA, Dubuc A, Dupuy AJ, Shih DJH, Witt H, Croul S, Bouffet E, Fults DW, Eberhart CG, et al. Clonal selection drives genetic divergence of metastatic medulloblastoma. Nature. 2012;482:529–533. doi: 10.1038/nature10825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Clifford SC. Cancer genetics: evolution after tumour spread. Nature. 2012;482:481–482. doi: 10.1038/nature10949. [DOI] [PubMed] [Google Scholar]
  • 27.Hou Y, Song L, Zhu P, Zhang B, Tao Y, Xu X, Li F, Wu K, Liang J, Shao D, et al. Single-cell exome sequencing and monoclonal evolution of a JAK2-negative myeloproliferative neoplasm. Cell. 2012;148:873–885. doi: 10.1016/j.cell.2012.02.028. [DOI] [PubMed] [Google Scholar]
  • 28.Xu X, Hou Y, Yin X, Bao L, Tang A, Song L, Li F, Tsang S, Wu K, Wu H, et al. Single-cell exome sequencing reveals single-nucleotide mutation characteristics of a kidney tumor. Cell. 2012;148:886–895. doi: 10.1016/j.cell.2012.02.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Caldas C. Cancer sequencing unravels clonal evolution. Nat. Biotechnol. 2012;30:408–410. doi: 10.1038/nbt.2213. [DOI] [PubMed] [Google Scholar]
  • 30.Thomas RK, Nickerson E, Simons JF, Janne PA, Tengs T, Yuza Y, Garraway LA, LaFramboise T, Lee JC, Shah K, et al. Sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat. Med. 2006;12:852–855. doi: 10.1038/nm1437. [DOI] [PubMed] [Google Scholar]
  • 31.Ciullo M, Debily MA, Rozier L, Autiero M, Billault A, Mayau V, El Marhomy S, Guardiola J, Bernheim A, Coullin P, et al. Initiation of the breakage-fusion-bridge mechanism through common fragile site activation in human breast cancer cells: the model of PIP gene duplication from a break at FRA7I. Hum. Mol. Genet. 2002;11:2887–2894. doi: 10.1093/hmg/11.23.2887. [DOI] [PubMed] [Google Scholar]
  • 32.Waris G, Ahsan H. Reactive oxygen species: role in the development of cancer and various chronic conditions. J. Carcinog. 2006;5:14. doi: 10.1186/1477-3163-5-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Moelans CB, de Weger RA, Monsuur HN, Vijzelaar R, van Diest PJ. Molecular profiling of invasive breast cancer by multiplex ligation-dependent probe amplification-based copy number analysis of tumor suppressor and oncogenes. Mod. Pathol. 2010;23:1029–1039. doi: 10.1038/modpathol.2010.84. [DOI] [PubMed] [Google Scholar]
  • 34.Cui J, Yin Y, Ma Q, Wang G, Olman V, Zhang Y, Chou W-C, Hong CS, Zhang C, Cao S, et al. Towards Understanding the Genomic Alterations in Human Gastric Cancer. PLoS Genet. 2013 doi: 10.1002/ijc.29352. under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kim JY, Shin NR, Kim A, Lee HJ, Park WY, Lee CH, Huh GY, Park do Y. Microsatellite instability status in gastric cancer: a reappraisal of its clinical significance and relationship with mucin phenotypes. Korean J. Pathol. 2013;47:28–35. doi: 10.4132/KoreanJPathol.2013.47.1.28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ottini L, Falchetti M, Lupi R, Rizzolo P, Agnese V, Colucci G, Bazan V, Russo A. Patterns of genomic instability in gastric cancer: clinical implications and perspectives. Ann. Oncol. 2006;17(Suppl 7):vii97–vii102. doi: 10.1093/annonc/mdl960. [DOI] [PubMed] [Google Scholar]
  • 37.Pennisi E. Genomics. ENCODE project writes eulogy for junk DNA. Science. 2012;337:1159–1161. doi: 10.1126/science.337.6099.1159. [DOI] [PubMed] [Google Scholar]
  • 38.Hahn MW, De Bie T, Stajich JE, Nguyen C, Cristianini N. Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res. 2005;15:1153–1160. doi: 10.1101/gr.3567505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Liu L, Yu L, Kalavacharla V, Liu Z. A Bayesian model for gene family evolution. BMC Bioinformatics. 2011;12:426. doi: 10.1186/1471-2105-12-426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1981;17:368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  • 41.Byrd RH, Lu PH, Nocedal J, Zhu CY. A limited memory algorithm for bound constrained optimization. Siam. J. Sci. Comput. 1995;16:1190–1208. [Google Scholar]
  • 42.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 43.McKee ML, Kerwin SM. Synthesis, metal ion binding, and biological evaluation of new anticancer 2-(2′-hydroxyphenyl)benzoxazole analogs of UK-1. Bioorganic Med. Chem. 2008;16:1775–1783. doi: 10.1016/j.bmc.2007.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kampen KR. Membrane proteins: the key players of a cancer cell. J. Membrane Biol. 2011;242:69–74. doi: 10.1007/s00232-011-9381-7. [DOI] [PubMed] [Google Scholar]
  • 45.Zaman GJ, Versantvoort CH, Smit JJ, Eijdems EW, de Haas M, Smith AJ, Broxterman HJ, Mulder NH, de Vries EG, Baas F, et al. Analysis of the expression of MRP, the gene for a new putative transmembrane drug transporter, in human multidrug resistant lung cancer cell lines. Cancer Res. 1993;53:1747–1750. [PubMed] [Google Scholar]
  • 46.Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–218. doi: 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Johnson AD, O'Donnell CJ. An open access database of genome-wide association results. BMC Med. Genet. 2009;10:6. doi: 10.1186/1471-2350-10-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C. Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clin. Cancer Res. 2008;14:5158–5165. doi: 10.1158/1078-0432.CCR-07-4756. [DOI] [PubMed] [Google Scholar]
  • 50.Konrad A, Teufel AI, Grahnen JA, Liberles DA. Toward a general model for the evolutionary dynamics of gene duplicates. Genome Biol. Evol. 2011;3:1197–1209. doi: 10.1093/gbe/evr093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Slatkin M, Pollack JL. The concordance of gene trees and species trees at two linked loci. Genetics. 2006;172:1979–1984. doi: 10.1534/genetics.105.049593. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES