Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 1998 Dec 22;95(26):15452–15457. doi: 10.1073/pnas.95.26.15452

Using rare mutations to estimate population divergence times: A maximum likelihood approach

Giorgio Bertorelle *,, Bruce Rannala
PMCID: PMC28063  PMID: 9860989

Abstract

In this paper we propose a method to estimate by maximum likelihood the divergence time between two populations, specifically designed for the analysis of nonrecurrent rare mutations. Given the rapidly growing amount of data, rare disease mutations affecting humans seem the most suitable candidates for this method. The estimator RD, and its conditional version RDc, were derived, assuming that the population dynamics of rare alleles can be described by using a birth–death process approximation and that each mutation arose before the split of a common ancestral population into the two diverging populations. The RD estimator seems more suitable for large sample sizes and few alleles, whose age can be approximated, whereas the RDc estimator appears preferable when this is not the case. When applied to three cystic fibrosis mutations, the estimator RD could not exclude a very recent time of divergence among three Mediterranean populations. On the other hand, the divergence time between these populations and the Danish population was estimated to be, on the average, 4,500 or 15,000 years, assuming or not a selective advantage for cystic fibrosis carriers, respectively. Confidence intervals are large, however, and can probably be reduced only by analyzing more alleles or loci.


The amount of genetic divergence between two isolated populations tends to accumulate with time, following their subdivision from a common ancestral population. New alleles are independently generated in each descendent population by mutation, and the frequencies of the pre-subdivision alleles tend to diverge due to the random sampling of genes in each generation (genetic drift). These two processes therefore leave a signature, increasingly evident with time, on the genetic composition of the subdivided populations.

Several methods have been proposed to estimate the time, in the past, when two or more populations arose from a single ancestral population (i.e., the divergence time). Takahata and Nei (1) suggested that the net number of nucleotide substitutions d accumulated between two populations (called also dA in ref. 2) be used to estimate the time of their divergence. If the two populations have the same constant size, d is expected to increase linearly with the product μT (where μ is the mutation rate and T is the divergence time). The same linear increase with μT is predicted for the genetic distance (δμ)2 (3), computed as the square of the difference between the average allele size observed at microsatellite markers in the diverging populations. In both cases, therefore, an estimation of T can be simply obtained if the mutation rate is known.

Another commonly used approach (see, for example, ref. 4) is to first estimate Wright’s Fst (5) from allele frequencies and then use this estimate to predict the divergence time. The relationship Fst = 1 − eT/2N (5, 6) can be used to estimate the divergence time T, given that the populations have the same constant and known size N. This estimator is based on a model of genetic drift without mutation, and it is therefore most suitable for populations that have recently separated. However, when the differences between alleles are taken into account by using equivalents of Fst that allow for mutation [such as φst (7) or Rst (8)], equivalent estimators become feasible for older population subdivision events (8).

More recently, Nielsen et al. (9) have proposed a maximum likelihood estimator based on the coalescent process and assuming no mutation. When applied to simulated data from stable populations, this estimator appears to have less bias and lower variance than an Fst-based estimator (9).

In this paper we present another likelihood estimator of the time of divergence of two populations, specifically designed for the analysis of rare alleles that have arisen by nonrecurrent mutation. Rare alleles are becoming an important source of information on human populations as more disease mutations are mapped, more effort is focused on the study of population frequencies of disease mutations, and large-scale programs of genetic screening are becoming a realistic possibility. The method we present here is best suited for analyzing data on rare disease mutations, since it assumes that the number of copies of each mutant (at the same or different loci) can be modeled by using a stochastic birth–death process with sampling (10, 11). This assumption, which is satisfied if mutants are rare (12), allows many demographic factors, including selection and population growth, to be introduced into the model in a relatively simple way. This approach also greatly simplifies the analysis of multiallelic and multilocus data.

The Model

We consider a simple model of an ancestral population (labeled population 0) that separates into two descendent populations (labeled populations 1 and 2) at a time T generations in the past. The descendent populations experience no immigration. A rare allele is assumed to have arisen by nonrecurrent mutation at time T + t in the past. When the population split occurs, any copy of the allele in the ancestral population has a probability s and (1 − s) of joining one, or the other, descendent population.

Within each population, we assume that the demography of the mutant lineages can be described by using a birth–death process approximation to the coalescent model, valid when the allele has remained rare (12). Population sizes are allowed to change over time according to an exponential process of growth (or decline) with rates ξ0 (ancestral population), ξ1, and ξ2 (descendent populations).

The Likelihood

A mutant allele arises at time T + t in the past. The probability distribution of the total number of alleles descended from the mutant, k, that existed immediately prior to the population subdivision event T generations ago is

graphic file with name M1.gif 1

where (see ref. 12) νt is defined as

graphic file with name M2.gif 2

Let j1 be the number of copies of the mutant that enter descendent population 1 at the population subdivision event and let j2 be the number that enter descendent population 2. The joint probability distribution function (pdf) of j1 and j2, conditional on k = j1 + j2, is

graphic file with name M3.gif 3

The joint probability generating function (pgf) of j1 and j2, conditional on k, is then

graphic file with name M4.gif
graphic file with name M5.gif 4

The unconditional joint pgf of j1 and j2 is

graphic file with name M6.gif
graphic file with name M7.gif 5

Using standard techniques (see ref. 13), we obtain the probability of j1 and j2 by evaluating the j1th and j2th-order partial derivatives with respect to z1 and z2 as follows

graphic file with name M8.gif 6

By examining successive terms of the pdf obtained from the pgf in this manner, a general formula for the unconditional pdf of j1 and j2 may be obtained

graphic file with name M9.gif 7
graphic file with name M10.gif

We will assume that no additional mutations occur at the site of interest (this condition is satisfied when the product of divergence time T and the mutation rate is small), which implies that each descendent population must have contained at least one copy of the mutant allele after the divergence event so that we must condition on j1 > 0 and j2 > 0. It is easy to show that

graphic file with name M11.gif
graphic file with name M12.gif 8

and the conditional pdf is then

graphic file with name M13.gif 9
graphic file with name M14.gif

Let li be the number of copies of the mutant found in population i immediately after the divergence event at time T that leave one or more descendents in a present-day sample from population i, where i is either 1 or 2. The pdf of li is a binomial of the form

graphic file with name M15.gif 10

where QTi is the probability that an allele in population i leaves one or more descendents at present and is given by (see ref. 12)

graphic file with name M16.gif 11

where the sampling fraction fi is the probability that a chromosome in the present-day descendents of population i is sampled.

The pgf of li, conditional on ji, is

graphic file with name M17.gif
graphic file with name M18.gif 12

Because the drift processes in descendent populations 1 and 2 after the subdivision event are independent, the joint pgf of l1 and l2, conditional on j1 and j2, is a product of the pgfs of l1 and l2,

graphic file with name M19.gif 13

The unconditional joint pgf of l1 and l2 is then

graphic file with name M20.gif 14

Using standard methods (see above), the pdf can be obtained from Eq. 14 and is

graphic file with name M21.gif 15

We must once more condition on the fact that at least one copy of the mutant allele leaves descendents in each population sample. This conditional probability is

graphic file with name M22.gif 16

The pdf of the number of copies of the mutant allele in a sample from population i, denoted as ni, conditional on li is (see, e.g., ref. 10)

graphic file with name M23.gif 17

where we define (see ref. 12)

graphic file with name M24.gif 18

Since the drift processes in the two populations are independent, the joint distribution of n1 and n2, given l1 and l2, is

graphic file with name M25.gif 19
graphic file with name M26.gif

The unconditional pdf of n1 and n2 is then calculated, using Eq. 16 and Eq. 19, as

graphic file with name M27.gif 20

To simplify the evaluation of the sum Eq. 20 we used the iterative relationship

graphic file with name M28.gif
graphic file with name M29.gif
graphic file with name M30.gif
graphic file with name M31.gif
graphic file with name M32.gif 21

Eq. 20 is the basis for the likelihood function of T used in our analysis:

graphic file with name M33.gif 22

where ψ = {t, s, ξ0, ξ1, ξ2, f1, f2} is a vector of the additional unknown (nuisance) parameters.

The likelihood function for multiallelic and/or multilocus data can be obtained by multiplying the probability in Eq. 20 for each different rare allele. This is because the birth–death process determining the frequency of each allele is independent. In other words, if r is the number of rare alleles (from the same or from different loci), and ni,z is the number of copies of the zth allele in population i, the probability of observing a configuration n = {{n1,1, n2,1}, {n1,2, n2,2}, … {n1,r, n2,r}} is given by

graphic file with name M34.gif 23

In the sections that follow, we will briefly analyze the behavior of a maximum likelihood estimator of the divergence time T based on Eq. 23 when applied to hypothetical and to real data. Hereafter we will use the abbreviation RD to refer to the rare alleles based estimator of divergence time.

Properties of the Estimator

In this section we will examine several hypothetical data sets to illustrate qualitatively the effects of the parameters of the model on the shape of the likelihood function.

Effect of Population Growth.

Suppose first that the total number of copies n1 + n2 of a rare allele sampled in two divergent populations is 200, and that both populations have grown exponentially. As one might expect, recent and ancient divergence times will often result in similar and different numbers of copies of a rare allele in the descendent populations, respectively. In fact, the estimated divergence time obtained by maximizing Eq. 23 increases when the difference between n1 and n2 is increased (see Fig. 1a). This is, of course, the most important property of the estimator, and suggests that the number of copies of a rare allele contains some information about the time at which the population was subdivided. Fig. 1a also suggests that for very ancient divergence events this information about the divergence time is ultimately lost, since the different possible data configurations become equally probable.

Figure 1.

Figure 1

(a) The log-likelihood of four different data sets as a function of the divergence time. In all cases (n1 + n2) = 200, f1 = f2 = 0.01, (T + t) = 1000, ξ0 = ξ1 = ξ2 = 0.02. The plotted numbers correspond to (n1n2). (b) The log-likelihood as a function of the divergence time when n1 = 40, (n1 + n2) = 200, f1 = f2 = 0.01, (T + t) = 1000. The plotted numbers correspond to the growth rate, assuming ξ0 = ξ1 = ξ2.

If the growth rate is decreased, the estimated divergence time tends to increase (Fig. 1b). This effect is related to the increase with the growth rate of the variance of the number of lineages in a birth–death process (14). After population subdivision, the same difference n1n2 is reached more rapidly in fast growing populations.

The influence of the growth rate on the estimated divergence time appears enormous when considering Fig. 1b. However, a variation of the exponential growth rate by a factor of 5 has to be regarded as enormous as well. For example, the final size of a population which started growing 500 generations ago from an initial size of 1,000 is expected to be either less than 15,000 or more than 250 million, depending on whether the growth rate is 0.005 or 0.025. Assuming that the growth rates of the populations can be at least roughly estimated, the variation of RD due to such uncertainties may be much smaller. An increase of the population growth rate from 0.015 to 0.025 in Fig. 1b (which is still a substantial increase), for example, results only in a decrease of the estimated divergence time from about 300 to 200 generations.

Effect of Mutation Age.

Another potential source of uncertainty for the estimator we propose arises from assumptions about the ages of mutations. Allele ages are, in general, not known and must be estimated from the data (for some human diseases it is conceivable that historical information can be used). Different methods exist for estimating allele age (12, 15, 16), but the confidence intervals of these estimates are typically large.

The likelihood curves for several extreme situations in terms of allele age serve to illustrate this point. In Fig. 2a, the maximum likelihood estimator RD, as well as its support interval (i.e., the radius of curvature of the likelihood) do not seem to be strongly affected by the mutation age. In other words, even if the assumed mutation age varies from 500 to 1500 generations, the estimated divergence time lies within the same region. It can also happen, however, that very different assumptions about the age of the allele result in very different estimates of the divergence time (Fig. 2b). In addition, as expected (and also shown by Fig. 2a), assuming different allele ages affects the absolute value of the likelihood. This behavior implies that, when more alleles are simultaneously used to compute RD, a single allele whose age is very poorly estimated could have a differentially strong effect on the likelihood estimate of T.

Figure 2.

Figure 2

The log-likelihood as a function of the divergence time for different ages (T + t) of the mutant. In all cases, (n1 + n2) = 200, f1 = f2 = 0.01. (a) ξ0 = ξ1 = ξ2 = 0.02, n1 = 40. The dashed lines indicate the 2 log-likelihood units of support intervals. (b) ξ0 = ξ1 = ξ2 = 0.001, n1 = 85. The model we used assumes that the mutation arose before the split of the ancestral population in two descendent populations. For this reason, some curves terminate earlier than others.

One possible solution to this problem is to simultaneously estimate the divergence time and the allele ages. In this approach, however, each additional allele would introduce an additional parameter to estimate, thus reducing the power of the method. Instead, we therefore decided to analyze the behavior of a modified version of the estimator RD, hereafter called RDc. The estimator RDc is based on the probability of an observed configuration (n1, n2) conditioned on the sum (n1 + n2). This probability can be simply derived by using the rule of conditional probabilities as:

graphic file with name M35.gif 24

In principle, since the expected sum (n1 + n2) depends mainly on the allele age, conditioning the probability of a configuration on this sum should reduce the influence of the allele age on the likelihood of T.

The likelihood function based on Eq. 24 for the data sets in Fig. 2 has less curvature because most of the information in the total number of copies is disregarded. The estimator RDc obtained by conditioning proved to be largely insensitive to the age of the mutation. For example, if RDc is used instead of RD, the three data sets used in Fig. 2a have only a single likelihood function, and the two data sets in Fig. 2b have very similar likelihood functions with the maximum value separated by only about 100 generations (results not shown).

Finally, we used Monte Carlo simulation to examine how uncertainty about the allele age might affect the estimators RD and RDc. Five hundred samples of either 10 or 50 alleles from two divergent populations were simulated, assuming moderately small growth rates (ξ0 = ξ1 = ξ2 = 0.005) and sampling fractions (f1 = f2 = 0.005). In the first set of simulations (upper part of Table 1), the age of each allele was fixed at 600 generations, whereas in the second set (lower part of Table 1) the age of each allele was randomly assigned from a uniform distribution with lower and upper limits equal to 400 and 800 generations, respectively. The simulated populations were assumed to have diverged T = 200 generations ago, and the number of copies of each allele was simulated by using the same birth–death model we used to derive the likelihood function. When these parameters were used: the average number of copies of each allele in each population was 6.8 and 8.6 for the first and second sets of simulations, respectively, and in both cases more than 10% of the alleles were present in a single copy in a population. These are, of course, unrealistically small numbers, but they allowed us to analyze large number of samples. The calculation of the likelihood function for a single data set is quite computationally intensive, especially for the RDc estimator.

Table 1.

Results of the simulations

Simulation set r Estimator Mean SD Range
First 10 RD 188.3 121.6 0.0–600.0
RDc 196.1 135.5 0.0–600.0
50 RD 194.8 49.6 56.0–314.2
RDc 197.0 55.4 32.0–334.2
Second 10 RD 123.1 108.2 0.0–355.2
RDc 212.7 154.0 0.0–800.0
50 RD 130.7 82.9 0.0–246.2
RDc 212.7 57.8 72.1–331.2

In the first set, alleles have all the same age of 600 generations, and the divergence time is estimated assuming that the allele age is known. In the second set, alleles can have any age between 400 and 800 generations with the same probability, and the divergence time is estimated assuming that every allele has the same age of 800 generations. ξ0 = ξ1 = ξ2 = 0.005; f1 = f2 = 0.005; r = number of alleles; actual divergence time T = 200. 

The maximum likelihood estimate of the divergence time was obtained for each sample by using RD and RDc and assuming that the growth rates and the sampling fractions were known, as well as the age of each allele (600 generations) for the analysis of the first set of simulations. In the analysis of the second set of simulations, where the real age of the alleles varied between 400 and 800 generations, the RD and RDc were computed assuming the same fixed age (800 generations) for each allele.

The results (Table 1) show that if the age of the alleles is known, both RD and RDc are almost unbiased, with RD having a slightly lower standard deviation (SD) than does RDc. On the other hand, when the age of the alleles is not known, and estimates are calculated by assuming that all alleles have an (incorrect) age equal to that of the maximum possible age, RDc has consistently less bias than RD, and also lower SD when more rare alleles are considered. The higher SD of RDc when data sets of 10 alleles were considered is mainly due to the fact that for a small fraction of the samples the estimated divergence time was equal to the assumed mutation age. This kind of edge effect disappeared when more alleles were sampled.

Application to Cystic Fibrosis

As an example application of the method developed in this paper, we considered three cystic fibrosis (CF) mutations in four human populations from the paper by Estivill et al. (17). The three most frequent CF mutations—namely ΔF508, G542X, and N1303K—will be used to estimate the pairwise divergence times between Italy, Sardinia, Denmark, and Turkey. Since the model used to derive the estimators assumes isolated populations, we expect that gene flow processes would result in an underestimation of the divergence time. This effect, which is probably minor for European populations that experienced reasonably low migration rates (18), should of course be kept in mind.

The population data are shown in Table 2, and the results provided by RD when the ages of ΔF508, G542X, and N1303K were set to 50,000, 35,000 and 35,000 years, respectively (16, 17), and the generation time is 20 years are shown in Table 3. Due to the large number of ΔF508 copies, the computation of RDc using the complete data sets would be unreasonably slow. Therefore, we computed RDc assuming smaller sizes for ΔF508 samples in Italy and Denmark. RD and RDc provided similar estimates (RDc having larger confidence intervals), and we therefore report only the results for RD.

Table 2.

CF data from four European populations

Population 2N (×106) No. of CF chromosomes
Total ΔF508 N1303K G542X
Sardinia 3.3 141 82 4 8
Italy 114.8 3,524 1,795 156 156
Denmark 10.6 678 591 7 4
Turkey 112.9 141 49 9 4

N is the present-day population size. Total refers to the number of CF chromosomes in the sample. The numbers of CF chromosomes with specific mutations are given in the last three columns. 

Table 3.

Application of RD to data from the four European populations

Divergence time estimates, thousands of years
Sardinia Italy Denmark
Italy MLE 0.0
2SUI 0.0–11.4
3SUI 0.0–13.8
Denmark MLE 10.8 16.4
2SUI 5.6–17.7 12.5–22.1
3SUI 4.5–19.7 11.6–23.8
Turkey MLE 10.0 18.4 18.6
2SUI 0.4–17.2 0.0–28.4 14.2–24.7
3SUI 0.0–18.9 0.0–31.1 13.2–26.8

MLE is the maximum likelihood estimate of the divergence time. 2SUI is the interval of support computed as the divergence times at two units of support from the best supported value; this interval corresponds roughly to the 95% confidence interval. 3SUI is the interval of support computed as the divergence times at three units of support from the best supported value. 

Populations are assumed to grow exponentially with a rate of 0.005 per generation, which is compatible with an Upper Paleolithic demographic expansion often suggested for European populations (1921). We also analyzed these data sets assuming a growth rate of 0.025, which is equivalent to assigning a selective advantage of 0.02 to CF carriers (lethal CF homozygotes are ignored because of their low frequency). Heterozygote advantage has long been proposed as an explanation for high prevalence of CF among Caucasoids (2224). Recent analyses, however, were either unable to find evidence for such an effect (25, 26), or showed analytically that the high prevalence of some deleterious mutations is not unexpected in expanding populations (27).

The sample size, which is needed to compute the sampling fractions fi and fj for each pair of populations i and j, was estimated from the total number of CF mutants in the samples, assuming an incidence of the disease of 1 in 2,500 newborns (28) in each population. Finally, as we do not have any information about the ratio s of the divergent populations at the split, we assumed a ratio equal to the ratio between the present-day population sizes.

The results obtained when the population growth rate is fixed to ξ = 0.005 for all populations (thus excluding a selective advantage for CF carriers) suggest an Upper Paleolithic divergence (about 15,000 years ago, on the average) between Denmark and the Mediterranean populations (see Table 3). This result is compatible with several previous genetical analyses of European populations (29), suggesting a relatively recent divergence time even between very distant populations, and also suggesting no major impact in Northern Europe of Neolithic dispersion (30) from the Middle East.

The comparisons among the Mediterranean populations provide estimates of the divergence time ranging from 0 (Italy vs. Sardinia) to 18,000 years (Italy vs. Turkey). In contrast to the comparison with the Danish population, however, very short divergence times cannot be excluded for any pairs of Mediterranean populations due to the large confidence intervals.

Finally, we note that, among all the comparisons, Sardinia and Italy show the closest genetic relationship. Classical markers have often identified Sardinians as a genetical outlier in Europe (31), but the same pattern is not observed when the sequence of the mitochondrial control region is considered (32). In a previous analysis of the relative frequencies of CF alleles in Italy, Rendine et al. (33) found a certain level of divergence between Sardinians and other Italian regions, which was, however, mainly due to the private mutation T338I. All in all, it seems, therefore, that strong drift effects and the appearance of some new mutations after the relatively recent colonization of this island (around 9,000 years ago) might explain the peculiarity of the Sardinians. Their relationship with other Europeans, and especially with other Italians, is, however, still evident.

The results we obtained by setting the growth rate to 0.025 (thus assuming a positive selective effect of CF mutations in heterozygotes) gave a maximum likelihood estimate of the divergence time between 1/4 and 1/3 of the previous estimates (of course with the exception of the Italy–Sardinia comparison, which resulted again in a divergence time of 0). In other words, if the CF carriers had experienced a selective advantage of about 2% at any time in the past, our results would be consistent with a more recent (Neolithic) divergence also between populations as distant as Denmark and Turkey. Interestingly, these estimates would support the analysis of 4 microsatellite markers by Chikhi et al. (18), who found that the divergence time between pairs of European populations (estimated from the variance in the number of repeats) never exceeded 6,000 years. Only the analysis of other alleles or loci and more accurate estimates of the relevant demographic and selection parameters will clarify this point.

We also analyzed the CF data sets assuming earlier ages for the three mutations. Using linkage disequilibrium patterns, Serre et al. (15) estimated the age of ΔF508 between 3,000 and 6,000 years, and Kaplan et al. (34) suggested an age of 17,000 years for the same mutation. We assigned the intermediate age of 10,000 years to each allele in our data set. All population comparisons provided a maximum likelihood estimate of the divergence time equal to the age of the mutation. In principle, these estimates cannot be excluded, and due to the large confidence intervals, they are not incompatible with the results we obtained assuming older allele ages. We note, however, that CF allele ages have been estimated assuming a single panmictic European population. Since the age of a nonrecurrent allele shared by two disjunct populations must necessarily be older than the age of the population subdivision event, it is possible that the estimated age for some CF mutations is too recent.

Discussion

In this paper, we have derived some theory for a birth–death process used to describe the population dynamics of rare alleles in a simple model of population divergence. This theory was used to derive two maximum likelihood estimators, RD and its conditional version RDc, of the time at which two recently isolated subpopulations diverged from a common ancestral population.§

The appropriate data for applying these estimators are the number of copies of one or more rare alleles at the same or different loci sampled from the descendent populations. It is assumed that each allele has arisen by nonrecurrent mutation in the ancestral population. Data on rare disease mutations in humans that are widespread in several populations seems therefore to be the most suitable for applying the method, since the amount of information on such mutations that is available for population studies is rapidly growing. For example, Estivill et al. (17) recently reported the geographic distribution of almost 30,000 CF chromosomes, and equivalent databases for other disease alleles are currently being developed.

In principle, the proposed method could also be applied to alleles that are not rare if the population sizes are not regulated by density dependence (e.g., in rapidly fluctuating island populations or in populations that are far from carrying capacity) (14). In this case, the main assumption of the birth–death process approximation (that individuals reproduce independently of one another) is still valid, since individual birth and death rates do not depend on population density (14).

Maximum likelihood estimators always require some knowledge about the parameters used in the model; the divergence time estimators proposed here are no exception. In particular, information on the demography of the populations and on the age of the alleles considered is needed to compute RD or RDc. One might therefore suspect that computationally easier methods, such as those based on Wright’s Fst (5) or on the net number of substitutions (1), should be preferred. These methods, however, rely on strong assumptions about the demography of the populations, and if these assumptions are violated (which is often the case), the results might be difficult to interpret. If reasonable approximations of the parameters of the model are available, it should often be the case that the estimators RD and RDc we propose here will provide more accurate estimates of the divergence time. Large-scale simulation studies are needed to evaluate the overall performance of the different estimators of population divergence times for a range of biologically reasonable demographic conditions.

Our qualitative analysis of the estimators suggests that the influence of the demographic parameters on RD or RDc is pronounced only if radically different demographic scenarios are considered. In humans at least, historical or archaeological data may often provide independent information about the demography of a population (35). Alternatively, genetical data can potentially be used to distinguish between stable and growing populations, and, if necessary, to estimate the rate of exponential growth of population (3640).

The ages of alleles can also affect estimates of the divergence time, even though our results indicate that this influence is probably not great. If the age of the analyzed alleles can be additionally estimated (e.g., see refs. 12, 15, and 16), the simpler and most powerful estimator RD is preferable. When, however, the age of an allele cannot be estimated, even approximately, and the data consist of several alleles, each with relatively small sample sizes, RDc should be preferred. The computation of RDc is quite computationally intense, but this estimator proved to be less affected by incorrect estimates of allele age than was RD.

Finally, when RD was applied to three CF mutations in four European populations, we obtained results that appear consistent and compatible with previous studies. In particular, the divergence between Danish and Turkish populations was estimated to have occurred between 15,000 and 25,000 years ago. On the other hand, when three Mediterranean populations (Italy, Sardinia, and Turkey) were compared, the CF data set we analyzed was only able to exclude very old (>30,000 years) divergence times, with an average estimate of about 10,000 years. All these estimates, however, reduced substantially when a selective advantage for CF carriers (2224) is introduced. This result, unfortunately, does not allow one to distinguish between several current hypotheses (see refs. 18, 41, and 42) on the relative contribution of Paleolithic and Neolithic genes to the present European gene pool. A better understanding of the selection process affecting the CF locus is therefore needed, as well as the analysis of additional rare mutations at additional loci. We expect the amount of these kind of data available for human populations to increase rapidly in the near future.

Acknowledgments

We thank Monty Slatkin for helpful discussions and for critical reading of this manuscript. G.B. was supported by National Institutes of Health Grant GM28428 to L. L. Cavalli-Sforza. B.R. was supported by National Institutes of Health Grants GM40282 to M. Slatkin and HG01988 to B.R.

ABBREVIATIONS

pdf

probability distribution function

pgf

probability generating function

CF

cystic fibrosis

Footnotes

§

A computer program to compute the divergence time estimators proposed in this paper can be downloaded from the WWW site at the URL http://mw511.biol.berkeley.edu.

References

  • 1.Takahata N, Nei M. Genetics. 1985;110:325–344. doi: 10.1093/genetics/110.2.325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nei M. Molecular Evolutionary Genetics. New York: Columbia Univ. Press; 1987. [Google Scholar]
  • 3.Goldstein D B, Ruiz Linares A, Cavalli-Sforza L L, Feldman M W. Proc Natl Acad Sci USA. 1995;92:6723–6727. doi: 10.1073/pnas.92.15.6723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Poloni E S, Semino O, Passarino G, Santachiara-Benerecetti A S, Dupanloup I, Langaney A, Excoffier L. Am J Hum Genet. 1997;61:1015–1035. doi: 10.1086/301602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wright S. Ann Eugen. 1951;15:323–354. doi: 10.1111/j.1469-1809.1949.tb02451.x. [DOI] [PubMed] [Google Scholar]
  • 6.Cavalli-Sforza L L, Bodmer W F. The Genetics of Human Populations. San Francisco: Freeman; 1971. [Google Scholar]
  • 7.Excoffier L, Smouse P, Quattro J M. Genetics. 1992;131:479–491. doi: 10.1093/genetics/131.2.479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Slatkin M. Genetics. 1995;139:457–462. doi: 10.1093/genetics/139.1.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nielsen R, Mountain J L, Huelsenbeck J P, Slatkin M. Evolution. 1998;52:669–677. doi: 10.1111/j.1558-5646.1998.tb03692.x. [DOI] [PubMed] [Google Scholar]
  • 10.Kendall D G. Ann Math Stat. 1948;19:1–15. [Google Scholar]
  • 11.Nee S, May R M, Harvey P H. Phil Trans R Soc Lond B. 1994;344:305–311. doi: 10.1098/rstb.1994.0068. [DOI] [PubMed] [Google Scholar]
  • 12.Slatkin M, Rannala B. Am J Hum Genet. 1997;60:447–458. [PMC free article] [PubMed] [Google Scholar]
  • 13.Medhi J. Stochastic Processes. 2nd Ed. New York: Wiley; 1994. [Google Scholar]
  • 14.Rannala B. Heredity. 1997;76:417–423. doi: 10.1038/hdy.1997.65. [DOI] [PubMed] [Google Scholar]
  • 15.Serre J L, Simon-Buoy B, Mornet E, Jaume-Roig B, Balassopolous A, Schwartz M, Taillandier A, Boué J, Boué A. Hum Genet. 1990;84:449–454. doi: 10.1007/BF00195818. [DOI] [PubMed] [Google Scholar]
  • 16.Morral N, Bertranpetit J, Estivill X, Nunes V, Casals T, Gimenez J, Reis A, Varon Mateeva R, Macek M, Jr, Kalaydjieva L, et al. Nat Genet. 1994;7:169–175. doi: 10.1038/ng0694-169. [DOI] [PubMed] [Google Scholar]
  • 17.Estivill X, Bancells C, Ramos C the Biomed CF Mutation Analysis Consortium. Hum Mutat. 1997;10:135–154. doi: 10.1002/(SICI)1098-1004(1997)10:2<135::AID-HUMU6>3.0.CO;2-J. [DOI] [PubMed] [Google Scholar]
  • 18.Chikhi L, Destro-Bisol G, Bertorelle G, Pascali V, Barbujani G. Proc Natl Acad Sci USA. 1998;95:9053–9058. doi: 10.1073/pnas.95.15.9053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Harpending H C, Sherry S T, Rogers A R, Stoneking M. Curr Anthropol. 1993;34:483–496. [Google Scholar]
  • 20.Rogers A R, Jorde L B. Hum Biol. 1995;67:1–36. [PubMed] [Google Scholar]
  • 21.Comas D, Calafell F, Mateu E, Perez-Lezaun A, Bosch E, Bertranpetit J. Hum Genet. 1997;99:443–449. doi: 10.1007/s004390050386. [DOI] [PubMed] [Google Scholar]
  • 22.Quinton P M. In: Fluid and Electrolytes Abnormalities in Exocrine Glands in Cystic Fibrosis. Quinton P M, Martinez R J, Hopfer U, editors. San Francisco: San Francisco Press; 1982. pp. 53–76. [Google Scholar]
  • 23.Romeo G, Devoto M, Galietta L J. Hum Genet. 1989;84:1–5. doi: 10.1007/BF00210660. [DOI] [PubMed] [Google Scholar]
  • 24.Bertranpetit J, Calafell F. In: Variation in the Humane Genome. Chadwick D, Cardew G, editors. Chichester, U.K.: Wiley; 1996. pp. 97–114. [Google Scholar]
  • 25.De Vries H G, Collée J M, de Walle H E K, van Veldhuizen M H R, Smit Sibinga C Th, Scheffer H, ten Kate L P. Hum Genet. 1997;99:74–79. doi: 10.1007/s004390050314. [DOI] [PubMed] [Google Scholar]
  • 26.Bertorelle G, Barbujani G. Ann Hum Genet. 1997;61:532–533. doi: 10.1007/s004390050570. [DOI] [PubMed] [Google Scholar]
  • 27.Thompson E A, Neel J V. Am J Hum Genet. 1997;60:197–204. [PMC free article] [PubMed] [Google Scholar]
  • 28.Boat T F, Welsh M J, Beaudet A L. In: The Metabolic Basis of Inherited Disease. Scriver C R, Beaudet A L, Sly W S, Valle D, editors. New York: McGraw-Hill; 1989. pp. 2649–2680. [Google Scholar]
  • 29.Cavalli-Sforza L L, Menozzi P, Piazza A. The History and Geography of Human Genes. Princeton, NJ: Princeton Univ. Press; 1994. [Google Scholar]
  • 30.Ammerman A J, Cavalli-Sforza L L. The Neolithic Transition and the Genetics of Populations in Europe. Princeton, NJ: Princeton Univ. Press; 1984. [Google Scholar]
  • 31.Piazza A, Cappello N, Olivetti E, Rendine S. Ann Hum Genet. 1988;52:203–213. doi: 10.1111/j.1469-1809.1988.tb01098.x. [DOI] [PubMed] [Google Scholar]
  • 32.Stenico M, Nigro L, Bertorelle G, Calafell F, Capitanio M, Corrain C, Barbujani G. Am J Hum Genet. 1996;59:1363–1375. [PMC free article] [PubMed] [Google Scholar]
  • 33.Rendine S, Calafell F, Cappello N, Gagliardini R, Caramia G, Rigillo N, Silvetti M, Zanda M, Miano A, Battistini F, et al. Ann Hum Genet. 1997;61:411–424. doi: 10.1046/j.1469-1809.1997.6150411.x. [DOI] [PubMed] [Google Scholar]
  • 34.Kaplan N L, Lewis P O, Weir B S. Nat Genet. 1994;8:216–218. doi: 10.1038/ng1194-216a. [DOI] [PubMed] [Google Scholar]
  • 35.Mussi M. In: The World at 18,000 BP: High Latitudes. Soffle O, Gamble C, editors. London: Allen & Unwin; 1990. pp. 126–147. [Google Scholar]
  • 36.Roger A R, Harpending H. Mol Biol Evol. 1992;9:552–569. doi: 10.1093/oxfordjournals.molbev.a040727. [DOI] [PubMed] [Google Scholar]
  • 37.Slatkin M, Hudson R R. Genetics. 1991;129:555–562. doi: 10.1093/genetics/129.2.555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Griffiths R C, Tavaré S. Phil Trans R Soc Lond B. 1994;344:403–410. doi: 10.1098/rstb.1994.0079. [DOI] [PubMed] [Google Scholar]
  • 39.Polanski A, Kimmel M, Chakraborty R. Proc Natl Acad Sci USA. 1998;95:5456–5461. doi: 10.1073/pnas.95.10.5456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kuhner M K, Yamato J, Felsenstein J. Genetics. 1998;149:429–434. doi: 10.1093/genetics/149.1.429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Richards M, Corte-Real H, Forster P, Macaulay V, Wilkinson-Herbots H, Demaine A, Papiha S, Hedges R, Bandelt H-J, Sykes B. Am J Hum Genet. 1996;59:185–203. [PMC free article] [PubMed] [Google Scholar]
  • 42.Barbujani G, Bertorelle G, Chikhi L. Am J Hum Genet. 1998;62:488–491. doi: 10.1086/301719. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES