A Coalescent-Based Estimator of Admixture From DNA Sequences

Jinliang Wang

doi:10.1534/genetics.105.054130

. 2006 Jul;173(3):1679–1692. doi: 10.1534/genetics.105.054130

A Coalescent-Based Estimator of Admixture From DNA Sequences

Jinliang Wang ^1,¹

PMCID: PMC1526692 PMID: 16624918

Abstract

A variety of estimators have been developed to use genetic marker information in inferring the admixture proportions (parental contributions) of a hybrid population. The majority of these estimators used allele frequency data, ignored molecular information that is available in markers such as microsatellites and DNA sequences, and assumed that mutations are absent since the admixture event. As a result, these estimators may fail to deliver an estimate or give rather poor estimates when admixture is ancient and thus mutations are not negligible. A previous molecular estimator based its inference of admixture proportions on the average coalescent times between pairs of genes taken from within and between populations. In this article I propose an estimator that considers the entire genealogy of all of the sampled genes and infers admixture proportions from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters in the admixture model, such as admixture time, divergence time, population size, and mutation rate. Comparative analyses of simulated data indicate that the new coalescent estimator generally yields better estimates of admixture proportions than the previous molecular estimator, especially when the parental populations are not highly differentiated. It also gives reasonably accurate estimates of other admixture parameters. A human mtDNA sequence data set was analyzed to demonstrate the method, and the analysis results are discussed and compared with those from previous studies.

OVER the past 70 years, many statistical methods have been developed and applied to estimating the genetic compositions of admixed/hybrid populations, using genetic marker data (for recent reviews see Beaumont 2003; Choisy et al. 2004; Excoffier et al. 2005). The primary interest is to infer, from the amount and pattern of genetic variation revealed by markers, the proportional contributions of two or more potential parental populations to the gene pool of an admixed population (Chakraborty 1986). Estimating such admixture proportions helps in understanding the evolutionary history of populations (e.g., Chikhi et al. 2002; Wen et al. 2004), in genetic epidemiological investigations (Chakraborty and Weiss 1986, 1988), and in assessing the risk of diseases in human populations. In conservation biology, knowledge of admixture proportions helps in making informed management of endangered species in the wild.

Most methods available use allele frequency data to estimate admixture proportions, exploiting the genetic characteristic of an admixed population that its allele frequencies should be intermediate between those of the parental populations (Cavalli-Sforza and Bodmer 1971; Bertorelle and Excoffier 1998). The main differences among these methods are whether or not to take genetic drift into account and how to select (e.g., Chakraborty et al. 1992) and treat allele frequency data statistically. Traditional methods are usually moment estimators that ignore the genetic drift that occurred to the parental and hybrid populations since the admixture event (e.g., Glass and Li 1953; Roberts and Hiorns 1965; Elston 1971; Long 1991; Chakraborty et al. 1992), while recent ones are usually likelihood or Baysian estimators (e.g., Thompson 1973; Chikhi et al. 2001; Wang 2003), allowing the joint estimation of admixture proportions and genetic drift. A flexible method based on some summary statistics and approximate Bayesian computation (Beaumont et al. 2002; Marjoram et al. 2003) has been proposed recently, which estimates admixture proportion, genetic drift, and mutation parameters simultaneously from linked or unlinked microsatellite markers (Excoffier et al. 2005).

Molecular markers, such as DNA sequences and microsatellites that are now used widely, provide us not only allele frequency information, but also deep genealogical information revealed by the molecular diversity of sampled genes. Most of the above methods do not use such molecular information and assume that mutations are absent since the admixture event, causing two potential problems. One is that discarding molecular information may result in a loss of estimation precision, especially when the mutation rate is high for markers such as large DNA sequences. The other is that when admixture is ancient and thus mutations are not negligible, these methods may fail to deliver an estimate or give rather poor estimates. Realizing these problems, Bertorelle and Excoffier (1998) developed a novel estimator that uses both allele frequency and molecular information and explicitly takes mutations into account in estimating admixture proportions. The estimator was shown to be less biased and, in some situations, to yield more precise estimates of admixture proportions (Bertorelle and Excoffier 1998; Wang 2003). Later on, the estimator was extended by Dupanloup and Bertorelle (2001) to allow three or more parental populations contributing to the admixture.

In this article I develop a new molecular estimator under a well-defined admixture model (Bertorelle and Excoffier 1998; Wang 2003; Excoffier et al. 2005) and compare it with the previous molecular estimator by using both simulated and real data. While the previous molecular estimator bases its inference on the average coalescent times between two genes taken from within and between populations, the current one considers the entire genealogy of the sampled genes and infers admixture from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters such as admixture time, divergence time, population size, and mutation rate as well as admixture proportions.

METHODS

The admixture model:

Several recent studies adopt the admixture model proposed by Bertorelle and Excoffier (1998), with slight modifications (Chikhi et al. 2001; Wang 2003; Choisy et al. 2004; Excoffier et al. 2005). In this study, I use the admixture model of Excoffier et al. (2005), as illustrated in Figure 1. I assume an ancestral population, P₀, splits into two parental populations, P₁ and P₂, which evolve separately for T_D generations. At that point of time, a hybrid population, P_h, is instantaneously created by combining genes of proportions p₁ and p₂ = 1 − p₁ taken at random from parental populations P₁ and P₂, respectively. After the admixture event, the three populations evolve in isolation for a period of T_A generations, when a sample of individuals is taken from each population for examining some markers. I also assume, as is implicit in all previous admixture models, that neither direct nor indirect selection is associated with the markers surveyed, and the markers are from diploid and autosomal loci. With 2N (N, the effective population size) replaced by N, however, the method applies to haploid markers (such as mtDNA) as well.

Figure 1.— — The admixture model. It is assumed that an ancestral population, P₀, is split into two parental populations P₁ and P₂, which evolve independently for T_D generations before they contribute genes of proportions p₁ and 1 − p₁ to form the hybrid population, P_h. After the admixture event, P₁, P₂, and P_h evolve independently for T_A generations before a sample of individuals is taken from each of them to assay some markers. The effective sizes of populations P₀, P₁, P₂, and P_h are N₀, N₁, N₂, and N_h, respectively.

The above admixture model is characterized by seven parameters, which are the effective sizes of the ancestral (N₀), parental (N₁ and N₂), and admixed (N_h) populations; the times (in the unit of generations) of divergence (T_D) and admixture (T_A); and the admixture proportion p₁. The seven parameters are denoted by set ω = {N₀, N₁, N₂, N_h, T_A, T_D, p₁}. Without an external standard, however, it is impossible to estimate the absolute values of population size and time. They are therefore rescaled by the mutation rate (μ) of markers as θ_k = 4N_kμ (k = 0, 1, 2, h), τ_k = T_kμ (k = A, D), and the estimable parameters are denoted by Ω = {θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, p₁}.

The mutation model:

To utilize molecular information and account for mutations explicitly in estimating admixture, a suitable mutation model must be specified to describe the mutational process of markers. Herein I use DNA sequences as markers, which are assumed to follow the infinite-site model of mutations (Kimura 1969). Under this model, a locus is composed of so many sites that no more than one mutation occurs at any site in the genealogy of the sampled sequences. I also assume a constant-rate neutral mutation process, in which each offspring sequence differs from its parental one by an average of μ mutations. Under these assumptions, the number of mutations in a sample of DNA sequences is identical to the number of nucleotides that are polymorphic (segregating) in the sample. Therefore, the expected number of segregating sites in a sample is simply the product of μ and the expected total branch length of the genealogy of the sample (see, e.g., Tajima 1983; Hudson 1990).

Expected numbers of segregating sites:

In this section, I derive the expected total branch length of the genealogy (ETBLG) (denoted by Δ) of a sample of DNA sequences given parameter set ω. The expected number of segregating sites in a sample is simply the product of Δ and mutation rate μ under the infinite-site model. In the next section, the observed numbers of segregating sites are fitted to these expected values to obtain least-squares estimates of Ω.

Suppose n₁, n₂, and n_h sequences of a given locus are sampled at random from the current P₁, P₂, and P_h populations, respectively. The n = n₁ + n₂ + n_h sequences can be arranged to constitute seven composite (artificial) samples. Samples 1, 2, and 3 contain sequences solely from populations P₁, P₂, and P_h, respectively. Samples 4, 5, and 6 are obtained by merging samples 1 and 2, 1 and 3, and 2 and 3, respectively. Sample 7 contains all of the n sequences. The sample sizes (number of sequences) for samples 1, 2, … , 7 are therefore n₁, n₂, n_h, n₁ + n₂, n₁ + n_h, n₂ + n_h, and n₁ + n₂ + n_h, respectively.

Expected total branch length of a genealogy:

For convenience in deriving the ETBLG of a sample of sequences, time is measured backward hereafter. The current time when the sample was taken is designated as generation zero and the time T generations ago is referred to as generation T. Consider the ETBLG of sample 1, Δ₁, conditional on parameters ω. The genealogy can be partitioned into two segments, the first being formed by the coalescent process in population P₁ during time interval [0, T_A + T_D], while the second is formed by the coalescent process in population P₀ during time interval [T_A + T_D, ∞]. The expected total branch length of segment one can be derived, as shown in appendix a, as

(1)

where i = n₁ is the initial number of sequences, N = N₁ is the effective population size, T = T_A + T_D is the time interval, and Inline graphic and are the rising and falling factorial functions, respectively.

For the second segment, the initial number of sequences, j (n₁ ≥ j ≥ 2), at time T_A + T_D is a random variable. The probability of an initial i sequences at time zero coalescing into j sequences at time T in a population with effective size N is

(2)

(Tavaré 1984). Inserting i = n₁, T = T_A + T_D, and N = N₁ into (2) gives the probability that j (n₁ ≥ j ≥ 2) sequences are left extant at time T_A + T_D in population 1. Given j at time T_A + T_D and the effective size of the ancestral population N₀, the expected total branch length of the second segment of genealogy (e.g., Hudson 1990) is Inline graphic . Summing over all possible values of j gives the expected total branch length of the second segment of genealogy

(3)

where i = n₁, N = N₁, T = T_A + T_D. The ETBLG is the sum of (1) and (3),

(4)

When Inline graphic , irrespective of the parameter values of , and N₀, as is expected because the most recent common ancestor (MRCA) must be found in the first segment of genealogy. In such a case, (4) reduces to , the expected total branch length of the genealogy of n₁ sequences from a population of a constant size N₁ (e.g., Hudson 1990). It can be shown that when N₀ = N₁, (4) also reduces to Inline graphic irrespective of T_A and T_D, as expected.

Similarly, the ETBLG of sample 2 conditional on parameter set ω, Δ₂, is calculated by the right side of (4), replacing N₁ by N₂ and n₁ by n₂, respectively.

The derivation for the ETBLG of sample 3 is much more complicated. The genealogy is again partitioned into two segments. The first segment is formed by the coalescent process of an initial n_h sequences in population P_h during interval [0, T_A]. The expected total branch length of this segment is calculated by (1) with i = n_h, N = N_h, and T = T_A. The second segment is formed by the coalescent processes in populations P₁ and/or P₂ during time interval [T_A, T_A + T_D] and then in ancestral population P₀ during time interval [T_A + T_D, ∞]. The probability of i = n_h sequences coalescing into j = m sequences at time T = T_A in population P_h with effective size N = N_h is calculated by (2). According to the sources of the m extant sequences at time T_A, three cases are distinguishable.

Case 1: The m extant sequences at time T_A come exclusively from population P₁. When this case occurs, with probability , the expected total branch length of the second segment of genealogy given m can be calculated by the sum of (1) and (3), replacing i, N, and T by m, N₁, and T_D, respectively.
Case 2: The m extant sequences at time T_A come exclusively from population P₂. When this case occurs, with probability where , the expected total length of the second segment of genealogy given m can be calculated by the sum of (1) and (3), replacing i, N, and T by m, N₂, and T_D, respectively.
Case 3: Among the m extant sequences at time T_A, m₁ (0 < m₁ < m) sequences come from population P₁ and m₂ = m − m₁ from P₂. When this case occurs, with the binomial probability of , the second segment of genealogy can be further partitioned into three subsegments. Subsegment 1 is formed by the coalescent process of the initial m₁ sequences at time T_A in population P₁ during interval [T_A, T_A + T_D]. The expected total branch length of this subsegment can be derived, shown in appendix a, as

(5)

where i = m₁ is the initial number of sequences, N = N₁ is the effective size, and T = T_D is the length of time. Subsegment 2 is formed by the coalescent process of the initial m₂ sequences at time T_A in population P₂ during interval [T_A, T_A + T_D]. The expected total branch length of this subsegment can be calculated similarly by (5) replacing i, N, and T by m₂, N₂, and T_D, respectively. The third subsegment is formed by the coalescent process in ancestral population P₀ during the time interval [T_A + T_D, ∞]. Suppose m₃ and m₄ sequences are extant at time T_A + T_D in populations P₁ and P₂, respectively, with probabilities Inline graphic and , respectively, calculated by (2). The expected total branch length of the segment of genealogy for the given initial m₃ + m₄ sequences in population P₀ during the time interval [T_A + T_D, ∞] is (e.g., Hudson 1990) . Considering all possible values of m₃ and m₄ leads to the expected total branch length of the third subsegment of genealogy, Inline graphic . Summing over the three subsegments gives the expected total branch length of the second segment of genealogy in case 3.

Summing over the three cases yields the expected total branch length of genealogy for segment 2. Adding the expected total branch lengths of segments 1 and 2 gives the ETBLG of sample 3,

(6)

where Inline graphic and . It can be shown that, when N_k = (k = 0, 1, 2, h) and n_h = 2, (6) reduces to , which is twice the expected coalescent time between a pair of sequences from the admixed haploid population derived by Bertorelle and Excoffier (1998).

Using an approach similar to the derivation of (6), I also obtained, as shown in appendix a, the equations for the ETBLGs of samples 4–7.

Expected number of segregating sites:

Under the infinite-site model assumed above, the expected number of segregating sites (ES) in a sample is simply the product of mutation rate μ and ETBLG of the sample. The expected number of segregating sites of the kth sample conditional on parameter set Ω, ES_k, can then be calculated by the equation for Δ_k by replacing N_j, T_A, and T_D with θ_j/4, τ_A, and τ_B, respectively, where k = 1, 2, … , 7 and j = 0, 1, 2, h.

Estimation of parameters:

Suppose, for a single locus with (unknown) mutation rate μ, the number of segregating sites in sample k (k = 1, 2, … , 7) is observed to be OS_k. Estimates of the parameters Ω = {θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, p₁} can be obtained by fitting these observed to the expected numbers of segregating sites by a least-squares approach,

(7)

where ES_k is calculated as shown above. Since f(Ω) is a complicated function of the seven parameters and no closed form of solution is possible, some numerical methods have to be adopted for the estimation. The first derivatives of f(Ω) with respect to each of the seven parameters can be obtained and used in the multidimensional Newton–Raphson algorithm for the estimates of Ω from (7). However, the computation is intensive, especially when sample sizes are large, because both function (7) and its derivatives are not trivial to compute. Further, such an algorithm is sometimes fooled by a local rather than a global minimum of f(Ω). Having tried several methods, I finally choose to use Powell's quadratically convergent method (Press et al. 1996) with slight modifications. This algorithm does not require the computation of derivatives, and with the modification it updates only one of the parameters in most iterations so that only part of (7) needs to be recalculated. For example, updating θ₁ does not alter ES₂ but changes only parts of the calculations of ES_j for j = 3, 4, … , 7. Therefore, the algorithm coupled with storing/reusing the computational results for different parts of ES_j could reduce computational burden tremendously. To speed up computation, this algorithm occasionally updates multiple parameters simultaneously along an optimal direction determined by collecting and using information of previous iterations. Some comparative analyses of simulated data indicated that Powell's algorithm is less often stuck on a local minimum than the Newton–Raphson algorithm. In the results shown below, each simulated data set is analyzed in five independent replicates, each with a randomly chosen set of starting parameter values. The final estimates are those from the replicate with the minimum value of f(Ω). To analyze an empirical data set, more starting points can be used to obtain more reliable estimates.

The computational load of (7) increases rapidly with the numbers of sequences from the three populations. Furthermore, it is difficult to calculate (2) quickly and accurately because it is a series having terms of large values and alternating signs. To avoid large numerical errors in calculating (2) and thus (7) for large genealogies (a sample of ≥100 sequences), I conduct computations using high precision of hundreds of significant digits (depending on sample size). An alternative option is to adopt a Markov chain Monte Carlo method proposed by Griffiths and Tavaré (1994) as in O'Ryan et al. (1998).

Multiple loci:

For multiple independent loci, it is inappropriate to use simply the average numbers of segregating sites over loci as data in the estimation. Distinctive loci may have different mutation rates because of the differences in mutation rate per base pair and/or in the sequence length. Different loci may also have different sample sizes, resulting in different ETBLGs and thus different expected numbers of segregating sites. The above methodology can be extended to use multilocus data jointly in estimating the parameters Ω = {θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, p₁}. In addition, the relative mutation rate of each locus can be estimated simultaneously.

Suppose a number of L unlinked loci have been surveyed, with locus l (l = 1, 2, … , L) having a mutation rate μ_l and a sample size of n_j_,l sequences from population P_j (j = 1, 2, h). Without loss of generality, I scale parameters N_j (j = 0, 1, 2, h), T_A, T_D, and μ_l (l = 2, 3, … , L) by μ₁, the mutation rate of the first locus. The set of parameters to be estimated now becomes Ω = {θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, p₁, λ₂, λ₃, … , λ_L}, where θ_j = 4N_jμ₁ (j = 0, 1, 2, h), τ_A =T_Aμ₁, τ_D =T_Dμ₁, and λ_l = μ_l/μ₁ (l = 2, 3, … , L). The least-squares function for multiple loci becomes

(8)

where OS_j,l and ES_j,l are observed and expected numbers of segregating sites for locus l in composite sample j (j = 1, 2, … , 7). ES_j,1 is calculated as before, while ES_j,l for locus l with l ≥ 2 is calculated using parameters {θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, p₁} and then multiplied by λ_l.

The value of p₁ and the relative values of θ_j (j = 0, 1, 2, h), τ_A, and τ_D obtained from (8) are independent of the locus chosen to scale the parameters. This is because mutation rate does not affect the genealogies and acts only as a multiplier with ETBLGs in determining the expected numbers of segregating sites. Using a locus with a smaller (larger) mutation rate to scale the parameters causes just a proportional decrease (increase) in the estimates of θ_j (j = 0, 1, 2, h), τ_A, and τ_D and has no affect on the estimates of p₁. This can be checked easily by simulations.

Simulations:

Monte Carlo simulations were run to generate data sets with known parameters. These data were then analyzed by the newly developed estimator to check its quality of estimates, to investigate its statistical properties, and to compare it with a previous molecular estimator. Although quite a few admixture estimators are available, only the one of Bertorelle and Excoffier (1998) is a molecular estimator designed to use molecular information and take mutations into account. Therefore I confine my comparison to this molecular estimator in this study.

Following the coalescence approach (Hudson 1990) and the admixture model (Figure 1), the genealogies of the n₁ + n₂ + n_h DNA sequences from the current three populations were reconstructed until the MRCA was found. Poisson-distributed mutations were then imposed on the reconstructed gene tree. Recombination was assumed to be absent and mutations were assumed to follow the infinite-site model. Data for different loci were generated independently, and monomorphic loci with no segregating sites were discarded. The sequence data were then processed to extract information for different estimators. For the current estimator, the number of segregating sites in each of the seven composite samples was obtained. For Bertorelle and Excoffier's molecular estimator, the mean coalescence time (scaled by mutation rate) was estimated by the mean number of site differences between pairs of sequences.

Several statistics are adopted to measure the quality of estimates from simulated data. First, the applicability (denoted as Appl%) of an estimator is calculated as the percentage of replicates in which admixture proportion estimates can be made successfully and the estimates are in the legitimate range of [0, 1] (Choisy et al. 2004). Second, the mean and root mean square errors (the square root of mean squared errors, denoted by RMSE) of estimates across replicates are calculated. Third, “factor 2” is calculated as the proportion of replicates in which the estimated value is within the interval bounded by values equal to 50 and 200% that of the true value (Excoffier et al. 2005). This measurement overlaps with RMSE in telling how close the estimates are to the true parameter value, but it is less affected by extreme outliers of the distribution of estimates. For most combinations of parameters, 1000 replicates were conducted.

Analysis of an empirical data set:

For demonstration, the estimator proposed herein was applied to the analysis of a published data set from McLean et al. (2003). They sequenced the hypervariable segments I (HVSI, 364 bp in length) and II (HVSII, 343 bp in length) of the mtDNA from 47 Sierra Leoneans, 12 European–Americans, 12 rural Gullah-speaking African–Americans, 12 urban African–Americans living in Charleston, South Carolina, and 12 Jamaicans. Assuming that African–American populations are admixtures by Europeans and Africans (e.g., Parra et al. 1998), the mtDNA data can be analyzed by the coalescent estimator to infer the European genetic contributions to the gene pool of each of the three African–American populations and the admixture time, divergence time, and genetic drift (population size) of each parental and admixed population involved. Sites in HVSI and HVSII sequences with missing or ambiguous information were eliminated, resulting in 295 and 275 unambiguous sites for HVSI and HVSII, respectively, utilized in the analysis. Due to the absence of recombination in human mtDNA, HVSI and HVSII are effectively a single locus. Sequences for the two loci are thus combined to form single-locus data before being analyzed by the two molecular estimators of admixture.

RESULTS

Simulations:

Many factors are important in determining the quality of admixture estimates, including the true parameters (defined by the genetic model in Figure 1) being estimated and the marker information content influenced by the number of loci, the number of individuals genotyped, and the polymorphism of each locus (e.g., Wang 2003; Choisy et al. 2004; Excoffier et al. 2005). The factor combinations are prohibitively too numerous to consider in a simulation study. Here I choose to present the estimation results in some hopefully typical scenarios.

The performances of the current and previous molecular estimators in estimating admixture proportions for the scenarios of a short or long divergence time (T_D = 2500 or 10,000 generations); short, intermediate, or long admixture time (T_A = 50, 500, or 5000 generations); and small or moderate admixture (p₁ = 0.05 or 0.20) are summarized in Table 1. Sample size for each population is assumed to be either 20 or 40, and the number of loci is assumed to be 1, 5, or 10 with the same mutation rate of 0.001 per DNA sequence per generation. When the divergence time is short (T_D = 2500) so that parental populations are not highly differentiated (T_D/N = 0.5) when admixture occurs, the Appl% of the m_Y estimator is only ∼70–90% for the case of a single locus, and 10–30% of p₁ estimates from this estimator are either smaller than zero or greater than one. Although the Appl% of the m_Y estimator improves with an increasing amount of marker information (mainly number of loci), it is still <90% for slight admixture even if 5 loci are used. Note that the increase in Appl% of the m_Y estimator with an increasing T_A is an artifact, because a large T_A results in the estimates of p₁ biased toward 0.5 and thus in fewer negative estimates. In contrast, the new estimator gives the estimates of p₁ that are always in the legitimate range of [0, 1]. Compared with the m_Y estimator, the new estimator is generally much less biased and has much smaller RMSE.

TABLE 1.

Estimates of admixture proportions from simulations

			T_D = 2,500					T_D = 10,000
New estimator		m_Y			New estimator		m_Y
S/L	p₁	T_A	Mean	RMSE	Mean	RMSE	Appl%	Mean	RMSE	Mean	RMSE	Appl%
20/1	0.05	50	0.075	0.138	0.126	0.343	71.5	0.081	0.128	0.071	0.085	86.6
500	0.102	0.186	0.178	0.387	76.5	0.088	0.146	0.087	0.125	86.8
5000	0.260	0.478	0.347	0.523	83.1	0.102	0.213	0.199	0.247	95.7
0.20	50	0.236	0.220	0.269	0.309	85.7	0.251	0.207	0.221	0.114	98.7
500	0.249	0.259	0.278	0.375	83.6	0.265	0.225	0.230	0.151	97.8
5000	0.336	0.415	0.379	0.470	84.0	0.257	0.293	0.295	0.261	96.8
20/5	0.05	50	0.054	0.065	0.129	0.150	88.4	0.059	0.047	0.068	0.042	97.6
500	0.067	0.097	0.167	0.189	90.6	0.064	0.056	0.080	0.058	97.3
5000	0.186	0.302	0.346	0.353	97.1	0.063	0.088	0.188	0.165	99.8
0.20	50	0.212	0.126	0.260	0.138	97.6	0.215	0.104	0.211	0.054	100
500	0.232	0.163	0.286	0.177	97.0	0.227	0.118	0.221	0.075	99.9
5000	0.285	0.324	0.389	0.268	98.2	0.215	0.160	0.292	0.153	99.9
20/10	0.05	50	0.051	0.048	0.127	0.117	93.4	0.054	0.031	0.067	0.033	99.2
500	0.063	0.076	0.172	0.160	95.5	0.059	0.037	0.080	0.047	99.2
5000	0.158	0.251	0.352	0.328	99.6	0.055	0.062	0.184	0.148	100
0.20	50	0.209	0.097	0.258	0.100	99.9	0.209	0.076	0.212	0.041	100
500	0.224	0.126	0.281	0.136	98.9	0.219	0.087	0.222	0.056	100
5000	0.258	0.275	0.400	0.236	99.8	0.204	0.116	0.289	0.124	100
40/1	0.05	50	0.064	0.101	0.100	0.258	72.5	0.073	0.099	0.061	0.072	88.4
500	0.096	0.167	0.139	0.365	74.3	0.089	0.133	0.076	0.103	84.9
5000	0.238	0.405	0.356	0.533	82.2	0.088	0.190	0.191	0.232	94.7
0.20	50	0.216	0.171	0.234	0.237	89.8	0.231	0.187	0.205	0.083	99.0
500	0.255	0.229	0.249	0.341	83.6	0.271	0.226	0.215	0.129	97.6
5000	0.324	0.410	0.408	0.482	84.9	0.254	0.291	0.298	0.267	95.7
40/5	0.05	50	0.054	0.048	0.098	0.108	86.8	0.055	0.038	0.059	0.030	98.0
500	0.070	0.086	0.145	0.174	86.6	0.064	0.055	0.075	0.051	97.2
5000	0.173	0.286	0.348	0.354	97.3	0.066	0.090	0.182	0.160	99.8
0.20	50	0.206	0.097	0.229	0.101	97.9	0.205	0.099	0.204	0.040	100
500	0.229	0.135	0.260	0.151	96.9	0.230	0.116	0.216	0.064	100
5000	0.277	0.308	0.396	0.269	97.8	0.222	0.160	0.289	0.152	99.8
40/10	0.05	50	0.052	0.034	0.099	0.083	94.0	0.053	0.025	0.059	0.022	99.6
500	0.066	0.064	0.146	0.136	93.1	0.059	0.035	0.074	0.039	99.5
5000	0.134	0.215	0.349	0.325	99.7	0.059	0.064	0.179	0.145	100
0.20	50	0.205	0.071	0.230	0.072	100	0.198	0.067	0.205	0.029	100
500	0.229	0.101	0.267	0.117	99.3	0.217	0.085	0.215	0.047	100
		5000	0.256	0.264	0.399	0.236	99.9	0.210	0.120	0.283	0.121	100

Open in a new tab

For each parameter combination, estimates of admixture proportion are obtained from 1000 replicates using the new estimator and Bertorelle and Excoffier's (1998) estimator m_Y. The parameters are T_D = 2500 or 10,000 generations; T_A = 50, 500, or 5000 generations; N_j = 5000 (j = 0, 1, 2, h); the sample size S = 20 or 40 for all three populations; the number of loci L = 1, 5, or 10; and the admixture proportion p₁ = 0.05 or 0.20. The applicability of the new estimator is always 100% irrespective the parameter combinations and is thus not listed.

When the divergence time is long (T_D = 10,000 generations) so that parental populations are highly differentiated (T_D/N = 2) before they contribute to the admixture, the performances of the two estimators become similar. The m_Y estimator has an Appl% close to 100%, except when a single locus is used in estimating small admixture proportions. The new estimator is less biased than the m_Y estimator, especially when admixture is small and T_A is large. The main merit of molecular estimators in comparison with traditional estimators is that mutations after the admixture events can be accounted for so that ancient admixture can be inferred accurately. The current estimator allows almost unbiased estimation of admixture proportions even if T_A = 5000 (T_A/N = 1) when the divergence time is long and multilocus data are available. In contrast, the m_Y estimator gives estimates of p₁ biased toward 0.5 when admixture is ancient.

In addition to admixture proportions, the new estimator can also provide estimates of other interesting parameters. Table 2 summarizes the properties of the estimates of θ₀, θ₁, θ₂, θ_h, τ_A, τ_D, and relative mutation rates (λ_l). It can be seen that θ₁, θ₂, and μ_l are very well estimated with small biases and RMSEs, while θ_h is the most difficult parameter to estimate. This is understandable because information about θ_h comes from the coalescent events in the hybrid population during time interval [0, T_A] only, and these events are too few when n_h and T_A/N_h are small to allow accurate estimates of θ_h. Indeed, the quality of θ_h-estimates increases with an increasing sample size and admixture time, as shown in Table 2. Similarly, the quality of τ_A estimates is dependent on the number of coalescent events in the three populations during interval [0, T_A] and thus increases with an increasing admixture time (T_A) and a decreasing population size (N₁, N₂, N_h). In contrast, τ_D is more accurately estimated with a decreasing admixture time (T_A). Small T_A means fewer coalescent events during interval [0, T_A] and more coalescent events during time interval [T_A, T_A + T_D] and thus more information about τ_D. For similar reasons, θ₀ is better estimated with a smaller value of T_A + T_D. The estimates of all parameters are improved substantially by increasing the number of loci and sample sizes.

TABLE 2.

Estimates of divergence and admixture times and population sizes from simulations

		One locus			Five loci
Parameters	True value	Mean	RMSE	Factor2	Mean	RMSE	Factor2
T_D = 10,000, T_A = 500, S = 40
θ₁	20	24.2	13.6	0.87	21.2	5.5	0.99
θ₂	20	23.1	10.7	0.91	21.1	5.4	0.99
θ_h	20	42.1	105.1	0.44	26.2	28.9	0.86
θ₀	20	12.1	19.1	0.31	15.4	12.9	0.60
τ_A	0.5	0.41	0.38	0.53	0.46	0.21	0.85
τ_D	10	15.6	11.2	0.71	15.4	8.0	0.78
λ_l	1	1.02	0.26	0.99
T_D = 2,500, T_A = 50, S = 40
θ₁	20	17.9	11.8	0.75	18.4	6.0	0.95
θ₂	20	17.1	9.1	0.80	18.2	5.3	0.97
θ_h	20	79.0	183.1	0.05	49.4	132.3	0.22
θ₀	20	24.4	14.2	0.75	23.1	7.5	0.97
τ_A	0.05	0.09	0.14	0.17	0.08	0.10	0.33
τ_D	2.5	2.7	1.5	0.78	2.6	0.86	0.96
λ_l	1	1.05	0.31	0.99
T_D = 10,000, T_A = 500, S = 100
θ₁	20	22.6	9.0	0.94	21.7	7.6	0.95
θ₂	20	22.8	9.1	0.95	22.1	7.7	0.96
θ_h	20	39.6	28.9	0.74	28.1	27.9	0.90
θ₀	20	12.0	17.3	0.33	18.3	13.7	0.61
τ_A	0.5	0.49	0.37	0.68	0.48	0.23	0.85
τ_D	10	17.2	14.0	0.63	13.6	7.5	0.76
λ_l	1	1.02	0.21	1.00
T_D = 2,500, T_A = 50, S = 100
θ₁	20	19.2	7.0	0.93	19.2	4.7	0.98
θ₂	20	18.5	5.9	0.94	19.2	4.9	0.99
θ_h	20	39.8	85.0	0.24	35.3	72.5	0.50
θ₀	20	23.5	11.7	0.83	22.2	7.1	0.99
τ_A	0.05	0.06	0.09	0.45	0.06	0.04	0.63
τ_D	2.5	2.8	1.6	0.78	2.8	1.0	0.94
λ_l	1				0.99	0.22	1.00

Open in a new tab

The parameters being estimated are θ_i = 4μ₁N_i (i = 0, 1, 2, h), τ_A = μ₁T_A, τ_D = μ₁T_D, and λ_l = μ_l/μ₁ (l = 2, 3, 4, 5). A total of 1000 simulated data sets are generated and analyzed, assuming N_i = 5000 (i = 0, 1, 2, h), p₁ = 0.2, μ_l = 0.001 (l = 1, … , 5), and T_D = 10,000 and T_A = 500 or T_D = 2500 and T_A = 50. The sample size S = 40 or 100 for all three populations, and the number of loci L = 1 or 5.

I adopt the parametric bootstrapping technique to assess the uncertainties of admixture estimates from the new estimator. This is rendered possible because all the parameters fully defining the admixture model in Figure 1 can be estimated by the new estimator. Parametric bootstrapping is more appropriate than nonparametric bootstrapping (Bertorelle and Excoffier 1998) because the latter tends to yield too conservative estimates of uncertainties when the number of resampling units (loci, sequences) is small. Due to the heavy computational burden of the current estimator, however, it is difficult to evaluate the performance of the parametric bootstrapping procedure using extensive simulations. Table 3 lists the uncertainty estimates, which are average upper and lower limits of 95% confidence intervals (C.I.95%) and coverage (frequency of the true parameter value being covered by the estimated C.I.95%), for the cases of a single locus and five loci. In each case, 100 replicate data sets are simulated, and each data set is analyzed for point and C.I.95% estimates using 500 bootstrapping samples. The parameter values used in generating the simulated data sets are θ_k = 20 (k = 0, 1, 2, h), τ_A = 0.5, τ_D = 10, μ = 0.001, and 20 sequences for each locus from each population. As can be seen, the true parameter value is included in the estimated 95% confidence intervals in ∼95% of the replicates for both the single-locus and the five-loci cases. The confidence intervals for five loci are much narrower than those for a single locus, as is expected. In accordance with the results listed in Table 2, θ_h is the most difficult parameter to estimate, as indicated by the extremely large confidence intervals.

TABLE 3.

Estimates of confidence intervals by parametric bootstrapping

		One locus			Five loci
Parameters	True value	Coverage	C.I.95_L	C.I.95_U	Coverage	C.I.95_L	C.I.95_U
θ₁	20	0.91	6.23	74.44	0.91	7.18	30.71
θ₂	20	0.93	5.68	42.87	0.92	8.07	30.07
θ_h	20	0.99	1.08	∞	0.98	2.71	∞
θ₀	20	0.99	0.41	94.17	0.99	0.55	78.12
τ_A	0.5	0.95	0.05	1.95	0.96	0.06	1.20
τ_D	10	0.93	3.62	36.62	0.94	4.43	34.75
p₁	0.2	0.92	0.03	0.75	0.95	0.07	0.49

Open in a new tab

A total of 100 data sets were simulated, assuming the parameter values listed in column 2, μ_l = 0.001 (l = 1, … , 5), and a sample size S = 20 for each population and locus. Each data set was analyzed by the coalescent estimator with 500 bootstrapping samples. C.I.95_L (C.I.95_U) is the average lower (upper) limit of the estimated 95% confidence intervals, and coverage gives the frequency that the true parameter value is included in the estimated 95% confidence intervals.

Admixture analysis of human populations:

The mtDNA sequence data from McLean et al. (2003) are analyzed by Bertorelle and Excoffier's (1998) m_Y estimator and the new estimator. Parametric bootstrapping and nonparametric bootstrapping are adopted for the new and m_Y estimators, respectively, to ascertain the uncertainties of the estimates using 1000 samples of size identical to the original samples. The estimates of the European contributions to each of the three admixed African–American and Jamaican populations are listed in Table 4.

TABLE 4.

Admixture analysis results of three human admixed populations

	Gullah		Charleston		Jamaican
Admixture parameters	Estimate	95% C.I.	Estimate	95% C.I.	Estimate	95% C.I.
τ_D	2.053	0.113, 5.844	1.883	0.003, 6.242	1.868	0.001, 6.057
τ_A	0.001	0.000, 0.848	0.114	0.000, 0.820	0.196	0.000, 1.050
θ₁	5.006	0.565, 22.931	4.175	0.034, 18.860	3.972	0.020, 18.992
θ₂	45.146	8.662, 106.283	32.207	0.152, 143.713	34.083	0.129, 137.628
θ_h	0.001	0.000, ∞	0.848	0.000, ∞	2.142	0.000, ∞
θ₀	16.492	1.927, 47.619	19.933	1.728, 49.476	21.787	0.847, 53.180
p₁	0.001	0.000, 0.546	0.064	0.000, 0.752	0.048	0.000, 0.870
p₁*	0.090	−0.343, 0.356	0.500	−0.008, 0.681	−0.063	−0.490, 0.330

Open in a new tab

The 95% confidence intervals are obtained from 1000 bootstrapping samples. The estimated parameters are θ_i = 4μN_i, τ_A = μT_A, τ_D = μT_D, where N_i is the female effective size of population i (i = 0, 1, 2, h for ancestral, European, African, and admixed populations, respectively). The European contribution to an admixed population was obtained from the new estimator (line headed by p₁) and the m_Y estimator (line headed by p₁*). The other seven parameters were estimated by the new estimator only.

The European contributions to the three admixed populations are estimated to be <7% from the new coalescent estimator and are in close agreement with previous estimates (Parra et al. 2001; McLean et al. 2003). McLean et al. (2003) calculated admixture proportions from the frequencies of haplotypes composed of three HVS restriction site polymorphisms (RSPs). These RSPs are chosen because they are substantially differentiated between African and European populations and are thus highly informative for admixture analysis. Furthermore, a large number of 1396 individuals from the same five populations as the mtDNA sequence data analyzed herein are assayed for the RSPs. The estimated European contributions from their analyses are 0.030, 0.069, and −0.027 for the Gullahs, Charleston African–Americans, and the Jamaicans, respectively. From large samples (∼100 sequences per population) of HVSI data, the European contributions to the three admixed populations were estimated to be ∼0% using the highly informative haplogroup H frequencies (Parra et al. 2001) and were estimated to be 0.065 and 0.129 for the Charleston and Jamaican populations, respectively, from both haplogroup H and L frequencies (Parra et al. 1998). In general, these estimates are much lower than those inferred from many informative nuclear markers (Parra et al. 1998), indicating that European females contributed little to the admixtures. The sex-biased admixtures, with European males contributing substantially greater than European females, were confirmed by analyzing the Y Alu polymorphic (YAP) informative markers (Parra et al. 1998). It is encouraging that with a sample size as small as 12 sequences, the new coalescent estimator yields similar results.

The m_Y estimator yields estimates of European contributions to the Gullah or Jamaican population that are low and roughly compatible with estimates from the other estimators and data, but estimates of European contribution to the Charleston population (0.5) that is much larger than other estimates and is even larger than the estimate from nuclear markers (0.12, Parra et al. 1998). The estimated European contribution to the Jamaican population was −0.063, which is not surprising given the simulation result that the m_Y estimator often yields negative estimates when admixture proportion is low (Table 1).

For both molecular estimators and all of the three admixed populations, the 95% confidence intervals for the admixture estimates determined by parametric and nonparametric bootstrapping are quite broad. This is perhaps not surprising because the data set is small with effectively a single locus and only 12 sequences from a population. More loci and larger sample sizes are required to obtain more precise admixture estimates.

In addition to admixture proportions, the new estimator also gives the estimates of divergence times, admixture times, population sizes, and relative mutation rates. These estimates are listed in Table 4. Because human mtDNA does not recombine (Pakendorf and Stoneking 2005), HVSI and -II sequences were obtained from the same individuals (McLean et al. 2003), and the sample sizes are very small, the genealogical information that can be extracted from the data set is quite limited and the analysis results need to be interpreted with caution. Divergence time between the Africans and Europeans is estimated to be 1.9 on average. In the literature, the estimates of mutation rate for HVSs are quite diverse, the median value being ∼0.1/site/million years (MY) (Pakendorf and Stoneking 2005; Santos et al. 2005). The total mutation rate for HVSI and -II is therefore ∼70/MY or 0.0021/generation if the generation interval is taken to be 30 years. The absolute divergence time is thus estimated to be 1.9/(70/MY) = 27,143 years, which is roughly in agreement with the estimate of <60,000 years from phylogenetic analysis of mtDNA control regions (e.g., Watson et al. 1997; Quintana-Murci et al. 1999). Subsequent migration after the split of Eurasian from African population could reduce the divergence and thus lead to an underestimation of divergence time (Wang 2003).

The point estimates of admixture time are quite variable from the three admixture analyses. Using the mutation rate estimate of 70/MY, the admixture time is estimated to be 6, 1628, and 2801 years ago for the Gullah, Charleston, and Jamaican populations, respectively. The first estimate is too small while the last two estimates are too high compared with the historical evidence that the American–African populations were formed 150 years ago (Parra et al. 1998). For all of the three admixed populations, however, the 95%C.I.'s are fairly consistent and well include the admixture time of 150 years.

The parental and ancestral population sizes (θ₁, θ₂, θ₀) are well estimated while the admixed population size (θ_h) is poorly estimated, as indicated by the corresponding widths of 95%C.I.'s. On average, the African population size is estimated to be 8 and 37 times larger than that of European and admixed populations, respectively, and the European population is 4 times larger than admixed populations. The results seem to be plausible and are at least qualitatively in agreement with previous studies.

DISCUSSION

In this article, I show that DNA sequence data can be utilized more efficiently in admixture inferences by considering the entire genealogy of all sampled sequences rather than the genealogy of pairs of sequences. In comparison with a previous molecular estimator (Bertorelle and Excoffier 1998), the new estimator provides better estimates of admixture proportions, which are always in the legitimate range of [0, 1] and have usually higher accuracy and precision, especially when divergence time is short and/or admixture time is long (Table 1). In addition, it allows reasonably good estimation of other important parameters of the admixture model, such as the divergence time, admixture time, and population sizes (Table 2). These parameters are scaled by the mutation rate of the markers, but their relative values are still meaningful in understanding the admixture events. When marker mutation rates are known, the absolute values of divergence and admixture times and population sizes can be easily calculated from the estimates. Other advantages of the new estimator are that it can use information from multiple loci with different mutation rates in estimating admixture and relative mutation rates jointly and that it automatically accounts for variable sample sizes both among and within loci because its inferences are based on the genealogy of an entire sample rather than pairs of sequences. The simulation results shown in Tables 1 and 2 assumed equal sample sizes among populations and loci and equal mutation rates among loci. When either or both of these two quantities vary, the new estimator is expected to perform even better than the previous molecular estimator (Bertorelle and Excoffier 1998). Furthermore, the previous molecular estimator assumes an equal size of all populations involved in the admixture. The assumption is now redundant in the new estimator and all population sizes can be estimated jointly with admixture proportions.

The differences between the current and previous molecular estimators of admixture are in close analogy with those between Watterson's and Tajima's estimators of Inline graphic from DNA sequence data. Watterson (1975) showed that, under the infinite-site mutation model, the product of mutation rate and expected total branch length of the genealogy of a sample of sequences gives the expected number of segregating sites of the sample. From this relationship, he derived the estimator Inline graphic , where OS is the observed number of segregating sites in a sample of n sequences. Tajima's estimator, θ_T, is given by the average number of nucleotide differences between two sequences (Tajima 1983). It is well known that both estimators are unbiased, but θ_W is generally more efficient than θ_T because it has a smaller variance, and the difference increases with sample size (Li 1997; Wang 2005). In the special case of a single Wright–Fisher population (say, parental population 1), the admixture model has just one parameter (θ) and my admixture estimator (7) reduces to θ_W, as is expected. The previous molecular estimator of admixture (Bertorelle and Excoffier 1998) uses the mean number of nucleotide differences between pairs of sequences as information and is thus quite similar in this respect to θ_T.

Some assumptions made in deriving the current estimator can be relaxed without affecting much of the validity of the estimator. Although the current estimator assumes diploid nuclear markers, it applies to maternally (mtDNA) or paternally (Y chromosome) inherited markers as well. The only difference is in the explanation of the parameter θ that corresponds to different effective sizes. It is assumed that there is no recombination at a locus. However, I use means rather than variances of the number of segregating sites as data in the estimator, and therefore it should apply to loci with recombination. Like Watterson's estimator, the current admixture estimator should actually give better estimates when there is recombination, although the estimated uncertainties might become too exaggerative. The current estimator also assumes two parental populations contributing to the admixture. It is straightforward to extend the method to the case of three or more parental populations. However, the computational burden increases very rapidly with the number of parental populations. Even with two parental populations as assumed in this study, the estimator's computational load increases so rapidly with sample sizes that it can cope only with samples of a few hundred sequences on an ordinary PC. Further refinements of the computational algorithms are necessary before the estimator is extended to more complicated situations such as three or more parental populations.

There is also room for methodological improvements of the current estimator. In (7), the expected numbers of segregating sites of the seven composite samples are obviously nonindependent, because each original sample of sequences appears in four of the seven composite samples. Ideally, their variance and covariance structure should be incorporated into a general least-squares framework to obtain estimates of the seven parameters. However, it is extremely difficult to derive this variance and covariance matrix analytically, and computation of the matrix numerically by simulations is too CPU demanding to be realistic. Although (7) ignored this variance and covariance structure, it should provide unbiased estimates as verified by simulations. The estimator's precision may be improved by the proper weighting based on the variance and covariance matrix.

A software package, Molecular Estimator of Admixture (MEAdmix), computing the admixture estimator and finding the confidence intervals by parametric bootstrapping, is available for free download from http://www.zoo.cam.ac.uk/ioz/software.htm.

Acknowledgments

I thank David C. McLean for sending me the human mtDNA sequence data that were analyzed by my new admixture estimator and Laurent Excoffier, Brigitte Pakendorf, Bruce Walsh, and two anonymous referees for critical reading and constructive comments on earlier versions of this manuscript.

APPENDIX A: THE EXPECTED TOTAL BRANCH LENGTH OF A GENEALOGY

Suppose i genes at a locus are sampled randomly from a Wright–Fisher population of N diploid individuals. Looking backward in time, the current time when the sample was taken is designated as generation zero and the time T generations ago is referred to as generation T. As T → ∞, these i genes will coalesce into their MRCA, and the ETBLG is Inline graphic generations (e.g., Hudson 1990). When T is a fixed definite number, however, the MRCA may or may not be found in the interval of [0, T] and j (1 ≤ j ≤ i) lineages may be left extant at generation T. The probability of j distinct lineages extant at time T, , is given by (2) (Tavaré 1984). Given i, j, N, and T, the expected time, ET_m, during which there are m (j ≤ m ≤ i) distinct lineages in a genealogy is derived as follows:

ET_m for m = i: Conditional on i, j < i, N, and T, the probability of the time interval (in generations), T_i, during which there are i distinct lineages is
ET_i is thus obtained by the integration
which, after some algebra, simplifies to
(A1)
where .
ET_m for j < m < i: Similarly, the expected time (in generations) during which there are m distinct lineages (j < m < i), given i, j < i − 1, N, and T, is obtained by integration:
It can be simplified to
(A2)
where when r = m and otherwise, and .
ET_m for m = j: The expected time during which there are m = j distinct lineages, given i, j ≤ i, N, and T, is obtained by integration,
which leads to
(A3)
where when k = j and otherwise, and . When no coalescent events occur in the interval of [0, T] (i.e., j = i), (A3) reduces to , as is expected.

The ETBLG in the interval [0, T] can be calculated using (A1–A3), in two separate cases. In the first case, the part of the genealogy after the MRCA is found, branch length T₁, is irrelevant and excluded. In the second case, T₁ is included in the ETBLG. Cases 2 and 1 apply when the MRCA lineage is and is not to be included in another genealogy involving other genes formed after T. The ETBLG conditional on i, j, N, T is

(A4)

if T₁ is excluded and is

(A5)

if T₁ is included.

Considering all possible values of j given i, N, and T, the ETBLG in case 1 is

which, after some algebra, reduces to (1) in the text. As T → ∞, (1) further reduces to Inline graphic (e.g., Hudson 1990), as is expected. Similarly, the ETBLG in case 2 is

which reduces to (5) in the text.

APPENDIX B: THE ETBLG OF SAMPLES 4–7

The expected total branch length of the genealogy (ETBLG) for sample 4 can be derived, using the approach adopted in deriving (6) in the text, as

(B1)

When n₁ = n₂ = 1 and N_k = N (k = 0, 1, 2), (B1) reduces to Inline graphic , which is twice the expected coalescent time between a sequence from parental population 1 and a sequence from parental population 2.

The ETBLG for sample 5 is

(B2)

where Inline graphic and . As is expected, (B2) reduces to , twice the average coalescent time between a sequence from parental population 1 and a sequence from the admixed population, when n₁ = n_h = 1 and N_k = N (k = 0, 1, 2, h). For sample 6, Δ₆ is calculated by the right side of (B2) by exchanging Inline graphic and .

The ETBLG of sample 7 is

(B3)

where Inline graphic and .

References

Beaumont, M. A., 2003. Conservation genetics, pp. 775–780 in Handbook of Statistical Genetics, Ed. 2, edited by D. J. Balding, M. Bishop and C. Cannings. John Wiley & Sons, Chichester, England.
Beaumont, M. A., W. Zhang and D. J. Balding, 2002. Approximate Bayesian computation in population genetics. Genetics 162: 2025–2035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bertorelle, G., and L. Excoffier, 1998. Inferring admixture proportions from molecular data. Mol. Biol. Evol. 15: 1298–1311. [DOI] [PubMed] [Google Scholar]
Cavalli-Sforza, L. L., and W. F. Bodmer, 1971. The Genetics of Human Populations. W. H. Freeman, San Francisco.
Chakraborty, R., 1986. Gene admixture in human populations: models and predictions. Yearb. Phys. Anthropol. 29: 1–43. [Google Scholar]
Chakraborty, R., and K. M. Weiss, 1986. Frequencies of complex diseases in hybrid populations. Am. J. Phys. Anthropol. 70: 489–503. [DOI] [PubMed] [Google Scholar]
Chakraborty, R., and K. M. Weiss, 1988. Admixture as a toll for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85: 9119–9123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty, R., M. I. Kamboh, M. Nwankwo and R. E. Ferrell, 1992. Caucasian genes in American blacks: new data. Am. J. Hum. Genet. 50: 145–155. [PMC free article] [PubMed] [Google Scholar]
Chikhi, L., M. W. Bruford and M. A. Beaumont, 2001. Estimation of admixture proportions: a likelihood-based approach using Markov chain Monte Carlo. Genetics 158: 1347–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chikhi, L., R. A. Nichols, G. Barbujani and M. A. Beaumont, 2002. Y genetic data support the Neolithic demic diffusion model. Proc. Natl. Acad. Sci. USA 99: 11008–11013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choisy, M., P. Franck and J. M. Cornuet, 2004. Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol. Ecol. 13: 955–968. [DOI] [PubMed] [Google Scholar]
Dupanloup, I., and G. Bertorelle, 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672–675. [DOI] [PubMed] [Google Scholar]
Elston, R. C., 1971. The estimation of admixture in racial hybrids. Ann. Hum. Genet. 35: 9–17. [DOI] [PubMed] [Google Scholar]
Excoffier, L., A. Estoup and J. M. Cornuet, 2005. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169: 1727–1738. [DOI] [PMC free article] [PubMed] [Google Scholar]
Glass, B., and C. C. Li, 1953. The dynamics of racial intermixture—an analysis based on the American Negro. Am. J. Hum. Genet. 5: 1–19. [PMC free article] [PubMed] [Google Scholar]
Griffiths, R. C., and S. Tavaré, 1994. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159. [Google Scholar]
Hudson, R. R., 1990. Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by D. J. Futuyma and J. D. Antonovics. Oxford University Press, New York.
Kimura, M., 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, W. H., 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA.
Long, J. C., 1991. The genetic structure of admixed populations. Genetics 127: 417–428. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marjoram, P., J. Molitor, V. Plagnol and S. Tavare, 2003. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100: 15324–15328. [DOI] [PMC free article] [PubMed] [Google Scholar]
McLean, D. C., I. Spruill, S. Gevao, E. Y. S. Morrison, O. S. Bernard et al., 2003. Three novel mtDNA restriction site polymorphisms allow exploration of population affinities of African Americans. Hum. Biol. 75: 147–161. [DOI] [PubMed] [Google Scholar]
O'Ryan, C., E. H. Harley, M. W. Bruford, M. A. Beaumont, R. K. Wayne et al., 1998. Microsatellite analysis of genetic diversity in fragmented South African buffalo populations. Anim. Conserv. 1: 85–94. [Google Scholar]
Pakendorf, B., and M. Stoneking, 2005. Mitochondrial DNA and human evolution. Annu. Rev. Genomics Hum. Genet. 6: 165–183. [DOI] [PubMed] [Google Scholar]
Parra, E. J., A. Marcini, J. Akey, J. Martinson, M. A. Batzer et al., 1998. Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63: 1839–1851. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parra, E. J., R. A. Kittles, G. Argyropoulos, C. L. Pfaff, K. Hiester et al., 2001. Ancestral proportions and admixture dynamics in geographically defined African Americans living in South Carolina. Am. J. Phys. Anthropol. 114: 18–29. [DOI] [PubMed] [Google Scholar]
Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, 1996. Numerical Recipes in Fortran 77, Ed. 2. Cambridge University Press, Cambridge, UK.
Quintana-Murci, L., O. Semino, H. J. Bandelt, G. Passarino, K. McElreavey et al., 1999. Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat. Genet. 23: 437–441. [DOI] [PubMed] [Google Scholar]
Roberts, D. F., and R. W. Hiorns, 1965. Methods of analysis of the genetic composition of a hybrid population. Hum. Biol. 37: 38–43. [PubMed] [Google Scholar]
Santos, C., R. Montiel, B. Sierra, C. Bettencourt, E. Fernandez et al., 2005. Understanding differences between phylogenetic and pedigree-derived mtDNA mutation rate: a model using families from the Azores Islands (Portugal). Mol. Biol. Evol. 22: 1490–1505. [DOI] [PubMed] [Google Scholar]
Tajima, F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré, S., 1984. Lines of descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164. [DOI] [PubMed] [Google Scholar]
Thompson, E. A., 1973. The Icelandic admixture problem. Ann. Hum. Genet. 37: 69–80. [DOI] [PubMed] [Google Scholar]
Wang, J., 2003. Maximum-likelihood estimation of admixture proportions from genetic data. Genetics 164: 747–765. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang, J., 2005. Estimation of effective population sizes from data on genetic markers. Philos. Trans. R. Soc. B 360: 1395–1409. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watson, E., P. Forster, M. Richards and H. J. Bandelt, 1997. Mitochondrial footprints of human expansions in Africa. Am. J. Hum. Genet. 61: 691–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watterson, G. A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. [DOI] [PubMed] [Google Scholar]
Wen, B., H. Li, D. R. Lu, X. F. Song, F. Zhang et al., 2004. Genetic evidence supports demic diffusion of Han culture. Nature 431: 302–305. [DOI] [PubMed] [Google Scholar]

[bib1] Beaumont, M. A., 2003. Conservation genetics, pp. 775–780 in Handbook of Statistical Genetics, Ed. 2, edited by D. J. Balding, M. Bishop and C. Cannings. John Wiley & Sons, Chichester, England.

[bib2] Beaumont, M. A., W. Zhang and D. J. Balding, 2002. Approximate Bayesian computation in population genetics. Genetics 162: 2025–2035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Bertorelle, G., and L. Excoffier, 1998. Inferring admixture proportions from molecular data. Mol. Biol. Evol. 15: 1298–1311. [DOI] [PubMed] [Google Scholar]

[bib4] Cavalli-Sforza, L. L., and W. F. Bodmer, 1971. The Genetics of Human Populations. W. H. Freeman, San Francisco.

[bib5] Chakraborty, R., 1986. Gene admixture in human populations: models and predictions. Yearb. Phys. Anthropol. 29: 1–43. [Google Scholar]

[bib6] Chakraborty, R., and K. M. Weiss, 1986. Frequencies of complex diseases in hybrid populations. Am. J. Phys. Anthropol. 70: 489–503. [DOI] [PubMed] [Google Scholar]

[bib7] Chakraborty, R., and K. M. Weiss, 1988. Admixture as a toll for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85: 9119–9123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Chakraborty, R., M. I. Kamboh, M. Nwankwo and R. E. Ferrell, 1992. Caucasian genes in American blacks: new data. Am. J. Hum. Genet. 50: 145–155. [PMC free article] [PubMed] [Google Scholar]

[bib9] Chikhi, L., M. W. Bruford and M. A. Beaumont, 2001. Estimation of admixture proportions: a likelihood-based approach using Markov chain Monte Carlo. Genetics 158: 1347–1362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Chikhi, L., R. A. Nichols, G. Barbujani and M. A. Beaumont, 2002. Y genetic data support the Neolithic demic diffusion model. Proc. Natl. Acad. Sci. USA 99: 11008–11013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Choisy, M., P. Franck and J. M. Cornuet, 2004. Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol. Ecol. 13: 955–968. [DOI] [PubMed] [Google Scholar]

[bib12] Dupanloup, I., and G. Bertorelle, 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672–675. [DOI] [PubMed] [Google Scholar]

[bib13] Elston, R. C., 1971. The estimation of admixture in racial hybrids. Ann. Hum. Genet. 35: 9–17. [DOI] [PubMed] [Google Scholar]

[bib14] Excoffier, L., A. Estoup and J. M. Cornuet, 2005. Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169: 1727–1738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Glass, B., and C. C. Li, 1953. The dynamics of racial intermixture—an analysis based on the American Negro. Am. J. Hum. Genet. 5: 1–19. [PMC free article] [PubMed] [Google Scholar]

[bib16] Griffiths, R. C., and S. Tavaré, 1994. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131–159. [Google Scholar]

[bib17] Hudson, R. R., 1990. Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by D. J. Futuyma and J. D. Antonovics. Oxford University Press, New York.

[bib18] Kimura, M., 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893–903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Li, W. H., 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA.

[bib20] Long, J. C., 1991. The genetic structure of admixed populations. Genetics 127: 417–428. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Marjoram, P., J. Molitor, V. Plagnol and S. Tavare, 2003. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100: 15324–15328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] McLean, D. C., I. Spruill, S. Gevao, E. Y. S. Morrison, O. S. Bernard et al., 2003. Three novel mtDNA restriction site polymorphisms allow exploration of population affinities of African Americans. Hum. Biol. 75: 147–161. [DOI] [PubMed] [Google Scholar]

[bib23] O'Ryan, C., E. H. Harley, M. W. Bruford, M. A. Beaumont, R. K. Wayne et al., 1998. Microsatellite analysis of genetic diversity in fragmented South African buffalo populations. Anim. Conserv. 1: 85–94. [Google Scholar]

[bib24] Pakendorf, B., and M. Stoneking, 2005. Mitochondrial DNA and human evolution. Annu. Rev. Genomics Hum. Genet. 6: 165–183. [DOI] [PubMed] [Google Scholar]

[bib25] Parra, E. J., A. Marcini, J. Akey, J. Martinson, M. A. Batzer et al., 1998. Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63: 1839–1851. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Parra, E. J., R. A. Kittles, G. Argyropoulos, C. L. Pfaff, K. Hiester et al., 2001. Ancestral proportions and admixture dynamics in geographically defined African Americans living in South Carolina. Am. J. Phys. Anthropol. 114: 18–29. [DOI] [PubMed] [Google Scholar]

[bib27] Press, W. H., S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, 1996. Numerical Recipes in Fortran 77, Ed. 2. Cambridge University Press, Cambridge, UK.

[bib28] Quintana-Murci, L., O. Semino, H. J. Bandelt, G. Passarino, K. McElreavey et al., 1999. Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat. Genet. 23: 437–441. [DOI] [PubMed] [Google Scholar]

[bib29] Roberts, D. F., and R. W. Hiorns, 1965. Methods of analysis of the genetic composition of a hybrid population. Hum. Biol. 37: 38–43. [PubMed] [Google Scholar]

[bib30] Santos, C., R. Montiel, B. Sierra, C. Bettencourt, E. Fernandez et al., 2005. Understanding differences between phylogenetic and pedigree-derived mtDNA mutation rate: a model using families from the Azores Islands (Portugal). Mol. Biol. Evol. 22: 1490–1505. [DOI] [PubMed] [Google Scholar]

[bib31] Tajima, F., 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Tavaré, S., 1984. Lines of descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119–164. [DOI] [PubMed] [Google Scholar]

[bib33] Thompson, E. A., 1973. The Icelandic admixture problem. Ann. Hum. Genet. 37: 69–80. [DOI] [PubMed] [Google Scholar]

[bib34] Wang, J., 2003. Maximum-likelihood estimation of admixture proportions from genetic data. Genetics 164: 747–765. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Wang, J., 2005. Estimation of effective population sizes from data on genetic markers. Philos. Trans. R. Soc. B 360: 1395–1409. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Watson, E., P. Forster, M. Richards and H. J. Bandelt, 1997. Mitochondrial footprints of human expansions in Africa. Am. J. Hum. Genet. 61: 691–704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Watterson, G. A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276. [DOI] [PubMed] [Google Scholar]

[bib38] Wen, B., H. Li, D. R. Lu, X. F. Song, F. Zhang et al., 2004. Genetic evidence supports demic diffusion of Han culture. Nature 431: 302–305. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Coalescent-Based Estimator of Admixture From DNA Sequences

Jinliang Wang

Abstract

METHODS

The admixture model:

Figure 1.—

The mutation model:

Expected numbers of segregating sites:

Expected total branch length of a genealogy:

Expected number of segregating sites:

Estimation of parameters:

Multiple loci:

Simulations:

Analysis of an empirical data set:

RESULTS

Simulations:

TABLE 1.

TABLE 2.

TABLE 3.

Admixture analysis of human populations:

TABLE 4.

DISCUSSION

Acknowledgments

APPENDIX A: THE EXPECTED TOTAL BRANCH LENGTH OF A GENEALOGY

APPENDIX B: THE ETBLG OF SAMPLES 4–7

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Coalescent-Based Estimator of Admixture From DNA Sequences

Jinliang Wang

Abstract

METHODS

The admixture model:

Figure 1.—

The mutation model:

Expected numbers of segregating sites:

Expected total branch length of a genealogy:

Expected number of segregating sites:

Estimation of parameters:

Multiple loci:

Simulations:

Analysis of an empirical data set:

RESULTS

Simulations:

TABLE 1.

TABLE 2.

TABLE 3.

Admixture analysis of human populations:

TABLE 4.

DISCUSSION

Acknowledgments

APPENDIX A: THE EXPECTED TOTAL BRANCH LENGTH OF A GENEALOGY

APPENDIX B: THE ETBLG OF SAMPLES 4–7

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases