Abstract
The inference of population divergence times and branching patterns is of fundamental importance in many population genetic analyses. Many methods have been developed for estimating population divergence times, and recently, there has been particular attention towards genome-wide single-nucleotide polymorphisms (SNP) data. However, most SNP data have been affected by an ascertainment bias caused by the SNP selection and discovery protocols. Here, we present a modification of an existing maximum likelihood method that will allow approximately unbiased inferences when ascertainment is based on a set of outgroup populations. We also present a method for estimating trees from the asymmetric dissimilarity measures arising from pairwise divergence time estimation in population genetics. We evaluate the methods by simulations and by applying them to a large SNP data set of seven East Asian populations.
Keywords: ascertainment bias, maximum likelihood, phylogeny, population divergence
Introduction
With the fast advance in high throughput genotyping techniques, large single-nucleotide polymorphisms (SNP) data sets have become available for humans (Conrad et al. 2006; Jakobsson et al. 2008) and other organisms such as Drosophila (Chen et al. 2008), cattle (Decker et al. 2009; Gibbs et al. 2009), horse (Brooks et al. 2010), dog (Vonholdt et al. 2010) and pig (Uimari & Tapio 2011). These data are rich in information about past demographic history and are, therefore, widely used in evolutionary studies (Padhukasahasram et al. 2006; Lohmueller et al. 2009; Yin et al. 2009; Pavlidis et al. 2010), for example in the estimation of the history of population divergence, also called the population phylogeny or population tree. Many populations, including most human populations, cannot be described by only considering population divergence without gene flow (Schaffner et al. 2005; Hey 2010; Wang & Hey 2010). Population trees estimated assuming absence of gene flow among populations may be biased in one way or another when gene flow is truly present. However, even in these cases, the estimation of population trees is a useful abstraction for illustrating the relationship between populations and is the main topic of this study.
Genealogy-based methods have been developed for estimating population phylogenies from sequence data (Rannala & Yang 2003; Hey & Nielsen 2007; Heled & Drummond 2010). These methods calculate the likelihood/posterior probability of observed data by considering the underlying coalescent trees/genealogies that describe the ancestry of the sampled sequences. In the next step, a Markov Chain Monte Carlo (MCMC) approach is implemented to integrate over all possible coalescent trees to obtain estimates of demographic parameters and population phylogeny. A related method by Liu & Pearl (2007) implemented in the program BEST uses a sample of coalescent trees obtained using generic phylogenetic MCMC methods as input and then estimates demographic parameters using a type of importance sampling weighting of the sampled trees. Finally, a method described by Kubatko et al. (2009) implemented in the program STEM treats the sampled genealogies as if they were data and then proceeds to estimate the population history using maximum likelihood.
In general, the genealogy-based methods assume no recombination within each locus and free recombination between adjacent loci. However, in most organisms (vertebrates, insects and most plants), recombination rates and mutation rates are of similar magnitude. As a result, nearby SNPs are in linkage disequilibrium with each other but have typically experienced a few recombination events in their genealogical history. Therefore, for most organisms, it might not be appropriate to assume no recombination within loci containing more than one SNP. The evolutionary history at such loci is better described by an ancestral recombination graph (ARG), instead of a simple coalescent tree. Unfortunately, full analysis of the ARG is usually not tractable for most problems and certainly not for large SNP data sets. Therefore, many studies on large SNP data sets use composite likelihood methods that treat sites as if there were independent (Garrigan 2009; Gutenkunst et al. 2009; Naduvilezhath et al. 2011). Such composite likelihood estimators often have desirable statistical properties such as consistency (Wiuf 2006).
Analyses of genome-wide SNP data are associated with additional challenges. Helyar et al. (2011) recently reviewed several common issues in analysing SNP data, one of which being the effects of ascertainment bias. Many studies have pointed out that SNP data are often subject to an ascertainment bias (Kuhner et al. 2000; Wakeley et al. 2001; Akey et al. 2003; Nielsen & Signorovitch 2003; Nielsen et al. 2004). This bias arises from the SNP discovery process. Single-nucleotide polymorphisms used in genotyping are generally discovered by resequencing samples in a small discovery panel. Single-nucleotide polymorphisms with high minor allele frequency (MAF) are more likely to appear polymorphic in the panel and be discovered, than those with low MAF. This introduces a bias towards common alleles. In addition, the samples used for discovering SNPs may not be representative for the samples that are later genotyped, leading to a further bias of the Site Frequency Spectrum (SFS) (Nielsen 2004; Albrechtsen et al. 2010). This bias in the distribution of allele frequencies will also translate into errors and biases when estimating population history and other demographic parameters (Schlotterer & Harr 2002; Marth et al. 2004; Morin et al. 2004; Novembre & Rosenblum 2007; Storz & Kelly 2008; Guillot & Foll 2009; Chen et al. 2010; Moragues et al. 2010). The magnitude and direction of the bias depend on the SNP discover strategy, number of chromosomes included in the discovery panel and the demographic history of the populations (Akey et al. 2003; Helyar et al. 2011).
Several maximum likelihood methods have been developed for estimating population branching patterns and divergence times from the joint SFS for multiple populations using composite likelihood methods (Nielsen et al. 1998; RoyChoudhury et al. 2008; Bryant et al. 2010). In the original implementation by Nielsen et al. (1998), it is assumed that no mutation occurs after population divergence and that genetic drift can be modelled by a neutral coalescent model (Kingman 1982). The likelihood function is calculated by first conditioning on the allelic configuration of the ancestral lineages at the internodes of the population tree and then summing over all possible ancestral configurations. The (composite) maximum likelihood estimate is then found by maximizing over all population divergence times and topologies of the population tree. One obvious drawback of this method is that the computational time increases dramatically as the number of populations increases. To speed up the likelihood calculation, RoyChoudhury et al. (2008) devised a two-stage pruning algorithm that enables efficient summation and optimization. Bryant et al. (2010) developed a similar method, which also takes into account the effect of mutation. Recently, Gutenkunst et al. (2009) implemented a diffusion approximation method for demographic history inference. Their method accommodates a parameter-rich model that incorporates mutation, population size change, selection, migration and admixture.
An important assumption for these methods is that the ancestral population has reached mutation-drift equilibrium. The ancestral SFS can then be calculated using appropriate mutation models (Wright 1931; Kimura & Crow 1964; Kimura 1969). However, in most cases the assumption of mutation-drift equilibrium in the ancestral population is probably not realistic. Moreover, as SNP data are often subjected to an ascertainment bias, the frequency spectrum in the ancestral population may deviate from the expectation. Some methods have implemented an ascertainment correction procedure to account for the ascertainment bias (RoyChoudhury et al. 2008; Bryant et al. 2010). However, such corrections require knowledge about demographic history and SNP selection scheme and are difficult to generalize among studies (Albrechtsen et al. 2010).
In this study, we modify the method described by Nielsen et al. (1998) using joint estimation of ancestral SFS and divergence time. We demonstrate that, when SNPs are discovered in one or more outgroup populations, our method provides accurate estimates of population divergence times from data with ascertainment bias and does not depend on the demography of the ancestral populations. We also present an algorithm, for estimating a rooted population tree from the estimated pairwise divergence times.
Method and materials
Model
We consider, here, a model (Fig. 1) where two populations (Pop1 and Pop2) diverged from the ancestral population t generations ago as in the methods described by Nielsen et al. (1998). Explanations of the symbols used in the Fig. 1 are given in Table 1, with subscripts indicating population identity. We assume a standard neutral coalescence process, and that loci are di-allelic. Throughout, we use superscript 1 or 2 to label the ancestral and derived alleles, respectively. The data at each SNP site are given by the counts of alleles, and from population 1 and 2, with sample sizes and , respectively.
Table 1. Notation.
Symbol (i, j = 1or 2) |
Notation |
---|---|
Ti | Scaled divergence time, Ti = t/2Ni |
nij | Number of jth allele in ith population |
ni = (ni1, ni2) | Allele count of samples from ith population |
ni | Number of samples from ith population, ni = ni1 + ni2 |
n | Total number of samples, n = n1 + n2 |
rij | Number of jth allele ancestral to ith population |
ri = (ri1, ri2) | Allele count of lineages ancestral to ith population |
ri | Number of lineages ancestral to ith population, ri = ri1 + ri2 |
rj | Number of jth allele ancestral to all samples, rj = r1j + r2j |
r = (r1, r2) | Allele count of all ancestral lineages |
r | Number of all ancestral lineages, r = r1 + r2 = r1 + r2 |
A central assumption is that all SNPs represented in the sample are caused by mutations arising in the ancestral population before the time of population divergence. This assumption is reasonable if T = t/2N is small. It is also justified if the SNPs analysed have been ascertained in an outgroup set of populations that have shared no gene flow with the ingroup populations analysed. We are then interested in estimating the two parameters T1 and T2. There are two parameters to estimate as the effective population sizes may differ between the two populations. The resulting estimates of divergence times are, therefore, not symmetric. The likelihood function for a single SNP site can then be written as
(eqn 1) |
Here, and denote the allele counts in the ancestral lineages of the two populations, respectively, at the time of divergence, and r1 and r2 denote the total number of alleles of type 1 and 2, respectively, in the ancestral lineages. and are the numbers of ancestral lineages at the time of divergence in the two populations. The sum in eqn (1) is over all values of and is, therefore, implicitly also a sum over all supported values of r1 and r2. The probability of observing the current samples, conditioning on the configuration in the ancestral populations is (Slatkin 1996):
(eqn 2) |
In addition, assuming that the probability a gene-copy ends up in one or the other population does not depend on allelic state
(eqn 3) |
Notice that this assumption will be violated if SNPs were discovered in a population sharing a most recent common ancestral population with one of the ingroup populations more recently than the divergence time of the two ingroup populations. For instance, if the ascertainment population is a sister-population to population one, a rare allele is much more likely to occur in population one than in population two.
The probability of having r1 and r2 ancestral lineages at the time of divergence is (Tavare 1984):
(eqn 4) |
where a(k) = a (a + 1)…(a + k-1), and a[k] = a (a-1)…(a-k + 1).
We can now calculate all terms in Equation (1) except Pr(r1, r2 | r1, r2). The modification we introduced here is to fully parameterize the ancestral SFS, allowing n parameters, Pn = (p0, p1, …, pn-1), to be jointly estimated from data from multiple loci. Notice that pn=1−p0−p1…−pn-1. We then obtain
(eqn 5) |
We consider here Pn to be a parameter to be estimated and, therefore, we redefine the LHS of Equation (1) as L (T1, T2, Pn | n1, n2) We refer to this as the fully parameterized method. The advantage of this parameterization is that no information regarding ancestral allele frequencies will affect the estimation of T1 and T2. We can take the product of this function over all SNPs and maximize it jointly for the parameters, T1, T2 and Pn, using the BFGS (Press et al. 1992) algorithm. If SNPs are located so far apart from each other that they are independent, the estimate is a maximum likelihood estimate. If not, it can be thought of as a composite maximum likelihood estimate.
To compare with the fully parameterized method, we consider two additional models. In the first model, we set the ancestral frequency spectrum to equal the stationary distribution of the Infinite-Sites model at equilibrium We refer to this method as the IS-based method. In a second comparison, we assume the ancestral frequency spectrum follows a Beta-Binomial distribution:
where B (…) is the beta function and obtain joint maximum likelihood estimates (MLEs) of the divergence times and the parameters of the underlying Beta distribution (α and β). We refer to this method as the Beta-binomial method. The Beta distribution provides some flexibility (but less than what the fully parameterized method offers) in modelling the ancestral SFS. We estimate the divergence time using both the IS-based method and the Beta-binomial method as well, and compare the results with those estimated by the fully parameterized method.
Simulations
To test the performance of our method, we simulated data using a standard neutral coalescent process for diverging populations. We assigned two outgroup populations as ascertainment populations. These two populations diverged from the focal populations (for which we are interested in estimating the divergence time) at 40 000 (tA) and 100 000 (tB) years (2000 and 5000 generations, respectively, assuming a generation time of 20 years), which roughly corresponds to the divergence time between Asians and Europeans, and Asians and Africans, respectively. The reason for choosing these times is that we will apply our method to HGDP data and the ascertainment panel of the HGDP is believed to largely consist of Europeans and Africans. Five sequences were simulated from each ascertainment population. These ten samples were pooled together as the SNP discovery panel. Mutations were simulated using the Infinite-Site model with a uniform mutation rate of 10−6 mutations per locus per year (corresponding to 10−9 mutations per site per year and a locus length of 1000 bps). Sites showing no polymorphism in the panel were discarded. Data were first simulated as haplotypes and SNPs were then extracted from the haplotypes. As in real data, the SNPs simulated from this process were, therefore, not completely unlinked.
We simulated SNP data using different population sizes, divergence times and migration rates. For a combination of parameters, we simulated 1000 data sets. Each of these data sets includes 100 000 SNPs. Ten samples were simulated from each focal population. We also varied the numbers of SNPs and samples, to test their influence on the performance of our method.
Estimation of trees for asymmetric dissimilarity measures
Notice that the divergence time estimates provided by this method, and other similar methods are asymmetric; the scaled divergence time from population i to the common ancestral population of i and j, T̂ij, is not in general equal to the scaled divergence time from population j to the common ancestral population of i and j, T̂ji. The two divergence times are asymmetric because they are scaled with the effective population sizes and the effective population sizes may differ between the populations. As this will be true both for this method and for most other pairwise methods used in population genetics, it is worthwhile to consider how to estimate population trees from such measures. It may be reasonable to require such an algorithm to correctly estimate the tree with correctly estimated additive dissimilarity measures. For example, simply adding T̂ij and T̂ji together to form a distance and then applying the Neighbor-joining algorithm (Saitou & Nei 1987) will achieve this. However, we may also reasonably require the algorithm to take advantage of the information in the availability of two asymmetric measures. As the variance in the estimated distance typically increases with the mean, one might reasonably expect that considerable accuracy can be gained by using methods that take advantage of the information from both of the two dissimilarity measures, if the dissimilarity measures are highly asymmetric. In the following, we present one simple algorithm that has these properties:
Initialization
Define C to be the set of leaf nodes, one for each given population, and put L = C.
Iteration
Pick a pair i, j in L for which , is minimal.
Define a new node k and set T̂kl and T̂lk for all leaf nodes l (l not equal to i or j) in L as and , respectively.
Add k to L with edges of lengths T̂ij and T̂ji joining k to i and j, respectively.
Remove i and j from L.
Termination
When only one cluster remains.
Notice that the estimated tree is rooted. Also notice the similarity between this algorithm and several other algorithms such as the neighbor-joining algorithm.
East Asian SNP data
We downloaded the HGDP high-resolution genome-wide SNP data (Jakobsson et al. 2008) and extracted the SNP data of 75 individuals (150 chromosomes) from seven East Asian populations (Fig. S1). Our data include 10 Cambodian, eight Lahu, 10 Yi, 12 Han Chinese, 16 Japanese, 10 Daur and nine Mongolian individuals. We did not include individuals from Yakuts because of the worry of possible admixture between Yakuts and European populations (Fig. 1 in Li et al. 2008). We obtained ancestral information for each SNP from the dbSNP database (Sherry et al. 2001) and removed SNPs with unknown ancestral state. We also removed SNPs with more than two alleles and SNPs with missing data. In the end, we analysed more than 475 000 SNPs.
Results
Simulation studies
We first simulated data using the population history shown in Fig. 1b. We used the same population size (N1 = N2 = 937.5) for both focal populations but varied the divergence times (t) to be 3000, 6000, 10 000, 20 000 or 30 000 years. The corresponding scaled divergence times (T1 = T2) were 0.080, 0.160, 0.267, 0.533 and 0.800, respectively. We used an effective population size of 6250 for ancestral populations and ascertainment populations. Ten samples were simulated from each focal population. We applied our method to the simulated data and collected the MLEs. The average running time on a single data set was apporximately 8.5 s on a 2.1 GHz processor. We then calculated the scaled means and standard deviations of the 1000 MLEs (both scaled by the true value of T) and plotted them against the true value of T. We measure the accuracy of the method in terms of the size of the bias and the standard deviation (SD) of the estimates. For the method to perform successfully, we would want both the bias and the SD to be small (bias<5% and SD<10%, for example). As the curves for T1 are T2 are highly similar, we only show those for T1 (Fig. 2a). For all five divergence times, our method provides accurate estimates. In addition, we simulated data using unequal values of N1 and N2 and repeated the analysis. We found our estimates to also be accurate under these circumstances as well (Fig. S2).
Our method makes inference by jointly estimating divergence times and the ancestral frequency spectrum (Pn). Including the ancestral frequency spectrum explicitly as a parameter in the inference procedure, lends us the ability to account for ascertainment biases introduced when SNPs are detected in outgroup populations. As a comparison, we calculated the scaled average and variance of the 1000 MLEs estimated from the IS-based method and the Beta-Binomial method (Fig. 2a). For the IS-based method, we noticed significant biases in most estimates, illustrating that the ascertainment bias in fact does lead to biased estimates of the divergence time if not accounted for. For small values of T, we observed positive biases. As the value of T increases, the bias decreases and finally becomes negative. In contrast, the estimates found by the Beta-binomial method show little bias (comparable with the fully parameterized method) for small values of T. However, a negative bias can be observed as the value of T increases and is comparable with the IS-based method.
Next, we examined to what extent the choice of sample size and number of SNPs influences the accuracy of our estimates. First, we kept the number of SNPs at 100 000 but simulated only two samples from each focal population. Instead of conducting extra simulation, we reused the data simulated above by randomly picking two samples from each focal population. Our simulation result shows that all three methods generate estimates with larger variances (Fig. 2b) as sample size decreases. This is expected as the data of smaller sample size contains less information. More importantly, we noticed a difference in the pattern of the biases. While the fully parameterized method is only minimally influenced by sample size, the small sample size introduces additional negative biases to estimates from both the IS-based method and the Beta-binomial method. In summary, we conclude that the fully parameterized method performs no worse than the other two methods for any divergence time and can provide accurate inferences for data with small sample size.
Next, we fixed the sample size at two chromosomes per focal population, but varied the number of SNPs to be 30 000, 10 000, 3000 and 1000. Once again, we reused the data simulated before by random sampling SNPs. We plotted the estimates of divergence time against the true values (Fig. S3 two population trees were the same as those shown in) and compared them with results from the full data sets (Fig. 2a). We observed that in most cases, the mean scaled MLEs do not change with the number of SNPs. An exception is that the fully parameterized method tends to find estimates with small negative bias for recent divergence (T = 0.08) from data sets with less SNPs. This bias becomes significant (>5%) for data sets of 1000 SNPs. We also observed that the variance in our estimates increases as we reduced the number of SNPs, similar to the pattern we saw when we reduced the sample size. Furthermore, we found that reducing the number of SNPs from 100 000 to 1000 leads to a bigger variance than reducing the sample size from ten to two. Consider the case where true divergence time equals 0.08 times of the effective population size as an example. Using the fully parameterized method, MLEs estimated from the data set containing two chromosomes per population and 100 000 SNPs have a SD of 12.1%, while those estimated from data sets of ten chromosomes per population with 10 000, 3000 and 1000 SNPs have SD of approx. 10.6%, 20.0% and 40.3%, respectively.
In our model, we assumed no gene flow among populations. Such an assumption helps to simplify and facilitate the likelihood calculation process. However, it also raises questions about our method's reliability, when gene flow is present. While strong gene flow will most likely strongly bias the estimates, we hope our method is less vulnerable to low levels of gene flow. We also want to understand the extent and direction of biases introduced by gene flow. To examine these problems, we simulated three sets of data, each with gene flow between a different pairs of populations. We also varied the population migration rate M (=2Nm). The value of M represents the expected number of migrants per generation.
In the first set of simulations, we included gene flow between the ancestral population (Pop. A) and an ascertainment population (Asc. 1). Such gene flow will change the ancestral SFS, but should have no effect on postdivergence drift process. Therefore, we did not observe any significant bias in our estimates (Fig. 3a). Next, we simulated data with gene flow between the two focal populations. The exchange of genetic material between these two populations will reduce the level of divergence. As a result, we found negative biases in the estimates of the divergence time. However, the extent of biases is a function of the geneflow rate (Fig. 3b) and is not significant (<5%) for M <0.15. In the last situation, we modelled gene flow between a focal population (Pop. 2) along with its ancestral population (Pop. A) and an ascertainment population (Asc. 1). Using data simulated from this model, we obtained estimates with biases in both directions (Fig. 3c). In fact, the divergence time along the branch of Pop.1 (T1) is always overestimated, while that along the branch of Asc. 1 (T2) can be either overestimated or underestimated, depending on other demographic parameters (results not shown here). Similarly, the extent of biases increases as the level of gene flow increases, but it is minimal (<5%) for small values of M (<0.15).
In many population studies, researchers are interested in the divergence process of many populations in which case the population tree becomes the centre of interest. As optimization over large trees is extremely computationally demanding, methods for estimating trees based on pairwise comparisons become more attractive, and we will focus on one such method. The method is based on first estimating the asymmetric dissimilarity measure Tij, for all i ≠ j. We use the name ‘asymmetric dissimilarity measure’ because Tij is not symmetric and is therefore not a distance according to standard mathematical definitions. Tij, is also not explicitly proportional to the divergence time between population i and j. Instead, it represents the summation of lengths of all branches that lead from population i to the most recent ancestral population of population i and population j, scaled by the population sizes associated with each branch. Subsequently, to the estimation of Tij and Tji for all pairs of populations, the population tree can be estimated using standard algorithms based on distances formed as functions of the asymmetric dissimilarity measures. We also explored a new algorithm described in the Methods section that explicitly attempts to take advantage of the information regarding asymmetry in the dissimilarity measures.
To examine the properties of these methods, we simulated data using two population histories, each includes seven focal populations and two outgroup ascertainment populations. In both cases, we used the same outgroup divergence times and population sizes as in the previous simulations. We set other population sizes and divergence times so that the two population trees were the same as those shown in Fig. 4(a,b) Tian, which represent the East Asian population trees estimated using two different algorithms (see section Population tree of seven East Asian populations). In each case, we examined the performance of our method in recovering the assumed ‘true’ East Asian population history. Each data set includes 100 000 SNPs, with five samples from each outgroup population and ten samples from each focal population. 1000 data sets were simulated for each population history.
The estimates of the asymmetric dissimilarity measures were plotted against the true value in the simulations (Fig. S4). All points fall near the line x = y, indicating that the estimated asymmetric dissimilarity measures are close to the true values and can be relied upon for estimating population trees. We repeated the simulation for different sample sizes (two, four, six and eight) and estimated population trees from the dissimilarity measures, using both our newly developed algorithm and the Neighbor-Joining algorithm. We counted the number of correctly estimated trees for the two algorithms (Fig. 5). We found both algorithms work well at estimating the population tree in Fig. 4b. Even for small sample sizes, the true tree is recovered with >95% chance. However, the success rate in estimating the other population tree (Fig. 4a) appears to be dependent on the sample size. With a sample size of ten, the true tree is recovered 87.9% and 86.1% of the time by our algorithm and the Neighbor-joining algorithm, respectively. When the sample size drops to two, the percentages of the correctly estimated trees also drop to 32.2% and 27.8%, respectively. However, we notice that for all five sample sizes, the algorithm based on asymmetric measures performed slightly better than the Neighbor-joining algorithm.
Population tree of seven East Asian populations
We applied the methods to genome-wide SNP data of 75 individuals from seven East Asian populations (Jakobsson et al. 2008). We estimated the asymmetric dissimilarity measures between these seven populations (Table 2). We then estimated the population tree using the new algorithm. The resulting tree is drawn in Fig. 4a. We noticed that the population from the Indochinese peninsula (Cambodia) forms an outgroup to the other populations. Of the six remaining populations, Mongolian and Daur form a clade A, and the other four populations form another clade B. Lahu diverge first from the clade B, followed by Yi, leaving Han Chinese and Japanese closely related with each other. We also noticed that some populations have shorter branches than others, possibly due to their larger effective population sizes. For example, Han Chinese has a relative short branch, in agreement with the large size of the population. Mongolian also displays a short branch, which leads us to infer a large Mongolian effective population size. On the other side, Lahu has a very long branch, indicating a possible small effective population size. In addition, we estimated the unrooted tree using Neighbor-joining (Fig. 4b). The unrooted tree differs from our estimated rooted tree in the placement of the Daur-Mon-golian clade. But the two trees show similar lengths for common branches. For comparison, we obtained trees of these seven East Asian populations based on the results of two other studies (Li et al. 2008; Tian et al. 2008). We estimated an unrooted Neighbor-joining tree (Fig. 4c) from the Fst statistics reported by Tian et al. (2008). We also compared with the phylogenetic tree (Fig. 4d, note branch lengths in this figure are not to scale), estimated by Li et al. (2008) using the CONTML method in the PHYLIP package. Notice the similarity between the two Neighbor-Joining trees. Also notice that the unrooted versions of the four trees differ in the placement of Daur and Mongolian populations. The differences between these trees might in part be exacerbated by the presence of gene flow between the populations.
Table 2. Estimated pairwise asymmetric dissimilarity measures between seven East Asian populations.
Cambodian | Lahu | Yi | Han | Japanese | Daur | Mongolian | |
---|---|---|---|---|---|---|---|
Cambodian | 0.000 | 0.002 | 0.002 | 0.004 | 0.014 | 0.018 | |
Lahu | 0.047 | 0.038 | 0.039 | 0.041 | 0.052 | 0.056 | |
Yi | 0.030 | 0.011 | 0.007 | 0.010 | 0.018 | 0.020 | |
Han | 0.026 | 0.009 | 0.004 | 0.002 | 0.012 | 0.012 | |
Japanese | 0.038 | 0.020 | 0.016 | 0.012 | 0.020 | 0.022 | |
Daur | 0.029 | 0.013 | 0.006 | 0.002 | 0.000 | 0.007 | |
Mongolian | 0.017 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 |
Discussion
In this study, we described a maximum likelihood method for estimating divergence times from SNP data with ascertainment bias. In our model, we assumed that the divergence time between the populations is small or that the polymorphic sites used in the focal populations are all detected in a set of outgroup populations. Given either of these assumptions, we argue that mutations occurring after divergence are of little or no influence compared with genetic drift and can be neglected.
In many cases, SNPs genotyped in large samples are first detected in a small discovering panel where they appear to be polymorphic. Consequently, rare alleles are less likely to be included than common alleles, introducing an ascertainment bias. Ascertainment bias results in a skew in the SFS (Clark et al. 2005) and leads to biases in the estimation of genetic variation, population structure, population sizes, migration rates and in population assignment (Morin et al. 2004; Bradbury et al. 2011). Common ways of measuring population divergence from SNP data, like FST statistics and principal component analysis (PCA), are likely to be vulnerable to such biases (Albrechtsen et al. 2010). For example, Lewandowska-Sabat et al. (2010) genotyped 282 individuals from 31 Arabidopsis thaliana populations and found a high level of population subdivision (FST = 0.85 ± 0.007). However, as the 149 SNPs they genotyped were previously selected to show intermediate global population allele frequencies, their FST values are likely to be overestimated. In another study, Seeb et al. (2011) compared chum salmons collected at 114 locations, ranging from Korean and Japan to Alaska and Northern America, by genotyping 60 SNPs discovered in some early efforts that focused on Western Alaska. They saw an elevated level of diversity in Alaska populations reflected in both allelic richness and heterozygosity, which may reflect the intrinsic differences among salmon populations. But they also pointed out the possibility of this being an artefact from ascertainment bias in the SNP panel.
We considered the situation where SNPs are discovered in a set of outgroup populations. We demonstrated that, under these conditions, our method can make fast and accurate inference from data that are strongly affected by ascertainment biases. We also demonstrated that neglecting ascertainment bias (IS-based method) will introduce positive biases for small divergence times and negative biases for large divergence times. A possible explanation is that, for small divergence times, the SFS in both extant populations are very similar to that in the ancestral population. In the IS-based methods, the ancestral SFS is fixed to the expectation of a standard neutral model with infinite-sites mutation. When strong ascertainment biases exist, the true ancestral SFS deviates from the expected one, leading to an upward bias in the difference between the present SFS and the ancestral SFS, which in turn results in the overestimation of divergence times. However, for large divergence times, the SFS in the extant populations are more different from the ancestral SFS because of drift. In the presence of ascertainment biases, the ancestral SFS is enriched for sites with medium frequencies. As the divergence time increases, genetic drift acts to reduce the proportion of medium-frequency sites. Therefore, fixing the ancestral SFS to the expectation of the standard neutral model with infinite-sites mutation introduces a downward bias in the difference between the present SFS and the ancestral SFS, and leads to negative biases in divergence time estimates.
In addition to the IS-based method, we tested a Beta-binomial method that employs a Beta distribution as the ancestral SFS. We noticed that for more recent divergence times, this method is capable of correcting for ascertainment biases caused by SNP discovery in an outgroup sample. However, for large divergence times, the estimates are biased towards smaller values. Moreover, the method does not perform well for data with small sample sizes.
Our method is developed for analysing genome-wide SNP data and may lose its power when applied to smaller data sets. We examined the influence of sample size and number of loci on our estimates through simulation. Interestingly, we found that accurate inferences can still be made from a data set of 100 000 SNPs that includes only two chromosomes from each focal population. Such a result suggests that our method could be used for estimating divergence times from single individuals, a promising prospect with the emergence of individual full genome sequencing. However, the accuracy of our method deteriorates as the number of SNPs included in the analysis decreases. If the number of SNPs is reduced to 3000, the standard deviation increases to about 20% of the true value for very small divergence times. This suggests that our method should be applied to large data set (>10 000 SNPs), when the populations to be studied diverged only recently. However, the standard deviation in the divergence time estimates decreases quickly as the divergence time increases. For example, as the divergence time increases to 0.16 times of the effective population size, the standard deviation will drop to about 12%. Moreover, we noticed that higher variance in divergence time estimates does not necessarily lead to worse population trees. The accuracy in estimating some population histories (for example, the one in Fig. 4b) is less sensitive to the standard deviation in the divergence time estimate, possibly due to the lack of short internal branches. Combining these observations, we suggest that our method may be applied to data set with <10 000 SNPs, if divergence times are relatively large compared with the effective population sizes. But the results should be taken with some caution.
In this study, we also described a new algorithm for estimating rooted population trees from asymmetric dissimilarity measures. The algorithm uses a distance-based method to reconcile the divergence times estimated from each pair of populations. The algorithm is, therefore, the first population/species distance-based method for estimating species / population history. We compared this algorithm with the Neighbor-Joining method based on simple addition of the dissimilarity measures. The new algorithm generally performs better than the Neighbor-Joining algorithm, suggesting that further research into algorithms for estimating trees from asymmetric dissimilarity measures is warranted. The algorithm we presented has some advantages shared with the Neighbor-Joining method, such as provable consistency, but there may be other algorithms with better statistical properties for solving this problem. In addition, the use of Neighbor-Joining might be improved for these applications by using other functions than simple addition for converting asymmetric dissimilarity measures into distances.
With the fast advances in new-generation sequencing technologies and high throughout genotyping platforms, the volume of available genome-wide SNP data is rapidly increasing, in humans and many other organisms. Commercial high-density SNP chips are now available for chicken, cattle, dog, pig, sheep, horse, mouse and maize, all of which include at least 50 000 high quality SNPs. Genome-wide surveys of SNP variation have also been performed in many plants (McNally et al. 2009; Hohenlohe et al. 2010; Geraldes et al. 2011). We are foreseeing a growing trend of using large, gen-ome-wide data for population genetics and phylogenetic analyses. For such data, the methods developed here should be of use.
Supplementary Material
Acknowledgments
This research was supported by the National Institutes of Health grant (GM078204) to Rasmus Nielsen and Jody Hey.
Footnotes
Y.W. works on developing and implementing statistical method for estimating demographic history from large scale genome data. R.N. focuses on statistical and computational aspects of evolutionary theory and genetics.
Data accessibility: The 525 910 single-nucleotide polymorphisms data from 485 individuals is available at http://neurogenetics.nia.nih.gov/paperdata/public/. The program described in this study is available from the authors upon request.
Supporting information: Additional supporting information may be found in the online version of this article.
Fig. S1 Geographic locations of the seven East Asian Populations.
Fig. S2 Data were simulated from population structure shown in Fig. 1b with population size (a) N1 = 937.5 and N2 = 3750 and (b) N1 = 937.5 and N2 = 312.5, and different divergence times ranging from 3000 years to 30 000 years.
Fig. S3 Data were simulated from population structure shown in Fig. 1b with population size N1 = N2 = 937.5, and different divergence times ranging from 3000 years to 30 000 years.
Fig. S4 Data were simulated using population tree shown in (a) Fig. 4(a,b).
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.
References
- Akey JM, Zhang K, Xiong M, Jin L. The effect of single nucleotide polymorphism identification strategies on estimates of linkage disequilibrium. Molecular Biology and Evolution. 2003;20:232–242. doi: 10.1093/molbev/msg032. [DOI] [PubMed] [Google Scholar]
- Albrechtsen A, Nielsen FC, Nielsen R. Research article: ascertainment biases in SNP chips affect measures of population divergence. Molecular Biology and Evolution. 2010;11:2534–2547. doi: 10.1093/molbev/msq148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradbury IR, Hubert S, Higgins B, et al. Evaluating SNP ascertainment bias and its impact on population assignment in Atlantic cod, Gadus morhua. Molecular Ecology Resources. 2011;11(Suppl 1):218–225. doi: 10.1111/j.1755-0998.2010.02949.x. [DOI] [PubMed] [Google Scholar]
- Brooks SA, Gabreski N, Miller D, et al. Whole-genome SNP association in the horse: identification of a deletion in myosin Va responsible for Lavender Foal Syndrome. PLoS Genetics. 2010;6:e1000909. doi: 10.1371/journal.pgen.1000909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bryant D, Bouchaert R, Rosenberg NA. Inferring species trees directly from SNP and AFLP data: full coalescent analysis without those pesky gene trees. 2010 doi: 10.1093/molbev/mss086. id: arXiv:0910.4193v1 Available at http://arxiv.org/ [DOI] [PMC free article] [PubMed]
- Chen D, Ahlford A, Schnorrer F, et al. High-resolution, high-throughput SNP mapping in Drosophila melanogaster. Nature Methods. 2008;5:323–329. doi: 10.1038/nmeth.1191. [DOI] [PubMed] [Google Scholar]
- Chen H, Patterson N, Reich D. Population differentiation as a test for selective sweeps. Genome Research. 2010;20:393–402. doi: 10.1101/gr.100545.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark AG, Hubisz MJ, Bustamante CD, Williamson SH, Nielsen R. Ascertainment bias in studies of human genome-wide polymorphism. Genome Research. 2005;15:1496–1502. doi: 10.1101/gr.4107905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Conrad DF, Jakobsson M, Coop G, et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics. 2006;38:1251–1260. doi: 10.1038/ng1911. [DOI] [PubMed] [Google Scholar]
- Decker JE, Pires JC, Conant GC, et al. Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:18644–18649. doi: 10.1073/pnas.0904691106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garrigan D. Composite likelihood estimation of demographic parameters. BMC Genetics. 2009;10:72. doi: 10.1186/1471-2156-10-72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geraldes A, Pang J, Thiessen N, et al. SNP discovery in black cottonwood (Populus trichocarpa) by population transcriptome resequencing. Molecular Ecology Resources. 2011;11:81–92. doi: 10.1111/j.1755-0998.2010.02960.x. [DOI] [PubMed] [Google Scholar]
- Gibbs RA, Taylor JF, Van Tassell CP, et al. Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science. 2009;324:528–532. doi: 10.1126/science.1167936. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guillot G, Foll M. Correcting for ascertainment bias in the inference of population structure. Bioinformatics. 2009;25:552–554. doi: 10.1093/bioinformatics/btn665. [DOI] [PubMed] [Google Scholar]
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics. 2009;5:e1000695. doi: 10.1371/journal.pgen.1000695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heled J, Drummond AJ. Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution. 2010;27:570–580. doi: 10.1093/molbev/msp274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Helyar SJ, Hemmer-Hansen J, Bekkevold D, et al. Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges. Molecular Ecology Resources. 2011;11(Suppl 1):123–136. doi: 10.1111/j.1755-0998.2010.02943.x. [DOI] [PubMed] [Google Scholar]
- Hey J. The Divergence of Chimpanzee species and subspecies as revealed in multipopulation isolation-with-migration analyses. Molecular Biology and Evolution. 2010;27:921–933. doi: 10.1093/molbev/msp298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hey J, Nielsen R. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:2785–2790. doi: 10.1073/pnas.0611164104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hohenlohe PA, Bassham S, Etter PD, et al. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genetics. 2010;6:e1000862. doi: 10.1371/journal.pgen.1000862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008;451:998–1003. doi: 10.1038/nature06742. [DOI] [PubMed] [Google Scholar]
- Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61:893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M, Crow JF. Number of alleles that can be maintained in finite population. Genetics. 1964;49:725. doi: 10.1093/genetics/49.4.725. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman JFC. The coalescent. Stochastic Processes and their Applications. 1982;13:235–248. [Google Scholar]
- Kubatko LS, Carstens BC, Knowles LL. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009;25:971–973. doi: 10.1093/bioinformatics/btp079. [DOI] [PubMed] [Google Scholar]
- Kuhner MK, Beerli P, Yamato J, Felsenstein J. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics. 2000;156:439–447. doi: 10.1093/genetics/156.1.439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewandowska-Sabat AM, Fjellheim S, Rognli OA. Extremely low genetic variability and highly structured local populations of Arabidopsis thaliana at higher latitudes. Molecular Ecology. 2010;19:4753–4764. doi: 10.1111/j.1365-294X.2010.04840.x. [DOI] [PubMed] [Google Scholar]
- Li JZ, Absher DM, Tang H, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–1104. doi: 10.1126/science.1153717. [DOI] [PubMed] [Google Scholar]
- Liu L, Pearl DK. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology. 2007;56:504–514. doi: 10.1080/10635150701429982. [DOI] [PubMed] [Google Scholar]
- Lohmueller KE, Bustamante CD, Clark AG. Methods for human demographic inference using haplotype patterns from genomewide single-nucleotide polymorphism data. Genetics. 2009;182:217–231. doi: 10.1534/genetics.108.099275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marth GT, Czabarka E, Murvai J, Sherry ST. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. doi: 10.1534/genetics.166.1.351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McNally KL, Childs KL, Bohnert R, et al. Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:12273–12278. doi: 10.1073/pnas.0900992106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moragues M, Comadran J, Waugh R, et al. Effects of ascertainment bias and marker number on estimations of barley diversity from high-throughput SNP genotype data. Theoretical and Applied Genetics. 2010;120:1525–1534. doi: 10.1007/s00122-010-1273-1. [DOI] [PubMed] [Google Scholar]
- Morin PA, Luikart G, Wayne RK, Grp SW. SNPs in ecology, evolution and conservation. Trends in Ecology & Evolution. 2004;19:208–216. [Google Scholar]
- Naduvilezhath L, Rose LE, Metzler D. Jaatha: a fast composite-likelihood approach to estimate demographic parameters. Molecular Ecology. 2011;20:2709–2723. doi: 10.1111/j.1365-294X.2011.05131.x. [DOI] [PubMed] [Google Scholar]
- Nielsen R. Population genetic analysis of ascertained SNP data. Human Genomics. 2004;1:218–224. doi: 10.1186/1479-7364-1-3-218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Signorovitch J. Correcting for ascertainment biases when analyzing SNP data: applications to the estimation of linkage disequilibrium. Theoretical Population Biology. 2003;63:245–255. doi: 10.1016/s0040-5809(03)00005-4. [DOI] [PubMed] [Google Scholar]
- Nielsen R, Mountain JL, Huelsenbeck JP, Slatkin M. Maximum-likelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution. 1998;52:669–677. doi: 10.1111/j.1558-5646.1998.tb03692.x. [DOI] [PubMed] [Google Scholar]
- Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168:2373–2382. doi: 10.1534/genetics.104.031039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Novembre J, Rosenblum EB. Ascertainment bias in spatially structured populations: a case study in the eastern fence lizard. Journal of Heredity. 2007;98:331–336. doi: 10.1093/jhered/esm031. [DOI] [PubMed] [Google Scholar]
- Padhukasahasram B, Wall JD, Marjoram P, Nordborg M. Estimating recombination rates from single-nucleotide polymorphisms using summary statistics. Genetics. 2006;174:1517–1528. doi: 10.1534/genetics.106.060723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlidis P, Jensen JD, Stephan W. Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations. Genetics. 2010;185:907–922. doi: 10.1534/genetics.110.116459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes in C, the Art of Scientific Computing. 2nd. Cambridge University Press; Cambridge: 1992. [Google Scholar]
- Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics. 2003;164:1645–1656. doi: 10.1093/genetics/164.4.1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RoyChoudhury A, Felsenstein J, Thompson EA. A two-stage pruning algorithm for likelihood computation for a population tree. Genetics. 2008;180:1095–1105. doi: 10.1534/genetics.107.085753. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
- Schaffner SF, Foo C, Gabriel S, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Research. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlotterer C, Harr B. Single nucleotide polymorphisms derived from ancestral populations show no evidence for biased diversity estimates in Drosophila melanogaster. Molecular Ecology. 2002;11:947–950. doi: 10.1046/j.1365-294x.2002.01491.x. [DOI] [PubMed] [Google Scholar]
- Seeb LW, Templin WD, Sato S, et al. Single nucleotide polymorphisms across a species' range: implications for conservation studies of Pacific salmon. Molecular Ecology Resources. 2011;11(Suppl 1):195–217. doi: 10.1111/j.1755-0998.2010.02966.x. [DOI] [PubMed] [Google Scholar]
- Sherry ST, Ward MH, Kholodov M, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M. Gene genealogies within mutant allelic classes. Genetics. 1996;143:579–587. doi: 10.1093/genetics/143.1.579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storz JF, Kelly JK. Effects of spatially varying selection on nucleotide diversity and linkage disequilibrium: insights from deer mouse globin genes. Genetics. 2008;180:367–379. doi: 10.1534/genetics.108.088732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tavare S. Line-of-descent and genealogical processes, and their applications in population genetics models. Theoretical Population Biology. 1984;26:119–164. doi: 10.1016/0040-5809(84)90027-3. [DOI] [PubMed] [Google Scholar]
- Tian C, Kosoy R, Lee A, et al. Analysis of East Asia genetic substructure using genome-wide SNP arrays. PLoS ONE. 2008;3:e3862. doi: 10.1371/journal.pone.0003862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uimari P, Tapio M. Extent of linkage disequilibrium and effective population size in Finnish Landrace and Finnish Yorkshire pig breeds. Journal of Animal Science. 2011;89:609–614. doi: 10.2527/jas.2010-3249. [DOI] [PubMed] [Google Scholar]
- Vonholdt BM, Pollinger JP, Lohmueller KE, et al. Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication. Nature. 2010;464:898–902. doi: 10.1038/nature08837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. The discovery of single-nucleotide polymorphisms–and inferences about human demographic history. American Journal of Human Genetics. 2001;69:1332–1347. doi: 10.1086/324521. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Hey J. Estimating divergence parameters with small samples from a large number of loci. Genetics. 2010;184:363–379. doi: 10.1534/genetics.109.110528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wiuf C. Consistency of estimators of population scaled parameters using composite likelihood. Journal of Mathematical Biology. 2006;53:821–841. doi: 10.1007/s00285-006-0031-0. [DOI] [PubMed] [Google Scholar]
- Wright S. Evolution in Mendelian populations. Genetics. 1931;16:0097–0159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yin J, Jordan MI, Song YS. Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data. Bioinformatics. 2009;25:i231–i239. doi: 10.1093/bioinformatics/btp229. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.