Skip to main content
Genetics logoLink to Genetics
. 2013 Nov;195(3):979–992. doi: 10.1534/genetics.113.154740

Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data

Matteo Fumagalli *,1, Filipe G Vieira *, Thorfinn Sand Korneliussen †,, Tyler Linderoth *, Emilia Huerta-Sánchez *, Anders Albrechtsen , Rasmus Nielsen *,‡,§
PMCID: PMC3813878  PMID: 23979584

Abstract

Over the past few years, new high-throughput DNA sequencing technologies have dramatically increased speed and reduced sequencing costs. However, the use of these sequencing technologies is often challenged by errors and biases associated with the bioinformatical methods used for analyzing the data. In particular, the use of naïve methods to identify polymorphic sites and infer genotypes can inflate downstream analyses. Recently, explicit modeling of genotype probability distributions has been proposed as a method for taking genotype call uncertainty into account. Based on this idea, we propose a novel method for quantifying population genetic differentiation from next-generation sequencing data. In addition, we present a strategy for investigating population structure via principal components analysis. Through extensive simulations, we compare the new method herein proposed to approaches based on genotype calling and demonstrate a marked improvement in estimation accuracy for a wide range of conditions. We apply the method to a large-scale genomic data set of domesticated and wild silkworms sequenced at low coverage. We find that we can infer the fine-scale genetic structure of the sampled individuals, suggesting that employing this new method is useful for investigating the genetic relationships of populations sampled at low coverage.

Keywords: next-generation sequencing, FST, principal components analysis


DETERMINING the level of genetic variation within and between species or populations is necessary to study the effects of mutation, natural selection, and genetic drift. In the past few years, faster and cheaper high-throughput DNA sequencing technologies have provided us with an unprecedented amount of large-scale genetic data. These next-generation sequencing (NGS) technologies are now commonly used in population genetic studies and provide us with the perfect opportunity to investigate the evolutionary forces affecting genetic variation.

Currently, available NGS technologies differ in their protocol design (reviewed in Metzker 2010) but all produce data with similar general features. Briefly, the sequencing output consists of relatively short stretches (e.g., currently about 50–100 bp for Illumina machines) of sequenced DNA, commonly called “reads.” These small segments of DNA are then aligned to a reference genome or assembled into scaffolds in de novo assembly when a reference genome is not available.

These technologies have greatly improved sequencing efforts in both model and nonmodel organisms, but they have also introduced new challenges because many of the data sets produced using these methods are sequenced at low coverage (a position in the genome is covered by only few sequencing reads), and raw sequencing error rates are often higher than observed using Sanger sequencing. In such circumstances, it is often difficult to distinguish between a variable site and a sequencing error, making the identification of variable sites in the sample (a procedure known as “SNP calling”) nontrivial and prone to error. Also, determining the genotype for each individual (“genotype calling”) can be unreliable due to uncertainty about whether both the parental chromosomes were sampled. Therefore, sequencing errors and uncertainty in the genotype calls may lead to a biased allele frequency distribution (Johnson and Slatkin 2008; Hellmann et al. 2008).

Accurate estimation of the site frequency spectrum (SFS), however, is important for population genetic inferences of demography, natural selection, and population structure. Indeed, many summary statistics for evolutionary inferences are functions of the sample allele frequencies (Nielsen 2005). Higher sequencing coverage lowers the uncertainty, but with a fixed budget, researchers need to choose between sequencing fewer samples at higher coverage or sequencing more samples at low to medium coverage. The latter was the preferred option for many recent large-scale sequencing population genetic studies (1000 Genomes Project Consortium 2010, 2012; Auton et al. 2012; Huang et al. 2012).

Naïve methods for estimating allele frequencies, which are primarily based on direct counting of sequencing reads, provide inaccurate estimates of local nucleotide diversity (Nielsen et al. 2011). Consequently, there have been numerous efforts to use statistical models to analyze NGS data to provide more accurate estimates of allele frequencies. To this end, maximum-likelihood (ML) methods and Bayesian methods have been developed for estimating the allele frequency at any given site (Lynch 2009; Keightley and Halligan 2011; Kim et al. 2011) or the entire distribution of allele frequencies jointly across multiple sites (Li 2011; Keightley and Halligan 2011; Nielsen et al. 2012). Bayesian methods incorporate base quality scores and statistical uncertainty to obtain posterior probabilities associated with each genotype. Recent studies incorporate this probabilistic approach to estimate population genetic parameters from NGS data (Yi et al. 2010; Gompert and Buerkle 2011; Kang and Marjoram 2011; Li 2011; Gompert et al. 2012).

Thanks to these approaches, genome-wide scans of positive selection have been possible in samples sequenced at moderate coverage. For example, in Yi et al. (2010), 50 Tibetan individuals were sequenced to identify the regions of the genome involved in the adaptation to high altitude. Species of rice (Xu et al. 2011), chicken (Rubin et al. 2010), and silkworm (Xia et al. 2009) have also been sequenced at low coverage to identify functional differences between domesticated and wild populations.

In genome-wide scans for selection, it is often informative to summarize genetic variation using population differentiation statistics, such as FST (Wright 1951), to identify particular regions of the genome that are highly differentiated relative to the rest of the genome. FST can also be informative about the divergence time between two populations. Another powerful tool for the analysis of genetic data are principal component analysis (PCA). This data reduction method is a convenient way to visualize the data, derive corrections for population stratification in association studies, and investigate specific features of population history and differentiation. Both PCA and FST have been used extensively for the past 30 years and continue to be valuable tools in summarizing genetic variation.

However, as we show, traditional methods for computing FST and performing PCA result in biases when applied to genotype calls from low or moderate coverage NGS data. Therefore, we propose a new method to estimate FST from NGS data that accounts for uncertainty in the genotype calls. Furthermore, we also show that population structure can be investigated with PCA under the proposed probabilistic framework that accounts for sequencing errors. These new methods outperform previous approaches, especially in the case of low-coverage sequencing data as determined from simulated sequences. Finally, we demonstrate the power of the proposed methods by applying it to a previously published data set of wild and domesticated samples of Bombyx mori (Xia et al. 2009).

The methods developed in this study contribute to the current toolkit for population genetic analyses of next-generation sequencing data and can be applied to both model and nonmodel organisms.

Materials and Methods

Measuring genetic differentiation between populations

FST is a measure of population genetic differentiation that quantifies the proportion of variance in allele frequencies among populations relative to the total variance (the sum of the variance within individuals, within populations, and between populations). Several estimators of FST have been proposed through the years (reviewed in Weir and Hill 2002; Holsinger and Weir 2009).

There is considerable debate about definitions of FST. Some researchers consider FST to be a model parameter (e.g., Balding and Nichols 1995; Nicholson et al. 2002; Holsinger et al. 2002), while others consider it to be a statistic (e.g., Reynolds et al. 1983; Weir and Cockerham 1984; Hudson et al. 1992). Even when considering FST as a parameter, there is considerable discussion about what model it is a parameter of and how it should be estimated (Marchini and Cardon 2002; Balding 2003). The objective of this article is not to compare these approaches, which differ both in what they estimate and in how the estimation procedure works. We remain agnostic with regard to the debate on interpretation and definition of FST, although we use the word “estimator” throughout. Instead, we show how some of the most commonly applied estimators of FST can be modified in the presence of low- and medium-coverage data to more accurately reflect what the original FST estimators were intended to capture; i.e., the objective will be to derive estimators applicable to NGS data that produce results similar to those that would have been obtained from the original estimator based on full genotype data without any errors. As a note, other estimators, not considered here, could potentially be modified in a similar fashion.

Method-of-moments estimation:

We start by considering the most simple method-of-moments estimators of FST. They do not rely on any assumptions about the shape of the sampling distribution, beyond the moments used to estimate the parameters, and they are easy to implement through simple algebraic expressions. For these reasons, method-of-moments estimators are popular and often used.

Our first aim is to extend the method-of-moments FST estimator proposed by Reynolds et al. (1983), as this is one of the most popular and well-motivated estimators of FST, to take into account genotyping uncertainty. Assuming a biallelic SNP, with nonreference allele at estimated frequencies of p^i, p^j, and p^ for population i, j, and pooled, the genetic variance between and within populations at site s is, respectively,

as=4ni(p^(i,s)p^s)2+4nj(p^(j,s)p^s)2bs2(2ninj/(ni+nj)) (1)

and

bs=niα(i,s)+njα(j,s)ni+nj1, (2)

where ni and nj are the number of sampled individuals per population, α(i,s)=2p^(i,s)(1p^(i,s)), and α(j,s)=2p^(j,s)(1p^(j,s)). Table 1 describes nomenclature used throughout this manuscript.

Table 1. Nomenclature used in the manuscript.
Notation Description
p(i,s), ps Population allele frequency in population i and pooled, respectively, at site s
panc,s Ancestral population allele frequency at site s
p^(i,s), p^s Estimated population allele frequency from allele counts at population i and pooled, respectively, at site s
ni, n No. of sampled individuals at population i, and pooled, respectively
m No. of sites
rs No. of sequencing reads at site s
vz,s Base at sequencing read z at site s
L(z,v,s) Likelihood of base v at read z and site s
G(w,s) Genotype at site s for individual w; G ∈ {0, 1, 2}
X(w,s) Data (sequencing reads) at site s for individual w
Y(i,s) Data (sequencing reads) at site s for population i
h(i,s)(k) Marginal likelihood of k nonreference alleles for population i at site s
π(i,s)(k), πs(k) Posterior probability of k nonreference alleles for population i and pooled, respectively, at site s
π(i,j,s)(k,z) Joint posterior probability of k and z nonreference alleles for population i and j, respectively, at site s
a(i,j)(k,z), b(i,j)(k,z), c(i,j)(k,z) Genetic variance between (a) and within (b) populations and total (c) assuming k and z nonreference alleles at population i and j, respectively
S(i,j)(k,z) Joint allele proportions for k and z nonreference alleles at population i and j, respectively
C(w,y) Normalized matrix for PCA for individual w and y
s Index for sites
k Index for samples (allele frequencies)
w, y Indexes for individuals
z Index for sequencing reads
P(⋅) Probability function
B(⋅) Beta-function

The estimate of FST for a single site is then

FST=asas+bs (3)

while for a locus of m sites it is

FST(locus)=s=1mass=1m(as+bs). (4)

Maximum-likelihood estimation:

ML methods for estimating FST require the specification of a sampling probability distribution. Once this distribution is defined, one can maximize a likelihood function to obtain ML estimators for the parameters of the distribution. ML estimators of FST have been very popular, particularly for detecting signatures of adaptive natural selection among populations (e.g., Beaumont and Balding 2004; Riebler et al. 2008; Foll and Gaggiotti 2008).

Assuming a biallelic site s with beta-distributed allele frequencies, the probability of the sample allele frequencies p^(i,s) at population i can be expressed as a beta-binomial distribution with parameters 2ni (sample size), FST, and panc,s, the ancestral population allele frequency. This parameterization assumes divergence from a common ancestral population and that the subsequent divergence is well modeled by the beta-distribution. The marginal sampling distribution in population i is then given by (Balding and Nichols 1995; Balding 2003)

P(p^(i,s)=k2ni|panc,s,FST)=(2nik)B(k+α,2nik+β)B(α,β), (5)

where k is the count of the nonreference (or derived) allele, B is the Beta-function,

α=panc,s(1FST)FST, (6)

and

β=(1panc,s)(1FST)FST. (7)

The full-likelihood function is the product of this sampling distribution for all populations, as the populations are independent conditional on panc,s. For two populations i and j, we have

P(p^(i,s)=k2ni,p^(j,s)=z2nj|panc,s,FST)=P(p^(i,s)=k2ni|panc,s,FST)×P(p^(j,s)=z2nj|panc,s,FST), (8)

where the subscripts on n and p^ indicate population identity. We numerically maximize Equation 8 using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (Fletcher 1987; Press et al. 2007).

Quantifying population genetic differentiation by calling genotypes

A naïve strategy for estimating sample allele frequencies and FST is to first call genotypes at each site, and then simply count the occurrence of nonreference or derived alleles among all individuals.

We first assessed the accuracy of several genotype-calling strategies (Supporting Information, File S1). These methods include approaches based on direct counts of read bases, on genotype likelihoods, and on genotype posterior probabilities. One promising approach is to use Bayesian methods to assign individual genotypes by computing genotype posterior probabilities P(G|X) from genotype likelihoods and a specific prior P(G) on genotype G. Bayes’ theorem is used to calculate P(G|X), the posterior probability of genotype G given the observed data X (1000 Genomes Project Consortium 2010). The prior can be defined using extraneous data, such as the reference sequence, sequences in a database, an estimate of the allele frequency, and/or inbreeding coefficients, etc. (e.g., 1000 Genomes Project Consortium 2010; Li 2011; Nielsen et al. 2012).

We calculate genotype posterior probabilities at site s for individual w, P(G(w,s)|X(w,s)) as

P(G(w,s)|X(w,s))=P(X(w,s)|G(w,s))P(G(w,s))G=02P(X(w,s)|G(w,s))P(G(w,s)), (9)

where P(X(w,s)|G(w,s)) are the genotype likelihoods and P(G(w,s)) is the prior probability of genotype G at site s under Hardy–Weinberg Equilibrium (HWE). The prior is calculated from estimates of the per-site population allele frequencies using the method described in Kim et al. (2011). To call genotypes, the genotype with the highest posterior probability was chosen for each individual.

Results show that calling genotypes from genotype posterior probabilities provides the most stable and accurate genotype and SNP-calling accuracy at almost all tested experimental scenarios (Table S1, Table S2, and Table S3). We adopted this strategy to call genotypes throughout the rest of the study. Specifically, we counted nonreference alleles from these called genotypes to infer allele frequencies and computed a method-of-moments estimator of FST, which we labeled F^ST.GC (Equations 10 and 11). We adopted this genotype calling strategy to compute a ML estimator of FST, F^ST.ML.GC (Equations 5 and 8).

An alternative strategy for computing FST is to avoid genotype calling altogether so that inference is based directly on the posterior probabilities (e.g., Yi et al. 2010; Nielsen et al. 2012). We describe such methods in the following sections.

Quantifying population genetic differentiation without calling genotypes

Here we propose using a Bayesian probabilistic framework to estimate FST from posterior probabilities of sample allele frequencies of each population at each site without calling specific genotypes. In our applications, we compute a maximum-likelihood estimate of the site frequency spectrum from genotype likelihoods, as previously proposed by Nielsen et al. (2012). Using this ML estimate of the SFS as a prior in an empirical Bayes approach, we estimate the posterior probability for all possible allele frequencies at each site (Nielsen et al. 2012).

Method-of-moments estimation:

Let πi(k)=P(pi^=k/(2ni)|Y(i,s)) be the posterior probability that a site in population i has derived sample allele frequency pi^=k/(2ni), in a sample of ni diploid individuals, given the read data Y(i,s). This probability can be calculated from the genotype probabilities using the algorithm in Nielsen et al. (2012). Allele labeling with respect to the derived allele is arbitrary and any other labeling of alleles could have been chosen if identification of the ancestral and derived state is not possible.

From these quantities, we compute the posterior expectation of the genetic variance between and within populations (see Equations 1 and 2) at site s as

E[as|Ys]=k=02niz=02nja(i,j)(k,z)π(i,j,s)(k,z) (10)

and

E[bs|Ys]=k=02niz=02njb(i,j)(k,z)π(i,j,s)(k,z), (11)

where a(i,j)(k,z) and b(i,j)(k,z) are genetic variances from Reynolds et al. (1983) formula, with k- and z-derived alleles in populations i and j, respectively, and Ys is the sequencing data at site s. The total expected variance, E[cs|Ys], at each site, is then E[cs|Ys] = E[as|Ys] + E[bs|Ys].

The estimate of FST for a single site is given by the ratio of E[as|Ys] to E[cs|Ys] (Equation 3). However, since the two variance components are not independent and this calculation involves the expectation of a ratio, we approximate it using the delta method (Rice 2008; Rice and Papadopoulos 2009) to obtain the following estimator of FST at site s,

F^ST.Ev=E[ascs|cs0,Ys]=E[as|Ys]E[cs|Ys]+u=1(1)uE[as|Ys]cu+a,cuE[c|Ys]i+1, (12)

where 〈cu〉 is the uth central moment of cs and 〈a, cu〉 is the mixed central moment, which can be calculated as

cu=E[(csE[cs])u|Ys]=k=02niz=02nj(c(i,j)(k,z)E[cs|Ys])uπ(i,j,s)(k,z) (13)

and

a,cu=E[(asE[as|Ys])(csE[cs|Ys])u|Ys]=k=02niz=02nj(a(i,j)(k,z)E[as|Ys])(c(i,j)(k,z)E[cs|Ys])uπ(i,j,s)(k,z), (14)

where c(i,j)(k,z) is the total genetic variance from Reynolds et al. (1983) formula, with k- and z-derived alleles in populations i and j, respectively. For computational purposes, we use only the first central and mixed central moments.

π(i,j,s)(k,z) can be calculated using maximum likelihood similarly to the method used for calculating π(i,s)(k) for a single population (Nielsen et al. 2012). However, this calculation may not be desirable due to the high variance associated with the estimation of so many parameters.

An alternative approach is to compute an estimate of the two-dimensional site frequency spectrum (2D-SFS), S(i,j)(k,z), as

S(i,j)(k,z)=1s=0mk=02niz=02nj(h(i,s)(k)h(j,s)(z))s=0mh(i,s)(k)h(j,s)(z), (15)

where h(i,s)(k) and h(j,s)(z) are the marginal likelihoods of observing k and z nonreference alleles at population i and j, respectively, at site s, as presented in Nielsen et al. (2012).

S(i,j)(k,z) is then used as a prior to compute the posterior probability of quantities of interest. For instance, the expectation of the genetic variance between populations (see Equation 10) can be computed as

E[as|Ys]=k=02niz=02nja(i,j)(k,z)h(i,s)(k)h(j,s)(z)S(i,j)(k,z). (16)

Finally, a method-of-moments estimator of FST over m sites is given by Equation 4. When analyzing multiple sites, we do not add the correction factor to the ratio of E[a|X] to E[c|X] at each site because, for a large number of sites, the error introduced by taking the ratio of two nonindependent expectations will be minimal. We also tested the performance of other methods to estimate FST from sequencing data derived from the expectations of sample allele frequencies (File S1).

These methods can be extended to nonpairwise definitions of FST (Weir 1996). These formulations require the estimation of a joint SFS among all populations, which can be estimated in a similar fashion as in Equation 15.

Maximum-likelihood estimation:

We also extend the procedure for ML estimation of FST and panc under the Beta-binomial distribution (Balding and Nichols 1995; Balding 2003) (Equation 8) to the case of unknown genotypes. These estimates, which we call FST.ML, are obtained by maximizing the likelihood function

P(Y(i,s),Y(j,s)|panc,s,FST)=k=02niz=02njP(Y(i,s)|p^(i,s)=k2ni)P(p^(i,s)=k2ni|panc,s,FST)×P(Y(j,s)|p^(j,s)=z2nj)P(p^(j,s)=z2nj|panc,s,FST)=k=02niz=02njh(i,s)(k)P(p^(i,s)=k2ni|panc,s,FST)h(j,s)(z)×P(p^(j,s)=z2nj|panc,s,FST), (17)

where Y(i,s) and Y(j,s) are the observed read data at site s for population i and j, respectively, and h(i,s)(k) and h(j,s)(z) are again the marginal likelihoods of the sample allele frequency for population i and j, computed as in Nielsen et al. (2012).

Principal Components Analysis

A similar approach to the one used for correcting estimates of FST can be used in PCA. The now-standard method for calculation PCA in population genetics is based on Patterson et al. (2006). For n individuals and m sites a normalized covariance matrix C is calculated as

C(w,y)=1ms=1m(G(w,s)2p^s)(G(y,s)2p^s)p^s(1p^s), (18)

where ps^ is the derived allele frequency at site s (the labeling is again arbitrary) and G(w,s) is the number of derived alleles for individual w at site s (G ∈ {0, 1, 2} in the diploid case). The denominator is inserted to account for genetic drift and normalizes the standardized allele frequencies to have the same variance (Patterson et al. 2006). However, other normalizations can be chosen. An eigenvector decomposition of C is then computed.

We propose computing an estimate of C(w,y) by integrating over the posterior genotype probabilities at site s for individual w, P(G(w,s)|X(w,s)), and y, P(G(y,s)|X(y,s)), which can both be calculated as in Equation 9. The prior is calculated using the sample allele frequencies p^s at site s as in Kim et al. (2011). Therefore, P(G(w,s)=2)=p^s2, P(G(w,s)=1)=2p^s(1p^s), P(G(w,s)=0)=(1p^s)2, where G(w,s) is the number of derived alleles for individual i at site s. Missing genotype data are then implicitly incorporated in a Bayesian manner using the prior from the sample allele frequencies.

Additionally, the C matrix is weighted by the probability of each site being variable. This is motivated by the fact that, at low to medium sequencing coverage, sites that have a small probability of being variable in the sample can have a small but nonnegligible contribution to the matrix C. As they are several orders of magnitude more invariable than variable sites, this can have a profound effect on the analyses, even when weighting with genotype probabilities. Instead of using an arbitrary discrete SNP calling, or minor allele frequency, cut-off, we propose weighting sites according to their probability of being variable.

We, therefore, estimate the matrix C as (for wy)

C(w,y)=1s=1mPvar,ss=1m(G(w,s)=02G(y,s)=02(G(w,s)2p^s)(G(y,s)2p^s)P(G(w,s)|X(w,s))P(G(y,s)|X(y,s)))Pvar,sp^s(1p^s), (19)

where the probability of site s being variable, Pvar,s, is computed as

Pvar,s=1(πs(0)+πs(2n)). (20)

We emphasize that this approach does not provide a form of Bayesian PCA analysis. Rather, it is a modification of the Patterson et al. (2006) approach for PCA analysis in the context of population genetics, modified to incorporate uncertainty in genotype calls by using an appropriate weighting of different genotypes using their respective posterior probabilities.

Note that we estimate the joint posterior of the genotype probabilities for the two individuals using the product of their marginal genotype probabilities, i.e., we estimate P(G(w,s), G(y,s)|X(w,s), X(y,s)) by P(G(w,s)|X(w,s))P(G(y,s)|X(y,s)). P(G(w,s)|X(w,s)) and P(G(y,s)|X(y,s)) are not independent as they are correlated through the underlying estimate of genotype frequencies affecting the prior. However, as these analyses are carried out conditional on an estimated allele frequency, the approximation is accurate, although it ignores the sampling variance in the estimate of the allele frequency. Conditional on the allele frequency, P(G(w,s)|X(w,s)) and P(G(y,s)|X(y,s)) are independent.

We also note that

E[G(w,s)=02G(y,s)=02(G(w,s)2p^s)(G(y,s)2p^s)P(G(w,s)|X(w,s))P(G(y,s)|X(y,s))]=0 (21)

for unrelated individuals under HWE assuming known allele frequencies and a HWE-derived prior for the genotype probabilities. This shows that the covariance function for unrelated individuals is in fact expected to be zero using this estimator, a necessary and desirable property for the method to perform well. Proof of Equation 21 is provided in the Appendix. As we argue, the resulting PCA is greatly improved over naïve methods using genotype calling under all the explored scenarios.

This approach could be extended to different strategies to perform PCA from a matrix of genotype posterior probabilities, for instance, ML methods that account for noise contributions of each variable (Wentzell et al. 1997) or Bayesian methods that use external information about the data (Nounou et al. 2002).

Simulating sequencing data for multiple populations

We performed simulations to compare the performance of these methods to estimate population genetic differentiation, as well as to quantify the genotyping and SNP calling accuracy, under a broad range of experimental conditions. As in previous studies (Kim et al. 2010, 2011), we simulated sequencing data rather than raw sequencing reads for computational efficiency. We treated sites as independent of each other and simulated genotypes for each individual assuming HWE and a specific population allele frequency. Specifically, we repeated the following procedure for each site.

First, for each site, we drew an ancestral allele frequency panc from a distribution in [5 × 10−3, 1 − (5 × 10−3)] with density proportional to 1/x. This distribution is the expected allele frequency distribution under a standard neutral infinite sites model, truncated at the boundaries corresponding to a population size of 200 individuals (see, e.g., Ewens 2004). We then simulated allele frequencies for two populations using the Balding–Nichols model (Balding and Nichols 1995) with mean equal to panc, as in previous studies (Pritchard and Donnelly 2001; Price et al. 2006). We simulated two independent samples, conditionally on FST and panc, from this distribution to obtain allele frequencies for two populations (see Equation 5). From these population allele frequencies, we assigned genotypes according to HWE for each individual.

To simulate data from three populations, we first drew population allele frequencies from the Balding–Nichols model for two populations as described above. We then assigned the first allele frequency to population 1 and used the second allele frequency as the ancestral allele frequency for populations 2 and 3. We then drew two population allele frequencies from the Balding–Nichols model for a different value of FST and assigned these allele frequencies to populations 2 and 3.

To simulate NGS data, the number of reads at each locus for each individual was simulated from a Poisson distribution as in Kim et al. (2010, 2011). Additionally, errors were randomly introduced uniformly among nucleotides at a rate of 0.0075. This value is comparable to error rates found in previous studies (1000 Genomes Project Consortium 2010; Li et al. 2010; Yi et al. 2010). The probability of a site being polymorphic, Pvar, was varied from 0.02 to 1.

We computed genotype likelihoods from simulated sequencing reads. Genotype likelihoods depend on both base calls and quality scores and are proportional to the probability, P(X|G), of the observed read data, X, at a site for each individual given a certain genotype G. In the simplest possible case, for read z at site s, we calculated the genotype likelihood of a particular base v, L(z,v,s) with v ∈ {A, C, G, T} as L(z,v,s) = (1 − e) if v is the observed base at read z, and L(z,v,s) = e/3 otherwise. Here e is the sequencing error used in the simulation setting. There are many other methods for estimating e, including methods for estimating it directly from the data (e.g., Kim et al. 2011). Genotype likelihoods at site s for individual w are then calculated by taking the product of the likelihoods over all r reads:

P(X(w,s)|G(w,s)=v1v2)=12rz=1r(L(z,v1,s)+L(z,v2,s)). (22)

Using this procedure, we computed genotype likelihoods for each individual at each site for all 10 possible genotypes. We then computed posterior probabilities of genotypes and sample allele frequencies, as previously described (see Equation 9).

When calling genotypes, we assigned genotypes with a posterior probability <0.90 as missing data. We removed sites where more than half of the individuals had missing genotypes. With this procedure, we filtered ∼25% of the total sites at 2× sequencing coverage. We computed FST only on nonmissing genotypes, while for PCA we imputed missing data with genotypes with the highest posterior probability.

To assess the accuracy of the per-site estimates of FST, we simulated two data sets of 10k and 1k sites for each experimental scenario to evaluate method-of-moments and ML estimates, respectively, with FST varying from 0.01 to 0.4, and with Pvar = 1. We verified convergence of optimization algorithms for ML estimators of FST and discarded sites where this condition was not met. We also simulated 1M sites by concatenating 100 sets of 10k simulated sites with FST values drawn from a Normal distribution N(0.2, 0.2) truncated at 0.02 and 0.90, and Pvar = 0.10 to assess the accuracy of multiple-sites estimates of FST. We simulated 20 individuals per population at low (2×), medium (6×), and high (20×) sequencing coverage.

To evaluate the performance of different methods for estimating FST, we calculated two measures of deviation from the true FST over m sites: the root-mean-square deviation (RMSD),

RMSD=1ms=1m(F^ST(s)FST(s))2 (23)

and mean bias

Meanbias=1ms=1m(F^ST(s)FST(s)), (24)

where FST(s) and F^ST(s) is the estimated FST at site s from the case of known genotypes and sequencing data, respectively.

To evaluate the accuracy of the PCA method, we simulated 10k sites for each scenario with values of FST ranging from 0.02 to 0.4 and with Pvar = 0.02, 0.1, or 1. We simulated three populations with 20 individuals each at 2×, 6×, and 20× sequencing coverage. We performed 10 distinct simulations for each experimental condition to assure robustness of our results. We assessed the accuracy of inferred PCA plots using Procrustes analysis (Wang et al. 2010). Briefly, we measured the deviation of PC1 and PC2 computed from the case of known genotypes and the case of unknown genotypes using sum-of-squares (SS), where SS values closer to 0 indicate better fits.

Applications to real data

We analyzed a data set of wild and domesticated species of silkworm, B. mori (Xia et al. 2009). The data consisted of 40 samples representing 29 domesticated lineages and 11 wild lineages. Domesticated lineages are phenotypically and geographically separated into subgroups while all wild lineages are from China. Samples were sequenced at an approximate mean per-site coverage of 3×. We analyzed chromosome 2 using the original genotype likelihoods by removing sites where we had no information for at least one individual. Details on the calculation of genotype likelihoods can be found in the original article (Xia et al. 2009). Approximately, 200,000 sites were analyzed in total.

We computed posterior probabilities of sample allele frequencies and genotypes using ANGSD software (available at http://www.popgen.dk/angsd). We then performed PCA and estimated FST using the new proposed methods implemented in a set of C/C++ programs (available at https://github.com/mfumagalli/ngstools). All statistical analyses were performed in the R environment (http://www.r-project.org).

Results

Quantifying population genetic differentiation from sequencing data

We performed extensive simulations to evaluate the accuracy of estimating FST using different methods and under different conditions. We first evaluated the accuracy of method-of-moments estimates of per-site FST based on called genotypes. Specifically, we assign genotypes for each individual based on the the highest genotype posterior probability (F^ST.GC) (see Materials and Methods). This approach is representative of strategies currently used for genotype calling, and it provides better genotype and SNP calling accuracies than other genotype calling strategies examined here (Table S1, Table S2, and Table S3).

We then obtain a method-of-moments estimator of FST from NGS data without calling genotypes by using posterior probabilities of sample allele frequencies, which allows us to compute expected genetic variance components between and within populations (see Materials and Methods). Here, we employ Equation 15 to estimate the 2D-SFS and use it as a prior as in Equation 16. We call this estimator F^ST.Ev.

Results show that this new method performs substantially better than the method based on genotype calling under the experimental conditions explored in this study, especially at low sequencing coverage (Figure 1). F^ST.Ev tends to underestimate the true value of FST at 2× coverage, but this bias is reduced at 6× coverage (Figure 1). We observe accuracy in our estimates that are comparable to that of methods based on genotype calling for high coverage sequencing data. We obtain similar results when using the true 2D-SFS as a prior (Figure S1). We also observe that at 2× coverage, F^ST.Ev is more accurate for estimating FST than estimators based on computing the expected allele frequency for each population (see File S1 and Table S4), which overestimates FST (Figure S2).

Figure 1.

Figure 1

RMSD (left) and mean bias (right) for method-of-moments estimates of FST under different sequencing coverage (2×, 6×, and 20×). We compared the accuracy of the new method, which does not rely on genotype calling (F^ST.Ev), and a method based on allele frequencies estimated from called genotypes (F^ST.GC) (see Materials and Methods). We simulated 20 individuals for each population and 10,000 sites for each scenario.

Next, we compared the accuracy of a ML estimator of FST from called genotypes under the Balding–Nichols model, FST.ML.GC, to the proposed estimator based on the full likelihood under the same model while taking genotype calling uncertainty into account, FST.ML (see Materials and Methods). The results show that FST.ML outperforms the method based on calling genotypes at 2× and 6× coverage (Figure 2). For higher sequencing coverage, both methods perform very similarly. We also observe that ML estimates of the ancestral population allele frequency are highly correlated with the true values (Figure S3).

Figure 2.

Figure 2

RMSD (left) and mean bias (right) for maximum-likelihood estimates of FST under different sequencing coverage (2×, 6×, and 20×). We compared accuracy of the new method, which does not rely on genotype calling (F^ST.ML), and the standard method applied to called genotypes. (F^ST.ML.GC) (see Materials and Methods). We simulated 20 individuals for each population and 1000 sites for each scenario.

We also test the accuracy of estimating multiple-sites FST on 10k sites from a larger set of 1M simulated positions where only 10% of the sites are variable in the population (see Materials and Methods). For this particular analysis we chose the method-of-moments estimator because of its natural extension to multiple-sites estimation (Equation 4). At 2× sequencing coverage we underestimate the true FST (Figure S4). This bias diminishes at 6× and disappears at 20×. When we use the true 2D-SFS as a prior at 2× sequencing coverage, we underestimate the true FST when this value is above the whole-region average (approximately equal to 0.25), while we overestimate the true FST when this value is below the whole-region average (Figure S4). This bias is derived from using the 2D-SFS estimated from the entire region as a prior. At 6× and 20× sequencing coverage we observed unbiased estimates using the true 2D-SFS as a prior (Figure S4).

Principal components analysis

In traditional PCA, genotypes are called at each site for each individual. We explore an alternative approach based on the genotype posterior probabilities for each individual at each site (see Materials and Methods).

At low sequencing coverage, the new method, which does not rely on SNP or genotype calling, produces PCA plot results that are essentially identical to those that use known genotypes (Figure 3). By contrast, direct genotype calling at low sequencing coverage generally leads to a loss in the ability to cluster individuals according to populations, which is a problem that may persist even after removing outlier individuals (Figure 3).

Figure 3.

Figure 3

PCA plots from known genotypes, called genotypes using genotype posterior probabilities with or without outlier individuals, and using the new method without calling genotypes (see Materials and Methods). We simulated three populations of 20 individuals each at 2× sequencing coverage. Colors are coded according to each simulated population. Purple and green/red populations are differentiated by an FST of 0.4 while green and red populations are differentiated by an FST of 0.15. We simulated 10,000 sites with 10% of sites being variable in the population.

We replicated these findings under many different experimental conditions and for multiple independent simulations and assessed the accuracy of PCA plots using SS values from PC1 and PC2 computed from known genotypes (see Materials and Methods). The new method provides better accuracy than the method based on genotype calling for all tested scenarios, even at medium sequencing coverage (Figure S5). Generally, we obtain lower SS values without normalization of the standardized allele frequencies (see Equation 18), and the new method still outperforms an approach based on called genotypes at low sequencing coverage (Figure S6). We next simulated only variable sites data at high sequencing coverage to produce an ideal scenario for genotype calling. As expected, procedures based on calling genotypes lead to accurate PCA results under these conditions (Figure S7).

Notably, weighting each site by its probability of being variable gives higher accuracy than simply weighing all sites equally, especially when there are only a few variable sites in the sample (Figure S8). This proposed method also performs better than an approach based on computing expected genotypes from genotype posterior probabilities (Skotte et al. 2012; Gompert et al. 2012) for low coverage data (Figure S9). We also simulated one population with no genetic structure but where half of the individuals were sequenced at low coverage (2×) while the rest were sequenced at high coverage (20×). We still observe an improvement in the accuracy of the inferred PCA plots (Figure S10).

Analysis of real data

To illustrate the performance of the herein proposed methods, we applied them to a data set of low-coverage sequencing data. Specifically, we investigate the population structure of wild and domesticated silkworm samples (Xia et al. 2009). Despite using only a single chromosome of the entire silkworm data set, we were able to detect fine-scale population genetic structure. Indeed, the first component of the PCA plot generated using the new method, which takes statistical uncertainty in genotype calling into account, shows a clear separation between wild and domesticated lineages (Figure 4A). Moreover, the second component divides the different lineages of domesticated silkworms into their subgroups (Figure 4A). The first two principal components explain 6.8 and 5.2% of the total genetic variation, respectively. Of note is that we achieve a better separation among the subgroups than in the original study using whole-genome sequence data, where several subgroups appear to be intermixed (Xia et al. 2009).

Figure 4.

Figure 4

(A) PCA plot for wild and domesticated B. mori samples using the method proposed in this study. (B) Distribution of FST between wild and domesticated B. mori samples over 20-kb genomic windows.

We then applied naïve strategies of performing PCA based on called genotypes using the maximum genotype likelihood or genotype posterior probability at each site for each individual. Results show several outlier individuals, which may be the effect of systematically misassigned heterozygous sites (Figure S11). However, when including only sites with estimated allele frequency greater or equal to two, and using genotype calling based on genotype posterior probabilities, we see a more accurate representation of the genetic structure. (Figure S11). A similar result using this allele frequency-filtered data set is obtained using the new proposed method that does not rely on genotype calling (Figure S11). Nonetheless, the new method applied to all of the data provides larger fractions of explained variance than the method based on genotype calling (Figure S12).

Finally, we estimated FST between wild and domesticated samples for 20-kb nonoverlapping genomic windows. We used the folded 2D-SFS due to uncertainty in assigning the ancestral and derived state of alleles. The distribution of the estimated FST values in 20-kb windows has mode around 0.4 (Figure 4B), which is larger than what was found in the original study (Xia et al. 2009).

Discussion

NGS technologies are now an essential tool for population genetic studies. However, genotyping uncertainty associated with low sequencing coverage and high sequencing error can drastically bias downstream analyses (Nielsen et al. 2011). A recent study assessed the power to detect selective events and infer demographic scenarios as a function of sequencing coverage and error (Crawford and Lazzaro 2012). The results of the study show that weak selective events are hardly detectable, and inferences of population size changes are systematically biased for low-coverage data (<10×) (Crawford and Lazzaro 2012). Interestingly, the authors determined that population genetic differentiation was underestimated, even at medium to high sequencing coverage, suggesting that multipopulation analyses are even more sensitive to inaccuracy of NGS data (Crawford and Lazzaro 2012).

In this study, we take full advantage of a recently proposed Bayesian approach to taking sequencing data uncertainty into account (Li 2011; Nielsen et al. 2012). This method involves computing posterior probabilities for each genotype and all possible sample allele frequencies from genotype likelihoods. Estimation of classic population genetic parameters within this new probabilistic framework has previously been suggested (Yi et al. 2010; Li 2011; Nielsen et al. 2012) and in some cases implemented (Yi et al. 2010; Gompert and Buerkle 2011; Kang and Marjoram 2011). For instance, Gompert and Buerkle (2011) proposed a hierarchical Bayes model for genomic population structure. Their method accounts for uncertainty in sampling sequencing reads and measured population differentiation in terms of haplotype distances. Also, Skotte et al. (2012) and Gompert et al. (2012) used genotype expectations rather than called genotypes for the analysis of population structure. Here, we developed new methods for quantifying population genetic differentiation in terms of FST without relying on SNP or genotype calling. We simulated NGS data to assess the accuracy of these new estimators under a wide range of experimental scenarios.

Herein proposed methods for computing method-of-moments FST estimators, based on computing the posterior expected genetic variance components (see Materials and Methods), offer a solution to the lack of accuracy for low coverage data and outperform other examined estimators under all tested conditions (Figure 1). While the improvement offered by the new method is greatest and most noticeable at low coverage, even at medium sequencing coverage, it results in less biased estimates of FST (Figure 1). Similarly, ML estimation of FST that accounts for uncertainty in genotype calls outperforms a method based on genotype calling at low and medium coverage (Figure 2). These findings suggest that the framework presented in this study can be easily extended to other FST estimators. Overall, these results highlight the importance of taking statistical uncertainty into account when computing population genetic differentiation from NGS data. The great improvement in accuracy for low coverage data can be explained by the fact that we do not call SNPs or genotypes. We can thus avoid introducing errors during these processes, which can be particularly problematic for downstream analyses.

Errors introduced by calling SNPs and genotypes for low coverage and quality data can be even more evident when investigating population structure with PCA. Simple genotype calling provides very little ability to accurately identify structure using PCA for low coverage data. However, the new method based on genotype posterior probabilities provides PCA plots that are almost identical to cases in which true genotypes are known (Figure 3). Accuracy in identifying population structure can be recovered when calling genotypes by removing outlier individuals, low-quality sites, and low-frequency variants, but at the price of losing potential important information. Skoglund and Jakobsson (2011) investigated population structure by randomly sampled one read from each individual at each position. In this way they could compare modern, high-quality data with the low-pass ancient data. A disadvantage of this method is the loss of information associated with using only a single read from each individual, especially in the presence of sequencing errors.

We applied methods proposed in this study to a data set comprising 40 silkworm samples sequenced at low coverage (Xia et al. 2009). We used only a single chromosome of the original data set and we did not apply any criteria for SNP calling. Despite this, we were able to obtain a fine-scale map of population genetic structure, clearly separating wild and domesticated lineages of silkworm samples (Figure 4A). The first principal component separates domesticated and wild varieties, while the second component accurately divides the domesticated lineage into subgroups. Genotype calling from genotype posterior probabilities can provide an overall similar representation of the genetic structure when using a conservative initial filtering of data.

Genotype calling using stringent data filtering and a conservative approach for SNP calling and rare variants removal may be sufficient to give an overall picture of the genetic population structure, for example, a reasonably representative PCA. Other analyses, such as estimation of FST, that rely on accurate estimates of allele frequencies may be more difficult to rescue by conservative filtering because a fixed cut-off for SNP calling cannot provide unbiased estimates of allele frequencies (e.g., Johnson and Slatkin 2008). Furthermore, the accuracy of genotype calling can be improved for human data by using imputation or haplotype-based genotype-calling methods (e.g., Zhi et al. 2012), although such approaches are not as easily applicable to most other species. The poor performance of PCA after calling genotypes may largely be a result of inaccuracies in SNP calling rather than a consequence of erroneous genotype calls at variable sites. However, when simulating sequences with a larger proportion of polymorphic sites the new method still outperforms traditional methods, even in the case of an uneven sequencing coverage among individuals (Figure S10). While more sophisticated approaches have been developed to perform accurate SNP calling (e.g., Kim et al. 2011), calling polymorphic sites using all individuals may result in ascertainment biases, which can influence estimates of population structure and divergence (Albrechtsen et al. 2010). Additionally, a stringent SNP-calling strategy implies that a large amount of data is discarded from the analyses, potentially leading to loss of important features of the data. For example, low-frequency variants, which are more likely to be removed in a conservative SNP calling strategy, can effectively distinguish closely related populations. Moreover, highly differentiated SNPs among populations, which may be related to genetic adaptation, might be lost in some analyses.

Like any other method for SNP-calling and allele-frequency estimation, the approach herein discussed is sensitive to the underlying base-calling algorithm and to the accuracy of quality scores. By improving accuracy and quality scores, current and future base callers can both reduce sequencing costs and increase accuracy of all downstream analyses of genetic variation. Furthermore, data filtering is a complex procedure when sequencing quality is low (e.g., Minoche et al. 2011). Many other protocols, other than the ones used in this article, can be adopted to minimize the genotypes assignment bias.

We implemented the new proposed methods for estimating FST and perform PCA from NGS data in a fast, portable, and memory-efficient set of C/C++ programs and distributed on a public repository for shared development. These programs are directly integrated with ANGSD (http://www.popgen.dk/angsd), a software for the analysis of NGS data and easily integrable with other common software such as SAMtools (Li et al. 2009) or GATK (Mckenna et al. 2010). The computational cost associated with the new methods is slightly higher than that of standard approaches (Table S5 and Table S6). However, the increased computational burden is mostly associated with the computation of sample allele frequency posterior probabilities, which can be used for additional analyses. Notably, the computational cost should not be prohibitive for any existing data sets.

As NGS technologies become more ubiquitous and affordable, the frequency of large-scale population genetic and quantitative studies will certainly increase. The methods presented in this article provide tools for investigating genetic variation for multiple populations at large scales directly from high-throughput sequencing data.

Supplementary Material

Supporting Information

Acknowledgments

We are grateful to Michael DeGiorgio and Gaston Sanchez at the University of California, Berkeley, CA, for helpful discussions. We thank Shiping Liu and Zengli Yan at Beijing Genomics Institute, Shenzhen, China, for testing previous versions of the programs, and three anonymous reviewers for insightful comments on the manuscript. M.F. is supported by EMBO Long-Term Post-doctoral Fellowship (ALTF 229-2011). T.L. is supported by National Institutes of Health (NIH) Genomics Training Grant (Grant T32HG000047-13). E.H.S. is supported by National Science Foundation grant DBI-0906065 and NIH grant 3R01HG03229-08S2. R.N. is supported by NIH grant 3R01HG03229-07.

Appendix

Let Gi and Xi be random variables representing the genotype and read data, respectively, from individual i. Likewise, let gi be a realization of Gi, and let xi be a realization of Xi, i = 1, 2. We then wish to prove that

EX1,X2[g1g2(g1E[G1])(g2E[G2])p(g1|x1)p(g2|x2)]=0, (A1)

where the sums, here and in the following, are over all supported values of the variable under the summation sign. We use a simplified notation so that expectation operators implicitly are taken with respect to the random variable(s) inside the argument of the expectation operator, except when otherwise indicated by the use of subscripts. Also, we use the short-hand notation p(gi) for the probability of the random variable Gi taken on the value gi.

First note that, for unrelated individuals, G1 and G2 are independent and that X1 and X2 are independent assuming a fixed known allele frequency and assuming random mating. Next also note that

EXi[gip(gi|xi)]=gip(gi) (A2)

and

EXi[p(gi|xi)]=p(gi). (A3)

Then

EX1,X2[g1g2(g1E[G1])(g2E[G2])p(g1|x1)p(g2|x2)]=g1g2(EX1,X2[g1g2p(g1|x1)p(g2|x2)]EX1,X2[g1E[G2]p(g1|x1)p(g2|x2)]+EX1,X2[g2E[G1]p(g1|x1)p(g2|x2)]+EX1,X2[g1g2p(g1|x1)p(g2|x2)])=g1g2(EX1[g1p(g1|x1)]EX2[g2p(g2|x2)]EX1[g1p(g1|x1)]EX2[E[G2]p(g2|x2)]EX2[g2p(g2|x2)]EX1[E[G1]p(g1|x1)]+EX1[g1p(g1|x1)]EX2[g2p(g2|x2)])=g1g2(g1g2p(g1)p(g2)g1p(g1)E[G2]p(g2)g2p(g2)E[G1]p(g1)+E[G1]E[G2]p(g1)p(g2))=E[G1]E[G2]E[G1]E[G2]E[G2]E[G1]+E[G1]E[G2]=0.

The interchange of summations in the first step is justified because all sums are finite. The second equality is true because of the independence assumption. The third equality is verified by substitution of the expressions in (A2) and (A3). The fourth equality follows from the independence assumption and the definition of expectation.

Footnotes

Communicating editor: M. A. Beaumont

Literature Cited

  1. 1000 Genomes Project Consortium , 2010.  A map of human genome variation from population-scale sequencing. Nature 467(7319): 1061–1073 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. 1000 Genomes Project Consortium , 2012.  An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422): 56–65 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Albrechtsen A., Nielsen F. C., Nielsen R., 2010.  Ascertainment biases in snp chips affect measures of population divergence. Mol. Biol. Evol. 27(11): 2534–2547 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Auton A., Fledel-Alon A., Pfeifer S., Venn O., Segurel L., et al. , 2012.  A fine-scale chimpanzee genetic map from population sequencing. Science 336(6078): 193–198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Balding D. J., 2003.  Likelihood-based inference for genetic correlation coefficients. Theor. Popul. Biol. 63(3): 221–230 [DOI] [PubMed] [Google Scholar]
  6. Balding D. J., Nichols R. A., 1995.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96(1–2): 3–12 [DOI] [PubMed] [Google Scholar]
  7. Beaumont M. A., Balding D. J., 2004.  Identifying adaptive genetic divergence among populations from genome scans. Mol. Ecol. 13(4): 969–980 [DOI] [PubMed] [Google Scholar]
  8. Crawford J. E., Lazzaro B. P.2012.  Assessing the accuracy and power of population genetic inference from low-pass next-generation sequencing data. Frontiers Genet. 3: 66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Ewens, W., 2004 Mathematical Population Genetics: Theoretical Introduction Springer-Verlag, New York. [Google Scholar]
  10. Fletcher R., 1987.  Practical Methods of Optimization, Ed. 2 Wiley-Interscience, New York [Google Scholar]
  11. Foll M., Gaggiotti O., 2008.  A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gompert Z., Buerkle C. A., 2011.  A hierarchical bayesian model for next-generation population genomics. Genetics 187: 903–917 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gompert Z., Lucas L. K., Nice C. C., Fordyce J. A., Forister M. L., et al. , 2012.  Genomic regions with a history of divergent selection affect fitness of hybrids between two butterfly species. Evolution 66(7): 2167–2181 [DOI] [PubMed] [Google Scholar]
  14. Hellmann I., Mang Y., Gu Z., Li P., de la Vega F. M., et al. , 2008.  Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. Genome Res. 18(7): 1020–1029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Holsinger K. E., Weir B. S., 2009.  Genetics in geographically structured populations: defining, estimating and interpreting f(st). Nat. Rev. Genet. 10(9): 639–650 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Holsinger K. E., Lewis P. O., Dey D. K., 2002.  A Bayesian approach to inferring population structure from dominant markers. Mol. Ecol. 11(7):1157–1164 [DOI] [PubMed] [Google Scholar]
  17. Huang X., Kurata N., Wei X., Wang Z. X., Wang A., et al. , 2012.  A map of rice genome variation reveals the origin of cultivated rice. Nature 490(7421): 497–501 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hudson R. R., Slatkin M., Maddison W. P., 1992.  Estimation of levels of gene flow from dna sequence data. Genetics 132: 583–589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Johnson P. L., Slatkin M., 2008.  Accounting for bias from sequencing error in population genetic estimates. Mol. Biol. Evol. 25(1): 199–206 [DOI] [PubMed] [Google Scholar]
  20. Kang C. J., Marjoram P., 2011.  Inference of population mutation rate and detection of segregating sites from next-generation sequence data. Genetics 189: 595–605 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Keightley P. D., Halligan D. L., 2011.  Inference of site frequency spectra from high-throughput sequence data: quantification of selection on nonsynonymous and synonymous sites in humans. Genetics 188: 931–940 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kim S. Y., Li Y., Guo Y., Li R., Holmkvist J., et al. , 2010.  Design of association studies with pooled or un-pooled next-generation sequencing data. Genet. Epidemiol. 34(5): 479–491 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kim S. Y., Lohmueller K. E., Albrechtsen A., Li Y., Korneliussen T., et al. , 2011.  Estimation of allele frequency and association mapping using next-generation sequencing data. BMC Bioinformatics 12: 231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li H., 2011.  A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21): 2987–2993 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., et al. , 2009.  The sequence alignment/map format and samtools. Bioinformatics 25(16): 2078–2079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li Y., Vinckenbosch N., Tian G., Huerta-Sanchez E., Jiang T., et al. , 2010.  Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants. Nat. Genet. 42(11): 969–972 [DOI] [PubMed] [Google Scholar]
  27. Lynch M., 2009.  Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182: 295–301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Marchini J. L., Cardon L., 2002.  Discussion on the meeting on statistical modelling and analysis of genetic data. J. R. Stat. Soc. Series B Stat. Methodol. 64(4): 737–775 [Google Scholar]
  29. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., et al. , 2010.  The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 20(9): 1297–1303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Metzker M. L., 2010.  Sequencing technologies: the next generation. Nat. Rev. Genet. 11(1): 31–46 [DOI] [PubMed] [Google Scholar]
  31. Minoche A. E., Dohm J. C., Himmelbauer H., 2011.  Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome Biol. 12(11): R112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Nicholson G., Smith A. V., Jonsson F., Gustafsson O., Stefansson K., et al. , 2002.  Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. Series B Stat. Methodol. 64: 695–715 [Google Scholar]
  33. Nielsen R., 2005.  Molecular signatures of natural selection. Annu. Rev. Genet. 39: 197–218 [DOI] [PubMed] [Google Scholar]
  34. Nielsen R., Paul J. S., Albrechtsen A., Song Y. S., 2011.  Genotype and snp calling from next-generation sequencing data. Nat. Rev. Genet. 12(6): 443–451 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nielsen R., Korneliussen T., Albrechtsen A., Li Y., Wang J., 2012.  Snp calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS ONE 7(7): e37558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Nounou M. N., Bakshi B. R., Goel P. K., Shen X., 2002.  Bayesian principal component analysis. J. Chemometr. 16: 576–595 [Google Scholar]
  37. Patterson N., Price A. L., Reich D., 2006.  Population structure and eigenanalysis. PLoS Genet. 2(12): e190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Press W., Teukolsky S., Vetterling W., Flannery B., 2007.  Numerical Recipes: The Art of Scientific Computing, Ed. 3 Cambridge University Press, Cambridge, UK [Google Scholar]
  39. Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A., et al. , 2006.  Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38(8): 904–909 [DOI] [PubMed] [Google Scholar]
  40. Pritchard J. K., Donnelly P., 2001.  Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60(3): 227–237 [DOI] [PubMed] [Google Scholar]
  41. Reynolds J., Weir B. S., Cockerham C. C., 1983.  Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105: 767–779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Rice S. H., 2008.  A stochastic version of the price equation reveals the interplay of deterministic and stochastic processes in evolution. BMC Evol. Biol. 8: 262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Rice S. H., Papadopoulos A., 2009.  Evolution with stochastic fitness and stochastic migration. PLoS ONE 4(10): e7130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Riebler A., Held L., Stephan W., 2008.  Bayesian variable selection for detecting adaptive genomic differences among populations. Genetics 178: 1817–1829 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Rubin C. J., Zody M. C., Eriksson J., Meadows J. R., Sherwood E., et al. , 2010.  Whole-genome resequencing reveals loci under selection during chicken domestication. Nature 464(7288): 587–591 [DOI] [PubMed] [Google Scholar]
  46. Skoglund P., Jakobsson M., 2011.  Archaic human ancestry in East Asia. Proc. Natl. Acad. Sci. USA 108(45): 18301–18306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Skotte L., Korneliussen T. S., Albrechtsen A., 2012.  Association testing for next-generation sequencing data using score statistics. Genet. Epidemiol. 36(5): 430–437 [DOI] [PubMed] [Google Scholar]
  48. Wang C., Szpiech Z. A., Degnan J. H., Jakobsson M., Pemberton T. J., et al. , 2010.  Comparing spatial maps of human population-genetic variation using procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9(1): 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Weir, B. S., 1996 Genetic Data Analysis II Sinauer Associates, Sunderland, MA. [Google Scholar]
  50. Weir B. S., Cockerham C. C., 1984.  Estimating f-statistics for the analysis of population structure. Evolution 38: 1358–1370 [DOI] [PubMed] [Google Scholar]
  51. Weir B. S., Hill W. G., 2002.  Estimating f-statistics. Annu. Rev. Genet. 36: 721–750 [DOI] [PubMed] [Google Scholar]
  52. Wentzell P. D., Andrews D., Hamilton D. C., Faber F., Kowalski B. R., 1997.  Maximum likelihood principal component analysis. J. Chemometr. 11: 339–366 [Google Scholar]
  53. Wright S. 1951.  The genetical structure of populations. Ann. Eugenics 15: 323–354 [DOI] [PubMed] [Google Scholar]
  54. Xia Q., Guo Y., Zhang Z., Li D., Xuan Z., et al. , 2009.  Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (bombyx). Science 326(5951): 433–436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Xu X., Liu X., Ge S., Jensen J. D., Hu F., et al. , 2011.  Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 30(1): 105–111 [DOI] [PubMed] [Google Scholar]
  56. Yi X., Liang Y., Huerta-Sanchez E., Jin X., Cuo Z. X., et al. , 2010.  Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329(5987): 75–78 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Zhi D., Wu J., Liu N., Zhang K., 2012.  Genotype calling from next-generation sequencing data using haplotype information of reads. Bioinformatics 28(7): 938–946 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES