Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Jul 20.
Published in final edited form as: J Am Stat Assoc. 2009 Mar 1;104(485):142–154. doi: 10.1198/jasa.2009.0010

A Bayesian hierarchical model for analysis of SNP diversity in multilocus, multipopulation samples

Feng Guo 1, Dipak K Dey 1, Kent E Holsinger 1
PMCID: PMC2713112  NIHMSID: NIHMS57100  PMID: 19623271

Abstract

The distribution of genetic variation among populations is conveniently measured by Wright’s FST, which is a scaled variance taking on values in [0,1]. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs) in particular, it is reasonable to presume that allelic differences at most loci are selectively neutral. For such loci, the distribution of genetic variation among populations is determined by the size of local populations, the pattern and rate of migration among those populations, and the rate of mutation. Because the demographic parameters (population sizes and migration rates) are common across all autosomal loci, locus-specific estimates of FST will depart from a common distribution only for loci with unusually high or low rates of mutation or for loci that are closely associated with genomic regions having a relationship with fitness. Thus, loci that are statistical outliers showing significantly more among-population differentiation than others may mark genomic regions subject to diversifying selection among the sample populations. Similarly, statistical outliers showing significantly less differentiation among populations than others may mark genomic regions subject to stabilizing selection across the sample populations. We propose several Bayesian hierarchical models to estimate locus-specific effects on FST, and we apply these models to single nucleotide polymorphism data from the HapMap project. Because loci that are physically associated with one another are likely to show similar patterns of variation, we introduce conditional autoregressive models to incorporate the local correlation among loci for high-resolution genomic data. We estimate the posterior distributions of model parameters using Markov chain Monte Carlo (MCMC) simulations. Model comparison using several criteria, including DIC and LPML, reveals that a model with locus- and population-specific effects is superior to other models for the data used in the analysis. To detect statistical outliers we propose an approach that measures divergence between the posterior distributions of locus-specific effects and the common FST with the Kullback-Leibler divergence measure. We calibrate this measure by comparing values with those produced from the divergence between a biased and a fair coin. We conduct a simulation study to illustrate the performance of our approach for detecting loci subject to stabilizing/divergent selection, and we apply the proposed models to low- and high-resolution SNP data from the HapMap project. Model comparison using DIC and LPML reveals that CAR models are superior to alternative models for the high resolution data. For both low and high resolution data, we identify statistical outliers that are associated with known genes.

Keywords: Bayesian approach, Hierarchical model, SNP, Wright’s Fst, MCMC

1. INTRODUCTION

Human genetic diversity reflects our common evolutionary history. Differences among individuals belonging to the same group are smaller than those of individuals belonging to different groups. Moreover, differences among groups derived from the same broad geographical region are smaller than those derived from different geographical regions. For example, an analysis of microsatellite variation at 377 loci in 52 human populations (Rosenberg, Pritchard, Weber, Cann, Kidd, Zhivotovsky and Feldman 2002) identified five broad geographical clusters of populations: Africa, Eurasia, southeast Asia, Oceania, and the Americas. Approximately 75% of the among-population variation in allele frequency is associated with differences among these major geographical regions (Song, Dey and Holsinger 2006).

For loci that do not affect survival or reproduction, i.e., loci that are selectively neutral, both the amount of variation within populations and the extent of differentiation among populations are determined by: (1) the number of individuals in local populations, (2) the rates of migration among local populations, and (3) the mutation rates among alleles (see, for example, Crow and Kimura (1970), Fu, Gelfand and Holsinger (2003); Song et al. (2006)). In a typical population sample, individuals are genotyped at many loci, and all individuals are genotyped for the same set of loci. Thus, whatever the vagaries of demographic history - including population decline or expansion, population bottlenecks, asymmetric or variable migration rates among populations, etc. - all autosomal loci in a sample will share that history, and they should show similar patterns of within- and among-population variation. Nonetheless, previous surveys of single-nucleotide polymorphisms (SNPs) in the human genome have revealed substantial differences among loci in the amount of among-population variation (see, for example, Akey, Zhang, Zhang, Jin and Shriver (2002) and Weir, Cardon, Anderson, Nielsen and Hill (2005)). Such differences suggest either that the mutational process differs substantially from locus to locus or that allelic differences at those loci (or loci with which they are closely associated) contribute differently to survival and reproduction than do allelic differences at other loci. Mutation rates at different SNP loci within a sample are likely to be comparable (Chakraborty, Kimmel, Stivers, Davison and Deka 1997; Weber and Wong 1993) (but see Lercher and Hurst (2002) for a cautionary note). Thus, if a few SNP loci show substantially more among-population differentiation than the rest, these loci may mark regions of the genome at which there has been divergent selection across populations in the sample. Similarly, SNP loci showing substantially less differentiation may mark regions subject to stabilizing selection across populations.

Cavalli-Sforza (1966) may have been the first to suggest using measures of population divergence to detect natural selection, but Lewontin and Krakauer (1973) were the first to propose using Wright’s F-statistics for this purpose. Nei and Maruyama (1975) and Robertson (1975) quickly pointed out that comparing a point estimate of FST for a particular locus with a point estimate for the genomic background fails to account for the large variance in FST among loci expected as a result of genetic drift, variance that is intrinsic to the stochastic evolutionary process and that cannot be eliminated by increased sampling. Nonetheless, Beaumont and Balding (2004) showed that Bayesian p-values derived from locus-specific FST estimates could be used to detect statistical outliers that corresponded to loci under selection in their simulations. Recently, Riebler, Held and Stephan (2008) extended this approach by introducing binary indicator variables whose posterior can be used to identify statistical outliers.

We take a similar approach. To identify loci that show unusually large or unusually small amounts of differentiation at SNP loci, we develop hierarchical Bayesian models for analysis of multilocus, multipopulation SNP data, and we combine them with a novel approach to identify loci that are statistical outliers. Hierarchical models are natural in this context because the underlying patterns of biological diversity are hierarchical. In this paper, populations are predefined based on the geographic origin of samples. Conceptually, we assume that populations have diverged from a common ancestor, a hyperpopulation. Consequently, we assume the allele frequency at each SNP locus in each population is drawn from a common hyperpopulation in which allele frequencies vary across loci. Although all autosomal loci have the same expected value of FST, the population sample at each locus represents a different realization from a stochastic evolutionary process and the realized FST at each locus will be different. Thus, we assume that FST at any particular locus is drawn from a hyperdistribution. The variability of this hyperdistribution reflects the among-realization variability in the stochastic evolutionary process. Loci with substantially greater or substantially smaller amounts of among-population differentiation than are consistent with this hyperdistribution will be identified as outliers. Thus, our inference is based on comparing the posterior distribution of a parameter reflecting locus-specific effects on Wright’s FST (namely, θi, i = 1,...,I) with the posterior distribution of parameters reflecting a genome-wide distribution for FST.

Specifically, we characterize both the common posterior distribution that describes most loci in the sample and the posterior distribution of each θi. In the hierarchical structure we propose, the hyperprior for θi is given by a beta distribution with mean φ and variance φ(1 - φ)θL. Thus, a beta distribution with mean φ^ and variance φ^(1φ^)θ^L is a suitable choice for the common posterior distribution, where φ^ and θ^L refer to posterior means. The posterior distributions of the θi are unimodal and have support on [0, 1]. Thus, we approximate them with a beta distribution by matching the posterior means and variances. By using this approach, we have a closed form for the posterior density function, and we can use the well accepted Kullback-Leibler divergence (KLD) measure to compare the posterior distribution of each θi with the common posterior distribution specified by φ^ and θ^L. We calibrate this divergence using the method proposed by Peng and Dey (1995).

There are 23 chromosomes in the human genome and approximately 3.2M SNP loci in the dataset from which our data set is derived. Adjacent loci in the complete data set are separated by an average of only 1000 nucleotides. Thus, some loci are in close physical proximity, and it is reasonable to expect that they will show similar patterns of variation as a result. We introduce a conditional autoregressive (CAR) model for high-resolution genomic data to incorporate the effects of physical proximity among the loci into the model. The proximity effects are brought into the model by constructing a CAR prior for θi. Thus, we consider four models in this paper: (1) a hierarchical model in which we assume that the θi are random samples from a hyperdistribution for FST, (2) a hierarchical model in which we use a product decomposition to distinguish locus- and population-specific effects on FST, (3) a CAR extension of model 1, and (4) a CAR extension of model 2. Because there are no closed form expressions for the posterior distribution of θi, we use a sampling based Markov chain Monte Carlo (MCMC) method to obtain the marginal posterior distributions.

The remainder of the paper is structured as follows: In section 2 we introduce the genetic data used in the analysis; section 3 provides a detailed description of the models; the method of detecting loci that are statistical outliers is described in section 4; in section 5, a simulation study is conducted to demonstrate the proposed models; the application and model comparison are presented in section 6; and section 7 provides a summary of the main results and discusses their implications.

2. THE SNP DATA

Whether we can identify certain loci as having unusually large or unusually small amounts of among-population differentiation depends on both the number of populations included in the sample and on the number of loci scored per individual. Because FST is directly proportional to the allele frequency variance among populations, the precision of FST estimates and our ability to detect outliers will obviously increase as the number of populations included in the sample increases. Moreover, the greater the number of populations included in the sample, the greater the chances that one or more of them have been subject to divergent selection leading to divergent allele frequencies. Similarly, the larger the number of markers included in a data set, the more precisely we are able to estimate the amount of among-locus variability in FST that is expected and the more power we have to detect loci that depart significantly from the common distribution.

We analyze publicly available data from the HapMap project (Consortium 2005), which provides data on circa 3.2M polymorphic SNPs. These data are derived from a relatively small number of individuals (270) and only four populations: Yoruba in Ibadan, Nigeria (YRI); Japanese in Tokyo, Japan (JPT); Han Chinese in Beijing, China (HCB); and Utah residents with ancestry from northern and western Europe (CEPH). Thus, the HapMap data provide an opportunity for high-resolution analysis of variation patterns across the human genome, but the small number of populations included in the sample will allow us to identify only those loci in which departures from the common distribution are especially large.

For the notation we assume individuals are sampled from K populations. By “population” we refer to sampling location. Both the Yoruba population sample (YRI) and the U.S. population sample (CEPH) consist of 30 trios: two parents and one offspring. To avoid modeling the dependence structure this sampling would induce, we analyze only parental genotypes in YRI and CEPH. For each individual, the genotype is determined at I SNP loci. Because nearly all SNP loci have only two alleles, we label alleles as A1 and A2 at each locus. As will become evident in the model description, inference on FST does not depend on the labeling of alleles. The data are aggregated to allele counts by locus and population. Denote xik as the sample size of allele A1 and Nik as the total number of alleles sampled at locus i in population k. Obviously, the sample size of allele A2 at locus i for population k is Nik - xik.

In order to implement the CAR model, we also need a proximity or adjacency matrix, W, in which element wij represents the spatial dependence between locus i and j. We consider distances measured in terms of the frequency of recombination between the markers. To calculate recombinational distances we used map positions (measured in centimorgans) as estimated by Peter Donnelly, Gil McVean, and Simon Myers in a dataset available for download from the HapMap site.

We focus our attention on SNP loci on human chromosome 7, for which 201,656 SNP loci were scored in the four populations included in the HapMap data set. The number of alleles in each population varies from around 70 to 120. Some populations are completely lacking genotypes at particular SNP loci. We included only those loci for which genotypes counts were available for all populations. The pruned data set includes 177, 374 loci. We also exclude a small number of loci in which all populations are fixed for one allele, i.e., loci for which the frequency of one allele is zero in all populations. At such loci there is no among-population variation in allele frequency to assess. Loci that are monomorphic in all populations may mark genomic regions subject to strong stabilizing selection. Thus, by excluding these loci from our analysis we reduce our ability to detect loci showing unusually small amounts of divergence among populations. To screen the whole chromosome while keeping the computational demands reasonable, we first perform a low-resolution scan by selecting loci throughout chromosome 7 but including only every 50th locus. There are 3040 loci separated by 52,000 nucleotides on average included in the final analysis. We then focus on a region marked by a strong statistical outlier in the low-resolution scan and perform a high-resolution scan that includes 3002 loci at intervals of approximately 860 nucleotides.

3. MODELS

3.1 Describing genetic structure

First consider one locus with multiple alleles. Let pm,k be the frequency of allele Am (m = 1, . . ., M) in population k (k = 1, . . ., K), and assume that alleles are associated randomly within individuals (i.e., genotypes are in Hardy-Weinberg proportions) so that the frequency of the ordered genotype (mn) AmAn in population k is given by

γmn,k={pm,k2form=n2pm,kpn,kformn.}

Then the mean genotype frequency, γmn·, across the set of K populations is given by

γmn=1Kk=1Kγmn,k={pm2+spm2form=n2pmpn2spmpnformn,} (1)

where pm=1Kk=1Kpm,k,spm2=1Kk=1K(pm,kpm)2 and spmpn=1Kk=1K(pm,kpm)(pn,kpn) (see Li (1955)). If alleles are exchangeable in the underlying stochastic evolutionary process, the allele frequencies are identically distributed at stationarity under quite general conditions (Fu et al. 2003). Specifically, E(p) = π, E(spm2)=σp2, and E(spmpn)=ρσp2, where π, σp2, and ρ are the common values (Fu et al. 2003). Under these conditions the expectation of the γmn· can be written as

E(γmn)={π2+Fstπ(1π)form=n,2π(1π)(1Fst)formn,} (2)

where

Fst=σp2π(1π). (3)

Since the work of Wright (1951) and Malécot (1948), FST has been the most widely used statistic for summarizing patterns of among-population differentiation in population genetics.

Assume that we have a sample of allelic data from I loci. For notational simplicity we restrict our attention to the case where each locus has only two alleles. The models discussed in this paper can be relatively easily extended to multiple alleles, and an outline of the extension is introduced in Holsinger (1999). Let xik denote the count of allele A1 in the sample from locus i in population k, let Nik be the total number of alleles sampled at locus i in population k, and let pik be the “true” allele frequency at locus i of population k. Then the first-stage likelihood is a product binomial:

f(xp)i=1Ik=1Kpikxik(1pik)Nikxik. (4)

We assume that the distribution of allele frequencies among populations at locus i is a beta distribution with parameters ((1 - θx)/θx)πi and ((1 - θx)/θx)(1 - πi) and that the distribution of allele frequencies across loci is a beta distribution with parameters ((1 - θy)/θy)π and ((1-θy)/θy)(1-π). It is straightforward to show that this hierarchical structure produces a mean and covariance structure that matches (2) and (3) (Holsinger 2006; Song et al. 2006). While the stationary distribution of among-population allele frequencies follows a beta distribution in some evolutionary models (Crow and Kimura 1970), we make no explicit assumption about the underlying evolutionary process in using a beta distribution to describe variation in allele frequencies among populations. We adopt it simply because it is a flexible distribution suitable for many distributions on [0,1]. Indeed, in a dataset with samples from a large number of populations it may be desirable to consider a finite mixture of beta distributions to allow for multimodality in the allele frequency distribution. Placing vague, uniform priors on π, θx, and θy completes the specification of a Bayesian model and allows us to construct an MCMC sampler for inference on the parameters. In particular, θx(1 - θy) + θy is mathematically equivalent to FST as estimated in Weir and Cockerham (1984)’s random effect model. FST provides a convenient measure of genetic differentiation among populations, because it is interpretable as the proportion of genetic diversity due to allele frequency differences among populations. Different demographic histories, different local population sizes, and different patterns of migration will lead to different amounts of among-population differentiation and to correspondingly different values FST, but all autosomal loci within an individual will be affected in the same way. Thus, all autosomal loci in a typical population sample will have values of FST drawn from the same distribution unless rates of mutation or patterns of selection differ substantially. As in Akey et al. (2002), Beaumont and Balding (2004), Storz, Payseur and Nachman (2004), and Weir et al. (2005), we shall use locus-specific estimates of FST to detect loci showing divergent patterns of variation.

All 4 models proposed in this paper are based on the framework introduced above. Directed acyclic graphs (DAG) showing the structure of each model are shown in Figure 1.

Figure 1.

Figure 1

DAG plot of the models

3.2 Model 1

The first model proposed here extends the simple framework above by incorporating locus-specific estimates of FST. As discussed above, all loci have 2-alleles and the likelihood of the data is a product binomial distribution. We place a beta prior with parameters ((1 - θi)/θi) ψi, (1 - θi)/θi) (1 - ψi) for the binomial parameter pik. It can be easily shown that the expectation of pik in the prior distribution is ψi and that its variance is θiψi(1 - ψi). Thus, θi corresponds directly with Wright’s FST for locus i. We adopt a full Bayesian approach and set the second and third level hierarchical prior for θi and ψi respectively. The posterior distribution is as follows:

π(ΘD)f(xp)π(pθ,ψ)π(θθL,φ)π(ψψH,ν)π(θL)π(φ)π(ψH)π(ν),

where Θ is the collection of all the model parameters; f(x|p) is the likelihood function as in (4). The first level prior for p is,

π(pθ,ψ)=i=1Ik=1KΓ(1θiθi)Γ(1θiθiψi)Γ(1θiθi(1ψi))pik1θiθiψi1(1pik)1θiθi(1ψi)1,

which is a product of beta distributions with parameters (1θiθiψi,1θiθi(1ψi)). The hyperparameter θi corresponds to FST at locus i and is the key parameter of interest. We place a second level of prior on θ as follows:

π(θθL,φ)=i=1IΓ(1θLθL)Γ(1θLθLφ)Γ(1θLθL(1φ))θi1θLθLφ1(1θi)1θLθL(1φ)1, (5)

which is a beta prior with parameters (1θLθLφ,1θLθL(1φ)). The second level prior for ψ is a beta distribution with parameters (1ννψH,1νν(1ψH)),

π(ψν,ψH)=i=1IΓ(1νν)Γ(1ννψH)Γ(1νν(1ψH))ψi1ννψH1(1ψi)1νν(1ψH)1. (6)

At the third level, there is no preference for any particular value. We use a Uniform(0,1) prior for θL, ν, ψH, and φ.

3.3 Model 2

The second model proposed is an extension of the first model. We replace θi in Model 1 with θik for locus i and population k, where θik = 1 - (1 - θi)(1 - θk). In this formulation θi represents a locus-specific effect and θk represents a population-specific effect. As in Model 1, hierarchical beta priors are assigned to θi and θk respectively. The posterior distribution is as follows:

π(ΘD)f(xp)π(pθ,θk,ψ)π(θθL,φ)π(θkθP,ϕ)π(ψψH,ν)π(θL)π(φ)π(θP)π(ϕ)π(ψH)π(ν)

where θk is the vector of θk, k = 1, ..., K. We further assume parameters θk come from a hyper-beta distribution with the following form:

π(θkθP,ϕ)=k=1KΓ(1θPθP)Γ(1θPθPϕ)Γ(1θPθP(1ϕ))θk1θPθPϕ1(1θk)1θPθP(1ϕ)1.

The prior for θ, ψ is the same as in model 1, equations (5) and (6). Further, priors for θL, θP, ν, ϕ, φ, and ψH are assumed Uniform(0,1).

3.4 CAR Model

The hierarchical models discussed above allow the θi to “borrow strength” from other sites in estimating their posterior distributions, but they treat variation at each locus as if it were independent of variation at all other loci. Analysis of locus-specific effects at high genomic resolution is essential if the results of our method are to provide experimentalists with a guide for selecting regions worthy of additional study. But at high resolutions the reduced probability of recombination among adjacent SNP markers is likely to lead to similar patterns of differentiation among populations, i.e., we expect the locus-specific effects of neighboring loci on FST to be similar. To account for this correlation, we adopt a common methodology used in the analysis of spatial variation in geographical models, namely a conditional autoregressive (CAR) model on random effects associated with each locus. The basic idea is that the loci close to each other are more likely to have similar amounts of among-population differentiation and thus similar posterior distributions for θi. Specifically, we incorporate the local correlation into the model through a CAR prior for θi in Model 1 and Model 2.

We incorporate the local correlation structure into the hierarchical model 1 and 2 by placing a prior distribution with CAR components on θi, and we use a logit transformation to extend the support of θi to entire real line. In short, the model specification is as follows:

log(θi1θi)=μ+i,i=1,2,I. (7)

Here μ captures the global mean and i represents a random effect associated with locus i. We place a normal prior with mean zero and variance 1/τh on, μ i.e.,

μN(0,1τh). (8)

We place a CAR prior on i to incorporate the local correlation among loci. The CAR prior reflects our expectation that at high genomic resolution i and i will be of similar sign and magnitude when the genetic distance between i and i’ is small. The CAR prior has the following conditional form:

i(i)N(jiwijwi+j,1τcwi+),i=1,2,I, (9)

where ∊(-i) is the collection of j, ∀ji, τc is a precision parameter, wij is the entry at row i and column j of proximity matrix W, and wi+=j=1Iwij. The W is an I × I proximity matrix in which entry wij indicates the spatial relationship between loci i and j. Several choices of W are possible. A simple choice would be to use 0 or 1 to indicate whether or two loci are “close” or not to each other, where “close” is defined as being within a certain distance. Because we expect the statistical association among loci to be related to the recombinational distance between loci, we define wij as a function of the distance between loci,

wij={c(dij)iflociij0iflocii=j,}

where dij is the distance between loci i and j, and c(dij) is a function that describes how the covariance among loci depends on the distance between them. The c(dij) is usually a decreasing function of dij, often a reciprocal or an exponential. Because the recombinational distance between some of our loci in the high-resolution scan is zero, we use the exponential function, c(dij) = c1 + c2 exp(-c3dij), where c1, c2, c3 are constants chosen for computational convenience and numerical stability. We chose the value of c1, c2, c3 so that (1) only the 20-100 nearest loci have a large influence; (2) the average value of wi+ is approximately 1; and (3) wi+ is greater then 0.5 to avoid numerical instability associated with small values of wi+.

The joint prior distribution for the i is

π(1,,I)exp{τc2ijωij(ij)2}.

Note that this is a pairwise difference model and is not a proper distribution (Banerjee, Carlin and Gelfand 2004). In particular, the i are nonidentifiable. As usual in such models, the constraint i=0 is sufficient to guarantee identifiability. Here θi can be calculated from the inverse logit function

θi=eμ+i1+eμ+i,and1θiθi=e(μ+i).

By replacing the θi in Model 1 with the and i, the prior for pik is then reduced to,

π(piki,μ,ψi)=1B(e(μ+i)ψi,e(μ+i)(1ψi))pike(μ+i)ψi1(1pik)e(μ+i)(1ψi)1,

where B(·,·) denotes the beta function. The rest of the model specification is the same as Model 1.

The last model proposed uses the CAR prior for θi in Model 2. Again, we use a logit transformation for θi and a random effect model as in (7), (8), and (9). Then using the identity (1 - θik) = (1 - θi)(1 - θk) we have,

θik=11θk1+eu+i,and1θikθik=1θkeμ+k+θk.

Thus the prior for pik is obtained as,

π(pikμi,θk,ψi)=1B(1θkexp(μ+i)+θkψi,1θkexp(μ+i)+θk(1ψi))pik1θkexp(μ+i)+θkψi1(1pik)1θkexp(μ+i)+θk(1ψi)1.

The rest of the model specification is the same as in Model 2.

As recommended by Banerjee et al. (2004), we adopt the following prior distributions for τc and τh: τh ∼ Gamma(0.001, 0.001), and τc ∼ Gamma(0.1, 0.1).

4. DETECTING LOCI WITH UNUSUAL PATTERNS OF VARIATION

The overarching objective of our models is to allow us to identify loci that are “unusual,” i.e., loci for which the amount of among-population variation differs substantially from that at other loci. Statistically, this corresponds to identifying loci for which θi is either unusually large or unusually small. As Nei and Maruyama (1975) and Robertson (1975) pointed out more than 30 years ago, however, it is not sufficient to ask whether a particular θi is different from a common mean. Such a comparison would account only for the statistical uncertainty associated with parameter estimates. It would neglect the much larger uncertainty often associated with the underlying stochastic evolutionary process. In our approach, we assume that the θi are drawn independently from a common hyperdistribution. Thus, if all loci in the sample were selectively neutral, the variability among loci in θi captured by this hyperdistribution would reflect variability in outcomes associated with different realizations of the underlying stochastic evolutionary process. If mutation rates differ among loci, that variation will also be reflected in the variability of this hyperdistribution. Thus, to detect θi that are unusually large or unusually small, we must compare them with a common distribution rather than a common mean.

We propose the following steps to detect outliers: (1) Approximate the posterior distribution of locus-specific effect parameters, i.e., θi for the beta-hierarchical model and i for the CAR models (see next paragraph for details). (2) Calculate the distance between the locus-specific effect and a “centering” distribution derived from the hyperdistribution describing among-locus variation in the locus-specific effect. (3) Compare the mean of the posterior distribution for loci identified as outliers with the mean of the “centering” distribution to identify loci with unusually large or unusually small amounts of among-population differentiation.

Our preliminary analysis shows that the posterior distribution of θi is unimodal. It is well known that any unimodal distribution with support on [0, 1] can be approximated by a beta distribution. Therefore, we approximate the posterior distribution of θi with a beta distribution whose first two moments match the first two moments of the posterior distribution for θi as estimated from the MCMC output. We compare the posterior distribution for each θi with the posterior of its hyperdistribution. For example, in Model 1, each θi is compared with

Beta(1θ^Lθ^Lφ^,1θ^Lθ^L(1φ^)),

where φ^ and θ^L refer to posterior means. The loci for which the posterior of θi diverges substantially from this hyper-distribution are considered as outliers.

We use the KLD to measure the divergence between the posterior of θi and its centering distribution. The KLD between two densities p(y) and q(y) is defined as

KLD(p,q)=p(y)log(p(y)q(y))dy.

The KLD between two beta distributions with parameters (α0, β0) and (α1, β1) is given by

KLD=1B(α0,β0)θα01(1θ)β01log1B(α0,β0)θα01(1θ)β011B(α1,β1)θα11(1θ)β11dθ.

If X ∼ Beta(α, β) then 1-X ∼ Beta(β, α). Furthermore, if X ∼ Beta(α, β), then E[log X] = ψ(α) - ψ(α + β), where ψ(α)=Γ(α)Γ(α) is the digamma function. Thus, E[log(1 - X)] = ψ(α + β) - ψ(β), and the KLD between two beta distribution is

KLD=logB(α1,β1)B(α0,β0)+(α0α1)(ψ(α0)ψ(α0+β0))+(β0β1)(ψ(β0)ψ(α0+β0)),

where ψ(·) is the digamma function.

Similarly, we compare the posterior distribution of each locus-specific effect from the CAR models, i, with the posterior of its corresponding hyperdistribution. The “centering” distribution can be calculated in two ways, corresponding to detecting loci with unusually large or unusually small amounts of differentiation either relative to near neighbors (“local outliers”) or relative to all loci in the sample (“global outliers”). The “centering” distribution for detecting local outliers is

i^(i)N(jwijwi+^j,1τ^cwi+),i=1,2,I, (10)

where ^j and τ^c refer to posterior means. Accordingly we define

KLDlocal=KLD(iD,i^(i)), (11)

where i|D is the marginal posterior distribution of i. We approximate this distribution with a normal distribution by matching the first two moments. So KLDlocal provides a measure of the divergence between the posterior of i and a locally smoothed estimate. A large KLDlocal indicates a locus differing substantially from its near neighbors.

Recall that for identifiability of the model we impose the constraint ii=0. Thus, a locus for which i is substantially different from zero exhibits either substantially more or substantially less differentiation among populations than the average locus in the sample. In short, it also makes sense to compare i|D with the marginal distribution, N(0,1τ^cwi+). Accordingly we define

KLDglobal=KLD(iD,N(0,1τ^cwi+)). (12)

KLDglobal measures the divergence between the posterior distribution of θi and the mean among-population differentiation. A group of loci with large global KLD but small local KLD indicates a cluster of loci with substantially more or substantially less among-population differentiation than the average locus in the sample. It is straightforward to show that the KLD between two normal distributions is

KLD(N(μ0,σ02),N(μ1,σ12))=12[logσ12σ02+σ02σ12+1σ12(μ0μ1)21].

We calibrate the KLD using the method proposed by Peng and Dey (1995). Consider flipping a “fair” coin with equal probability 0.5 for head and tail versus flipping a biased coin with probability θ for head. The larger |θ - 0.5| is, the more “extreme” the bias. The KLD measure between these two Bernoulli distributions is

L=log(0.5)0.5log(θ(1θ)).

For example, θ = 0.01, corresponds to a strong bias and a KLD value of 1.614.

The KLD value provides a measure of the distance between two distributions but no information about the relative locations of the centers of the two distributions. For example, two normal distributions, N(-1, 3) and N(1, 3), both have the same KLD relative to a standard normal distribution, N(0, 1). Outlier detection thus follows a two-stage procedure. First, we identify loci with a large KLD between the posterior of the locus-specific effect, θi or i and the corresponding centering distribution. Second, we compare the posterior means of θi or i for loci identified as outliers and the means of the centering distribution to determine whether the locus shows unusually large or unusually small amounts of among-population differentiation.

5. SIMULATION STUDY

To determine whether outliers detected with our method correspond to loci subject to selection, we simulate allele frequencies under a Wright-Fisher model with migration, mutation, and drift, following Beaumont and Balding (2004). A small number of loci included in the simulation are also subject to natural selection. Specifically, we simulate a sample of allele frequencies drawn from four populations as in the SNP data from the HapMap project. We assume a constant population size of 250 individuals (500 chromosomes) for all populations, and we assume that all sampled loci are independently inherited. The migration rate into a population is chosen by sampling FST from a beta distribution with parameters (0.25, 2.25) and setting m = (1 - FST)/(2NFST), where N is the population size (see Beaumont and Balding (2004) for details). The chosen parameters result in a distribution of FST comparable to that observed in the HapMap data.

The simulation allows for three types of loci: those subject to directional (divergent) selection, those subject to balancing (stabilizing) selection, and neutral loci. Allelic differences at neutral loci do not affect fitness. Levels of within- and among-population variation are determined entirely by migration, mutation, and genetic drift. We assume that the majority of loci are selectively neutral in our simulations. Thus, variation at these loci largely determines the distribution of FST across loci.

At a locus under directional selection, one allele enhances the fitness of individuals carrying it. When the allele enhancing fitness differs among populations, allele frequency differences among populations will be greater than at neutral loci. At a locus under balancing selection, heterozygous individuals are more likely to reproduce than individuals homozygous for either allele. In our simulations, the loci are unlinked and each is either neutral, subject to divergent selection, or subject to balancing selection. In the case of loci subject to directional selection, the relative fitness is 1 + s for the favored homozygote, 1 + s/2 for heterozygote, and 1 for the disfavored genotype. In the case of loci subject to balancing selection, the relative fitness of heterozygotes is 1 + s and the relative fitness of both homozygotes is 1.

We consider two different mutation models. In the two-locus model, the marker locus is completely linked to the locus that is under selection. The marker locus evolves according to an infinite allele model while the selected locus evolves according to a parent-independent K-allele model with three alleles. In the marker-selected model, the marker itself is subject to selection and evolves according to the parent-independent K-allele model with three alleles (see Beaumont and Balding (2004) for more details). The simulations were implemented using software provided by Mark Beaumont. The mutation rate at marker loci is μm = 0.00001 and at selected loci is μs = 0.0001. We generated 100 samples from each population after 50, 000 generations in the simulation from 11 different simulations scenarios (Table 1) corresponding to different mutational models, different strengths of selection, and different numbers of loci.

Table 1.

Simulation results

Selection model Selection coefficient Number of loci Classiffication*
Directional Balancing Neutral Direc. Bal. Neut.
4% 0% 0%
2-locus 0.02 80 20 900 0% 0% 0%
96% 100% 100%

68% 0% 0%
2-locus 0.05 80 20 900 0% 0% 0%
32% 100% 100%

90% 0% 1%
2-locus 0.1 80 20 900 0% 0% 1%
10% 100% 98%

10% 0% 1%
2-locus 0.02 40 10 450 0% 0% 0%
90% 100% 99%

68% 0% 0%
2-locus 0.05 40 10 450 0% 0% 0%
32% 100% 100%

88% 0% 2%
2-locus 0.1 40 10 450 0% 0% 0%
12% 100% 98%

100% 0% 3%
2-locus 0.2 40 10 450 0% 30% 4%
0% 70% 93%

0% 0% 0%
Marker selected 0.02 40 10 450 0% 0% 0%
100% 100% 100%

63% 0% 1%
Marker selected 0.05 40 10 450 0% 0% 0%
27% 100% 99%

90% 0% 1%
Marker selected 0.1 40 10 450 0% 20% 0%
10% 80% 99%

88% 0% 0%
Marker selected 0.2 40 10 450 0% 50% 5%
12% 50% 95%
*

Column is the true scenario and row is the classi ed scenario. Bold numbers indicate correct classi cation (using critical KLD=0.830, p = 0.05).

We fit the simulated data using model 1 and used the KLD criterion (p = 0.05) to identify outliers. A summary of the results is shown in Table 1. Several important features are apparent. First, neutral loci are rarely misclassified as being subject to selection. Only in one set of simulations was the false positive rate higher than 5%. Second, under conditions of the simulation loci subject to balancing selection are rarely detected. Only when the selective advantage of heterozygotes is very strong (s = 0.2) and the marker itself is subject to selection do we detect stabilizing selection in more than 30% of cases. Third, loci subject to divergent directional selection are quite readily detected when the selection coefficient is moderate to strong (s = 0.05), regardless of whether selection acts directly on the marker or on a tightly linked locus.

Thus, if allelic variation at most loci in a sample is selectively neutral and if mutation rates at those loci are the same, loci we designate as statistical outliers correspond to a large fraction of loci that are subject to divergent selection pressures. The lack of power to detect balancing (stabilizing) selection is not surprising. Given our simulation conditions FST at neutral loci is expected to be about 0.1. Detecting balancing selection would require us to detect loci at which FST < 0.1, which is very difficult given that FST is bounded below by 0. Detecting divergent selection on the other hand requires detection of loci at which FST > 0.1. Moreover, a reviewer pointed out that the stationary distribution of allele frequencies at such loci depends on Ns, where N is the effective size of local populations and s is the selection coefficient (Wright (1931), see also Holsinger (1999)). Thus, in a situation where local populations consist of 2500 individuals rather than 250, our approach may detect a large fraction of loci subject to divergent selection even when the selection coefficient is as small as 0.01.

6. APPLICATION AND MODEL COMPARISON

We apply the proposed models to two subsets of SNP data on human chromosome 7: (1) low-resolution data including 3040 loci separated by approximately 53k base pairs and (2) high-resolution data including 3002 loci separated by approximately 860 base pairs (see Section 2 for details). The high-resolution data are centered around SNP rs13239338, which has the largest KLD measure identified in the low-resolution analysis.

The proximity matrix is calculated from genetic map positions, as described earlier. Based on the three criteria introduced in section 3, we adopt the following proximity functions for low and high resolution data:

c(dij)={0.53040+0.0125exp(dij)low resolution data0.53002+0.02exp(1000dij)high resolution data,}

where dij is the distance (in centimorgans) between loci i and j.

We fit the models using MCMC. Except for a few parameters that can be sampled directly from conditional distributions, most parameters are sampled using the Metropolis-Hastings (M-H) updates. Examination of the trace and standard convergence diagnostics (Geweke 1992) suggest that convergence has been achieved.

We use two criteria to compare models: the Deviance Information Criterion (DIC) and the Conditional Predictive Ordinate (CPO) based log of the Pseudomarginal likelihood (LPML). DIC assesses models on the marginal space and is defined as

DIC=D+pD,

where D is the Bayesian deviance, D = -2log(p(y|θ)) + 2log(f(y)) and D is the posterior mean of D. pD is a penalty term: pD=DD(θ), where D(θ) is the Bayesian deviance measured at posterior mean of parameter θ.

The parameters we are interested in, θi and i, are at the hyperparameter level. Thus, it is more appropriate to assess the model based on θi and i than based on pik. In other words, θi and i are the parameters of focus in the sense of Spiegelhalter, Best, Carlin and van der Linde (2002), and the pik can be considered nuisance parameters. In light of this, we integrate the pik out and calculate DIC based on θi and ψi, i.e.,

f(xp)π(pθ,ψ)dp=i=1Ik=1K(Nikxik)1B(1θiθiψi,1θiθi(1ψi))B(xik+1θiθiψi,Nikxik+1θiθi(1ψi)). (13)

CPO and LPML are model evaluation criteria based on the predictive space (Gelfand and Dey 1994; Gelfand, Dey and Chang 1992; Geisser 1993; Dey, Chen and Chang 1997). The CPO for xik, the allele count at locus i in population k, is defined as

CPOik=f(xikD(ik))=f(xikθ)π(θD(ik))dθ,

where D(-ik) denotes the data with observation xik deleted and π(θ|D(-ik)) is the posterior density of the model parameter θ based on the data D(-ik). LPML is the summation of the logarithm of the CPOs,

LPML=i=1Ik=1Klog(CPOik).

CPO can be calculated using Monte Carlo approximation directly from the MCMC output,

CPOik=(1Bb=1B1f(xikθ(b)))1,

where {θ(b), b = 1,...,B} is the MCMC sample from π(θ|D) and D is the complete data. As with DIC, the CPO calculation can be based either on pik or on (θik, and ψi). Again, we want to predict the allele counts xik given the parameter of interest (θi, θk, ψi|D(-ik)). The pik should be considered as random effects rather than model parameters. Therefore, we integrate the pik out and use equation (13) to calculate CPO and LPML.

Table 2 summarizes DIC and LPML results. For low-resolution SNP data, Model 2 has the smallest DIC thus is preferred to the alternative models. The ordering of models according to the LPML criterion is identical. Both CAR models are inferior to the non-spatial alternatives. Thus, spatial effects are not detectable at low resolution, but population specific effects are important. The lack of spatial effect may not be too surprising in these data, because the average distance between adjacent markers is more than 52kb. Because of these results, outlier detection in the low-resolution data is based on Model 2.

Table 2.

Model comparison

Low-resolution data High-resolution data
Model pD DIC LPML pD DIC LPML
M1 81225 2809 84034 -42695 71205 2579 73784 -37423
M2 80417 2776 83192 -42356 71291 2541 73833 -37436
M1CAR 81145 3141 84285 -42690 70968 2813 73780 -37389
M2CAR 80534 3166 83700 -42368 71261 2557 73818 -37398

With the high-resolution data, the CAR models outperform models that fail to account for the statistical association expected between loci that are in close proximity. Once the effects of spatial proximity have been accounted for, however, we find no detectable effect associated with population. The lack of population specific effect in these data may not be surprising, because the high-resolution data cover only about 2% of chromosome 7 and are centered on a marker already known to exhibit much more among-population differentiation than the genomic average. Thus, we use Model 1 with a CAR prior to detect outliers in the high-resolution data.

To identify outliers in the HapMap data we chose a critical KLD value of 1.614 (p = 0.01). Using this criterion we identified 17 loci as outliers (Figure 2). In every case, the posterior distribution of θi is substantially shifted to the right, indicating that all of these loci mark regions of the genome showing substantially greater differentiation among populations than the average locus in our sample. Ten of the 17 loci we identify as outliers are located either within or close by a known gene or open-reading frame. The relationship between known genes and loci we identify as outliers is summarized in Table 3.

Figure 2.

Figure 2

Densities of posterior θi for low resolution scan

Table 3.

Outliers detected in low-resolution data

SNP ID Position KLD (M2) Candidate loci
rs7787411 14746124 2.35 diacylglycerol kinase
rs10263500 30557791 2.20 corticotropin releasing hormone receptor 2, indolethlyamin N-methyltransferase
rs11771444 30797413 1.76 growth hormone releasing receptor
rs12535578 54928795 3.51 epidermal growth factor receptor
rs4521648 70374286 1.73 UDP-GalNAc:polypeptide
rs2722963 82702679 1.70 SEMA3E: semaphorin 3E
rs1990040 85957725 1.83 glutamate receptor
rs17161695 98609357 2.29 actin-related protein 2/3 complex subunit 1A, PDGFA associated protein 1
rs11976018 98767088 1.87 zinc finger protein 95 homolog (mouse)
rs1476471 108525466 1.65 n.a.1
rs43083 111841138 1.72 n.a.1
rs12531918 111885313 2.40 n.a.1
rs2894673 112653596 1.79 n.a.1
rs6466707 118915526 1.92 n.a.1
rs13239338 126578324 7.21 n.a.1
rs2671095 131284016 3.07 n.a.1
rs4716934 155227668 2.17 Homo sapiens sonic hedgehog homolog (Drosophila)
1

No known genes in vicinity of this SNP.

In Table 3, we notice that SNP rs13239338 has the highest KLD, even though it is not within a known gene. We use this locus as the center of our high-resolution scan, including 3001 SNP loci around it in the high-resolution data set. Using a critical KLD of 1.614 (p=0.01), we now identify 57 loci showing unusually large amounts of among-population differentiation. Figure 3 shows the genomic location of these θi values as well as the locations of known genes in this region. The 57 outliers fall within a smaller number of clusters. Perhaps the most striking cluster is the one involving 16 markers in the vicinity of LEP. A smaller number of markers are clustered around NYD-SP18/CALU and KIAA0828. The remaining markers are spread through a region including GRM8, LOC168850, GCC1 and FSCN3.

Figure 3.

Figure 3

High resolution scan outliers (M1CAR)

We summarize the relationships between known genes and markers identified as outliers in Table 4. It is interesting to observe that SNPs rs2278815 and rs4731426 are in an intron of LEP leptin, which is the homolog of a gene contributing to obesity in mice. In the Yoruba population, 95% of chromosomes have nucleotide base G at both sites while in other populations the frequency of G is only 22-40%. The protein encoded by this gene is secreted by white adipocytes. In mice, mutations in this gene are associated with severe obesity. The relationship between allelic differences at these SNP loci has also been confirmed in human population association studies (Mammès, Betoulle, Aubert, Herbeth, Siest and Fumeron 2000; Li, Reed, Lee, Xu, Kilker, Sodam and Price 1999; Mammès, Betoulle, Aubert, Giraud, Tuzet, Petiet, Colas Linhart and Fumeron 1998). Our results suggest that allelic differences at other loci involved in fat metabolism must compensate for the among-population allelic differences observed here.

Table 4.

High-resolution data outliers and known genes

Gene & Location SNP loci
GRM8: glutamate receptor, (125672607,126477260) rs7796270, rs7786541,
rs17865314, rs6960871,
rs4532535, rs2106149,
rs2237808, rs10226369
LOC168850: hypothetical protein (126604306:126626717) rs951809
GCC1: Golgi coiled-coil protein 1 (126819604,126814633) rs989100
FSCN3: fascin3 (126827639:126835793) rs806214
SND1:staphylococcal nuclease domain containing 1 (126886152:127326608) rs7793281, rs712707,
rs12672945, rs6969233,
rs322821
LEP:leptin precursor (obesity homolog, mouse) (127475281:127491631) rs2021808, rs4731423,
rs4731424, rs1349419,
rs13245201, rs10487506,
rs7799039, rs2278815,
rs4731426, rs2071045,
rs2060715, rs4731429,
rs10954175, rs12537998,
rs1466145, rs4728090,
NYD-SP18: testes development-related (127949393 :127965611)
CALU: calumenin recursor (127973386:128005477)
rs17164371, rs2402934,
rs7780294, rs2060717
KIAA0828: adenosylhomocysteinase 3 (128458814: 128664001) rs721691, rs4731568
Loci with no known gene within 5k bps (around LOC168850) rs1419391, rs975308,
rs916598, rs12536774,
rs13239338, rs9640842,
rs2106177, rs10487482,
rs4731365, rs12666432,
rs12673058, rs10954158,
rs1419410, rs12671806,
rs12668127, rs11768389,
rs1592365, rs11984364,
rs17150996

7. DISCUSSION

In an earlier analysis of population differentiation using the HapMap data set, Weir et al. (2005) found substantial heterogeneity in locus-specific estimates of FST. Because their analysis used method-of-moment estimates (Weir and Cockerham 1984; Weir and Hill 2002), they were unable to provide a statistical criterion for recognizing particular loci as outliers, i.e., as exhibiting unusually large or unusually small amounts of differentiation among populations.

In this paper we extend an existing Bayesian framework for analysis of genetic differentiation among populations (Holsinger 1999; Holsinger 2006) to accommodate locus- and population-specific effects on FST. A novel aspect of our extension is the use of conditional autoregressive models to account for the correlation in patterns of among-population differentiation expected between loci that are in close physical proximity. A model that ignores associations among loci performs well on low-resolution data (one marker every 52kb on average), but including associations among closely linked loci is vital in analyses of the high-resolution data available in the full HapMap data set (one marker every kb on average). We compare estimated locus-specific effects with a hyperdistribution reflecting variation in FST across all loci in the sample. Strictly speaking, our approach only allows us to identify statistical outliers, but a small simulation study suggests that outliers detected by our approach correspond to loci under selection if most loci in the sample are neutral and share the same or comparable mutation rates.

In our analysis of data derived from the HapMap project, it seems reasonable to conclude that loci we identify as statistical outliers mark regions of the genome that have been subject to divergent selection among the populations included in the sample. For a set of neutral loci, allele frequencies are completely determined by the history of local population sizes they share, the history of migration among local populations they share, and the distribution of mutation rates among loci. If we summarize the amount of genetic differentiation among populations with FST, then the distribution of FST across loci will reflect both variation arising from the underlying stochastic evolutionary process and variation arising from differences among loci in mutation rates. We identified 17 loci in the low-resolution analysis and 57 loci in the high-resolution analysis that are outliers with respect to this hyperdistribution. Such outliers represent loci with levels of among-population differentiation that are substantially larger than would be consistent with the distribution of FST at the remaining 2900+ loci in our sample, and selection seems more likely than mutation to be responsible for such extreme departures from the genomic average.

We anticipate that our approach to outlier detection is less likely to detect loci that are subject to selection than approaches that directly model the demographic history of populations. Storz et al. (2004), for example, use a coalescent approach to estimate demographic parameters and construct expectations based on those parameter estimates. As discussed by Nielsen (2001; 2005) tests like these depend on strong assumptions about demography. We suspect that making such parametric assumptions make the tests based on coalescent approaches more sensitive to departures from neutrality. In our approach the vagaries of demographic history are shared by all autosomal loci, and the hyperdistribution describing FST variation among loci encapsulates the uncertainty associated with the drift process. Thus, our method will be robust to a variety of demographic scenarios, although it is likely to be less powerful than methods designed to take those scenarios into account.

Human population geneticists have made a wealth of data available in recent years. The HapMap data set, of which we have analyzed only a small portion here, includes samples only from four geographically distinct populations, but the data from these populations is available at high genomic resolution, roughly every 1kb over the entire human genome. In contrast, the HGDP-CEPH microsatellite data set (Cann, de Toma, Cazes, Legrand, Morel, Piou re, Bodmer, Bodmer, Bonne-Tamir, Cambon-Thomsen, Chen, Chu, Carcassi, Contu, Du, Exco er, Friedlaender, Groot, Gurwitz, Herrera, Huang, Kidd, Kidd, Langaney, Lin, Mehdi, Parham, Piazza, Pistillo, Qian, Shu, Xu, Zhu, Weber, Greely, Feldman, Thomas, Dausset and Cavalli-Sforza 2002) provides data at low genomic resolution (377 loci or roughly one every 107kb), but it includes samples from 52 geographically defined populations in Africa, Eurasia, Oceania, and the Americas. In addition to providing a much wider geographic sampling of human diversity and thereby increasing the potential that populations have experienced divergent selection pressures, loci in the HGDP-CEPH microsatellite data set harbor many alleles, and the mutational dynamics of microsatellite loci are quite different from those of SNPs. Our future work will include both high resolution analyses of the entire human genome using data derived from the HapMap project and the development of statistical models appropriate for similar analyses of the HGDP-CEPH microsatellite data.

Acknowledgments

This research was supported, in part, by a grant from the National Institutes of Health, National Institute of General Medical Sciences 1R01-GM068449-01A1.

REFERENCES

  1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a High-Density SNP Map for Signatures of Natural Selection. Genome Res. 2002;12(12):1805–1814. doi: 10.1101/gr.631202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC; Boca Raton, FL: 2004. [Google Scholar]
  3. Beaumont MA, Balding DJ. Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology. 2004;13(4):969–980. doi: 10.1111/j.1365-294x.2004.02125.x. [DOI] [PubMed] [Google Scholar]
  4. Cann HM, de Toma C, Cazes L, Legrand M-F, Morel V, Piouffre L, Bodmer J, Bodmer WF, Bonne-Tamir B, Cambon-Thomsen A, Chen Z, Chu J, Carcassi C, Contu L, Du R, Excoffier L, Friedlaender JS, Groot H, Gurwitz D, Herrera RJ, Huang X, Kidd J, Kidd KK, Langaney A, Lin AA, Mehdi SQ, Parham P, Piazza A, Pistillo MP, Qian Y, Shu Q, Xu J, Zhu S, Weber JL, Greely HT, Feldman MW, Thomas G, Dausset J, Cavalli-Sforza LL. A Human Genome Diversity Cell Line Panel. Science. 2002;296(5566):261b–262. doi: 10.1126/science.296.5566.261b. [DOI] [PubMed] [Google Scholar]
  5. Cavalli-Sforza LL. Population Structure and Human Evolution. Royal Society of London Proceedings Series B. 1966;164:362–379. doi: 10.1098/rspb.1966.0038. [DOI] [PubMed] [Google Scholar]
  6. Chakraborty R, Kimmel M, Stivers DN, Davison LJ, Deka R. Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci. Proceedings of the National Academy of Sciences USA. 1997;94:1041–1046. doi: 10.1073/pnas.94.3.1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Consortium, T. I. H. A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Crow JF, Kimura M. An Introduction to Population Genetics Theory. Burgess Publishing Company; Minneapolis, Minn: 1970. [Google Scholar]
  9. Dey DK, Chen M-H, Chang H. Bayesian Approach for Nonlinear Random Effects Models. Biometrics. 1997;53(4):1239–1252. [Google Scholar]
  10. Fu R, Gelfand AE, Holsinger KE. Exact moment calculations for genetic models with migration, mutation, and drift. Theoretical Population Biology. 2003;63:231–243. doi: 10.1016/s0040-5809(03)00003-0. [DOI] [PubMed] [Google Scholar]
  11. Geisser S. Predictive Inference: An Introduction. Chapman & Hall/CRC; Boca Raton, FL: 1993. [Google Scholar]
  12. Gelfand AE, Dey DK. Bayesian Model Choice: Asymptotics and Exact Calculations. Journal of the Royal Statistical Society. Series B (Methodological) 1994;56(3):501–514. [Google Scholar]
  13. Gelfand AE, Dey DK, Chang H. Model determination using predictive distributions with implementation via sampling-based methods. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics 4. Oxford University Press; Oxford: 1992. pp. 147–167. [Google Scholar]
  14. Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics 4. Oxford University Press; Oxford: 1992. [Google Scholar]
  15. Holsinger KE. Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas. 1999;130:245–255. [Google Scholar]
  16. Holsinger KE. Bayesian hierarchical models in geographical genetics. In: Clark JS, Gelfand AE, editors. Applications of Computational Statistics in the Environmental Sciences. Oxford University Press; New York, NY: 2006. pp. 25–37. [Google Scholar]
  17. Lercher MJ, Hurst LD. Human SNP variability and mutation rate are higher in regions of high recombination. Trends in Genetics. 2002;18:337–340. doi: 10.1016/s0168-9525(02)02669-0. [DOI] [PubMed] [Google Scholar]
  18. Lewontin RC, Krakauer J. Distribution of Gene Frequency as a Test of the Theory of the Selective Neutrality of Polymorphisms. Genetics. 1973;74(1):175–195. doi: 10.1093/genetics/74.1.175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li CC. Population Genetics. University of Chicago Press; Chicago, IL: 1955. [Google Scholar]
  20. Li WD, Reed DR, Lee JH, Xu W, Kilker RL, Sodam BR, Price RA. Sequence variants in the 5′ flanking region of the leptin gene are associated with obesity in women. Annals of Human Genetics. 1999;63:227–234. doi: 10.1046/j.1469-1809.1999.6330227.x. [DOI] [PubMed] [Google Scholar]
  21. Malécot G. Les Mathématiques de l’Hérédité. Masson et Cie; Paris, France: 1948. [Google Scholar]
  22. Mammès O, Betoulle D, Aubert R, Giraud V, Tuzet S, Petiet A, Colas Linhart N, Fumeron F. Novel polymorphisms in the 5′ region of the LEP gene: association with leptin levels and response to low-calorie diet in human obesity. Diabetes. 1998;47:587–489. doi: 10.2337/diabetes.47.3.487. [DOI] [PubMed] [Google Scholar]
  23. Mammès O, Betoulle D, Aubert R, Herbeth B, Siest G, Fumeron F. Association of the G-2548A polymorphism in the 5′ region of the LEP gene with overweight. Annals of Human Genetics. 2000;64:391–394. doi: 10.1017/s0003480000008277. [DOI] [PubMed] [Google Scholar]
  24. Nei M, Maruyama T. Lewontin-Krakauer test for neutral genes. Genetics. 1975;80(2):395. doi: 10.1093/genetics/80.2.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Nielsen R. Statistical tests of selective neutrality in the age of genomics. Heredity. 2001;86(6):641–647. doi: 10.1046/j.1365-2540.2001.00895.x. [DOI] [PubMed] [Google Scholar]
  26. Nielsen R. Molecular signatures of natural selection. Annual Review of Genetics. 2005;39(1):197–218. doi: 10.1146/annurev.genet.39.073003.112420. [DOI] [PubMed] [Google Scholar]
  27. Peng F, Dey DK. Bayesian analysis of outlier problems using divergence measures. Canadian Journal of Statistics. 1995;23:194–213. [Google Scholar]
  28. Riebler A, Held L, Stephan W. Bayesian variable selection for detecting genomic differences among populations. Genetics. 2008 doi: 10.1534/genetics.107.081281. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Robertson A. Gene Frequency Distributions as a Test of Selective Neutrality. Genetics. 1975;81(4):775–785. doi: 10.1093/genetics/81.4.775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic Structure of Human Populations. Science. 2002;298(5602):2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
  31. Song S, Dey DK, Holsinger KE. Differentiation among populations with migration, mutation, and drift: implications for genetic inference. Evolution. 2006;60:1–12. [PMC free article] [PubMed] [Google Scholar]
  32. Spiegelhalter D, Best N, Carlin B, van der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B. 2002;64(4):583–639. [Google Scholar]
  33. Storz JF, Payseur BA, Nachman MW. Genome Scans of DNA Variability in Humans Reveal Evidence for Selective Sweeps Outside of Africa. Mol Biol Evol. 2004;21(9):1800–1811. doi: 10.1093/molbev/msh192. [DOI] [PubMed] [Google Scholar]
  34. Weber JL, Wong C. Mutation in short tandem repeat polymorphisms. Human Molecular Genetics. 1993;2:1123–1128. doi: 10.1093/hmg/2.8.1123. [DOI] [PubMed] [Google Scholar]
  35. Weir BS, Cardon LR, Anderson AD, Nielsen DM, Hill WG. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005;15(11):1468–76. doi: 10.1101/gr.4398405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
  37. Weir BS, Hill WG. Estimating F-statistics. Annual Reviews of Genetics. 2002;36:721–750. doi: 10.1146/annurev.genet.36.050802.093940. [DOI] [PubMed] [Google Scholar]
  38. Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wright S. The genetical structure of populations. Annals of Eugenics. 1951;15:323–354. doi: 10.1111/j.1469-1809.1949.tb02451.x. [DOI] [PubMed] [Google Scholar]

RESOURCES