Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jul 1.
Published in final edited form as: Theor Popul Biol. 2018 Mar 21;122:78–87. doi: 10.1016/j.tpb.2018.03.004

Inference from the stationary distribution of allele frequencies in a family of Wright-Fisher models with two levels of genetic variability

Jake M Ferguson a, Erkan Ozge Buzbas b
PMCID: PMC6054576  NIHMSID: NIHMS922873  PMID: 29574050

Abstract

The distribution of allele frequencies obtained from diffusion approximations to Wright-Fisher models are useful in developing intuition about the population level effects of evolutionary processes. The statistical properties of the stationary distributions of K-allele models have been extensively studied under neutrality or under selection. Here, we introduce a new family of Wright-Fisher models in which there are two hierarchical levels of genetic variability. The genotypes composed of alleles differing from each other at the selected level have fitness differences with respect to each other and evolve under selection. The genotypes composed of alleles differing from each other only at the neutral level have the same fitness and evolve under neutrality. We show that with an appropriate scaling of the mutation parameter with respect to the number of alleles at each level, the frequencies of alleles at the selected and the neutral level are conditionally independent of each other, conditional on knowing the number of alleles at all levels. This conditional independence allows us to simulate from the the joint stationary distribution of the allele frequencies. We use these simulated frequencies to perform inference on parameters of the model with two levels of genetic variability using Approximate Bayesian Computation.

Keywords: Wright-Fisher model, K-allele models, diffusion approximation, balancing selection, Approximate Bayesian Computation

1. Introduction

Since 1940s, diffusion approximations to Wright-Fisher models have been a fruitful mathematical approach for understanding the effects of evolutionary processes on the long term behavior of idealized populations (Beatty 1986). Paul Joyce had made numerous contributions to theoretical population genetics using these models. He enjoyed the mathematical challenges these models offered, such as finding stationary distributions of allele frequencies or establishing statistical properties of estimators. But more importantly, Paul thought that the probabilistic results obtained in diffusion approximations provided a way to hone our intuition about the effects of evolutionary phenomena in shaping the genetic variability in idealized populations. The novel model presented in this paper and the methods used to investigate the properties of the parameter estimates are a direct extension of his work on K-allele Wright-Fisher models with balancing selection.

A Wright-Fisher model is a discrete time stochastic population model in which the alleles at a genetic locus evolve under the effect of genetic and demographic processes. Examples of these processes include mutation, selection, recombination, drift, migration, and population structure. Two alleles evolving at a single locus under neutrality in a large population is a textbook example of the Wright-Fisher model (see e.g., Rice 2004, pp. 139–144) for introducing the mathematical basis of population-level evolution. Wright-Fisher models capture the mathematical definition of evolution in an idealized population as the change in allele frequencies over generations. They also allow precise probabilistic formulations of population allele frequency distributions in multi-locus genomes undergoing multiple processes. This flexibility to accommodate a wide range of processes has created a fruitful framework of idealized mathematical models for theoretical population geneticists to understand evolution. The quantities obtained in Wright-Fisher models can often be analytically manipulated to yield insightful results about various evolutionary consequences of genetic and demographic phenomena.

In this paper, we describe a new family of Wright-Fisher models in which there are two hierarchically structured levels of genetic variability. Genotypes composed of alleles defined at the first level of genetic variability have fitness differences and evolve under selection with respect to each other. We refer to an allele of the first level as a class of the second level. At the second level of genetic variability, each allele of the first level is further classified into a set of other alleles. The genotypes composed of alleles within a class have the same fitness and evolve neutrally with respect to each other (Figure 1). Throughout this paper, we work in the framework of K-allele Wright-Fisher models and develop statistical methods to estimate the parameters of the stationary distribution arising from the diffusion approximation to our two-level model. In some sections, we provide a generic treatment of the K-allele Wright-Fisher model but in others we focus on specific models based on symmetric mutation, which means that the rate of mutation from one allele to another is the same for all alleles. We show that conditional on knowing the number of alleles at all levels, the joint stationary distribution of the allele frequencies for the model with two levels of genetic variability factors as the product of the stationary distribution under selection and under neutrality. This factorization is due to an appropriate scaling and partitioning of the overall mutation rate at each of the two levels of genetic variability. This result implies that the allele frequencies for two classes are independent of each other conditional on knowing the number of alleles at all levels. In practice, this conditional independence is methodologically useful because it allows us to efficiently simulate the allele frequencies under neutrality and selection from their marginal distributions respectively. We use these simulated frequencies in the context of Approximate Bayesian Computation (ABC) to perform statistical inference on model parameters of the joint distribution of the model with two levels of genetic variability. Since our model is based on neutral and selected K-allele Wright-Fisher models, we review these models and some of their history in the following section.

Figure 1.

Figure 1

Alleles (shown within squares) and their population frequencies (notation used throughout the paper) in the hierarchical structure of levels of genetic variability. The level at which selection acts has K allelic types, and the ith allele can further be classified by its ki neutral variants. In all models, we assume that the number of allelic types represented by parameters K, ki, i = 1, 2, …, K are known. The population frequencies of alleles are normalized and they sum to 1 at the level that selection acts, and also within each neutral variant set of ki allelic types.

2. K-allele Wright-Fisher models

2.1. A brief history of K-allele Wright-Fisher models

The stationary distribution of allele frequencies at a single locus arising from the diffusion process to Wright-Fisher models can be traced back to the early days of the modern evolutionary synthesis. The two allele models under generic selection were known to R.A. Fisher at least to some degree (Ewens 2004 [pp. 20-24]). However, Fisher did not explicitly write on these models and the exact form of the models he conceived is unclear. Sewall Wright was also familiar with the stationary distributions arising from diffusion approximations for two allele models (Wright 1931). In their classical population genetics reference Kimura and Crow (1970) provided a systematic study of the two allele models under neutrality and various selection schemes. They presented results on allele frequencies at large population and long time evolution limits for two-allele models.

For K > 2, the stationary distribution of allele frequencies in a fully parameterized K-allele model with selection and mutation has K mutation parameters, one for each allele, and K(K +1)/2 selection parameters, one for each genotype. Using the allele frequencies from a single locus for purposes of statistical inference on parameters such as the mutation rate or strength of selection leads to an overparameterized model since the number of parameters is greater than the number of alleles. Inference about the parameters of the fully-parameterized model can be performed if there are sufficiently many loci all evolving under identical evolutionary processes with the same parameter values, but this is a strong assumption on the evolution of a multi-locus genome.

For single-locus models, an approach to investigate the statistical properties of the system has been to focus on constrained models which reduce the number of parameters by making assumptions on selection and mutation mechanisms. A generic mutation model has K2 parameters for K alleles. However, we assume a Wright-Fisher diffusion approximation with parent-independent mutation, where mutation from one allele to another is independent of parent’s type resulting in K mutation parameters for K alleles. An example of a constrained model is a model assuming symmetric mutation, which means that all alleles mutate at the same rate. Thus, instead of K mutation parameters, one for each allele as in the fully parameterized model, there is only one mutation parameter in a symmetric mutation model. Constrained models capture biologically interesting selection schemes. For example, the full selection matrix with K(K +1)/2 parameters is constrained to a single parameter symmetric balancing selection model if we assume that all heterozygotes have equal fitness, all homozygotes have equal fitness, and the fitness of heterozygotes is higher than that of homozygotes.

An early analysis of the stationary distribution of allele frequencies arising from K-allele Wright-Fisher models with selection is given by Wright (1949, p.383), who investigated the symmetric balancing selection model. The same model was later studied by Kimura (1955, 1956), but a formal derivation of the stationary distribution was not given in these resources. Watterson (1977) was first to provide a systematic study of the stationary distribution under balancing selection. More recent work on estimating the strength of selection in a balancing selection model has maintained the simplifying assumption of symmetric balancing selection so that the estimability of parameters was not an issue (Maruyama and Nei 1981, Donnelly, Nordborg, and Joyce 2001). One of our examples in this paper is on estimating the strength of selection in the stationary distribution of the allele frequencies in the two-level model under symmetric balancing selection.

In the mathematical tradition of theoretical population genetics, the works of Wright, Kimura, and Watterson were expanded to investigate the properties of these models deeply. Generalizations and rigorous mathematical accounts using measure theoretic approaches based on Flemming-Viot processes, which include Wright-Fisher models as a subset are available (Ethier and Nagylaki 1989, Ethier and Kurtz 1993, Donnelly and Kurtz 1999, Ethier and Griffiths 1987) and properties of new models are investigated continually (see for example, Spano, Jenkins, and Griffiths this issue). For a concise summary of many benchmark developments and results Ewens (2004) is an excellent accessible reference.

When alleles evolve under selection, the stationary distributions often have computationally intractable likelihoods and studying the statistical properties of parameter estimators is challenging. However, analytical results have been obtained under specific selection patterns of biological interest, such as balancing selection (Joyce, Krone, and Kurtz 2003). A number of recent approaches taking the advantage of novel numerical and computational methods have been used to develop estimators for the strength of selection and properties of these estimators have been investigated. Examples include a perfect sampling approach to simulate samples under non-neutral models (Fearnhead 2006), statistical inference methods to estimate the strength of balancing selection using an importance sampling approach (Donnelly, Nordborg, and Joyce 2001), Bayesian inference on the strength of balancing selection by Markov chain Monte Carlo (Joyce et al. 2012), and spectral methods to perform inference on the strength of selection using time series data (Steinrücken et al. 2014).

2.2. The diffusion approximation to K-allele Wright-Fisher models

For a population of N diploid individuals, we consider K-allelic types denoted by A1, A2, …, AK at a locus. At each generation, the population is described by the allele frequencies x = (x1, x2, …, xK), subject to i=1Kxi=1, where xi denotes the frequency of allele Ai. Mutation from one allele type to another is independent of parent’s type with rate u=i=1Kui to one of the K types available in the population including itself. Under selection, a diploid individual with genotype AiAj has a selection coefficient equal to 1+ sij where sij > 0, for all i, j except a reference genotype, say, A1A1 which has a selection coefficient s11 = 0. The neutral model is a special case of the model with selection when sij = 0, for all i, j. The population at generation t + 1 is obtained by sampling N alleles independently of each other from the population at generation t, where each allele is sampled as follows: 1) sample a pair AiAj with probability proportional to its selection coefficient, 2) choose an allele from the pair AiAj with equal probability, 3) With probability u replace the chosen allele with one of the K types in the population, where each type occurs with probability ui.

The long-term evolution of a population under the process described in the previous paragraph produces a stationary distribution of the allele frequencies. This stationary distribution is obtained by rescaling the mutation and selection parameters as θ = limN→∞ 4Nu and σij = limN→∞ 2Nsij respectively. Using the coefficients ν = (ν1, ν2, …, νK) with i=1Kvi=1, νi > 0 for all i, and the matrix notation Σ = {σij} is for the selection parameter, the stationary distribution given by

f(x|θ,,v)=Cx1(θ,,v)exxTi=1Kxiθvi1, (1)

where

Cx(θ,,v)=XK××X2×X1exxTi=1Kxiθvi1dxKdxK1dx1 (2)

is a parameter-dependent normalization constant for f(x, Σ, ν) to sum to 1 and Xi is the space of frequencies for allele i.

Setting σij = 0, for all i, j, in equation 1 recovers the stationary distribution under the neutral K-allele model, and this is a Dirichlet distribution given by

f(y|θ,v)=By1(θ,v)i=1Kyiθvi1, (3)

where for clarity, we denote the frequency of neutral alleles by yi, and the normalization constant is given by the beta function

By(θ,v)=[i=1KΓ(θvi)]/Γ(i=1Kθvi),

where Γ(a)=0xa1ex is the gamma function. In the following sections, we will assume that the mutation is symmetric for the neutral K allele model given in equation 3. This implies that νi = 1/K for all i and all models become conditional on knowing the number of allelic types K. Furthermore, we will use the stationary distribution under selection and under neutrality in the same context but we will allow the number of alleles differ in these models. To emphasize this point, we introduce the parameter vector κ which contains all the number of alleles in the neutral model and the model under selection. We use the notation Cx(θ, Σ, ν) = Cx(θ, Σ, κ) and By(θ, ν) = By(θ, κ) for constants, and the notation f(x, Σ, κ), f(y|θ, κ) for the stationary distributions under selection and neutrality respectively. Figure 2 plots some examples of stationary probability density functions arising from diffusion approximations to 3-allele Wright-Fisher models under neutrality and selection.

Figure 2.

Figure 2

The stationary probability density functions for 3-allele Wright-Fisher models under neutrality and selection, as a function of the first (x1) and the second (x2) allele frequency. θi = θνi denotes the mutation parameter for allele Ai and σij is the selection parameter for genotype AiAj. In all plots, σij = 0 for all (i, j) not shown.

3. A family of Wright-Fisher models with two levels of genetic variability

The stationary distributions in equations 1 and 3 are obtained under the assumption that the K allelic types at a locus evolve either under selection or under neutrality respectively. Typically, alleles at a locus are characterized using a genetic criterion such as: two alleles are of different types if their DNA differ at least by one non-synonymous mutation from each other. At highly variable coding regions of the genome, however, the properties of the functional products of genes such as proteins might depend on multiple genetic criteria. In these cases, characterizations of allelic classes at multiple levels are required to capture the effects of evolutionary and demographic processes that maintain the genetic variability at a locus. As a concrete example, we consider the loci that make up the Major Histocompatibility Complex in the human genome, known as the Human Leukocyte Antigen (HLA) (see http://hla.alleles.org for a wealth of information). HLA genes code for proteins that help to distinguish the immune system proteins from pathogen proteins, and thus have a major role in the proper functioning of the immune system. There are about 200 HLA genes on chromosome 6 and more than 16,000 alleles have been identified in HLA loci. This staggering level of variability has changed the classification of HLA alleles radically in the last decade. A hierarchical definition of classes of allelic types has become the standard. Currently an HLA allele is uniquely characterized at five levels of variability: 1. its allelic group, 2. a non-synonymous difference in the coding region, 3. a synonymous difference in the coding region, 4. a difference in the non-coding region, and 5. the changes in expression (Marsh et al. 2010). At each level of this system, a different criterion of genetic variability is applied to an allele. Further, alleles belonging to a class that is characterized at a higher level (according to numbering 1-5), inherit the properties of alleles belonging to a class that is characterized at a lower level.

This hierarchical model of specifying the genetic variability and the following characterization of alleles motivates modeling multiple classes of alleles within a Wright-Fisher framework. Investigating a Wright-Fisher model with multiple classes of alleles is useful for at least two purposes. First, it will help us to understand how the allele frequencies at a locus are affected by genetic processes that might only operate on subsets of alleles or genotypes. For the HLA example above, we might want to model selection operating on genotypes that are made of two alleles that differ from each other at level 2 due to non-synonymous differences. On the other hand, genotypes that are made of two alleles that differ only at level 3 or higher will evolve neutrally with respect to each other (for models investigating overdominance in HLA loci see Black and Hedrick 1997, Stoffels and Spencer 2008).

Second, a Wright-Fisher model with multiple classes of alleles will make informative use of allele frequencies in parameter estimation by taking into account the fact that each class jointly informs the estimators of the parameters of relevant processes, thereby decreasing the uncertainty in estimators. Continuing with the HLA example, the allele frequencies characterized at level 2 will carry information on genetic drift, mutation, and selection, but the allele frequencies characterized at level 3 will carry information on genetic drift and mutation only since the genotypes that are made of alleles within a class at level 3 evolve neutrally with respect to each other. In the next section we build a Wright-Fisher model where there are two levels of genetic variability at a locus. At the first level, there is one class of alleles that evolves under selection. At the second level, for each of the alleles at the first level there is another class of alleles that evolves under neutrality. First, we define the stationary distribution of the joint allele frequencies for this model with two levels of genetic variability. Then we study the statistical properties of the estimators of the strength of selection and mutation parameters for a sample of allele frequencies from this distribution.

3.1. The joint stationary distribution of allele frequencies in the model with two levels of genetic variability

We first modify the standard notation A1, A2, …, AK for K-allele models that we introduced in section 2.2 to clearly denote the alleles in the model with two levels of genetic variability. We use Aij to denote the neutral allele j ∈ {1, 2, …, ki} within selected allele i ∈ {1, 2, …, K} evolving under selection. We keep Ai to denote the frequency of allele i. For example, genotypes AijAij have the same fitness with respect to each other for all pairs (j, j′) because the alleles that make up this genotype belong to the same allele i at the level at which selection operates On the other hand, the genotypes AijAij and AmℓAm will potentially have fitness differences if at least one of i ≠ m or i≠ m′ holds. We have the parameter vector κ = (K, k1, k2, …, kK). We denote the frequency of allele Aij by yij, so that the vector of allele frequencies within allele i is yi=(yi1,yi2,,yiki), subject to j=1kiyij=1, for all i = 1, 2, …, K. For the frequency of Ai, we use the notation from the K-allele model with selection: xi with x = (x1, x2, …, xK), subject to i=1kixi=1. Our definitions of alleles and their frequencies may seem strange at first. One can also define the alleles at a single level with i × j allelic types A1, A2, …, Ai×j and work with their frequencies such that their frequencies sum to 1. However, using x and yi has the following advantage. The alleles that form genotypes that are neutral with respect to each other have frequencies normalized to 1 within each class at the neutral level, and the allele frequencies at the level in which selection operates are normalized to 1 as well. These definitions allow us to define the two-level model as a function of a Wright-Fisher model under selection, and Wright-Fisher models under neutrality.

Our main technical result provided in Appendix 1 shows that the joint stationary distribution of the allele frequencies in the model with two levels of genetic variability is given by

f(x,y|θ,,κ)=f(x|θ1,θ2,,θK,,κ)i=1Kf(yi|θki,κ). (4)

Here, each f(yi|θki,κ) is the stationary distribution under the neutral ki-allele model as in equation 3. Further, f(x1, θ2, …, θK, Σ, κ) is the stationary distribution under the K-allele model with selection similar to equation 1 in its exponential part (which captures the effect of selection), but with mutation parameters (θ1, θ2, …, θK) that are not necessarily symmetric. In equation 4, θi and θki are appropriately scaled mutation parameters. Based on the assumption of symmetric mutation between alleles Aij, they are given by θi = ki(θ/τ − 1) + 1, and θki=ki(θτ), where τ=i=1Kki.

Our conclusion from equation 4 is that the joint stationary distribution factors as the product of the stationary distribution of allele frequencies under the K-allele model with selection and the stationary distribution of allele frequencies under the neutral K-allele model.

A practical implication of this factorization is that to simulate draws from the stationary distribution of the model with two levels of genetic variability, the allele frequencies under the selection model and classes of allele frequencies under neutral models can be simulated independently of each other. Figure 3 shows examples of K-allele models under neutrality, selection, and the model with two levels of genetic variability. Simulating draws from the stationary distribution of the model with two levels of genetic variability allows us to perform inference on the parameters of this model by Approximate Bayesian Computation (ABC).

Figure 3.

Figure 3

Schematic descriptions of single-locus K-allele Wright-Fisher models: I) a neutral model with symmetric mutation. II) a model of symmetric balancing selection where heterozygotes have fitness advantage over homozygotes. III) The model introduced in this paper with two levels of genetic variability. There are three alleles (A,B,C) at the first level which are evolving under symmetric balancing selection (selection matrix given). Each of these alleles is further classified to 3,2, and 2 alleles, evolving under neutrality.

3.2. Approximate Bayesian Computation (ABC) for inference on the mutation parameter and the strength of selection

The stationary distribution obtained from the diffusion approximation to a Wright-Fisher model produces a model for the random population frequencies. In other words, the population frequencies are a random draw from the stationary distribution in equation 4 which generates a population of joint allele frequencies x and (y1, y2, …, yK) at a locus. The sample allele frequencies is a multinomial random sample given the population frequencies and it consists of the sample counts of alleles under the selection model nx=(nx1,nx2,,nxK), and the sample counts of alleles under the neutral model nyi=(nyi1,nyi2,,nyiki), i = 1, 2, …, K, where nxi and nyij are the number of Ai and Aij respectively.

The likelihood of the sample allele frequencies is based on a multinomial sample where probability of success parameters for each allele type is the population allele frequency generated by the distribution in equation 4. By Bayes’ theorem, the joint posterior distribution of θ and Σ given the sample counts nx, nyi, i = 1, 2, …, K is

π(θ,|nx,ny1,ny2,,nyK,κ)P(nx,ny1,ny2,,nyK|θ,,κ)π(θ)π(),

where P(nx,ny1,ny2,,nyK|θ,,κ), is the joint likelihood of the observed allele counts given the parameters, and π(θ), π(Σ) are the prior distributions of the mutation parameter and the strength of selection assuming prior independence of θ and Σ. Performing inference on parameters θ and Σ is challenging. The parameter-dependent normalizing constant Cx(θ, Σ, κ), which is computationally intractable, appears in the joint population distribution of the allele frequencies and thus the calculation of the likelihood P(nx,ny1,ny2,,nyK|θ,,κ) depends on it. Standard computational methods to sample a distribution such as the rejection algorithm or Markov chain Monte Carlo (MCMC) require evaluating the likelihoods up to a normalizing constant independent of parameters, and therefore, they are not directly applicable here. However, the result given in Appendix 1 and described in section 3.1 allows us to simulate random draws of population frequencies from the stationary distribution of the allele frequencies in a two-level model. This is achieved by simulating the allele frequencies from the model under selection and from models under neutrality all independently of each other. Using these frequencies as parameters of the multinomial distribution, we simulate samples nx, ny1, ny2, …, nyK as random draws from the multinomial model. These simulated samples are key to perform inference by ABC (Beaumont et al. 2002, Tavaré et al. 1997), which allows us to sample the posterior distribution of parameters.

ABC is a simulation-based inference method to produce an approximate sample from the posterior distribution of parameters. ABC utilizes a large number of data sets simulated from the data generating process under different parameter values to approximate the likelihood of the observed data. When different parameter values are drawn from the prior distribution, the likelihood is weighted by the prior and ABC samples approximate draws from the posterior distribution. In practice, ABC is often implemented as follows. First, a reference table of parameter values, generated from their prior distribution, and the data sets (one per parameter set) simulated under those parameter values is built. Second, the dimensionality of all simulated data sets is reduced by summarizing each data set with summary statistics. These statistics are then compared to the same summary statistics calculated from the observed data set. The parameter values that produce summary statistics that are close to the summary statistics calculated from the observed data set are accepted as approximate samples from the posterior distribution. Closeness is made precise with a tolerance parameter and a metric to measure the distance between summary statistics as described in the next paragraph. There are two approximations in ABC. The first approximation is due to dimension reduction from the full data set to summary statistics. Often, the summary statistics are not sufficient for parameters of the model and thus there is information loss by this dimension reduction. The statistical consequence is to replace the likelihood of the data with the likelihood of the summary statistics. The second approximation arises in all but very small sample spaces. For large discrete or continuous data spaces it is not practical to expect to simulate data sets that produce summary statistics which exactly match the summary statistics calculated from the observed data set. Therefore, a distance metric such as Euclidean distance and a small tolerance ε are employed to quantify the approximation in the acceptance step. In the remainder of this section we discuss how to efficiently simulate a large number of data sets under the model with two levels of genetic variability as well as which errors are involved when we use ABC with our model.

Allele frequencies under the neutral K-allele model with parameters θvi, i = 1, 2, …, K are simulated from the Dirichlet distribution by standard methods. Simulation of allele frequencies under the K-allele model with selection can be performed by several methods. The efficiency of these methods depends on the complexity of the selection scheme in the sense of distinct number of unique parameters in the matrix of selection parameters and the magnitude of these parameter for the strength of selection. Often, selection schemes in which the matrix Σ can be parameterized by a single parameter σ are of interest to population geneticists. Examples include the symmetric balancing selection in which all heterozygotes have the same advantage over all homozygotes; the symmetric homozygote advantage in which all homozygotes have the same advantage over all genotypes; or the directional selection where one genotype has higher fitness that all other genotypes. Efficient numerical methods using Fast Fourier Transforms to calculate the normalization constants Cx(θ, Σ, κ) exist for these models (see Joyce et al. 2012). Once these constants are calculated on a fine grid of parameter values, random allele frequencies are generated by first evaluating the Cumulative Distribution Function on the grid and then using the numerical inverse as a look up table. This is the numerical inverse CDF method.

If the selection matrix Σ is structurally complex in the sense that the number of distinct selection parameters is large, then the rejection sampling is a good candidate to simulate the allele frequencies under selection. An example is using the Dirichlet proposal distribution, which corresponds to the stationary distribution under neutrality, with the same mutation parameters as the distribution under selection, which is the target distribution. This approach works well when the selection parameters are small but it does not work well when the selection is strong. Strong selection implies large values of parameters in the selection matrix and the population frequencies are far from what would be expected under neutrality. Hence, the frequencies proposed from the neutral model will often be rejected making the rejection algorithm inefficient. In the examples provided in section 4 we use small values for selection parameters and employ rejection sampling. Algorithm 1 shows our procedure to simulate samples from the joint distribution of the allele frequencies under the model with two levels of genetic variability given the parameters θ, Σ, κ = (K, ki, i = 1, 2, …, K), τ=i=1Kki, and the total sample size n.

Algorithm 1.

  1. Simulate a sample under selection:
    1. Simulate x* ~ Dirichlet (θ1, θ2, …, θK), where θi = ki[(θ/τ) − 1]+1.
    2. Accept x* as from distribution under selection with probability proportional to exΣx.
    3. Simulate nx~Multinomial(n,x), where n is the sample size and from the population and nx=(nx1,nx2,,nxK).
  2. Simulate samples under neutrality conditional on nx:

    For i=1:K,

    1. Simulate yi~Dirichlet(θk1,θk2,,θki,), where θki=ki(θ/τ).

    2. Simulate nyi~Multinomial(nxi,yi).

4. Inference on the mutation parameter and the strength of selection under the model with two levels of genetic variability

In this section, we present a simulation study designed to investigate the performance of estimates of the mutation parameter and the strength of selection by ABC. We consider a model in which the allele frequencies are available from multi-locus genomes where the alleles at each locus evolve under a model with two levels of genetic variability.

Equation 4 holds for any selection scheme and simulating allele frequencies evolving under highly parameterized selection matrices is straightforward in principle. However, these models are rarely useful for developing intuition about broad selection schemes because they are too complex to interpret the individual parameters meaningfully. Further, there are often not enough degrees of freedom in the allele frequencies to estimate all parameters of a full parameterized selection matrix precisely. Thus we focus on three selection schemes: directional selection, directional genic selection, and balancing selection, and vary the following parameters.

  1. The mutation parameter and the strength of selection (θ, σ),

  2. The number of loci in the genome L,

  3. The number of alleles at each level, K and ki, (i = 1, 2, …, K) alleles at each locus,

  4. The sample size from each locus (nx, ny).

For all models we assume symmetric recurrent mutation among alleles.

4.1. The selection models, the simulation parameters, and ABC

In the directional selection model, the matrix Σ has all elements equal to 0 except for σii = σ > 0 for a given i. That is, only one homozygote has higher fitness than the rest of the genotypes all of which have equal fitness. The directional genic selection model is the same as directional selection with the addition that σij = σii/2 for all j. This is the case where the heterozygotes AiAj carrying only one Ai allele have half the fitness of the homozygote AiAi, and genotypes carrying no Ai have 0 fitness.

Perhaps the most interesting selection scheme that we consider is the the symmetric balancing selection. This is the selection scheme that was studied by Wright (1949), Watterson (1977), Donnelly, Nordborg, and Joyce (2001), Buzbas and Joyce (2009). Here, Σ is a diagonal matrix with elements σii = −σ, where σ > 0 on the diagonal and 0 elsewhere. This means that all heterozygotes have the equal fitness and this is higher than the (equal) fitness of all homozygotes. In this model, the joint estimation of the mutation and selection parameters precisely is a statistically challenging task because both mutation and selection promote the genetic variability simultaneously.

The data from simulations were obtained as follows. We first generated M = 1000 selection and mutation parameters (σ(i), θ(i)), i = 1, 2, …, M, where θ(i) and σ(i) were drawn independently of each other from a uniform distribution on [2, 8]. The true parameter values for the mutation rate and the selection strength were also generated independently of each other from the discrete uniform distribution on [2, 8]. For each of these M parameter sets, we simulated the population frequencies (x(i), y(i)) and then sampled the population multinomially to get nx(i), ny(i). The sample sizes nx(i), ny(i) were set to 50 and 100. For the number of loci L, we used 5 values (2, 5, 10, 20, and 40). For each locus we generated a random number of alleles under selection K and the neutral alleles ki, i = 1, 2, …, K independently of each other from a discrete uniform distribution on [2, 6].

We performed the statistical inference using ABC as described in section 3.2 and a posterior sample was obtained from the joint posterior distribution of (θ, σ). For all of our models, we were fortunate to be able to eliminate the error in ABC due to replacing the data likelihood with the summary statistics likelihood that are not sufficient. This is due to the fact that the stationary distributions of the Wright-Fisher models with selection belong to the well-studied exponential family of distributions for which statistics jointly sufficient for (θ, σ) are well-known. The first class of sufficient statistics applies to alleles at both levels in our model and it is defined as the sum of the logarithm of the allele frequencies. We denote these statistics by Gx=iKlogxi and Gyi=jkilogyij for all i. The second class of sufficient statistics is related to the exponential function in the selection model. For example, in the directional selection model only xi2 is multiplied by a non zero selection parameter σii and so xi2 is a sufficient statistic. In the directional genic selection, xi2 and xixj for all j are sufficient statistics. Finally, under the balancing selection model, the well-known homozygosity given by iKxi2 is a sufficient statistic.

We implemented ABC by first sampling 106 values from the prior distributions on θ and σ assuming prior independence. We used uniform distributions for the prior distribution of (θ, σ), which were also the proposal distributions in our rejection method. Ideally, the support of a uniform prior distribution on parameters should be as large as possible so as not to assign zero probability to any parameter value. However, using large arbitrary bounds is computationally not efficient under selection. This is due to the fact that simulating allele frequencies under strong selection (for large values of σ) requires a computationally infeasible number of proposals under neutrality. We wanted to keep the bounds for the prior distributions as large as possible, but at the same time to be able to simulate the allele frequencies under the desired selection scheme. At the end, we settled on the following intervals. Under balancing selection we used a uniform prior on σ ∈ [0, 500], under directional selection we used σ ∈ [0, 14], and for directional genic selection we used σ ∈ [0, 20]. Then we generated one set of population allele frequencies for each parameter pair (θ, σ) and kept the 103 parameter sets which produced summary statistics closest to the summary statistics calculated from the observed data set and rejecting the rest. We applied the post-sampling linear regression adjustment (Beaumont et al. 2002) to these sampled points and obtained the final sample from the joint posterior distribution of parameters.

4.2. Results and discussion

We measured the performance of estimators using the mean-squared error (MSE) based on the posterior samples and the test values of the parameters used to generate the population frequencies (x, y) for each simulation. The selection scheme had a substantial effect on the precision of the selection and mutation parameter estimates (Figure 4). Under the balancing selection, we obtained larger MSE for the selection parameter than either under the directed genic selection or the directed selection models. This result has been observed before (Buzbas and Joyce 2009), and intuitively can be explained as follows. In the case of symmetric balancing selection with symmetric mutation, the mode of the stationary distribution for each allele is 1/K, indicating that both mutation and selection processes work toward decreasing variance in the allele frequencies (see Figure 2 top row, third plot). Thus given a sample of alleles, whether the elevated genetic variability in that sample is due to the effect of the balancing selection or the recurrent mutation is statistically difficult to distinguish. In contrast, in directional or genic selection, the frequency of specific alleles—those that contribute to genotypes that have high fitness—increase and not all the allele frequencies. The estimates of the mutation parameter are fairly similar across different selection schemes and their precision is high though the estimates under directional selection scheme had lower MSE than both balancing and genic selection.

Figure 4.

Figure 4

Comparison of mean-squared error (MSE) of estimators for the strength of selection σ and the mutation parameter θ under each selection scheme (balancing, directional genic, and directional) with respect to the genome size considered (as a function of the number of loci). Each locus has 100 alleles. The points denote the mean of the MSE of estimators, and the bars denote the standard deviation of MSE of estimators. The plots show the number of loci slightly offset, so that the estimates and bars do not overlap.

As the number of alleles increased, MSE of the estimators decreased (Figure 5). This result shows that increasing the the number of alleles in a model with symmetric mutation increases the information about the parameters. An interpretation of this result for the mutation parameter that relies on properties of the Dirichlet family of distributions is as follows. Observations from a symmetric Dirichlet distribution with a large number of classes is more informative about the parameter of the distribution in comparison to observations from a symmetric Dirichlet distribution with small number of classes. Our results show that the same property holds for the selection parameter as well for the selection schemes that we investigated here.

Figure 5.

Figure 5

Comparison of mean-squared error (MSE) for mutation parameter θ and the strength of selection σ in each type of selection model (balancing, directional genic, and directional) for a given number of loci sampled with a different number of alleles sampled. Points correspond to the mean of the estimator MSE, bars correspond to the standard deviation of the estimator MSE. The plots have the number of loci slightly offset, so that estimates and bars do not overlap. Sample size gives the number of alleles sampled.

Differences in MSE of estimators between low (50) and high (100) sample sizes tended to be greater for the selection parameter than the mutation parameter and the MSE of the selection parameter estimator was much larger than 0 under all selection schemes. We note that the maximum number of loci considered in this paper is 40 and not high by today’s genomic standards and the sample size is small at 100. In all cases the MSE of the mutation parameter estimator was close to zero even with a fairly moderate number of loci sampled.

In conclusion, for the stationary distribution of allele frequencies arising in Wright-Fisher models with two levels of genetic variability, our findings indicate low MSE in estimates of mutation parameter and high MSE in estimates of the strength of selection. The results from this work extend one of the main findings in Paul Joyce’s work on K-allele Wright-Fisher models to a broader class of Wright-Fisher models with two levels of genetic variability. In joint estimation of the strength of balancing selection and the mutation parameter, the strength of selection is the parameter that is more difficult to estimate with high MSE.

Acknowledgments

We thank to an anonymous reviewer for their careful reading of an earlier draft of this manuscript and for catching mistakes in the proof. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number P20GM104420. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix 1

We denote the joint allele frequencies by (x, y) where x is a vector of allele frequencies under the K-allele model with selection: xi with x = (x1, x2, …, xK), subject to i=1Kxi=1, and y is a list of length K, and each element of y, denoted by yi is a vector of length ki with yi=(yi1,yi2,,yiki), yi=(yi1,yi2,,yiki), subject to j=1kiyij=1, for all i = 1, 2, …, K. Our goal is to show that conditional on K and the partitioning of K by ki defined by the vector parameter κ = (K, ki, i = 1, 2, …, K), the joint stationary distribution of the allele frequencies (x, y) for the model with two levels of genetic variability is equal to the product of the stationary distribution of the allele frequencies x, the level at which selection operates, and the stationary distribution of the allele frequencies y, the level at which the evolution is neutral.

Since by definition (x, y) satisfy j=1Kxi=1, j=1kiyij=1, the frequency of the allele Aij in the population is the product of the frequency of the selected allele xi and the frequency of the neutral allele within its selected class yij and it is equal to xiyij. We use this fact and the assumption of the symmetric mutation among alleles to obtain our result. We let the total number of yij alleles by τ=i=1Kki. The joint stationary distribution of (x, y) that arises from diffusion approximation to Wright-Fisher model with full selection matrix Σ and mutation parameter θ is of the form

f(x,y|θ,,κ)=Cxy1(θ,,κ)eQ(x,y,)i=1Kj=1ki(xiyij)θ/τ1. (5)

Here, Cxy(θ, Σ, κ) is a normalizing constant for f(x, y, Σ, κ) to sum to 1 and it is given by

X×YeQ(x,y,)i=1Kj=1ki(xiyij)θ/τ1dxdy

where, we use the integral sign for multiple integrals on X, the allele frequency space of x, and on Y=Y1×Y2×YK with Yi denoting the allele frequency spaces of yi, i = 1, 2, …, K. We now focus on the function Q(x, y, Σ) because our proof relies on showing that the this function depends only on the allele frequencies x and not on y. The full selection matrix is Σ = {σil}, and for genotype AijAlm, i = 1, 2, …, K, m = 1, 2, …, K, σil is the selection parameter. This selection parameter is the same for all j, m in a given selected class denoted by the indices i, l. The function Q(x, y, Σ) is defined by the stationary distribution of allele frequencies under Wright-Fisher diffusion:

Q(x,y,)=i=1Kl=1Kj=1kim=1klσil(xiyij)(xlylm). (6)

We write this equation as

Q(x,y,)=i=1Kl=1Kσilxixlj=1kiyijm=1klylm (7)
=i=1Kl=1Kσilxixlj=1kiyij (8)
=i=1Kl=1Kσilxixl. (9)

Equation 7 follows from noting that σil, xi, xl do not depend on j, m so they can be taken out of the two innermost sums, and yij does not depend on m so it can be taken out of the innermost sum. Equation 8 follows because the sum of allele frequencies ylm over index m is equal to 1, and equation 9 follows because the sum of allele frequencies yij over index j is equal to 1.

Substituting Q(x, y, Σ) from equation 9 into equation 5 we get the joint distribution of (x, y) under full selection matrix as

f(x,y|θ,,κ)=Cxy1(θ,,κ)ei=1Kl=1Kσilxixli=1Kj=1ki(xiyij)(θ/τ)1. (10)

We now distribute the power (θ/τ) 1 to xi and y1, factoring xi out of the inner product, and adding and subtracting 1 to the power of xi we get

f(x,y|θ,,κ)=Cxy1(θ,,κ)ei=1Kl=1Kσilxixli=1Kxiki[(θ/τ)1]+11j=1kiyij(θ/τ)1. (11)

We now rearrange this density function and write it as the product of the stationary distribution of the allele frequencies x under selection and of y under neutrality. The stationary distribution of x under selection and generic mutation is given by

f(x|θ,,κ)=Cx1(θ,,κ)ei=1Kl=1Kσilxixli=1Kxi(θi1), (12)

where i=1Kθi=θ. On the other hand, the stationary distribution of the allele frequencies under neutrality for each yi with symmetric mutation is given by

f(yi|θ,κ)=Byi1(θ,κ)j=1kiyij(θ/ki1). (13)

Thus, our goal is to write equation 11 as a function of equation 12 and 13. We note that for i = 1, 2, …, K the mutation parameters for yi are given by θki = ki(θ/τ), whereas the mutation parameters for x are given by θi = ki[(θ/τ) 1] + 1. Using these notations and equations 12 and 13, we write the stationary distribution in equation 11 as

f(x,y|θ,,κ)=Cxy1(θ,,κ)Cx(θ,,κ)[i=1KByi(θki,κ)]f(x|θ,,κ)f(yi|θki,κ) (14)

where Cx(θ, Σ, κ) is the normalizing constant under selection for the distribution of x. Now, f(x, y, Σ, κ) is a probability density function, and therefore it satisfies

X×Yf(x,y|θ,,κ)=1,

so we have

1=X×YCxy1(θ,,κ)Cx(θ,,κ)[i=1KByi(θki,κ)]f(x|θ,,κ)f(yi|θki,κ) (15)
=Cxy1(θ,,κ)Cx(θ,,κ)[i=1KByi(θki,κ)]Xf(x|θ,,κ)Yif(yi|θki,κ) (16)
=Cxy1(θ,,κ)Cx(θ,,κ)[i=1KByi(θki,κ)], (17)

where the first line follows by the right hand side of equation 14, the second line follows by rearranging the integrals, and the third line follows by the fact that f(x, Σ, κ) and f(yi|θki,κ) are probability density functions and thus each sums to 1. Therefore, we have

f(x,y|θ,,κ)=f(x|θ,,κ)f(yi|θki,κ), (18)

which shows the desired result that conditional on κ = (K, ki, i = 1, 2, …, K), the joint stationary distribution of the allele frequencies for the model with two levels of genetic variability, is equal to the product of the stationary distribution of the allele frequencies x, the level at which selection operates, and the stationary distribution of the allele frequencies y, the level at which the evolution is neutral.

References

  1. Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162:2025–2035. doi: 10.1093/genetics/162.4.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Beatty J. The synthesis and the synthetic theory. In: Bechter W, editor. Integrating Scientific Disciplines. Martinus Nijhoff Publishers; Dordrecht: 1986. pp. 125–127. (Genetics). [Google Scholar]
  3. Black FL, Hedrick PW. Strong balancing selection at hla loci: Evidence from segregation in south amerindian families. Proc Natl Acad Sci USA. 1997;94(23):12452–12456. doi: 10.1073/pnas.94.23.12452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Buzbas EO, Joyce P. Maximum likelihood estimates under k-allele models with selection can be numerically unstable. Annals of Applied Statistics. 2009;3:1147–1162. [Google Scholar]
  5. Buzbas EO, Joyce P, Abdo Z. Estimation of selection intensity under overdominance by Bayesian methods. Statistical Applications in Genetics and Molecular Biology. 2009;8 doi: 10.2202/1544-6115.1466. Article 32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Buzbas EO, Joyce P, Rosenberg NA. Inference on balancing selection for epistatically interacting loci. Theoretical Population Biology. 2011;79:102–113. doi: 10.1016/j.tpb.2011.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Donnelly P, Kurtz T. A countable representation of the Fleming-Viot measure-valued diffusion. Ann Prob. 1999;24:743–760. [Google Scholar]
  8. Donnelly P, Nordborg M, Joyce P. Likelihood and simulation methods for a class of nonneutral population genetics models. Genetics. 2001;159:853–867. doi: 10.1093/genetics/159.2.853. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Ethier SN, Griffiths RC. The infinitely-many-sites model as a measure-valued diffusion. Ann Prob. 1987;15:515–545. [Google Scholar]
  10. Ethier SN, Kurtz TG. Fleming-Viot processes in population genetics. SIAM Journal of Control Optimization. 1993;31(2):345–386. [Google Scholar]
  11. Ethier SN, Nagylaki T. Diffusion approximations of the two-locus Wright-Fisher model. Journal of Mathematical Biology. 1989;27:17–28. doi: 10.1007/BF00276078. [DOI] [PubMed] [Google Scholar]
  12. Ewens W. Mathematical Population Genetics: I Theoretical Introduction. Second. Springer; London, UK: 2004. [Google Scholar]
  13. Fearnhead P. The stationary distribution of allele frequencies when selection acts at unlinked loci. Theor Popul Biol. 2006;70:376–386. doi: 10.1016/j.tpb.2006.02.001. [DOI] [PubMed] [Google Scholar]
  14. Fisher RA. On the dominance ratio. Proc Roy Soc Edinburgh. 1922;42:321–341. [Google Scholar]
  15. Joyce P, Genz A, Buzbas EO. Efficient simulation and likelihood methods for a class of non-neutral multi-allele models. Journal of Computational Biology. 2012;19:650–661. doi: 10.1089/cmb.2012.0033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Joyce P, Krone SM, Kurtz TG. When can one detect over-dominant selection in the infinite-alleles model? Ann App Prob. 2003;13(1):181–212. [Google Scholar]
  17. Kimura M. Stochastic Processes and Distribution of Gene Frequencies under Natural Selection. Cold Spring Harbor Symp Quant Biol. 1955;20:33–55. doi: 10.1101/sqb.1955.020.01.006. [DOI] [PubMed] [Google Scholar]
  18. Kimura M. Rules for testing stability of a selective polymorphism. Proc Natl Acad Sci US. 1956;42:336–340. doi: 10.1073/pnas.42.6.336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kimura M, Crow JF. An Introduction to Population Genetics Theory. Harper and Row; Publishers, NY: 1970. [Google Scholar]
  20. Marsh SGE, Albert ED, Bodmer WF, Bontrop RE, Dupont B, Erlich HA, Fernández-Viã M, Geraghty DE, Holdsworth R, Hurley CK, Lau M, Lee KW, Mach B, Maiers M, Mayr WR, Muller CR, Parham P, Petersdorf EW, Sasazuki T, Strominger JL, Svejgaard A, Terasaki PI, Tiercy JM, Trowsdale J. Nomenclature for factors of the HLA system, 2010. Tissue Antigens. 2010;75(4):291–455. doi: 10.1111/j.1399-0039.2010.01466.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Maruyama T, Nei M. Genetic variability maintained by mutation and over-dominant selection in finite populations. Genetics. 1981;98:441–459. doi: 10.1093/genetics/98.2.441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Rice SH. Evolutionary theory: mathematical and conceptual foundations. Sinauer; Sunderland, MA, US: 2004. [Google Scholar]
  23. Steinrücken M, Bhaskar A, Song YS. A novel spectral method for inferring general selection from time series genetic data. Annals of Applied Statistics. 2014;8:2203–2222. doi: 10.1214/14-aoas764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Stoffels RJ, Spencer HG. An asymmetric model of overdominance at major histocompatibility complex genes: Degenerate pathogen recognition and intersection advantage. Genetics. 2008;178:1473–1489. doi: 10.1534/genetics.107.082131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Tavaré S, Balding DJ, Griffiths RC, Donnelly P. Inferring coalescence times from DNA sequence data. Genetics. 1997;145(2):505–518. doi: 10.1093/genetics/145.2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Watterson GA. Heterosis or neutrality? Genetics. 1977;85:789–814. doi: 10.1093/genetics/85.4.789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wright S. Evolution in Mendelian Populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wright S. Adaptation and Selection. In: Jepson GL, Simpson GG, Mayr E, editors. Genetics, Paleontology, and Evolution. Princeton Univ. Press; Princeton, NJ: 1949. p. 383. [Google Scholar]

RESOURCES