Abstract
We analyze a decoupled Moran model with haploid population size , a biallelic locus under mutation and drift with scaled forward and backward mutation rates and , and directional selection with scaled strength . With small scaled mutation rates and , which is appropriate for single nucleotide polymorphism data in highly recombining regions, we derive a simple approximate equilibrium distribution for polymorphic alleles with a constant of proportionality. We also put forth an even simpler model, where all mutations originate from monomorphic states. Using this model we derive the sojourn times, conditional on the ancestral and fixed allele, and under equilibrium the distributions of fixed and polymorphic alleles and fixation rates. Furthermore, we also derive the distribution of small samples in the diffusion limit and provide convenient recurrence relations for calculating this distribution. This enables us to give formulas analogous to the Ewens–Watterson estimator of for biased mutation rates and selection. We apply this theory to a polymorphism dataset of fourfold degenerate sites in Drosophila melanogaster.
Keywords: Moran model, Mutation bias, Genic selection, Small mutation rates, Small samples, Drosophila melanogaster
1. Introduction
In the limit of relatively high recombination and small mutation rates, each polymorphic site can be considered independent from the rest of the genome. The distribution of allele frequencies at a large number of such loci has been called the “allele-frequency spectrum” or “site-frequency spectrum”. In a classical manuscript, Wright (1931) introduced a bi-allelic equilibrium model and derived the equilibrium allele frequency distribution, up to a constant of proportionality. Most recent treatments of similar models, however, assume irreversible mutations (e.g., Sawyer and Hartl, 1992; Hartl et al., 1994; Bustamante et al., 2001; Griffiths, 2003; Ewens, 2004; Evans et al., 2007). If mutation rates are low and an outgroup is available to infer the ancestral state, i.e., if states can be polarized, theory assuming irreversibility allows inference of selection coefficients for polymorphic sites. The quality of polarization and thus the quality of inference under this model depends on the relative age of outgroups: if outgroups are too closely related, polymorphism shared among species is problematic; if outgroups are too distantly related, double mutations may obscure patterns. Thus, for real data analysis, a model allowing for back mutations may be better suited. Furthermore, if mutation parameters are to be estimated in addition to the selection coefficient, an approach using reversible mutations is necessary. Relatively recently, McVean and Charlesworth (1999) reconnect to earlier work to derive some statistics for the allele-frequency spectrum and provide such an approach. Zeng and Charlesworth (2009, 2010) use the Wright–Fisher model and forward simulations to infer parameters using sequence data with a reversible mutation model (Shapiro et al., 2007).
In population genetics theory, the Wright–Fisher model (Fisher, 1930; Wright, 1931) and later the Moran model (Moran, 1958) have received the most attention among the explicit models moving forwards in time. Many classic results were derived using diffusion theory (Fisher, 1930; Wright, 1931; Kimura, 1955a,b). A key parameter in population genetics is the population size . In the limit of large (usually a reasonable assumption), results from different models and approaches converge. Diffusion theory can be seen either as a model in its own right or as an approximation to the explicit models in the limit of large . Since we are mostly interested in this limit, the mathematically most tractable approach has been used, usually diffusion theory (Ewens, 2004). The models and approaches discussed so far move forward in time. Since the 1980’s, the coalescent (Kingman, 1982), an approach that looks backward in time, has been used to derive insights into the distribution of small samples and into the genealogic tree behind allelic distributions.
Using a Moran model, Muirhead and Wakeley (2009) showed that exact equilibrium solutions (up to a constant of proportionality) can be obtained relatively easily for population genetic models with mutation, drift, and frequency-dependent selection, both for infinitely many and a finite number of -alleles. Some of their results go beyond those readily available by diffusion theory. Baake and Bialowons (2008) and Etheridge and Griffiths (2009) use a Moran model where mutation, selection, and drift are decoupled. With this model, Etheridge and Griffiths (2009) derive formulas for mutation, drift, and genic selection and show that they converge to the usual diffusion derived formulas in the limit of large . Furthermore, boundary conditions are rather difficult to incorporate into diffusion theory (e.g., Evans et al., 2007). This argues for multiple approaches to population genetics problems, challenging the nearly exclusive focus on diffusion theory in forward models.
Starting from a decoupled Moran model (Baake and Bialowons, 2008; Etheridge and Griffiths, 2009), we concentrate particularly on small scaled mutation rates ( and ) with directional selection. We derive theory analogous to a model without selection and apply it to a dataset of Drosophila melanogasterintrons and fourfold degenerate sites (Shapiro et al., 2007).
2. Small without selection
In this section, we re-derive known results for the case without selection, i.e., the mutation-drift model. We show how results derived for the infinite sites model follow from the general case for small scaled mutation rates, i.e., with and small and of order .
Without selection, the mutation-drift equilibrium distribution of a locus with two alleles is known to be beta-binomially distributed in the diffusion limit and also in the decoupled Moran model (Baake and Bialowons, 2008; Etheridge and Griffiths, 2009), which we will introduce in more detail below. The probability of finding copies of allele one in a small sample of size , with , is:
| (1) |
where is the rising factorial or Pochhammer function: and . For small , we have . Therefore, formula (1) becomes for :
| (2) |
Here, serves as an approximate constant of proportionality. For and , we have and , respectively. For a sample of loci, the expectation of the sum of all polymorphic loci then is to first order in :
| (3) |
This recapitulates formula (17) in RoyChoudhury and Wakeley (2010). It can be rearranged to give a method of moments estimator of polymorphism in a sample that extends the Ewens–Watterson estimator of molecular variation (Ewens, 1974; Watterson, 1975) to biased mutation rates. If the mutation rates are balanced, i.e., , formula (3) reduces to . This estimator has been derived with the infinite-sites model that assumes negligible scaled mutation rates .
Obviously, the quality of the approximation depends on the amount of polymorphism: according to our simulations, should be below 0.05, or better 0.02 (compare also: Desai and Plotkin, 2008). In Fig. 1, we plot the exact versus the approximate probability of polymorphism in a sample of .
Fig. 1.
Comparison of the exact versus the approximate probability of polymorphism in a sample of size (solid line). The dashed line shows equality.
We note that in the case without selection, the same formulas also hold for , i.e., for the equilibrium distribution of the whole population with haploid individuals. With selection, the case of small and has not been explored extensively. It is not known yet, if formulas similar to (1)–(3) can also be derived.
3. The decoupled Moran model with mutation, selection, and drift
In this section, we re-derive the equilibrium distribution of the decoupled Moran model, up to a constant, by showing that this distribution satisfies detailed balance. Baake and Bialowons (2008) and Etheridge and Griffiths (2009) use the same modified Moran model for their derivations. For the case of small mutation rates , we will derive a simple constant of proportionality and the allele-frequency spectrum, sojourn times, and divergence rates conditional on the ancestral and fixed allele.
3.1. Basic model
With the Moran model, generations overlap. It moves from step to step ; between steps, exponentially distributed waiting times may be introduced. In the pure-drift case, a constant population of haploid individuals is assumed. In a birth/death event, a random individual dies and is replaced by the offspring of a randomly chosen individual . The process repeats indefinitely. The lifespan of an individual is geometrically distributed with a mean time of . It may thus be useful to re-scale time in units of , i.e., to set , to reflect the average lifespan of an individual, or to set , to reflect the usual diffusion theory scaling.
With the original Moran model, mutation and selection are tied to a birth/death event, such that the replacing gamete is assumed to mutate at a rate . Recently, Baake and Bialowons (2008) and Etheridge and Griffiths (2009) used a mathematically more convenient decoupled version, where all events are independent from each other. In this decoupled model, assume a mutation rate of , independent of the allele’s original state, towards the first allele and set (and and , respectively). Assume that allele “1” is favored over allele “0” by selection with an advantage . The favored allele reproduces at a rate , the disfavored only at unit rate.
With a biallelic locus, a transition from one step to the next involves just three possibilities for any interior state, either the number of the favored allele increases or decreases by one or it remains the same. Eq. (4) in Box I, we list the probabilities of events depending on the three forces, drift, mutation, and selection.
Box I.
| (4) |
The transition probabilities between neighboring states are thus:
| (5) |
and
| (6) |
Under detailed balance, we have:
| (7) |
We obtain the stationary distribution of the Moran model, up to a constant of proportionality, by substituting the following function:
| (8) |
The model can easily be extended from a biallelic to a -allelic locus with parent independent mutation (Etheridge and Griffiths, 2009).
Limit of large . Set , and let and while , such that and, by a Taylor series expansion, . We then obtain, using Stirling’s approximation:
| (9) |
This result is identical to that from diffusion theory (Wright, 1931, 1949; Kimura, 1955b; Etheridge and Griffiths, 2009). The constant of proportionality can be obtained with the confluent hypergeometric or Kummer’s function (Moran, 1962).
3.2. Allele frequencies with selection; small scaled mutation rates
For small scaled mutation rates, i.e., and small and of order , the probability of polymorphic states, i.e., in Eq. (8) converges to:
| (10) |
Similarly, we have for the monomorphic states:
| (11) |
As can be seen from this equation, with small mutation rates only the boundary states and have probabilities of order one and in proportion
| (12) |
The ratio of the fixed, i.e., non-polymorphic allelic frequencies has been derived before in the diffusion limit, e.g., by Bulmer (1991) using arguments similar to ours.
The constant of proportionality is the inverse of the sum of all the states:
| (13) |
it can thus be approximated by (or for large and small ), for any finite and , as long as the mutation rates and and thus and are small enough. We note that there are actually two approximations involved, the first leading to formula (11) and the second to the constant, both depending on being small, such that only first order terms in may be retained. We note that, in the case without selection, the second approximation is not necessary, because there the constant of proportionality of the beta-binomial is available. Obviously, substituting into formula (13) provides the same result as formula (2) above.
For relatively large and small , the constant of proportionality can furthermore be approximated by using a Taylor series approximation. In the formulas below, it will become convenient to define the following constant: , or the form for relatively large and small : .
In Fig. 2, we show that for the approximation fits only moderately well, while it fits very well for scaled mutation rates that are an order of magnitude less (). This is similar to the results without selection in the previous section. In Fig. 2, the stippled line corresponds to only the first approximation, i.e., the constant was obtained by summing over formula (11) and the solid line to the first and second approximation, i.e., the constant was approximated by summing only the monomorphic states.
Fig. 2.
The exact probability (bars) and approximate probabilities (solid and stippled lines) for , (A) and (C) and , (B) and (D), respectively, for , (A) and (B), and , (C) and (D), respectively. For the stippled line, the constant was obtained by summing over formula (11); for the solid line the constant was approximated from only the monomorphic states (see the text for details).
3.2.1. A simplified equilibrium process with small
For small and , the proportion of mutations occurring in monomorphic states is of order one and of those in polymorphic states is only of order . By ignoring the latter, we can construct a simplified process, where mutations only occur from the monomorphic states. With this equilibrium process, there is a flow of new mutations from the non-polymorphic states at a rate of from to and from to , respectively.
The transition probabilities are, from any interior state to state : , and from to : . These transition probabilities are the same as those for the Moran model in Eq. (8), when flow that in equilibrium occurs with probability of order is ignored. Observing the amount of flow from 0 and , we can determine the constant of proportionality to obtain the following formula for :
| (14) |
This is the same equilibrium distribution as derived in the previous section up to order , when both the first and second approximation are used for the constant of proportionality (formulas (11) and (13)) and corresponds to the stippled line in Fig. 2. It can be seen that, with small , the allele-frequency spectrum is only influenced by but not by the scaled mutation rates and .
For large and small , using the Taylor series approximation, and setting , we obtain:
| (15) |
As far as we know, Eq. (15) has only been derived up to a constant of proportionality before, although it could be derived easily from the ratio of the fixed sites and the polarized flows below, which are both well known using the diffusion approach.
3.3. Polarizing the flow
Another motivation for constructing this simplified process, is that if an outgroup is available, it is possible to “polarize” the sample into ancestral alleles, i.e., alleles that are identified as already present in the ancestral sample by their existence in the outgroup, and derived alleles. The number of derived alleles in a polymorphic sample is usually called the “derived allele-frequency spectrum”. The information on the ancestral state only makes sense if the mutation rates and are small.
In equilibrium, the flow between the monomorphic states must balance. We can think of as quasi-stationary distribution for alleles originating from of the reduced process, in which and are made boundary conditions, and similarly for originating from . For , the boundary conditions are that the flow away from state 0 must be equal to that back into state 0 plus that into state on the other side:
| (16) |
The net flow in the interior must also balance:
| (17) |
The following corresponds to the equilibrium values (as can be seen by substituting into the formulas (16) and (17)):
| (18) |
or, if we again assume large and small such that , and set :
| (19) |
For the reverse direction, away from state , we have analogously:
| (20) |
or the continuous version:
| (21) |
The discrete versions of these equations are solutions to Eq. (2.143) in Ewens (2004) and seem new. The continuous version formulas are well-known and have been derived in the context of the infinite-sites model before, e.g., formula (9.23) in Ewens (2004) and formula (31) in Evans et al. (2007), although there the constant has a different interpretation as a model with irreversible mutations is considered. From these equations, we can get the conditional probabilities of mutations entering the process. The probability of origin and fixation at the unfavored state is:
| (22) |
those of entering and exiting at opposite states are:
| (23) |
and that of entering and exiting at the favored state is:
| (24) |
We note that for small scaled selection coefficients , already the first order terms differ from the model without selection for two of these probabilities: and , while the others only change with the second order terms: and . Summing all directions, we obtain for :
| (25) |
This is identical to Eq. (14).
Conditional flow. For determining the equilibrium distributions conditional on both origin and fixation, we need for each state the probability of fixation at the favored and unfavored states, respectively. By similar considerations to those above, we can determine that if the flow starts at (instead of or if the flow starts by mutation away from or , respectively), while the end states are still and , and we enter there at a rate per Moran time unit, the equilibrium probabilities of finding the population in state , with are:
| (26) |
Observing the flow out towards states 0 and , it follows that the probability of fixation of the favored allele, if the process started at frequency is:
| (27) |
Multiplying with the probabilities conditional on the starting values, and , respectively, results in the probabilities conditional on both starting and fixation states, e.g., for :
| (28) |
For , we have:
| (29) |
And finally for , we have:
| (30) |
Again these discrete equations seem new. The continuous versions could be determined easily from the average times spent in the different states multiplied by the rate of mutations. This seems to have been done only in the context of irreversible mutation models, e.g., Eq. (9.23) in Ewens (2004) and Eq. (31) in Evans et al. (2007), where the interpretation of the constant is different. In any case, earlier derivations of these equations are quite different from ours in here.
Alternatively, the conditional transition probabilities can be determined (Ewens, 2004, Chapter 2.12), e.g.,
| (31) |
where we used the notation in Ewens’ book. Other conditional transition probabilities follow analogously. It can then be shown that the conditional probabilities in Eqs. (28)–(30) are the equilibrium solutions to this equation given the boundary conditions and the conditional transition probabilities.
3.4. Average sojourn times
From Eq. (28), the conditional times in each state can be determined by dividing by the corresponding rate and summing, e.g., for the flow from unpreferred to preferred:
| (32) |
The last line follows for large , i.e., in the diffusion limit, and corresponds to earlier results (Kimura and Ohta, 1969), if time is not measured in Moran time-steps but in the usual diffusion scaling of generations. The conditional expected time to fixation from preferred to unpreferred is identical to the above equations. The two remaining times are:
| (33) |
and
| (34) |
In the diffusion limit, i.e., in the last line of the above equations, the conditional mean times do not change to first order in .
The average sojourn time can be obtained directly or by summing all times weighted by their proportions:
| (35) |
To first order in , this is not different from the case without selection. But we can use the convexity of the exponential function to show that this is faster than the average time without selection by noting that, in the interval and for , such that we have, for the continuous case:
| (36) |
Here we used the equality sign in the first line, because the result is exact for the diffusion approach. The last line corresponds to the result without selection. This result seems to be new.
3.5. The rate of divergence
In empirical population genetics, fixed differences between divergent populations or species are often used for inference. The rate of accumulation of such fixed differences, also called the rate of fixation of derived variants or divergence, per unit time in equilibrium is equal to the flow from the unfavored to the favored state and vice versa, i.e., the probability of being in state conditional on starting from state 0 () times the transition probability of fixation of the favored allele in the next Moran event conditional on being in state , i.e., and the same in the other direction:
| (37) |
If time is scaled per generation or per generations, this rate needs to be multiplied by or , respectively. Kimura (1962, 1969) derived the diffusion limit version of this equation.
4. Small sample properties
In this section, we derive the small sample properties, using the decoupled Moran model and the diffusion approximation. Usually in analyses of single nucleotide polymorphism data, we have a situation where a large number of loci are assumed to evolve independently according to the same model. A finite sample of size haplotypes is available from the population. In the first subsection, we will briefly describe the general results while, for the rest, we will consider the case of small scaled mutation rates.
4.1. A sample from the stationary distribution, not small
In population genetics, a population size of usually approximates results for reasonably well. Formula (8) with a population size of about 1000 may thus be taken as an approximation to the continuous distribution in formula (9). After sampling a number of alleles from the population from this distribution, a small sample of size may be obtained by sampling without replacement using the hypergeometric distribution, conditional on and with :
| (38) |
In the case without selection, setting the sample size to actually gives the identical results as this subsampling scheme. This is not true with selection, as there is obviously no selection if and selection is generally inefficient for extremely small . For , however, results from subsampling and results using are similar even for relatively large . As can be seen in Fig. 3, an even better approximation that fits well down to very small sample sizes is:
| (39) |
This formula was guessed at and then evaluated by simulations.
Fig. 3.
Likelihood of obtaining individuals of the favored type in a sample without replacement of size from a population of size (thick line), (thin line) and the approximation in formula (39) (stippled line) with . Note that the sample from is slightly less affected by selection and thus more symmetric.
The exact marginal distribution of given and can also be obtained as in the case without selection, but the resulting equation does not simplify as easily:
| (40) |
The conditional distribution of in a population of size given the results from a small sample of size is for :
| (41) |
This is distribution (8) with different parameters and .
4.2. A sample from the stationary distribution, small scaled mutation rates
With small scaled mutation rates, only a small proportion of sites is actually polymorphic; the rest are monomorphic. This situation is also considered in the Poisson-random-field (PRF) approach (Sawyer and Hartl, 1992; Hartl et al., 1994; Bustamante et al., 2001).
We note that the distributions of alleles of the favored type in an ordered sample of size are easier to derive than those of the practically more useful unordered samples. Furthermore, we will need the results for the ordered samples below for calculating the probability of polymorphic samples. Multiplication of the probabilities for ordered samples with gives those for the unordered sample. We will indicate the probabilities of ordered samples by an asterisk.
With small scaled mutation rates and for large population sizes , the probability of a monomorphic sample is of order one and and , respectively. For small mutation rates and in the diffusion limit, the probability of polymorphic samples is (for ):
| (42) |
A solution to this integral is Kummer’s function (Abramowitz and Stegun, 1970): :
| (43) |
Kummer’s function is a solution to the confluent hypergeometric equation and also denoted with (Abramowitz and Stegun, 1970). For small and moderate , this series converges relatively quickly. Below we will, however, provide recurrence relations that allow for quick calculation of all terms up to the sample size , since this is usually required when analyzing empirical frequency spectra.
4.3. Recurrence relations
From applying the rules of integration by parts, we get:
| (44) |
We can use formula (44) to work out the following four cases by using Eq. (42). For the remainder of the section, we will always take the limit and assume small but, for the sake of brevity, leave away the limit notation and the symbol for the order in .
4.3.1. Case:
| (45) |
4.3.2. Case:
| (46) |
4.3.3. Case:
| (47) |
4.3.4. Case:
| (48) |
We note furthermore:
| (49) |
such that:
| (50) |
The recurrence relationships (44) and (50) are well known and correspond to formulas (13.4.4) and (13.4.3) in Abramowitz and Stegun (1970), respectively.
For the unordered case, formula (45) becomes: . The formulas corresponding to (46)–(50) can be obtained easily. For later use, we will provide the formula corresponding to (50):
| (51) |
4.4. Small
For small , the recursion relationships (46)–(48) are useless, as very small and very large quantities delicately cancel out. Even with a relatively large and a relatively small numerical instabilities are too large to tolerate with double precision calculations. We therefore provide an alternative way of calculating approximations to that do not suffer from these deficiencies (and analogously also to the unordered probabilities ).
4.4.1. Case:
We note that, in the limit of large converges to . We can substitute that into the recursion (47) and run it backwards. The recursion then becomes:
| (52) |
and we get for :
| (53) |
Carrying on, we get:
| (54) |
where is again the ascending factorial or Pochhammer function. The last line is the limit , i.e., the diffusion limit, and equivalent to , where is Kummer’s function. For and , we get:
| (55) |
which is identical with formula (45).
Formula (54) is much more useful in the limit of small and results are identical to the case without selection for . We suggest to use this equation to calculate for the locus with the largest sample size, and then to use formulas (50) or (51), respectively, for calculating all other values for sample sizes down to .
4.4.2. Case:
For later use, we will also derive the equivalent formulas for . By the same reasoning, we obtain the recursion:
| (56) |
We thus have:
| (57) |
The last line is the diffusion limit and equivalent to Kummer’s function times . For , we get:
| (58) |
This last formula is identical to (55), as it should be.
4.4.3. Sum of all polymorphic states
We now give the results for the sum over the polymorphic states (i.e., excluding the monomorphic states) for , again using recursion. We note that for , the probability of a polymorphic sample is:
| (59) |
We note from formula (51) that:
| (60) |
Therefore, we have:
| (61) |
This means that the probability of a polymorphic sample of size is the sum of the probabilities of ordered samples of the edges, and , from the sample size to .
We note that for small :
| (62) |
If we compare this result with the equivalent result without selection, we see that, if forward and backward mutation rates are equal and small and selection is also small, selection does not affect polymorphism on average. This is no longer true for unequal mutation rates (Eq. (3)), has been derived earlier by RoyChoudhury and Wakeley (2010). If mutation and selection act in opposing directions, selection may actually increase polymorphism (Lawrie et al., 2011). If in a sample of loci, we observe polymorphic ones, we can solve Eq. (62) for the constant to obtain:
| (63) |
If mutation rates are equal and , this formula corresponds to the Ewens–Watterson estimator of per site scaled mutation rate (Ewens, 1974; Watterson, 1975). We believe that formulas (61)–(63) are new.
4.5. Estimating parameters from a sample of loci
Assume given data from a sample of loci with alleles of the first type in a sample of size for each locus . Each locus is assumed to have evolved independently to mutation-selection-drift equilibrium according to the process we described in this manuscript. We want to infer three parameters from this dataset. The parametrization we have considered so far is: the scaled mutation parameters and and the scaled selection parameter . With small , however, inference becomes more convenient through reparametrization. This has also been done with a slightly different model by Bustamante et al. (2001). We can estimate parameters from a sample of loci by separating three classes: the fixed alleles of the unpreferred type, the polymorphic alleles, and the fixed alleles of the preferred type. With small mutation rates, the number of polymorphic samples will be small, such that they correspond to the assumptions of a Poisson random field (PRF) (Sawyer and Hartl, 1992).
Without considering any other parameters concurrently, we can obtain a maximum likelihood estimate of the scaled selection rate from the polymorphic loci by a direct search or by more sophisticated numerical methods using the recurrence formulas or the closed form solutions in the previous subsection. From the proportion of polymorphic loci in the sample, we define as an estimate of via formula (63) conditional on an estimate of . This procedure corresponds to the Ewens–Watterson estimator of in the case of symmetric mutation rates and without selection. Let us parametrize the remaining parameter as that estimates , from the proportion of disfavored alleles among the fixed sites in the sample, again conditional on . We note that inference of and is independent conditional on . One can return to the original parameters by observing that and .
If we had a sample from another part of the genome that we believe to not be selected, more sophisticated inference methods would be possible.
We note that Zeng and Charlesworth (2009) use a Wright–Fisher model that they iterate to convergence to estimate the same parameters. Their way of inference is more cumbersome than ours, but should lead to nearly the same results in the limit of large . Furthermore, these authors extend the model to changes in effective population size. Our reduced model allows for efficient forward simulations and could thus also be used for similar simulation based methods. Zeng (2010) extends this biallelic model to a codon model. Extension of our analytical results of the biallelic model to higher dimensions is only possible for parent independent mutations. But if mutations are rare enough, the small scaled mutation rate approximation will be useful nevertheless.
Analysis of a Drosophila melanogaster dataset.
Fourfold degenerate sites in Drosophila have been shown to be under selection (e.g. Parsch et al., 2010; Zeng and Charlesworth, 2010). In most Drosophila species including D. melanogaster, alleles with the nucleotides G and C seem to be preferred over those with A and T (Shields et al., 1988; Akashi, 1994; Carlini and Stephan, 2003), although a high fixation rate for AT in D. melanogaster suggests that this codon usage bias has been relaxed in this species (Akashi, 1996; Begun et al., 2007; Singh et al., 2009). Previous models for analysis of fourfold degenerate site data in D. melanogaster have included directional selection, mutation, and drift in equilibrium and also modeled a change in effective population size (e.g., Zeng and Charlesworth, 2010). A force towards GC creates asymmetry in the folded allele-frequency spectrum with more AT low frequency variants than expected under neutrality. The polarized allele-frequency spectra show, for alleles that mutate from GC to AT, an increase in AT low-frequency and a decrease in AT high frequency alleles, and the reverse in the other direction. Scaled mutation rates are below 0.02 in Drosophila and thus our model may be used. However, in D. melanogaster fourfold degenerate sites are not in equilibrium as indicated by the excess of fixed derived AT over GC alleles. Hence, application of our equilibrium theory to fourfold degenerate sites is problematic. On the other hand, short introns seem to be close to equilibrium but there also a weak directional force towards GC can be detected (see Table 1).
Table 1.
GC site-frequencies in a sample of sequences from D. melanogaster; columns 1–14 are the site-frequencies (absolute numbers) for polymorphic sites, columns 0 and 15 are the numbers of fixed AT and fixed GC sites, respectively. Mutations were polarized with respect to the conservative outgroup of D. simulans, D. sechelia, D. mauritiana, D. erecta and D. yakuba. For the unpolarized spectrum, the state of the outgroups was ignored.
| GC-frequencies |
||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
| 4-fold deg. sites | ||||||||||||||||
| Unpolarized | 7464 | 29 | 20 | 14 | 9 | 6 | 10 | 12 | 14 | 20 | 14 | 27 | 49 | 53 | 157 | 15 518 |
| AT GC | 4810 | 15 | 6 | 3 | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 1 | 1 | 4 | 3 | 106 |
| GC AT | 464 | 4 | 6 | 7 | 6 | 3 | 7 | 10 | 8 | 11 | 8 | 19 | 36 | 40 | 127 | 12 608 |
| Short introns | ||||||||||||||||
| Unpolarized | 3147 | 18 | 3 | 4 | 4 | 3 | 5 | 6 | 3 | 8 | 3 | 2 | 3 | 9 | 27 | 1602 |
| AT GC | 2263 | 44 | ||||||||||||||
| GC AT | 53 | 960 | ||||||||||||||
4.6. Materials and methods
We applied our theory to a large polymorphism dataset of D. melanogaster (Shapiro et al., 2007). The data consist of 419 autosomal loci; we only consider a sample of 15 African inbred isogenic lines. We extended the existing outgroups D. simulans, D. sechellia and D. mauritiana by sequence data from D. erecta and D. yakuba which were downloaded from www.flybase.org and aligned with ClustalW version 2.0.12. From the data, we extracted fourfold degenerate sites and short introns (position 8–30 of introns less than 66 basepairs long, which are shown to be least selectively constrained Halligan and Keightley, 2006; Parsch et al., 2010). We furthermore required that all states are known in all lines and species considered for the analyses. For polarization, we used only sites that were monomorphic in all outgroups. This may introduce a bias, but prevents errors. We present the results for the unpolarized, i.e., folded allele frequency spectra for both the fourfold degenerate and intron sites. Because only relatively few sites are available for introns, we present the analysis of the unfolded, i.e., polarized data only for the fourfold degenerate sites.
For both the unpolarized and polarized site-frequency spectra we estimated the scaled selection coefficient via Eq. (14), and and . We performed a likelihood ratio test to determine if the maximum likelihood estimate of is significantly different from the null hypothesis of . Applying Eq. (63) and substituting an estimate for , we could then estimate absolute values for and (see Fig. 4).
Fig. 4.
Observed (bars) and inferred (lines) allele frequency spectra showing the GC-frequencies. A: unpolarized GC-frequencies at fourfold degenerate sites (). B: unpolarized GC-frequencies in introns (). C: polarized (AT to GC) spectrum of GC-frequencies at fourfold degenerate sites (). D: polarized (GC to AT) spectrum of GC-frequencies at fourfold degenerate sites ().
4.7. Results
Our model assumes selection–mutation-drift equilibrium. One of the equilibrium predictions is a balance of AT to GC fixed derived sites. At fourfold degenerate sites, the ratio for fixed derived sites is (AT to GC)/(GC to AT) = 106/464. The excess of GC to AT substitutions is highly significant, indicating non-equilibrium (). In introns, on the other hand, the ratio for fixed derived sites is (AT to GC)/(GC to AT) = 44/53. Thus, the numbers of substitutions are almost balancing and close to the expectation of an equilibrium ().
We estimated the scaled selection coefficient for short introns and fourfold degenerate sites from the site-frequency spectrum (see Fig. 4). The unpolarized spectrum of introns led to an estimate of , which was not significant (). Since the number of fourfold degenerate sites was higher than that of introns, we could estimate from the unpolarized and polarized site-frequencies. From the unpolarized, folded spectrum, was found to be 2.05 and highly significant (). Considering mutations from AT to GC and GC to AT, was estimated to be 2.58 (not significant, ) and 1.99 (significant, ), respectively. Note that the numbers differed between the two directions, which explains the significance of the smaller absolute value of .
Biased gene conversion might contribute to the inferred selection in introns. In that case, we would expect a similar amount of biased gene conversion at fourfold degenerate sites as well. The estimates of at fourfold degenerate sites are higher, indicating selection towards GC over that observed in the introns.
We used the estimate of and the base composition at fourfold degenerate sites to estimate the scaled mutation rates and . The proportion of AT-sites at fourfold degenerate sites is . From the proportion of polymorphic sites, we calculated , which provides an estimate for . From this, we get and , respectively.
4.8. Comparison with earlier results
Qualitatively, our results are consistent with the findings of Zeng and Charlesworth (2010), who used the same dataset to disentangle the genetic forces on fourfold degenerate sites and introns. Their model uses unpolarized polymorphism to estimate directional selection, mutation bias and demographic parameters. Although their model allows for a change in effective population size, they found that an equilibrium model with directional selection towards GC () and constant population size fits autosomal synonymous polymorphism well. Crucially, these authors ignore outgroup information. Including this information, we must exclude an equilibrium model for fourfold degenerate sites, as the ratio of fixed diverged sites from GC to AT and vice versa deviates strongly from 1 : 1. As we find deviation from equilibrium for fourfold degenerate sites, neither their model nor our equilibrium model might be accurate. On the other hand, our analysis suggests that short introns are close to equilibrium and, thus, are more appropriate for our model. Interestingly, Zeng and Charlesworth (2010) had to assume a complex form of selection or non-equilibrium to explain the pattern of introns. This difference to our results may be ascribed to the inclusion of different length-classes of introns in their analysis. The length of introns has been found to negatively correlate with divergence, indicating selective constraints (Halligan and Keightley, 2006).
5. Discussion
We analyze a decoupled Moran model with haploid population size , a biallelic locus under mutation and drift with scaled forward and backward mutation rates and and directional selection with scaled strength . Small scaled mutation rates, and are appropriate for single nucleotide polymorphism data in highly recombining regions of higher organisms. For microbes and viruses, however, this approximation may not be useful Desai and Plotkin (2008). Without selection, the equilibrium distribution of a sample is beta-binomial. The infinite sites approximation corresponds to the limit of small scaled mutation rates, where each polymorphic sites is assumed to be hit by a single mutation only. Many results using the infinite sites model are available for the case without selection (e.g., Ewens, 1974; Watterson, 1975; Ewens, 2004).
With selection, the general process, without the restriction to small scaled mutation rates, has been introduced by Wright (1931) and has been studied in more detail later (e.g., Moran, 1958, 1962; McVean and Charlesworth, 1999). The limiting case of small scaled mutation rates with selection has been studied mainly in the context of Poisson random field (PRF) models (e.g., Sawyer and Hartl, 1992; Hartl et al., 1994; Bustamante et al., 2001; Griffiths, 2003; Ewens, 2004; Evans et al., 2007). The PRF model, like the infinite sites model, assumes a single mutation per polymorphic site. In its present version, it considers only irreversible mutations, nevertheless the source of mutations is not diminishing since the number of sites is assumed infinite. In finite samples, however, such a unidirectional process must lead to depletion, even if whole genomes are considered. Hence applications of the PRF model to realistic data-sets would conform to a quasi-equilibrium process.
Starting from a decoupled Moran model, we use the same approximations as in the case without selection to derive an approximation with small scaled mutation rates. In particular, we derive a simple approximation to the equilibrium distribution, with the constant of proportionality , for the distribution of polymorphic alleles in the population (formula (14)).
We then introduce another simplified process obtained by dropping all transitions that occur with a probability of order from the Moran process. Then only monomorphic states will mutate. In effect, this process consists of two quasi-equilibrium processes that are joined to balance. This simplified process is similar to those considered in Section 2.12 of Ewens (2004) and especially in Section 2 of Evans et al. (2007) in the PRF context. To first order in , this simplified process produces the same results as the biallelic mutation, drift, and selection Moran model. This coincidence is due to the fact that in both cases the proportion of polymorphism in the population is only of order , while that of the monomorphic states is of order one, such that the probability of a mutation occurring in an already polymorphic state is approaching zero. Zeng and Charlesworth (2009) seem to have had the same idea to join the two processes in a model, but did not further investigate the possibilities of this model. In fact, they use a Wright–Fisher model for their data analyses.
Using this simpler model we derive the sojourn times, the equilibrium proportions of fixed alleles, and fixation rates conditional on origin and fixation. These formulas are discrete analogs to those derived earlier using diffusion theory (Kimura and Ohta, 1969; Bulmer, 1991; Ewens, 2004) and converge to those results for large . For practical applications, theory assuming an equilibrium distribution in a large population where small samples are taken seems most useful. Since these results hold for population sizes of according to our simulations, we assume that they will generally hold for all models in the diffusion limit, in particular for the Wright–Fisher model. For calculating this distribution, we provide convenient recurrence relations. This enables us to give formulas analogous to the Ewens–Watterson estimator of (Ewens, 1974; Watterson, 1975) for biased mutation rates and directional selection (formulas (61)–(63)) under equilibrium assumptions.
We apply this theory to a polymorphism dataset of fourfold degenerate sites in Drosophila melanogaster. Our results are qualitatively similar to those of Zeng and Charlesworth (2010), with the major quantitative differences arising not from different model choices, but from the use of polarized vs. non-polarized polymorphism, inclusion of all introns vs. only short introns, and inclusion vs. exclusion of outgroup information. While Zeng and Charlesworth (2010) used forward simulations with a Wright–Fisher model, we could more economically apply our analytical results.
Acknowledgments
We express our sincere thanks to the other members of the “Initiativkolleg Population Genetics” and the “Doktoratskolleg Populationsgenetik” and the members of the external advisory committee, especially Christian Schlötterer, Joachim Hermisson, Andreas Futschik, and Brian Charlesworth, for motivation, interesting discussions, and helpful suggestions and Ludwig Geroldinger for helping to improve mathematical rigor. Furthermore, we thank Peter Pfaffelhuber for discussions and suggestions. We acknowledge funding by the University of Veterinary Medicine Vienna (for the Initiativkolleg) and the FWF (for the Doktoratskolleg, W1225-B20), both headed by Christian Schlötterer. We thank the Editor and two reviewers for comments that helped improve the article.
References
- Abramowitz M., Stegun I., editors. Handbook of Mathematical Functions. nineth ed. Dover; 1970. [Google Scholar]
- Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics. 1994;136:927–935. doi: 10.1093/genetics/136.3.927. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akashi H. Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics. 1996;144:1297–1307. doi: 10.1093/genetics/144.3.1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baake E., Bialowons R. Ancestral processes with selection: Branchin and Moran models. In: Miekisz J., editor. Stochastic Models in Biological Sciences. vol. 80. Institute of Mathematics, Polish Academy of Sciences; Warsaw, Poland: 2008. pp. 33–52. (Banach Center Publications). [Google Scholar]
- Begun D.J., Holloway A.K., Stevens K., Hillier L.W., Poh Y.P., Hahn M.W., Nista P.M., Jones C.D., Kern A.D., Dewey C.N., Pachter L., Myers E., Langley C.H. Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol. 2007;5:e310. doi: 10.1371/journal.pbio.0050310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bulmer M. The selection–mutation-drift theory of synonymous codon usage. Genetics. 1991;129:897–907. doi: 10.1093/genetics/129.3.897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bustamante C., Wakeley J., Sawyer S., Hartl D. Directional selection and the site-frequency spectrum. Genetics. 2001;159:1779–1788. doi: 10.1093/genetics/159.4.1779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carlini D.B., Stephan W. In vivo introduction of unpreferred synonymous codons into the Drosophila Adh gene results in reduced levels of ADH protein. Genetics. 2003;163:239–243. doi: 10.1093/genetics/163.1.239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Desai M.M., Plotkin J.B. The polymorphism frequency spectrum of finitely many sites under selection. Genetics. 2008;180:2175–2191. doi: 10.1534/genetics.108.087361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Etheridge A., Griffiths R. A coalescent dual process in a Moran model with genic selection. Theor. Popul. Biol. 2009;75:320–330. doi: 10.1016/j.tpb.2009.03.004. [DOI] [PubMed] [Google Scholar]
- Evans S., Shvets Y., Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theor. Popul. Biol. 2007;71:109–119. doi: 10.1016/j.tpb.2006.06.005. [DOI] [PubMed] [Google Scholar]
- Ewens W. A note on the sampling theory for infinite alleles and infinite sites models. Theor. Popul. Biol. 1974;6:143–148. doi: 10.1016/0040-5809(74)90020-3. [DOI] [PubMed] [Google Scholar]
- Ewens W. Springer; NY: 2004. Mathematical Population Genetics. [Google Scholar]
- Fisher R. Clarendon Press; Oxford: 1930. The Genetical Theory of Natural Selection. [Google Scholar]
- Griffiths R. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor. Popul. Biol. 2003;64:241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
- Halligan D., Keightley P. Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 2006;16:875–884. doi: 10.1101/gr.5022906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartl D., Moriyama E., Sawyer S. Selection intensity for codon bias. Genetics. 1994;138:227–234. doi: 10.1093/genetics/138.1.227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. Solution of a process of random genetic drift with a continuous model. Proc. Natl. Acad. Sci. USA. 1955;41:144–150. doi: 10.1073/pnas.41.3.144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. Stochastic processes and the distribution of gene frequencies under natural selection. Cold Spring Harbor Symp. Quant. Biol. 1955;20:33–53. doi: 10.1101/sqb.1955.020.01.006. [DOI] [PubMed] [Google Scholar]
- Kimura M. On the probability of fixation of mutation genes in a population. Genetics. 1962;47:713–719. doi: 10.1093/genetics/47.6.713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics. 1969;61:893–903. doi: 10.1093/genetics/61.4.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kimura M., Ohta T. The average number of generations until fixation of a mutant gene in a finite population. Genetics. 1969;61:763–771. doi: 10.1093/genetics/61.3.763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingman J. On the genealogy of large populations. J. Appl. Probab. 1982;19A:27–43. [Google Scholar]
- Lawrie D., Petrov D., Messer P. Faster than neutral evolution of constrained sequences: the complex interplay of mutational biases and weak selection. Genome Biol. Evol. 2011;3:383–395. doi: 10.1093/gbe/evr032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McVean G., Charlesworth B. A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genet. Res. 1999;74:145–158. [Google Scholar]
- Moran P. Random processes in genetics. Proc. Camb. Phil. Soc. 1958;54:60–71. [Google Scholar]
- Moran P. Clarendon Press; Oxford: 1962. Statistical Processes of Evolutionary Theory. [Google Scholar]
- Muirhead C., Wakeley J. Modeling multi-allelic selection using a Moran model. Genetics. 2009;182:1141–1157. doi: 10.1534/genetics.108.089474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parsch J., Novozhilov S., Saminadin-Peter S.S., Wong K.M., Andolfatto P. On the utility of short intron sequences as a reference for the detection of positive and negative selection in Drosophila. Mol. Biol. Evol. 2010;27:1226–1234. doi: 10.1093/molbev/msq046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RoyChoudhury A., Wakeley J. Sufficiency of the number of segregating sites in the limit under finite-sites mutation. Theor. Popul. Biol. 2010;78:118–122. doi: 10.1016/j.tpb.2010.05.003. [DOI] [PubMed] [Google Scholar]
- Sawyer S., Hartl D. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shapiro J., Huang W., Zhang C., Hubisz M., Turissini D., Fang S., Wang H.-Y., Hudson R., Nielsn R., Wu C.-I. Adaptive genic evolution in the Drosophila genomes. PNAS. 2007 doi: 10.1073/pnas.0610385104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shields D.C., Sharp P.M., Higgins D.G., Wright F. “Silent” sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 1988;5:704–716. doi: 10.1093/oxfordjournals.molbev.a040525. [DOI] [PubMed] [Google Scholar]
- Singh N., Arndt P., Clark A., Aquadro C. Strong evidence for linage and sequence specificity of substitution rates and patterns in Drosophila. Mol. Biol. Evol. 2009;26:1591–1605. doi: 10.1093/molbev/msp071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watterson G. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 1975;7:256–276. doi: 10.1016/0040-5809(75)90020-9. [DOI] [PubMed] [Google Scholar]
- Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. Population structure in evolution. Proc. Am. Phil. Soc. 1949;93:471–478. [PubMed] [Google Scholar]
- Zeng K. A simple multiallele model and its application to preferred-unpreferred codons using polymorphism data. Mol. Biol. Evol. 2010;27:1327–1337. doi: 10.1093/molbev/msq023. [DOI] [PubMed] [Google Scholar]
- Zeng K., Charlesworth B. Estimating selection intensity on synonymous codon usage in a non-equilibrium population. Mol. Biol. Evol. 2009;183:651–662. doi: 10.1534/genetics.109.101782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng K., Charlesworth B. Studying patterns of recent evolution at synonymous sites and intronic sites in Drosophila melanogaster. J. Mol. Evol. 2010;183:651–662. doi: 10.1007/s00239-009-9314-6. [DOI] [PubMed] [Google Scholar]




