Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2019 Apr 19;36(8):1701–1710. doi: 10.1093/molbev/msz092

Applicability of the Mutation–Selection Balance Model to Population Genetics of Heterozygous Protein-Truncating Variants in Humans

Donate Weghorn 7,1,2,#, Daniel J Balick 1,2,#, Christopher Cassa 1,2, Jack A Kosmicki 3,4, Mark J Daly 3,4, David R Beier 5,6, Shamil R Sunyaev 1,2,
Editor: Ryan Hernandez
PMCID: PMC6738481  PMID: 31004148

Abstract

The fate of alleles in the human population is believed to be highly affected by the stochastic force of genetic drift. Estimation of the strength of natural selection in humans generally necessitates a careful modeling of drift including complex effects of the population history and structure. Protein-truncating variants (PTVs) are expected to evolve under strong purifying selection and to have a relatively high per-gene mutation rate. Thus, it is appealing to model the population genetics of PTVs under a simple deterministic mutation–selection balance, as has been proposed earlier (Cassa et al. 2017). Here, we investigated the limits of this approximation using both computer simulations and data-driven approaches. Our simulations rely on a model of demographic history estimated from 33,370 individual exomes of the Non-Finnish European subset of the ExAC data set (Lek et al. 2016). Additionally, we compared the African and European subset of the ExAC study and analyzed de novo PTVs. We show that the mutation–selection balance model is applicable to the majority of human genes, but not to genes under the weakest selection.

Keywords: protein-truncating variants, selection inference, genetic drift

Introduction

In well-adapted populations, the evolutionary dynamics of genes under purifying selection plays a prominent role. Early models describing this phenomenon were purely deterministic, that is they assumed infinite effective population size (Fisher 1930). In particular, the discussion centered around the concept of mutation–selection balance, when deleterious variants in the population are replenished by mutation against the constant purge of negative selection. Supported by the advent of large sequencing data sets and computer simulations, however, it became clear that the high amounts of nonsilent genetic variation observed in real populations cannot be fully explained by mutation–selection balance (Wright 1931; Lande 1976; Bürger et al. 1989). Instead, many mutations are only weakly selected against and many populations cannot be approximated to be infinitely large. Both of these factors emphasize the relative importance of stochastic effects, or genetic drift, compared with mutation and selection. Therefore, deterministic mutation–selection balance is not an adequate description of the evolutionary dynamics of deleterious alleles unless the selection strength is sufficiently high to dominate genetic drift. The full mutation–selection–drift balance has been extensively studied using the diffusion approximation (Kimura 1964). It is now widely appreciated that in humans the complexities of demographic history and changing population size must be explicitly modeled. This has been incorporated in many recent studies that estimated the intensity of selection (Williamson et al. 2005; Tennessen et al. 2012; Do et al. 2015).

One practically important case when effects of selection are strong in comparison to genetic drift is the dynamics of protein-truncating variants (PTVs). When assuming that most PTVs within a gene have similar fitness effects, they can be analyzed in aggregate. The cumulative number of new truncating mutations per gene is two orders of magnitude higher than the per-site expectation. It is thus appealing to apply a deterministic approximation to model the population genetics of PTVs, as has been proposed by us earlier (Cassa et al. 2017). The utility and limits of the deterministic approximation are a matter of debate (Charlesworth and Hill 2019; Cassa et al. 2019). Here, we investigated the applicability of this approximation to selection against heterozygous PTVs, shet, using simulations that explicitly incorporate drift. Further, we studied the effects of population stratification by comparing PTV allele frequencies between the African and Non-Finnish European (NFE) subpopulations of the ExAC data set. Lastly, we analyzed the estimates of selection strength vis-à-vis the fraction of de novo mutations in a recent pedigree sequencing data set, a direct measure of shet. Beyond the analysis of the influence of drift on the shet estimates, we also present a comparison of s^het with another measure of protein constraint, pLI (Lek et al. 2016).

Results

Impact of Genetic Drift on PTV Allele Count

We first analyzed how genetic drift affects the observed number of PTV mutations on a gene, k, and its variance in the population. Given an exome sequencing cohort, k is the sum of protein-truncating mutations over all sampled chromosomes in the cohort, n. The expected value of k, E[k], is determined by selection as well as by the local cumulative mutation rate, U. Provided PTVs are not nearly recessive, E[k] is well approximated by the deterministic expression nU/shet (Wright 1931). If mutation and selection are the dominant evolutionary forces, the approximate Poisson nature of the PTV count implies for the variance Var[k]=E[k]=nU/shet. Meanwhile, if PTV alleles remain in the population for a substantial period of time and thus become subject to the effects of genetic drift, this will tend to increase Var[k]. We can express the total variance of k as a sum of Poisson sampling variance and variance due to drift (Materials and Methods).

The impact of genetic drift shows in the population allele frequency spectrum. For given U and shet, this describes the distribution of the gene-specific cumulative frequency of PTVs in the population, X, and depends on the demographic history of the population. To gain intuition, we first considered a classic approximation for the allele frequency spectrum under the assumption of strong purifying selection (Nei 1968). At equilibrium, the variance of the cumulative PTV allele count k in a sample of size n chromosomes is then given by Var[k]=nU/shet(1+n/(4Neshet)). Notably, this expression approaches the deterministic approximation of nU/shet if the size of the sample is much smaller than the product of the effective population size, Ne, and the selection coefficient.

Modeling Recent Population Expansion

Human populations appear to be far from evolutionary equilibrium, and most populations have undergone very rapid, recent growth (Keinan and Clark 2012). With our focus on rare deleterious PTVs (X^<0.001), these alleles are expected to be on average young, so the most relevant population size for our purposes is the recent effective population size, corresponding to the lifetime of an average very rare deleterious allele. Current literature estimates of the present day effective size in the European population span the wide range of 0.5–8 million individuals (Tennessen et al. 2012; Gao and Keinan 2014; Browning and Browning 2015; Harpak et al. 2016). Although the long-term effective size of the human population is smaller than the sample size of the ExAC data set, this number, driven by ancestral or bottleneck sizes, is less relevant than the epoch of recent growth for very young alleles under relatively strong natural selection.

To determine the deviation from the deterministic regime, we evaluated relevant features of the demographic history under which rare deleterious variants evolve. We focused on the NFE subset, representing a majority of the samples (N = 33,370) in the ExAC data. The NFE subset has a well-studied demographic history from a single ancestry. We used features of the Tennessen et al. (2012) model of European demography, including the initial size, bottleneck and initial exponential growth phases. For the final exponential growth phase, we matched properties of rare alleles in the ExAC NFE sample to gauge the final effective population size and corresponding growth rate. We simulated a dense range of exponential growth rates in the final demographic period, and matched neutral simulation results of the downsampled site frequency spectrum to the fraction of synonymous singleton alleles observed in the NFE subpopulation. We focused on non-CpG transversions for these purposes, as CpG and even non-CpG transitions are known to exhibit the effects of recurrent mutations in the ExAC sample due to elevated mutation rates relative to non-CpG transversions. For example, the fraction of synonymous singletons for non-CpG transversions in ExAC NFE is 63.3% (95% CI 62.1–64.6%; ExAC All 60.5%), whereas the fraction of synonymous singletons for CpG transitions in ExAC NFE is 39.8% (39.1–40.7%; ExAC All 23.9%). The analysis of singletons resulted in a demographic history with a recent population size of 4.3 million individuals, which is consistent with estimates based on the same data set provided in Harpak et al. (2016).

In figure 1a, we find the coefficient of variation, Var[k]/E[k], to be in good agreement with the deterministic approximation for genes under strong selection (see also supplementary fig. 1, Supplementary Material online). For shet = 0.06, corresponding to the mean of the inferred genome-wide distribution of heterozygous selection coefficients in Cassa et al. (2017), the inflation of the coefficient of variation in our simulations did not exceed 4%. However, variance indeed diverged from the deterministic approximation for genes under weaker selection (shet0.02). Figure 1a also shows that the mean, E[k], is unaffected by genetic drift within the analyzed range of parameters.

Fig. 1.

Fig. 1.

Comparison of the deterministic mutation–selection balance model with the model that includes the effects of genetic drift, in the NFE demography. (a) Fold change in the coefficient of variation (squares) and the mean (crosses) of the number of PTV mutations, k, relative to the deterministic case, obtained from simulations of a realistic demography of the ExAC NFE sample for different values of heterozygous selection strength shet. (b) Heat map of gene-specific estimates for all 16,279 tested genes from the NFE sample, showing deterministic (x axis) and drift-inclusive (y axis) shet estimates. Note the double-logarithmic axes in both panels.

Incorporation of Genetic Drift in the Selection Inference

Having established the effects of genetic drift on the population PTV allele frequency, we could then address the resulting effects on selection inference. On the basis of the approach described in Cassa et al. (2017), we estimated the parameters θ of the distribution of heterozygous selection coefficients, P(shet;θ), from fitting the observed distribution of per-gene PTV counts,

P(k;θ,n,U)=dshetP(shet;θ)P(k|shet;n,U), (1)

where P(k|shet;n,U) denotes the conditional probability of observing count k given shet. At mutation–selection balance, P(k|shet;n,U)=Pois(k;nU/shet), with Pois(k; λ) denoting the Poisson distribution of k with parameter λ.

Here, we fully accounted for the effects of drift through incorporation of the allele frequency spectrum, φ(X;shet,U). In the deterministic case, X is assumed to be fixed at its expected value U/shet for a gene with mutation rate U and under selection shet. In contrast, we now explicitly include its variability:

P(k|shet;n,U)=XPois(k;nX)φ(X;shet,U). (2)

Population allele frequency spectra φ(X;shet,U) under the described demography were produced from simulations for a dense grid of mutation rates and selection strengths (Materials and Methods). We maximized the likelihood of the data over parameters θ, fitting the model in equation (1) to the observed distribution of k in the ExAC NFE subsample (supplementary fig. 2, Supplementary Material online). As we found previously, the inverse Gaussian distribution provided the best parametric form for the fit to the distribution P(shet;θ). To obtain gene-specific estimates of selection, s^het, we derived the mean of the posterior distribution (Materials and Methods). Supplementary table 1, Supplementary Material online, contains the s^het values from the NFE subpopulation for the drift-inclusive and the deterministic case, including 95% posterior probability intervals.

We used the resulting per-gene estimates to understand the impact of a full demographic model relative to the deterministic estimates of heterozygous selection coefficients obtained under mutation–selection balance. Figure 1b shows the direct comparison between the two scenarios for the NFE subset. Overall, the incorporation of drift in the model did not substantially change relative ranks of genes (Spearman rank correlation coefficient = 0.995; supplementary fig. 3, Supplementary Material online). Estimates for genes under moderate to strong selection (shet > 0.01, N = 10,744 genes, 66%) are very close to the deterministic estimates, only showing a slight downward shift on average. As variance due to genetic drift partly absorbs the variance of the prior distribution of selection coefficients, the latter decreased (by ∼18%). Genes under weaker selection (shet < 0.01, N = 5,535 genes, 34%) appeared with largely the same rank, but showed a monotonic increase in estimated values of heterozygous selection. The apparent convergence of selection coefficients to ∼0.004 for these genes can be explained by the effects of genetic drift on the conditional probability in equation (2). For weak negative selection, P(k|shet;n,U) becomes nearly flat in the regime of very small shet, moving the estimates closer to the genome-wide mean.

We conclude that even though human demographic history is complex, a realistic model of recent population expansion suggests that, owing to their deleteriousness, the evolution of PTVs can be largely described in a deterministic framework.

Test for Differences between Subpopulations

As an additional way to test the utility and limits of the deterministic approximation to estimate selection, we used a data-driven approach. This approach relied on a comparison of PTV counts between different human subpopulations, free from assumptions made in simulations, such as panmixia. For a given gene, if the same selection coefficient shet acts on heterozygous PTVs in two subpopulations, we expect E[X1]=E[X2]=U/shet. Here Xi denotes the cumulative PTV allele frequency in subpopulation i. The PTV allele count in subsample i was modeled as kiPois(niU/shet), where ni is the number of chromosomes sampled.

We tested whether this deterministic approximation was violated by comparing the NFE (i =1) and African (AFR, i =2) ExAC subsamples using the C-test (Przyborowski and Wilenski 1940). Given k1+k2k, the distribution of k1 conditional on the total count k is binomial with success probability P=n1/(n1+n2). We then computed the two-sided binomial P-value of the observed value of k1 for all genes in the set. Since the k1 are discrete random variables, their P-values are not uniformly distributed. Therefore, in order to account for multiple testing, we compared the observed genome-wide distribution of binomial P-values to the P-value distribution obtained from a simulation under the null assumption. We generated 500 instances of simulated binomial PTV counts for each gene and computed the false discovery rate (FDR) conservatively, as the fraction of genes that is expected by chance to have a P-value equal to or less than a certain threshold. We then measured the number of genes with FDR below 5%, xsig.

Table 1 shows the results for different intervals of the deterministic shet estimates, as well as the total number of genes in each interval, xtot, and significant fraction. We found that a total of 870 out of 15,865 tested genes (5.5%) show a significant deviation from the assumption of Poisson distributed counts with equal expected values in the two subpopulations. Of all 870 significant genes, 49% have s^het0.01. Unlike our simulations, this data-driven approach is not expected to be contingent on assumptions about demographic structure or aspects of the population history.

Table 1.

Fraction, xsig/xtot, of Genes with Significant, FDR-Corrected Two-Sided Binomial P-value According to the C-test across the NFE and AFR Subpopulations (out of N = 15,865 genes), in Intervals of Deterministic s^het Values Derived from the NFE Subpopulation of the ExAC Data Set.

s^het Interval xsig xtot xsig/xtot
[0.000, 0.005] 220 1,690 0.124
[0.005, 0.010] 202 2,315 0.086
[0.010, 0.020] 148 2,744 0.052
[0.020, 0.050] 140 3,275 0.041
[0.050, 0.200] 102 4,050 0.029
[0.200, 0.500] 58 1,765 0.028
[0.500, 1.000] 0 26 0.000

Note.—FDR was controlled at 0.05.

Prediction of De Novo Fraction Using s^het

Some of the PTVs detected in a sample are de novo mutations rather than segregating alleles inherited from the parental generation. With increasing strength of negative selection, the population allele frequency, and thus the chance of inheriting a deleterious allele, is reduced and more of the observed deleterious mutations arise de novo. The fraction of de novo out of all PTVs equals shet for genes under negative selection in the deterministic limit. As shown in Materials and Methods, this result is also valid across a wide range of parameters at mutation–selection-drift balance (supplementary fig. 4, Supplementary Material online).

We collected de novo and inherited PTVs in autism-spectrum disorder (ASD) probands from ∼4,000 parent–child trios (Kosmicki et al. 2017). For each gene, we computed the observed fraction of de novo PTVs, f^, and compared it to the deterministic estimate of the heterozygous selection coefficient, s^het. This analysis provides another independent and data-driven approach to test the validity of the shet estimates. Figure 2 shows the observed relation between s^het and f^. We find good agreement between s^het and f^ across a wide range of selection coefficients (s^het0.002). We repeated this analysis for the deterministic shet estimates obtained from the entire ExAC data set (Cassa et al. 2017), which delivered comparable results (supplementary fig. 5, Supplementary Material online).

Fig. 2.

Fig. 2.

In the strong selection limit, shet is a predictor of the fraction of de novo PTVs, f. De novo fraction of PTV mutations was estimated for 6,203 (out of 16,279) genes with at least one PTV (de novo or transmitted) in an ASD cohort of ∼4,000 parent–child trios (y axis) and compared with the deterministic s^het derived from the NFE sample (x axis). Red dots denote individual genes (genes with f^=0 were assigned f^=2×104 for illustration purposes). Black squares connected by black lines denote the mean in bins along the x axis of logarithmic width Δlog[s^het]=0.25 (number of genes per bin from left to right: {1, 10, 43, 148, 400, 811, 1,117, 1,158, 870, 597, 359, 360, 236, 90, 3}). Vertical error bars show the standard error of the mean per bin for f^. Corresponding error bars for s^het are smaller than the marker size. Gray line denotes the diagonal.

Comparison of s^het to Other Measures of Protein Constraint

Beyond the theoretical importance, evaluating selection on deleterious PTVs has practical applications in human genetics. Population-based measures of constraint, such as pLI (Lek et al. 2016) and RVIS (Petrovski et al. 2013), have been successfully used to prioritize genes in studies of neuropsychiatric and other diseases (Gussow et al. 2016). pLI measures the probability of a gene to be loss-of-function intolerant (Lek et al. 2016). This measure is based on a classification of genes into three categories and returns high values for genes under strong selection. By construction, this approach has limited resolution within this class of genes. Point estimation of shet characterizes the fitness loss beyond the binary classification of whether or not the gene is under constraint. Therefore, s^het has an advantage as a proxy for penetrance, disease age of onset and severity. This is illustrated in figure 3, together with a comparison of the predictive power of pLI of the fraction of de novo mutations (see also supplementary fig. 6, Supplementary Material online).

Fig. 3.

Fig. 3.

Comparison of per-gene selection estimates, s^het, with a measure of probability of loss-of-function intolerance, pLI (Lek et al. 2016). Shown is the correlation with independent measures of gene importance. (ac) Data on disease severity, penetrance, and age of onset (x axes) for a set of 113 haploinsufficient disease-associated genes of high confidence (ClinGen Dosage Sensitivity Project) were compared with deterministic NFE s^het (top row) and pLI (bottom row) predictions (y axis). (d) The fraction of de novo PTVs (f^), shown in bins of width 0.25 for 5,930 genes with at least one transmitted or de novo PTV and a pLI annotation (x axis), was derived from an ASD trio-sequencing data set (top: NFE s^het, bottom: pLI). Note that due to the small number of PTVs per gene in the ASD cohort, the distribution of f^ on the range [0, 1] is not smooth. Red dots denote individual genes, gray boxes enclose the central quartiles of the distribution in each category, and black horizontal bars through gray boxes show the median. Note the logarithmic y axis in the top row, whereas the bottom row has a linear y axis.

Discussion

The existence of stochastic effects in finite populations has long been known (Wright 1931), predating diffusion theory approaches to the evolutionary dynamics of populations (Kimura 1964). However, the impact of genetic drift on the fate of a newly arising mutation was not always appreciated. Partially due to historically insufficient amounts of data, the deterministic forces of mutation and selection were often the focus of population genetics analyses, corresponding to working in the limit of infinite effective population size, Ne. The equilibrium state in that case is mutation–selection balance. Arguably, a locus evolving under strong negative selection may still be considered in mutation–selection balance even if the effective population size is finite (that is, if it is large enough to ensure 4Neshet1). In our earlier work, we hypothesized that also selection against heterozygous PTVs scaled by Ne would be sufficiently strong to ensure the assumption of mutation–selection balance (Cassa et al. 2017).

Here, we have investigated the full effects of drift using simulations of a realistic human demography. This approach models recurrent mutations, is gauged by the observed NFE data in the ExAC cohort, and does not make assumptions about equilibrium. As a result of our analysis, we found that the deterministic mutation–selection balance approximation for counts of rare deleterious PTVs is applicable for genes under strong to intermediate selection, including genes with the previously inferred global mean of s^het0.06. Selection estimates for this class of genes are highly robust to the incorporation of drift effects. For genes under relatively weak selection, the deterministic shet estimates provide a stable ranking useful as prioritization scores for practical applications in human genetics (Cassa et al. 2017).

The ExAC data set used to estimate the deterministic shet given in Cassa et al. (2017) is composed of different subpopulations. Here, we focused on the largest subset, NFE, with its established demographic dynamics, to compare deterministic evolution of PTVs to a scenario under genetic drift. Population structure can cause an increase in the variance of allele frequency. In the case of lethal but highly recessive variants, the mean allele frequency can even be reduced below the deterministic expectation of the combined population (Wright 1937; Nei 1968; Glemin 2005). To address the effects of population structure on the deterministic shet estimates, we conducted two data-driven tests. First, we tested whether the PTV allele counts in the NFE and the African ExAC subsamples were compatible with being generated under the same deterministic Poisson model with identical selection strength. Second, we analyzed the relationship between our estimates of heterozygous selection coefficients based on the combined ExAC data set and the experimentally determined fractions of de novo mutations (supplementary fig. 5, Supplementary Material online). Both of these analyses are consistent with the simulation results, suggesting that the mutation–selection balance approximation is applicable to strongly to moderately selected genes.

Materials and Methods

PTVs in the NFE Subsample

In this analysis, we used the NFE subsample from the ExAC data set version 0.3 (Lek et al. 2016), a set of jointly called exomes from 33,370 individuals ascertained with no known severe, early-onset Mendelian disorders. The mean coverage depth was calculated for each gene (canonical transcript from Ensembl v75, GENCODE v19) in the ExAC data set (mean 57.75; SD 20.96). Genes with average coverage depth of at least 30× were used in further analyses (N = 17,199). Single nucleotide substitution variants annotated as PASS quality with predicted functional effects in the canonical transcript of “stop_gained,” “splice_donor,” or “splice_acceptor” (as annotated by Variant Effect Predictor) were included in the analysis. Variants such as indels, in-frame mutations, and frameshift variants were excluded from this analysis, as many of these variants may have annotation issues or may not be functionally impactful. Along the same lines, we are mindful that not all PTVs will result in the same effect on gene function, due, for example, to alternative transcripts or the position dependence of nonsense-mediated decay (Rivas et al. 2015). To address this, variants were filtered using LOFTEE and restricted to those predicted with high confidence to have consequences in the canonical transcript.

For each of the 17,199 genes, we have observable values for (k, n, U), where k denotes the total number of observed PTV alleles in the NFE population sample of n chromosomes covered in the gene, and U the PTV mutation rate across the canonical gene transcript from a mutational model (Samocha et al. 2014; Francioli et al. 2015). Values of U for each gene from Samocha et al. (2014) were used along with the number of well-covered chromosomes n in each gene to generate the null mutational expectation of neutral evolution, nU. Incorrectly specified values from this mutational model could alter estimates of selection for individual genes, as higher estimates of selection are made in genes with greater depletions from the null expectation model. Our inference of selection coefficients relies on the assumption that the cumulative population frequency of PTV mutations, X, is small owing to strong negative selection, so genes with X^=k/n>0.001 are omitted from the analysis, leaving 16,279 genes in the NFE subsample. On these genes, there are 115,651 NFE PTVs in total with a mean PTV count of 7.1, whereas the maximum is 66. Supplementary table 1, Supplementary Material online, contains the values k, n, and U for all 16,279 tested genes in the NFE subsample.

De Novo and Transmitted PTVs in an ASD Trio Sequencing Cohort

For the derivation of the de novo fraction, we downloaded 5,856 de novo variants from 3,982 individuals ascertained for ASD from Kosmicki et al. (2017). We also incorporated 72 published, validated ASD de novo variants from Krumm et al. (2015) and 119 ASD de novo coding variants from Werling et al. (2018), bringing the total number of de novo variants to 6,047 for ascertained individuals with ASD. One pair of the published ASD probands was a duplicate (10C112515 and 14621.p1), so we removed one of them (10C112515), which reduced the total number of individuals ascertained for ASD from 3,982 to 3,981 and their total number of de novo variants to 6,044. Lastly, following the protocol from The Deciphering Developmental Disorders Study (2017), we restricted the number of de novo variants to one variant per person, per gene, prioritizing the de novo variant with the most severe consequence, which removed 64 ASD de novo variants, bringing the final total to 5,980 ASD de novo variants. To restrict to PTVs, we included all variants with annotations “stop_gained,” “splice_donor_variant,” or “splice_acceptor_variant” (331 variants).

For the transmission analysis of segregating variants, we used all PTV variants with PASS quality in at least 95% of all 4,319 sequenced trios from the Autism Sequencing Consortium freeze v13 (see Kosmicki et al. 2017). This delivered a total of 22,666 distinct variants found in the parents, which were either transmitted or untransmitted to the probands ascertained for ASD. Because of the difference of 338 tested trios between the de novo and transmission data sets, we weighted each PTV category correspondingly for the estimation of the de novo fraction f^. For the NFE data set with 16,279 genes with X^=k/n0.001, the de novo PTVs were located on 265 genes, whereas the parental segregating mutations were distributed across 8,092 genes. We used the set of 6,203 genes with at least one PTV (transmitted or de novo) in the comparison with the shet estimates. In case of shet estimated from the entire ExAC data set, of the 15,998 genes with X^=k/n0.001 (Cassa et al. 2017), 261 had de novo PTVs and 7,861 had segregating variants in the parents, and 6,000 genes with at least one mutation (transmitted or de novo) were tested.

Analytical Derivation of PTV Count Variance

For most genes, protein-truncating alleles are both individually and collectively rare. We define the cumulative PTV allele frequency of a gene in the population, X=jxj, where the sum is over all PTV sites j on the gene with respective allele frequencies xj. This is motivated by the simplifying assumption of identical selection coefficients for all PTVs within a gene and the observation that the frequency of the vast majority of PTVs is extremely low, such that the occurrence of multiple variable sites within a gene on a single haplotype is also extremely low (2nxjxk<1 for sample size n).

The frequency X is governed by the demographic dynamics of the population. For heterozygous selection coefficient shet and cumulative mutation rate U, it follows the allele frequency distribution φ(X;shet,U). Given frequency X in the population, we expect to see on average E[k] = nX mutations in a sample of n chromosomes, hence we arrive at equation (2),

P(k|shet;n,U)=XP(k|X;n)P(X|shet;U)XPois(k;nX)φ(X;shet,U).

Here, Poisson sampling with parameter nX again represents the limit of binomial sampling for small success probabilities xj and large sample size n. In particular, we are interested in how genetic drift affects the mean and variance of the PTV count k. For a given selective effect shet we obtain the expected value of k from equation (2),

E[k]=nE[X], (3)

while the variance is given by

Var[k]=nE[X]+n2Var[X]. (4)

Analytical Approximation for Strong Selection on Heterozygous Sites

Both E[X] and Var[X] depend on the evolutionary dynamics of the population, but it is instructive to compare to the equilibrium case of strong purifying selection on heterozygous variation described by Nei (1968). Nei showed that in that case the theoretical allele frequency distribution can be approximated by a gamma distribution with shape parameter 4NeU and scale parameter 1/(4Neshet), where Ne again denotes the effective population size. Under this assumption, equations (3) and (4) become

ENei[k]=nUshet (5)

and

VarNei[k]=nUshet+n2(U4Neshet2)=nUshet(1+n4Neshet). (6)

When we compare this to the analogous expressions for mean and variance under the deterministic assumption of mutation–selection balance,

Edet[k]=Vardet[k]=nUshet, (7)

we find the same expected value. For the excess variance we consequently expect to see a suppression with increasing effective size of the population from which the sample was drawn, as well as with increasing selection strength.

Population Genetics Simulations

To understand the interaction between mutation, selection, genetic drift, and population sampling, we wrote a custom Wright–Fisher simulator appropriate for the parameter regimes of interest. We assumed infinite recombination, such that all sites evolve independently, in a panmictic population with a single value of genic (additive) natural selection per simulation, and a fixed mutation rate with no back mutations. We modeled genes as biallelic sites with a mutation rate associated with the sum of all PTV targets in the gene, U. For each simulated biallelic site, recurrent mutations were Poisson sampled with mean AU, where A is the number of ancestral, that is, unmutated, chromosomes. For a given mutation rate and selection coefficient we generated 100,000 independent realizations, providing a distribution of biallelic frequencies. We simulated a dense grid of mutation rates (U[107,7.16×105]) associated with the PTV-specific mutation rate estimates provided in Samocha et al. (2014). For each biallelic mutation rate, we simulated a range of 12 selection coefficients (shet[5×106,1]).

We equilibrated simulations at the human ancestral population size of Ne = 14,474 for a burn-in period of 10Ne = 144,740 generations. At this point, we simulated 5,921 generations of European demographic history, following the parameters specified in Tennessen et al. (2012). This corresponds to a constant population size for 3,880 generations, then a 1,120 generation bottleneck down to 2Ne = 3,722 individuals, then a 716 generation mild exponential growth phase with a growth rate of 0.00307, followed by a final exponential growth epoch of 205 generations. The final exponential growth phase was modeled three times with three distinct growth rates: growth = {0.0195, 0.024, 0.030}. The first two growth rates were chosen based on published values of the final European effective population size: 0.0195 from Tennessen et al. (2012) and 0.024 roughly corresponding to Gao and Keinan (2014), corresponding to final population sizes of around Ne = 0.5 million and Ne = 1.25 million, respectively. We estimated the final growth rate by simulating neutral variation through a dense range of growth rates and comparing to observations of the relative number of singletons to all segregating sites for synonymous sites in the ExAC NFE sample. This comparison was performed using a mutation rate of 3.8 × 109 for non-CpG transversions (Kong et al. 2012), and comparing to the fraction of non-CpG transversion singletons in the NFE sample. CpG and non-CpG transitions were explicitly ignored for these purposes, as they are known to exhibit the effects of recurrent mutations in the ExAC sample due to elevated mutation rates relative to non-CpG transversions. An effective population size growth rate of 0.03 per generation in the last exponential epoch provided a fraction of singletons of 0.636, which is highly consistent with the number of observed singletons in ExAC NFE of 0.633. Unsurprisingly, the final population size of Ne = 4.3 million found by matching the fraction of singletons is consistent with a recent population size inference performed on the same data set (Harpak et al. 2016). We are mindful that we cannot exclude the effects of selection at linked sites affecting synonymous variants in the ExAC NFE sample. We simulated all mutation rate and selection strength combinations through all three European demographies, and found a distribution of biallelic frequencies (including the number of monomorphic sites) for each. The distributions corresponding to the demography matching the ExAC NFE data set (final growth rate of 0.03) were used in all subsequent analyses.

Assessment of Variance in Comparison to Poisson Sampling Variance

To analyze the deviation from Poisson variance, we downsampled the final population of the simulation to the sample size of the NFE individuals in ExAC. The variance was computed over independent sites and compared with the theoretical mean of the Poisson distribution to assess the effects of genetic drift on the variance for a gene relative to the Poisson variance expected at mutation–selection balance.

Figure 1a shows the deviation of the coefficient of variation from the Poisson expectation due to the effects of a realistic demographic history (growth rate of 0.030, consistent with ExAC NFE observations). Simulations are shown for the non-CpG transversion mutation rate of 3.8 × 109, as the inflation of the Poisson assumption should be independent of the mutation rate. Supplementary figure 1, Supplementary Material online, shows the same deviation in all three demographic models, suggesting that the slowest growth model (Tennessen et al. 2012) inferred from 1,351 European American exomes deviates at higher selection strengths than the demographic model associated with the much larger ExAC data set. However, all three demographies are qualitatively consistent in their behavior, suggesting the departure from the Poisson assumption (variance to mean ratio > 1.5) occurs at roughly shet = 0.02, 0.03, and 0.06 for growth rates of 0.030, 0.024, and 0.0195, respectively.

Hierarchical Model to Estimate P(shet;θ)

From equations (1) and (2) we derive the expression for the PTV count distribution when shet is drawn from P(shet;θ) with parameter vector θ:

P(k;θ,n,U)=dshetP(shet;θ)XPois(k;nX)φ(X;shet,U). (8)

At mutation–selection balance, Xeq=U/shet and φ(X;shet,U)=δX,U/shet, where δ denotes the Kronecker delta, yielding

P(k;θ,n,U)=dshetP(shet;θ)Pois(k;nU/shet), (9)

which exactly reproduces equation (4) of Cassa et al. (2017).

Fit to the PTV Count Distribution Using Simulations of a Realistic Demography

To evaluate the differences between the selection inference under the original mutation–selection balance assumption, equation (9), and the full model incorporating genetic drift, we applied equation (8) to fit the observed distribution of per-gene PTV counts. Because of the complex demographic histories of the different subpopulations constituting the ExAC data set, particularly substantial recent population expansions, the simple equilibrium approximation by Nei (1968) is not sufficient to substitute for φ(X;shet,U). Instead, we used the population allele frequency spectrum obtained from forward simulations described above, based on published demographies for the largest subpopulation, NFEs. Since φ(X;shet,U) depends on shet, we used the simulated allele frequency spectra for each of 12 discrete values spanning six orders of magnitude: shet={5×106,1.6×105,5×105,1.6×104,5×104,1.6×103,5×103,1.6×102,5×102,1.6×101,5×101,1}, and rewrote the integral equation (8) as a sum. Genes were binned by their cumulative mutation rate U and for each shet we simulated φ(X;shet,U) for U-bins of width 107 if U <4 × 106, of width 2 × 107 for 4 × 106 < U <1.4 × 105 and for exact U values above. Each spectrum contained 100,000 random instances of simulated frequencies X.

Parameters of the discretized and renormalized P(shet;θ) were estimated from maximizing the likelihood (L), as described in Cassa et al. (2017). We compared the fits from three different two-parameter functional forms for P(shet;θ) (logL in brackets): inverse Gaussian (48417.4), gamma (48419.0), and inverse gamma distribution (48477.3). As for the deterministic scenario, we found that the highest likelihood model is the inverse Gaussian distribution,

IG(shet;θ=(α,β))=β2πshet3eβ(shetα)2/2shetα2, (10)

with mean α and shape parameter β. For comparison, we also give the inferred parameters of the gamma distribution, Gamma(shet;a,b)=1baΓ(a)eshetbsheta1, with estimated shape parameter a =0.406 and scale parameter b =0.099. Because the tested distributions are defined on the positive real axis, but the heterozygous selection coefficient shet1,P(shet;θ) was truncated at smax = 2.5, corresponding to a probability density of <0.01. We find a slight dependence of the inferred parameters θ^ on smax (α^¯=0.057±0.004,β^¯=0.0070±0.0005; smax{1,1.5,2,2.5}), while the rank correlation changes by <0.1%. Supplementary figure 2, Supplementary Material online, shows the resulting fit, P(shet;θ^), in comparison to the fit obtained under the mutation–selection balance assumption. As expected, the incorporation of genetic drift entails a moderate shrinkage of the variance (by 18%), as under the deterministic assumption its effects on PTV count variance are partly absorbed by P(shet;θ^). Per-gene estimates were derived as the mean of the posterior distribution:

s^het=1P(k;θ^,n,U)dshetshetP(shet;θ^)XPois(k;nX)φ(X;shet,U). (11)

Supplementary table 1, Supplementary Material online, provides the shet estimates as well as the 95% credible intervals derived from the posterior distribution. In case of the drift-inclusive estimates, these intervals are constrained to the 12 logarithmically discretized values of shet used in the inference.

Relationship between Shet and the De Novo Fraction of PTVs

In the limit of strong selection and at mutation–selection balance, the expectation for the cumulative PTV population allele frequency on a gene is U/shet. The relative fitness of an individual with a segregating PTV mutation in the current generation is 1 – shet, rendering the probability for a transmitted PTV in the next generation

Pr(transmitted PTV)2Ushet(1shet),

where we assumed U1061 and the strong selection approximation U/shet<1031. The probability for a de novo PTV mutation to occur is

Pr(denovoPTV)2U. (12)

From this, we compute the fraction of de novo among all PTVs in the limit of strong selection:

f=Pr(denovoPTV)Pr(denovoPTV)+Pr(transmitted PTV)shet. (13)

More generally, we can derive the fraction of de novo mutations from the equilibrium allele frequency spectrum for any value of shet. At mutation–selection–drift balance and in the limit of small mutation rates, the derived allele frequency spectrum is approximately given by

φeq(X;shet,U,Ne)=1e4Neshet(1X)1e4Neshet4NeUX(1X). (14)

Hence, we obtain for the probability of inheritance

Pr(transmitted PTV)(1shet)2X(1X)φeq(X;shet,U,Ne)dX=(1shet)(2Ushet+8NeU1e4Neshet). (15)

Supplementary figure 4, Supplementary Material online, shows the resulting expression for the de novo fraction as a function of shet. On the basis of our estimates of recent effective population size, which is the most relevant for rare deleterious mutations, on the order of millions, we expect the one-to-one correlation between shet and f to hold across the largest part of the selection regime. This renders the de novo fraction of PTV alleles an excellent cross-check for the shet estimates.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Supplementary Material

msz092_Supplementary_Data

Acknowledgments

We kindly thank Brian Charlesworth and William Hill for their interest in our work and for inspiring this study. We would like to thank Jeremy Berg, Guy Sella, Molly Przeworski, and Alexey Kondrashov for their helpful suggestions and feedback. This work was supported by the National Institutes of Health (GM127131, MH101244, HG009088).

References

  1. Browning SR, Browning BL.. 2015. Accurate non-parametric estimation of recent effective population size from segments of identity by descent. Am J Hum Genet. 973:404–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bürger R, Wagner GP, Stettinger F.. 1989. How much heritable variation can be maintained in finite populations by mutation–selection balance? Evolution 438:1748–1766. [DOI] [PubMed] [Google Scholar]
  3. Cassa CA, Weghorn D, Balick DJ, Jordan DM, Nusinow D, Samocha KE, O’Donnell-Luria A, MacArthur DG, Daly MJ, Beier DR, et al. 2017. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat Genet. 495:806.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cassa CA, Weghorn D, Balick DJ, Jordan DM, Nusinow D, Samocha KE, O’Donnell-Luria A, MacArthur DG, Daly MJ, Beier DR, et al. 2019. Reply to selective effects of heterozygous protein-truncating variants. Nat Genet. 511:3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Charlesworth B, Hill WG.. 2019. Selective effects of heterozygous protein-truncating variants. Nat Genet. 511:2.. [DOI] [PubMed] [Google Scholar]
  6. Do R, Balick D, Li H, Adzhubei I, Sunyaev S, Reich D.. 2015. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat Genet. 472:126.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fisher RA. 1930. The genetical theory of natural selection: a complete variorum edition. Clarendon Press, Oxford:Oxford University Press. [Google Scholar]
  8. Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, Van Duijn CM, Swertz M, Wijmenga C, Van Ommen G, et al. 2015. Genome-wide patterns and properties of de novo mutations in humans. Nat Genet. 477:822.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Gao F, Keinan A.. 2014. High burden of private mutations due to explosive human population growth and purifying selection. BMC Genomics 154:S3.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Glemin S. 2005. Lethals in subdivided populations. Genet Res. 861:41–51. [DOI] [PubMed] [Google Scholar]
  11. Gussow AB, Petrovski S, Wang Q, Allen AS, Goldstein DB.. 2016. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol. 171:9.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Harpak A, Bhaskar A, Pritchard JK.. 2016. Mutation rate variation is a primary determinant of the distribution of allele frequencies in humans. PLoS Genet. 1212:e1006489.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Keinan A, Clark AG.. 2012. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 3366082:740–743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kimura M. 1964. Diffusion models in population genetics. J Appl Probab. 102:177–232. [Google Scholar]
  15. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, et al. 2012. Rate of de novo mutations and the importance of fathers age to disease risk. Nature 4887412:471.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kosmicki JA, Samocha KE, Howrigan DP, Sanders SJ, Slowikowski K, Lek M, Karczewski KJ, Cutler DJ, Devlin B, Roeder K, et al. 2017. Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nat Genet. 494:504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Krumm N, Turner TN, Baker C, Vives L, Mohajeri K, Witherspoon K, Raja A, Coe BP, Stessman HA, He Z-X, et al. 2015. Excess of rare, inherited truncating mutations in autism. Nat Genet. 476:582.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lande R. 1976. Natural selection and random genetic drift in phenotypic evolution. Evolution 302:314–334. [DOI] [PubMed] [Google Scholar]
  19. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, et al. 2016. Analysis of protein-coding genetic variation in 60,706 humans. Nature 5367616:285.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Nei M. 1968. The frequency distribution of lethal chromosomes in finite populations. Proc Natl Acad Sci USA. 602:517–524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB.. 2013. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 98:e1003709.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Przyborowski J, Wilenski H.. 1940. Homogeneity of results in testing samples from Poisson series. Biometrika 31(3/4):313–323. [Google Scholar]
  23. Rivas MA, Pirinen M, Conrad DF, Lek M, Tsang EK, Karczewski KJ, Maller JB, Kukurba KR, DeLuca DS, Fromer M, et al. 2015. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 3486235:666–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, Kosmicki JA, Rehnström K, Mallick S, Kirby A, et al. 2014. A framework for the interpretation of de novo mutation in human disease. Nat Genet. 469:944.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, et al. 2012. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 3376090:64–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. The Deciphering Developmental Disorders Study 2017. Prevalence and architecture of de novo mutations in developmental disorders. Nature 5427642:433.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Werling DM, Brand H, An J-Y, Stone MR, Zhu L, Glessner JT, Collins RL, Dong S, Layer RM, Markenscoff-Papadimitriou E, et al. 2018. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nat Genet. 505:727–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD.. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA. 10222:7882–7887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wright S. 1931. Evolution in Mendelian populations. Genetics 162:97.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Wright S. 1937. The distribution of gene frequencies in populations. Proc Natl Acad Sci USA. 236:307–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msz092_Supplementary_Data

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES