Skip to main content
Genetics logoLink to Genetics
. 2022 Jul 14;221(4):iyac097. doi: 10.1093/genetics/iyac097

Local fitness and epistatic effects lead to distinct patterns of linkage disequilibrium in protein-coding genes

Aaron P Ragsdale 1,
Editor: A Agrawal
PMCID: PMC9339331  PMID: 35736370

Abstract

Selected mutations interfere and interact with evolutionary processes at nearby loci, distorting allele frequency trajectories and creating correlations between pairs of mutations. Recent studies have used patterns of linkage disequilibrium between selected variants to test for selective interference and epistatic interactions, with some disagreement over interpreting observations from data. Interpretation is hindered by a lack of analytic or even numerical expectations for patterns of variation between pairs of loci under the combined effects of selection, dominance, epistasis, and demography. Here, I develop a numerical approach to compute the expected two-locus sampling distribution under diploid selection with arbitrary epistasis and dominance, recombination, and variable population size. I use this to explore how epistasis and dominance affect expected signed linkage disequilibrium, including for nonsteady-state demography relevant to human populations. Using whole-genome sequencing data from humans, I explore genome-wide patterns of linkage disequilibrium within protein-coding genes. I show that positive linkage disequilibrium between missense mutations within genes is driven by strong positive allele-frequency correlations between mutations that fall within the same annotated conserved domain, pointing to compensatory mutations or antagonistic epistasis as the prevailing mode of interaction within conserved genic elements. Linkage disequilibrium between missense mutations is reduced outside of conserved domains, as expected under Hill–Robertson interference. This variation in both mutational fitness effects and selective interactions within protein-coding genes calls for more refined inferences of the joint distribution of fitness and interactive effects, and the methods presented here should prove useful in that pursuit.

Keywords: epistasis, interference, linkage disequilibrium, diffusion approximation, distribution of fitness effects

Introduction

Most new mutations that affect fitness are deleterious and will be eliminated from a population. The amount of time that a deleterious mutation segregates depends on the strength of selection against genomes that carry it, with very damaging mutations kept at low frequencies and purged relatively rapidly. In the time between mutation and fixation or loss, selected variants can dramatically impact patterns of variation in nearby regions (Smith and Haigh 1974; Charlesworth et al. 1995; Kim and Stephan 2000). This distortion away from neutral expectations has been empirically documented using sequencing data from an ever-growing set of study systems (e.g. Novembre and Di Rienzo 2009; Cutter and Payseur 2013; Comeron 2014), but questions remain about the primary mode of interactions between selected variants and their joint effects on genome-wide patterns of diversity.

In their foundational paper, Hill and Robertson (1966) recognized that linked selected variants reciprocally impede the efficacy of selection at each locus, a process known as selective interference. Linked selection reduces the fixation probability of advantageous mutations and increases that of deleterious mutations compared to expectations under single-locus models (Birky and Walsh 1988). Allele frequency dynamics and correlations of linked selected variants are also predicted to deviate from single-locus expectations. Under a multiplicative fitness model, where the fitness reduction of a genome carrying multiple deleterious variants is equal to the product of the fitness reduction of each independent mutation, net linkage disequilibrium (LD) is expected to be zero for unlinked sites (Kondrashov 1995). However, mutations at linked loci are expected to segregate on different haplotypes more often than together, leading to negative, or repulsion, LD, and the magnitude of LD depends nontrivially on the strength of selection and the probability of recombination separating loci (Hill and Robertson 1966; McVean and Charlesworth 2000).

Nonadditive effects, including dominance (i.e. interactions within a locus) and epistasis (interactions between loci), further complicate our evolutionary models. A large fraction of nonsynonymous coding mutations are thought to be at least partially recessive (Agrawal and Whitlock 2011; Huber et al. 2018), with average levels of dominance correlating with strength of selection (Kacser and Burns 1981), and dominance plays an important role in shaping expected equilibrium allele frequencies and the mutation load of strongly damaging disease mutations (Clark 1998). On the other hand, epistasis differentially impacts the deleterious load in asexually and sexually reproducing organisms (Kimura and Maruyama 1966; Kondrashov 1995), has been invoked as an explanation for the evolutionary advantage of sex (Kondrashov 1982; Charlesworth 1990; Barton and Charlesworth 1998), and can drive incompatibilities that lead to postzygotic isolation during the process of speciation (Turelli and Orr 2000). Within populations, epistasis is known to cause signed LD to deviate dramatically from zero (Charlesworth 1990; Kondrashov 1995). However, despite appreciation of the effect of dominance on linked variation (Turelli and Orr 2000; Zhao and Charlesworth 2016) and the evolutionary importance of epistatic interactions, we currently lack models for predicting patterns of correlations between linked mutations under general selection models.

Here I develop a numerical approach to solve for the two-locus sampling distribution under a general diploid selection model with variable recombination and single-population size history. I use this model to describe how epistasis and dominance shape expected patterns of signed LD, under both steady-state and nonequilibrium demography, that have been used to test for interference and epistasis in population genomic data. I then turn to human sequencing data and compare patterns of LD for synonymous, missense, and loss-of-function mutations in protein-coding genes and annotated conserved domains. I show that while synonymous and missense variants display similar slightly positive average LD within genes, for missense mutations this signal is driven by correlations between pairs of mutations within, but not between or outside of, conserved annotated elements. This suggests an importance for antagonistic epistasis or a prevalence of compensatory nonsynonymous mutations within conserved gene domains.

Empirical observations

The most direct way to test for interactions between linked selected variants is through deep mutation scanning experiments, in which many distinct mutations are introduced within a target gene and then organismal fitness or protein function is experimentally measured (Romero and Arnold 2009; Bank et al. 2015; Puchta et al. 2016; Steinberg and Ostermeier 2016). Using the model system of the TEM-1 β-lactamase gene in E. coli, Bershtein et al. (2006) found evidence for synergistic epistasis, in which the combined effect of multiple deleterious mutations on individual fitness was larger than would be expected from multiplying the independent observed effects of each individual mutation. The scale of mutation scanning experiments continues to increase, promising greater resolution of the fitness landscape in such model systems that can be compared to evolutionary theory (Otwinowski et al. 2018).

Directed mutational studies are not possible in most natural populations, and we must turn to population genetic approaches to infer selective interactions between observed segregating polymorphisms. Motivated by theoretical predictions that linked negatively selected mutations will display negative LD due to interference (Hill and Robertson 1966) and that epistasis will drive expected LD away from zero, multiple recent studies have used patterns of LD within classes of putatively selected variants to infer modes of selective interactions. Callahan et al. (2011) observed that pairs of tightly linked nonsynonymous mutations cluster together more than expected along lineages in the Drosophilid species complex, and that those clustered mutations tend to preserve the charge of the protein and were in positive LD compared to pairs of synonymous mutations at the same distance. From this, they proposed that compensatory nonsynonymous variants regularly arise and are maintained. More recently, Taverner et al. (2020) replicated this finding across a diverse set of genera, showing that such epistatic interactions are important in protein evolution.

Sohail et al. (2017) observed negative LD between loss-of-function variants in protein-coding genes (loss-of-function mutations include stop gains and losses, frameshifts, and other nonsense mutations) in both human and fruit fly populations. This was interpreted as evidence for widespread synergistic epistasis between these mutations, in which the fitness reduction of multiple mutations is greater than the product of that of each individual mutation independently. Both Sandler et al. (2021) and Garcia and Lohmueller (2021) recently reevaluated patterns of LD between coding variants in humans, fruit flies, and Capsella grandiflora, suggesting interference and dominance could be driving patterns of LD (Garcia and Lohmueller 2021) and questioning whether LD between loss-of-function variants is significantly different from zero (Sandler et al. 2021).

Many factors impede our interpretation of patterns of signed LD between coding variants. First, for strongly deleterious or loss-of-function mutations, their low allele frequencies mean that measurements of LD and other diversity statistics are very noisy. Second, comparisons are based on theory with limiting assumptions, such as steady-state demography, simple selection and interaction models, or unlinked loci. To generate predictions under more complex models, we rely on expensive forward simulations. Such simulations can help build intuition and be used to test inference methods, but they do not efficiently provide expectations for quantities of interest across the range of relevant parameters. Analytical and numerical methods for expected haplotype frequencies and LD under general selective interaction models are thus crucial for interpreting patterns of variation observed in data.

Methods

Existing theory and numerical methods

Many well-known properties of LD come from early work on the multilocus diffusion approximation (Kimura 1955; Hill and Robertson 1968; Ohta and Kimura 1969, 1971). This includes the result that genome-wide averages of signed LD are expected to be zero under neutrality. Under a two-locus biallelic model, where the left locus allows alleles A and a and the right locus allows alleles B and b, the standard covariance measure of LD is defined as D=fABfAfB, where fAB is the haplotype frequency of types carrying both A and B, and fA and fB are the marginal frequencies of those alleles at each locus. This covariance decays due to both drift and recombination at a rate proportional to the inverse of the effective population size and the distance separating loci (Hill and Robertson 1968):

E[D]t+1=(112Ne(t)r)E[D]t.

While E[D]=0, the variance of D is nonzero, and Ohta and Kimura (1971) found that the variance of D under neutrality and steady-state demography, normalized by the joint heterozygosity of the two loci, is

σd2=E[D2]E[fA(1fA)fB(1fB)]5+ρ/211+13ρ/2+ρ2/2, (1)

where ρ=4Ner.

Analytic progress beyond these results has come haltingly. In the 1980s, recursions were developed to compute the two-locus sampling distribution under neutrality (Golding 1984), and this approach later formed the foundation for the inference of local recombination rates from population genetic data (Hudson 2001; McVean et al. 2004). More recently, Song and Song (2007) computed E[r2] using a diffusion approximation approach, although their solution involves the summation of infinitely many terms and is restricted to neutrality and steady-state demography.

To include selection, there have been relatively few advances beyond the Monte Carlo simulation approach taken by Hill and Robertson (1966), albeit now with more powerful computational resources and sophisticated software for performing flexible forward simulation (e.g. Haller and Messer 2019; Thornton 2019). Analytic results for two-locus distributions under selection are notoriously difficult, with a few notable flashes of progress. For example, McVean (2007) considered the effect of a recent sweep on patterns of LD between neutral loci near the locus under selection, and in a recent paper, Good (2022) presented analytic solutions for patterns of LD between rare mutations under additive selection with epistasis. Nonetheless, such approaches are typically confined to steady-state demography and constrained selection models.

Numerical methods inhabit the space between expensive discrete simulations and limited analytic solutions, providing a more efficient and practical method to compute expectations of two-locus diversity measures under a wider range of parameters and demographic scenarios. Ragsdale and Gutenkunst (2017) used a finite differences approach to numerically solve the two-locus diffusion equation with additive selection at either locus, and Ragsdale and Gravel (2019) more recently extended the Hill and Robertson (1968) system for E[D2] to compute arbitrary moments of the distribution of D for any number of populations connected by gene flow. They also showed that such a moment construction can be used to solve for the two-locus sampling distribution within a single population, though it requires a moment-closure approximation for nonzero recombination and selection. Below, I extend this approach to model arbitrary diploid selection, which encompasses dominance, epistasis, and other forms of selective interactions between two loci. In a concurrent study to this paper, Friedlander and Steinrücken (2022) developed numerical solutions to the same moment system, which they used to describe selected haplotype trajectories and the distortion of neutral diversity at loci variably linked to beneficial alleles that sweep to high frequencies under non-equilibrium demography.

The 2-locus sampling distribution with arbitrary selection

The two-locus sampling distribution is the direct analog to the single-locus site-frequency spectrum (SFS) of a given sample size (Fig. 1). The two-locus distribution Ψn stores the density or number of pairs of loci with observed haplotype counts in a sample of size n, so that Ψn(i,j,k) is the number of pairs for which we observe i copies of the AB haplotype, j of type Ab, k of type aB, and nijk of type ab. The size of Ψn grows rapidly, with O(n3) entries, which practically limits computational approaches to moderate sample sizes n100 and a single population.

Fig. 1.

Fig. 1.

Sampling distributions and their summaries. Low-order summaries of sampling distributions are commonly computed for allele frequencies (a, the SFS) and two-locus haplotype distributions (b, LD). Demographic and selective processes affect both the SFS and LD, and observations of nonzero values of Tajima’s D or signed LD (σd1) are often taken as evidence for selection or interactions between loci, respectively. c) The full two-locus haplotype sampling distribution is a three-dimensional object, making it difficult to visualize. We can instead visualize conditional distributions of the full sampling distribution, e.g. conditioned on observing nA copies of the A allele at the left locus and nB copies of B at the right locus (e.g. Hudson 2001). d) σd2, which is closely related to r2, decays with increasing recombination distance between loci (Ohta and Kimura 1969). Selection distorts squared LD away from neutral expectations. e) σd1 (signed LD) is zero for pairs of neutral mutations (Hill and Robertson 1968). Interference between linked selected mutation causes negative signed LD (Hill and Robertson 1966), and other forms of interactions between selected mutations can cause large negative or positive signed LD.

Under neutrality, many approaches exist to compute Ψn, including the recursion due to Golding (1984) and Ethier and Griffiths (1990) and recent numerical approaches (Kamm et al. 2016; Ragsdale and Gutenkunst 2017). Selection is most easily included using the forward-in-time diffusion equation (Kimura 1955; Hill and Robertson 1966), where a standard approach is to first solve for the continuous distribution ψ of the density of two-locus haplotype configurations in the full population, and then integrate ψ against the multinomial sampling function to obtain Ψn. Alternatively, Ragsdale and Gravel (2019) showed that there exists a system of ordinary differential equations directly on the entries of Ψn. I briefly summarize this general approach below, but refer readers to that paper for detailed derivations of the drift, recombination, and mutation terms and the moment-closure approximation. Instead, here I focus on generalizing the selection operator to include epistasis, dominance, and other forms of two-locus interactions.

Moment equation for Ψn

The system of linear ordinary differential equations for the entries of Ψn takes the form

Ψnt+1(i,j,k;t)=DN(t)Ψnt+RrΨn+1t+UuΨnt+Ss,hΨn+2t. (2)

DN(t) is a sparse linear operator accounting for drift with population size N(t), R accounts for recombination with per-generation recombination probability r between loci, U accounts for mutation, either under an infinite sites or biallelic reversible mutation model, and S accounts for selection.

The moment system for Ψn can be derived directly from the diffusion approximation, or it can be found through a more intuitive process of tracking the dynamics of allelic states of a sample of size of n from the full population. We assume nNe, and r and s are O(1/Ne) so that multiple coalescence, recombination, or selective events within the n lineages are rare in any given generation (Supplementary Material; Jouganous et al. 2017; Ragsdale and Gravel 2019). In typical diffusion approximation fashion, we multiple through by 2Nref so that time is measured in 2Nref generations, and we consider scaled parameters ρ=4Nr,θ=4Nu, and γ=2Ns.

Moment closure

In the absence of selection and for fully linked loci (ρ = 0), the system closes and can be solved exactly. However, for nonzero recombination or selection, the entries of Ψn rely on the slightly larger sampling distributions with sample sizes n +1 (for recombination and additive selection) or n +2 (for nonadditive selection). This is because if a recombination event occurs within one of n lineages being tracked by Ψn, we need to draw an additional lineage from the full population to recombine with that chosen lineage, thus requiring Ψn+1t to find Ψnt+1. Selection events similarly require extra lineages from the full population, which replace a chosen lineage that fails to reproduce with probability proportional to its relative fitness.

This requirement of extra lineages for recombination and selection means that the system in (2) is not closed, so that we need a moment-closure approximation to solve for Ψn. As in Ragsdale and Gravel (2019), a jackknife approximation is used to estimate Ψn+l, for l =1 or 2, from Ψn (following the single-locus closure introduced in Jouganous et al. 2017), so that Ψ^n+l(i,j,k)=Jn,lΨn, although other accurate closure approximations are possible (Friedlander and Steinrücken 2022). This emits a closed approximate system,

Ψ^˙n(i,j,k;t)=Dν(t)Ψ^n(t)+RρJn,1Ψ^n(t)+UθΨ^n(t)+Sγ,hJn,2Ψ^n(t). (3)

The jackknife approximation, which approximates an entry Ψn+l(i,j,k) using nearby entries in Ψn, is more accurate for larger sample sizes, creating a tension between efficiency and accuracy: larger sample sizes result in more accurate solutions, as error in the jackknife is diminished, but computational complexity also grows rapidly in the number of entries of Ψn, which is O(n3) (Supplementary Fig. 1). In the results presented in this article, sample sizes between n =30 and n =80 are used. Derivations for the drift, recombination, and mutation operators and the jackknife moment-closure approximation can be found in section S1.3 of Ragsdale and Gravel (2019), and I repeat the main results in the Supplementary Material of this article.

Selection models with epistasis and dominance

To include selection, we consider a model where we draw lineages uniformly from the previous generation, but keep lineages with probability proportional to their fitness. In the absence of dominance, selection reduces to a haploid model, with acceptance and rejection probabilities depending on the fitnesses of each haploid copy, where haplotype Ab has fitness 1+sA, aB has fitness 1+sB, and AB has fitness 1+sAB. We assume the doubly ancestral haplotype ab has fitness 1, so fitnesses are relative to that of ab haplotypes. The standard multiplicative fitness function assumes that sABsA+sB (assuming s20), and a model for epistasis can be written as

sAB=(sA+sB)(1+ϵ),

so that ϵ>0 gives synergistic epistasis and ϵ<0 gives antagonistic epistasis.

To obtain the recursion equation under selection, we consider drawing n lineages from generation t, which has an expected sampling distribution of haplotype counts given by Ψnt. However, assuming s0 for each derived haplotype, each of those sampled lineages has probability of being rejected equal to the absolute value of the selection coefficient assigned to its haplotype state. If a lineage is rejected, a replacement is drawn from the full population. Under the assumption that ns1, the probability that more than one selection event occurs in any given generation is negligibly small, so that the case of multiple simultaneous rejections can be ignored. Then Ψnt+1 relies only on Ψnt and Ψn+1t for additive selection. The full selection operator S for additive selection is given in the Supplementary Material.

To account for dominance, or other general forms of two-locus selection, the selection operator no longer reduces to individual haplotypes, but instead we need to know the state of two-locus genotypes. For example, the fitness of an individual carrying an Ab haplotype depends on whether their second haplotype is ab, Ab, aB, or AB. We can therefore assign a selection coefficient to each possible diploid configuration, sAb/ab, sAb/Ab, and so on. Assuming that the doubly homozygous ancestral ab/ab genotype has relative fitness 1, this gives nine possible unique selection coefficients in the most general two-locus selection model. Doubly heterozygous AB/ab and Ab/aB genotypes need not have the same selection coefficient (Supplementary Table 1).

The general selection operator follows the same approach as the haploid selection operator with epistasis described above. Now, in the case of a selection event rejecting a lineage within our tracked samples, we need to draw not only the replacement lineage from the full population but also a second haplotype from the full population to form the diploid genotype, as this determines the probability that we reject the focal haplotype. We again assume that ns1 for all genotype selection coefficients, so that we may assume at most a single selection event occurs in any given generation. This means that to find Ψnt+1 under a general two-locus selection model, we need Ψn+2t. Again, a full derivation and expressions for the general selection operator are given in the Supplementary Material.

Low-order summaries of the sampling distribution

From Ψn, expectations for any two-locus statistic can be found by downsampling to the appropriate sample size. For example, to compute E[D], the sum is taken over all haplotype configurations n=(nAB,nAb,naB,nab), weighted by the density Ψn for that configuration:

E[D]=nΨn(n)nABnabnAbnaBn(n1). (4)

For large sample sizes, this is approximately equal to computing D by taking the maximum likelihood estimate for each allele frequency fi=ni/n, but the maximum likelihood-based estimate will be noticeably biased for small to moderate sample sizes. Other low-order two-locus statistics can be computed using the same approach, as implemented in moments following Ragsdale and Gravel (2020), which can be compared across sample sizes and between estimates from phased or unphased data. In this article, I focus on σd2=E[D2]/E[p(1p)q(1q)] and σd1=E[D]/E[p(1p)q(1q)], which can be averaged over pairs of variants at all frequencies. Allele-frequency conditioned statistics (such as keeping only loci below some frequency threshold as in Good [2022]) can be considered using this same approach.

Simulations of nonsteady-state demography

I consider four variable population size histories, two simple toy models and two inferred from human populations in African and Europe using Relate (Speidel et al. 2019). For each size history scenario, I track the evolution of Ψn(t) for varying selection models, plotting the trajectories of σd1 and σd2 over time (Figs. 6 and 7 and Supplementary Figs. 11–16). The selection strength at both loci is fixed at either γ=1 or –10 for the models with epistasis, or γ=2 for models with dominance, and recombination is set to zero.

Fig. 6.

Fig. 6.

The effects of demography on signed LD. a) Simulations under histories of an instantaneous size expansion and a 5-fold reduction and recovery. For two selection strengths (γ=1 and γ=10) and three cases of interactions (nonepistatic interference [ϵ = 0, b], synergistic epistasis [ϵ=1/2, c], and antagonistic epistasis [ϵ=1/2, d], each with ρ = 0), a sudden decrease in population size can cause large changes in signed LD, often in the opposite direction than more subtle shifts due to instantaneous expansion events. e) Signed LD between pairs of rare alleles (both nA,nB4, with sample size n =50) is more sensitive to population size changes. f) Signed LD between pairs of common alleles (both nA,nB5) is comparatively more stable over time. Colors are matched to conditions shown in (b–d). Additional comparisons, including for dominance models and showing σd2, are shown in Supplementary Figs. 11–13. Dashed lines indicate neutral expectations.

Fig. 7.

Fig. 7.

Signed LD under inferred models of human population-size history. Piecewise-constant population size histories inferred by Relate applied to 1000 Genomes Project Consortiumet al. (2015) phase 3 data were used to simulate time series of two-locus statistics, as in Fig. 6. a) The CEU are inferred to have a stronger bottleneck than the YRI 10–100 ka, reflecting the out-of-Africa event. For fully linked loci (ρ = 0) and γ=2 at both loci, I compare the effects of b) standard interference, c) site-wise dominance, and d) gene-based dominance on σd1. As with epistasis, more severe bottlenecks have larger effects on signed LD. e, f) LD among common variants is more stable than among pairs of uncommon variants. Colors are matched to conditions shown in (b–d). Additional comparisons with epistasis and showing σd2 are in Supplementary Figs. 14–16. Dashed lines indicate neutral expectations.

The simple size change models both have ancestral Ne= 10,000, with one a 3-fold population expansion that occurs 3,000 generations ago, and the other a 5-fold reduction 2,000 generations ago followed by a recovery to its initial size 1,000 generations ago. The size histories for YRI and CEU are inferred using Relate (Speidel et al. 2019) applied to the phase 3 haplotype-phased autosomal data from 1000 Genomes Project Consortium et al. (2015), using default parameters as recommended in the Relate online tutorial, assuming a mutation rate of 1.25×108 per-bp per-meiosis and a human generation time of 29 years. Relate returns estimates of coalescence rates within specified time bins, and population sizes are estimated as their inverses. Estimates using Relate for population sizes in the very recent past (<3,000 years, or 100 generations) diverge, so I truncate the history over this time period and assume a constant size from the most recent nondiverged bin.

Analysis of human genomic data

Using the annotated variant call format (VCF) files from the phase 3 1000 Genomes Project Consortium et al. (2015) (Thousand Genomes) data release, I subset the genotype VCFs to autosomal variants that are annotated as either synonymous or nonsynonymous, including both missense mutations and more damaging “high impact” loss-of-function mutations. Loss-of-function annotations include frameshifts, splice acceptor, splice donor, start loss, stop gain, stop loss, and transcript ablation variants. I further subset to samples within each nonadmixed population in the African, European, and East Asian continental groups (five populations each, Supplementary Table 3). Signed LD is sensitive to ancestral state misidentification, so I only keep sites for which ancestral alleles were estimated with high confidence in both the VCF info field and the Thousand Genomes human ancestor reconstructed from a phylogeny of six primates.

In addition to ancestral state misidentification, measured LD is sensitive to phasing error, so I compute LD statistics using unphased genotypes following Ragsdale and Gravel (2020). This approach provides unbiased estimates for pairwise LD, under the assumption that individuals are not inbred. I consider pairs of mutations within the same mutation class (synonymous, missense, and loss of function) either within the same gene and inside or outside of annotated domains within the same protein-coding genes. I use a dataset of annotated protein domains mapped to the hg19 human reference build compiled by Stanek et al. (2020) to determine if a given mutation falls within our outside a conserved domain.

Results

Expected signed LD under steady-state demography

In the Methods, I expand on the moment system developed in Ragsdale and Gravel (2019) to compute the expected sampling distribution of two-locus haplotypes (Ψn, Fig. 1) under a general model of selective interactions. This sampling distribution stores the expected density or observed counts of pairs of biallelic loci with each possible haplotype configuration in a sample of size n. Below, we compute expectations for Ψn under varying scenarios of selection and interaction between pairs of mutations. It is not possible in this framework to include the effects of additional linked selected mutations, such as background selection, and individual-based forward simulations are still needed for such scenarios (e.g. Supplementary Figs. 7–10).

In many cases, it is simpler to visualize summaries such as the expectation or variance of D (Fig. 1, d and e) or conditional slices of the distribution (Fig. 1c) instead of the full three-dimensional distribution Ψn. For pairs of biallelic loci, with alleles labeled A/a at the left locus and B/b at the right locus, D=fABfAfB is the standard covariance measure of LD, where fAB is the frequency of haplotypes carrying both A and B, and fA and fB are the marginal allele frequencies of A and B. Here I focus on low-order LD statistics and their decay with recombination distance, as these are statistics that are commonly used to test for interactions between loci. Instead of E[D2] and E[D] I consider expectations for σd2=E[D2]/E[fA(1fA)fB(1fB)] and σd1=E[D]/E[fA(1fA)fB(1fB)]. Normalized statistics have two benefits: (1) the mutation rate cancels so that expectations are robust to assumptions about the per-base mutation rate, and (2) we can compare to analytic expectations for these quantities under neutrality and constant population size (Ohta and Kimura 1971).

Below, I first consider the case of additive selection, Hill–Robertson interference, and epistasis. I then explore the effect of dominance acting within loci but without epistatis, and then describe a general diploid selection model and consider gene-based dominance effects. I present results for equal strengths of selection and dominance at each locus, but note that the methods presented here allow for arbitrary and unequal selection and dominance at the two loci. I also focus primarily on weak to moderate negative selection (|2Ns|120), since this range of selection leads to the strongest signals of interference (Fig. 2) and is the parameter regime for which the numerical approach is most accurate.

Fig. 2.

Fig. 2.

Hill–Robertson interference. Interference between pairs of selected mutations causes negative signed LD. a) The expected normalized variance of D (σd2) decreases below neutral expectations as the strength of negative selection increases. b, c) For tightly linked loci (4Nr1), interference is most noticeable for pairs of mutations with s1/N. At larger recombination distances (4Nr>1), signed LD is most negative for somewhat stronger selection coefficients. Dashed lines show neutral expectations.

Additive selection and epistasis

For mutations under additive selection (h =1/2) and no epistasis, we recover the well-known Hill and Robertson (1966) interference result of negative LD between selected mutations, which is strongest for pairs of mutations that have selection coefficients γ=2Nes=O(1), or s1/2Ne (Fig. 2). For strongly deleterious mutations, LD is close to zero even with tight linkage, as they almost always segregate at low enough frequencies that they are unlikely to interfere with each other (McVean and Charlesworth 2000).

With epistasis, mean signed LD is large for both weakly and strongly selected variants, with sign depending on the direction of epistatic interactions (Fig. 3). Synergistic epistasis (in which the effect of two mutations together is larger than the product of each individual mutation’s effect) results in negative LD while antagonistic epistasis (in which the combined effect is less than the product of independent effects) results in positive LD, and large nonzero LD can occur even when epistasis is relatively weak. Epistasis-induced LD can extend over long distances, especially for strongly deleterious mutations. Even moderately deleterious mutations with population-size-scaled selection coefficients of γ=10 show large mean LD that extends to values of ρ much greater than 1 (Fig. 3f; for humans, assuming roughly 1 cM/Mb, this is on the order of 100 kb or more). More strongly deleterious interacting mutations are expected to show large signed LD over much larger recombination distances.

Fig. 3.

Fig. 3.

Additive selection and epistasis. Left panels (a and c) show expectations for the decay of σd2 with recombination distance, and right panels (b and d) show expectations for the decay of σd1. Dashed lines show neutral expectations. For both weak (s=1/2N, a and b) and moderate (s=10/2N, c and d) selection, antagonistic epistasis (ϵ<0) causes positive signed LD and increased σd2 over a multiplicative model (ϵ = 0), and synergistic epistasis (ϵ>0) results in negative signed LD beyond that of Hill–Robertson interference and decreased σd2.

Dominance

The effect of nonadditive selection on correlations between mutations has received increased attention recently. For example, Garcia and Lohmueller (2021) used large-scale forward simulations to explore how dominance impacts patterns of LD, showing that LD depends nonlinearly on the magnitudes of both selection and dominance. Roze (2021) found an analytic expression for LD between pairs of strongly deleterious mutations under steady-state demography, showing that LD can be either positive or negative depending on the strength of dominance.

The combined effect of the strengths of selection and dominance on interference is indeed nontrivial (Fig. 4a). Some parameter regimes can cause large negative LD between negatively selected variants, with moderately selected recessive variants showing stronger signals of interference than additive selection (Fig. 4b). Unlike epistatic interactions, signed LD decays rapidly with increasing distance between loci and is roughly zero for ρ1. For weakly selected mutations (|γ|1), there is no monotonic effect of the level of dominance on negative LD, with both recessive and dominant pairs of mutations having more negative LD than additive mutations.

Fig. 4.

Fig. 4.

The effect of dominance on LD. a, b) The strengths of selection and dominance interact in a nonlinear way to shape expected signed LD. For weakly to moderately selected mutations, as shown here, signed LD can be large and negative for tightly linked loci (e.g. γ=5, h =0, and ρ<1). However, this large signed LD decays with recombination distance faster under a model of recessivity than does signed LD under a model of additive selection and epistasis (Fig. 3). c) Interference effects are most pronounced for recessive deleterious variants. d, e) Recessive strongly deleterious mutations can have positive signed LD, as recently shown by Roze (2021). However, the dominance threshold at which σd1 switches from positive to negative depends on the strength of selection, and weakly selected mutations can show nonmonotonic behavior as h varies. Selection parameters of sh=0.1 imply extremely strong selection (h =0.05 results in s=0.2 and γ=400 for homozygous diploids at a single locus). The numerical approach for Ψn cannot handle such strong selection.

Discrete simulations and the moment approach confirm the result from Roze (2021), that positive LD can occur for strong negative selection and small values of h (Fig. 4, d and e). However, while the analytic formula in Roze (2021) predicts that LD should be positive for h <0.25 and negative for h >0.25, this appears to only hold in the limit of strong selection (compare to Figure 1A in Roze 2021). For moderate to moderately strong selection, this threshold of h can be less than 0.25, and LD is negative for all 0h1 for weakly deleterious mutations (Fig. 4c).

Arbitrary two-locus selection models

Beyond standard models of epistasis and dominance, a large family of selection models can be specified by assigning unique fitness effects to each possible diploid pair of haplotypes. Assuming the diploid genotype homozygous for the ancestral alleles (ab/ab) has relative fitness 1, there are nine other diploid two-locus genotypes that could be given unique fitnesses (Supplementary Table 1), noting that AB/ab and Ab/aB genotypes can have differing selection coefficients.

The case with sAB/ab=sAb/aB may arise in a scenario where a mutation at either locus within a haplotype impacts some functional element, but a diploid individual carrying at least one copy that is free of mutations has minimal fitness loss. In this “gene-based dominance” scenario (e.g. Sanjak et al. 2017), an AB/ab genotype has higher fitness than an Ab/aB type (Supplementary Table 1). Such a gene-based fitness model gives positive signed LD, similar to the model of antagonistic epistasis (Fig. 5), although the interpretation of those two models can differ. With a highly parameterized space of possible general diploid selection models, multiple models with different biological interpretations can give similar patterns of expected signed LD.

Fig. 5.

Fig. 5.

Multiple modes of interactions can lead to large positive signed LD. Both antagonistic epistasis (a and c, and which includes compensatory mutation models) and gene-based dominance (b and d) lead to large positive signed LD. Compensatory mutations (ϵ1) also cause increased σd2 compared to neutral expectations (dashed black lines), while weaker antagonistic epistasis does not increase σd2 above neutral expectations (compare to Fig. 3c). Gene-based dominance instead causes lower σd2 than neutral expectations. While signed LD may be similar between different interaction models, other two-locus summaries of the data may help to distinguish between interaction models.

The effect of population size changes on signed LD

The moment system for Ψn readily incorporates variable population size. I explore two simple size history models (Fig. 6a), one with an instantaneous expansion and another with a bottleneck followed by recovery. I also consider two demographic histories inferred using genome-wide gene genealogy reconstruction (Speidel et al. 2019) applied to the 1000 Genomes Project Consortium et al. (2015) dataset, and focus on size histories for the Yoruba from Ibidan, Nigeria (YRI), and Utahns of North and West European ancestry (CEU) (Fig. 7a).

For each of the four histories, Figs. 6 and 7 and Supplementary Figs. 11–16 show the dynamics of σd1 and σd2 for a given parameterization of two-locus selection, including synergistic and antagonistic epistasis, dominance within loci, and gene-based dominance. In general across each selection model, population size expansions do not strongly affect σd1, whether that expansion occurs deeper in the past as in the simple expansion model or rapid expansion more recently, as for YRI. On the other hand, population size reductions tend to push signed LD to more extreme values and subsequent recoveries or expansion again reduce the magnitude of LD. Under no selection condition tested here do population size changes cause expected LD to change sign, showing that while the magnitude of deviation of LD from zero is sensitive to population size history, interpretations of the observed sign of LD should be robust to population size history.

Signed LD within protein-coding genes

Here, I examine patterns of signed LD between mutations in human protein-coding genes partitioned by functional annotations. Synonymous and missense mutations show similar levels of slightly positive signed LD when considering pairs of mutations within the same gene averaged over all autosomal chromosomes. Loss-of-function mutations have more negative LD, possibly due to differing modes of selective interactions for loss-of-function and missense mutations (Fig. 8, a and b). Within each population, measurement noise gives 95% confidence intervals that overlap with zero in each mutation class, although the observed patterns are remarkably consistent across African, European, and East Asian populations in the Thousand Genomes dataset. Comparing mean LD across populations, LD in Eurasian populations is somewhat larger on average, that is, more positive for missense mutations and more negative for loss-of-function mutations. This is in agreement with differences in expectations between populations that have or have not gone through a bottleneck in their recent history (Figs. 6 and 7).

Fig. 8.

Fig. 8.

LD in human protein-coding genes and annotated domains. a) Gene-wide averages of signed LD are slightly positive for both missense and synonymous mutations, considering pairs of mutations at matching distances. This positive, equal σd1 is also observed when conditioning on allele frequencies or considering only common variants with minor allele frequencies 0.1 (Supplementary Tables 6–11). b) While there are relatively fewer pairs of loss-of-function mutations within genes, causing larger measurement uncertainty, they have negative average LD. Measurement noises for each class of mutations overlap with zero and with each other, making it difficult to draw firm conclusions on the patterns of interactions occurring gene-wide. c) Partitioning pairs of mutations as falling within or outside of conserved domains reveals opposing patterns of signed LD, with σd1 between missense mutations larger than that of synonymous mutations within domains. Outside of conserved domains, missense mutations have reduced LD compared to synonymous mutations. Distances of pairs outside of domains were matched to within-domain mutation pair distances. e–f) Rare and uncommon variants (with frequencies <0.1 have positive LD, with synonymous LD exceeding missense LD in many populations). However, common missense variants within domains have large positive LD and are responsible for the pattern seen in (c).

Positive LD between pairs of missense mutations in conserved domains

The similarity in signed LD between missense and synonymous mutations might suggest that interference between missense mutations is minimal, or at least no stronger than interference between synonymous mutations. However, interactive effects differ dramatically between pairs of mutations found in different intragenic regions. Due to the rarity of loss-of-function mutations, I only compare synonymous and missense mutations when looking at finer partitions of mutations within genes.

Annotated conserved domains in protein-coding genes drive signals of positive LD between missense variants. Such protein-coding domains are conserved elements of genes, often associated with some known functional or structural feature of a protein (Stanek et al. 2020). Purifying selection is expected to be stronger within conserved domains than within the same gene but outside of those domains. Indeed, the SFS is skewed to lower frequencies for both missense and loss-of-function mutations within domains when compared to the same classes of mutations outside of domains, with much more negative values of Tajima’s D within domains (Supplementary Tables 4 and 5). On the other hand, no difference is observed for synonymous mutations whether within or outside domains, suggesting roughly equivalent effects of selection (either direct or linked) on synonymous variation.

Missense mutations within the same functional domain have large positive LD that is elevated above that of synonymous mutations within the same domain (Fig. 8c and Supplementary Figs. 18–22). This difference between missense and synonymous variants within domains is especially pronounced for linked pairs within a few hundred base pairs of each other (Supplementary Fig. 18).

Selection is stronger against missense mutations within domains than outside domains, leading to an excess of rare missense mutations within conserved domains (Supplementary Table 4). LD is known to be sensitive to allele frequencies, with rare mutations showing large positive signed LD (Good 2022). To test whether the signal of increased LD between missense mutations within domains is driven by rare variants, I considered subsets of pairs of mutations based on their derived allele frequencies (Fig. 8, d–f and Supplementary Figs. 20–22). Rare and uncommon variants show large average LD for each class of mutations. However, common variants recapitulate the opposing patterns of LD that is seen when averaging over pairs at all frequencies, and the SFS for common missense and synonymous variants are similar (Supplementary Fig. 24), so that this signal is unlikely to be driven by subtle differences in allele frequencies between classes of mutations.

Reduced LD between pairs of missense mutations outside of conserved domains

The large positive signal of LD for missense mutations within the same domain does not extend to pairs of missense mutations that span different domains. Missense and synonymous mutations show nearly equal levels of LD close to zero across domains, with missense mutations slightly more negative than synonymous mutations (Supplementary Fig. 23). The interactive effect driving large LD in domains is therefore likely domain specific. However, the average distance between mutations within domains is much smaller than between domains, so this observation may be primarily driven by the higher recombination distances between mutations across distinct domains.

Mutations that fall outside of annotated domains have the opposite pattern of signed LD to mutations within the same domain. For pairs of mutations outside of domains but with distances matched to those within domains, synonymous mutations have larger positive LD than missense mutations. More distant pairs of mutations outside of domains, matched to the same distances as the between-domain comparison, each have LD roughly equal to zero (Supplementary Fig. 23).

The role that tightly linked variants have in driving these opposing signals can be seen in the decay of signed LD with distance between mutations (Supplementary Figs. 17–19). Both synonymous and missense mutation pairs at distances greater than a few hundred bases have average LD fluctuating around zero. However, for mutations outside domains, synonymous variants separated by short distances have large positive LD, while missense mutations have lower LD (Supplementary Fig. 19). In contrast, for mutations within the same domains, missense mutations have more positive LD at short distances than synonymous mutations (Supplementary Fig. 18).

Discussion

Broad genome- and gene-wide surveys of LD across functional classes of mutations miss heterogeneous patterns of interactive effects occurring within genes. From gene-wide averages, LD between missense mutations does not appear to differ from synonymous variants, while LD between loss-of-function variants is more negative. Hill–Robertson effects are expected to be strongest for slightly to moderately deleterious variants (with |s|1/Ne), as strongly deleterious mutations are not expected to interfere with one another (McVean and Charlesworth 2000). However, inference of the distribution of fitness effects (DFE) for new loss-of-function variants shows that a large majority are strongly deleterious (Supplementary Material), so most will not strongly interfere with each other.

Instead, negative synergistic epistasis between strongly deleterious mutations does produce large negative deviations of mean LD. Weakly deleterious recessive mutations can also produce this pattern, but strongly deleterious recessive mutations lead to slightly positive LD (Roze 2021). While most loss-of-function mutations are strongly deleterious, those that rise to appreciable frequency are likely more benign and σd1 may be driven by patterns of weakly deleterious loss-of-function mutations. The difficulty in distinguishing these effects is compounded by the large measurement noise for E[D], especially for loss-of-function variants for which only a few hundred within-gene pairs exist in the human population data analyzed here and which are separated by larger distances on average than neighboring missense and synonymous mutations (Supplementary Fig. 27).

In addition to LD varying by distance, LD can also vary due to differences in allele frequencies among classes of mutations. Matching both distances between pairs and allele frequencies between classes of mutations reduces these concerns. It is further possible that recombination rates can vary between annotated regions, resulting in differing patterns of background selection, which can affect both allele frequencies and LD (e.g. Supplementary Figs. 7–10). I did not condition on local recombination rates or inferred levels of background selection here.

Non-uniform interactions between selected mutations within genes

Positive average LD between both missense and synonymous mutations has been reported in humans, Drosophila, and other species (Sohail et al. 2017; Sandler et al. 2021), while others have found that nonsynonymous mutations show lower LD than synonymous mutations (Garcia and Lohmueller 2021). The similarity of their gene-wide LD observed in this study might suggest that interference or interactions between missense mutations are minimal. However, averaging over all observed pairs of mutations within a gene masks element-specific interactive effects that drive LD in opposite directions. Nonsynonymous mutations found within conserved protein domains are more strongly selected against on average, but also have increased signed LD over synonymous mutations at the same distances within domains. Missense mutation outside of domains but at the same distances as those within domains have more negative LD, both compared to distance-matched synonymous mutations outside of domains and to mutations within domains. Neither dominance effects (aside from very strongly selected recessive mutations [Roze 2021]), synergistic epistasis, nor Hill–Robertson interference are expected to result in positive LD, so some other interactive effect should be driving this signal of positive LD within conserved domains.

Different interaction scenarios that can result in positive signed LD between tightly linked loci. One possibility is a prevalence of pairs of compensatory mutations that are tolerated to co-segregate at high frequencies within conserved domains (Yeang and Haussler 2007; Ivankov et al. 2014). Callahan et al. (2011) and Taverner et al. (2020) have proposed such a mechanism to explain observed clusters of nonsynonymous substitutions in Drosophila and other species. Another possibility is a model of antagonistic, or diminishing returns epistasis, in which a single amino acid-changing mutation within a domain damages the functionality of that subunit, but additional mutations within that same domain reduce fitness by a factor less than the first mutation. A third possibility, related to antagonistic epistasis, is that selection acts on the functional domain as a unit instead of on mutations within the domain individually (such as under a model of gene-based dominance). In this scenario, double heterozygotes have different fitnesses depending on whether the mutations are found on same haplotype or on different haplotypes.

There is no increase in LD between pairs of missense mutations found in different annotated domains or outside of domains. Rather, those missense variants have considerably lower LD than synonymous variants, and this difference between synonymous and missense variants disappears for SNPs separated by more than a few hundred base pairs (Fig. 8, e and f). This suggests that Hill–Robertson interference is the primary mode of interaction between missense mutations falling outside of domains, in agreement with Garcia and Lohmueller (2021), as epistasis is expected to impact LD over larger distances than what is observed. Importantly, however, the strength of epistasis is also likely to be a function of distance between mutations, complicating this interpretation.

Taken together, selection and interactions in protein-coding genes are nonuniform, depending on mutation type and location within the gene. Typical approaches for inferring the DFEs from population genomic data average over such differences by aggregating all observed nonsynonymous mutations (Boyko et al. 2008; Kim et al. 2017). It would be straightforward to adapt DFE-inference methods to infer more detailed representation of the heterogeneous effects of new mutations within genes by partitioning by missense and loss-of-function classes as well as by annotated domains. Additionally, standard models and simulation approaches may be too simple to capture evolutionary trajectories and patterns of diversity that differ due to region-specific interactions.

Positive LD between synonymous mutations

In a nonstructured randomly mating population, neutral mutations are expected to have average signed LD of zero, but across all populations analyzed here, LD between synonymous mutations is positive. While selection on some subset of synonymous variants is possible, it is likely weaker on average than between missense mutations, and any interference between selected synonymous variants should lead to negative LD. Spatial population structure may be responsible for the increase of LD observed between synonymous variants. While neutral evolution alone in structured populations cannot cause positive LD, Sohail et al. (2017) and Sandler et al. (2021) used forward simulations to show that population structure combined with selection at linked sites can induce positive LD between neutral mutations.

Nonrandom mutational processes, in which clusters of mutations occur simultaneously in the same mutational event, can also lead to positive LD between neutral variants separated by short distances. Such multinucleotide mutations have been shown to affect patterns of LD on the order of 10s to 100s of base pairs (Harris and Nielsen 2014), and clustered mutational events may be common in humans (Besenbacher et al. 2016). Indeed, a simple exponential model in which the fraction of mutations causing a multinucleotide mutation event decays with distance fits the observed patterns of σd1 between synonymous variants (Supplementary Fig. 29), with the best-fit model requiring only a small fraction of mutations to involve multinucleotide mutation events.

We may therefore treat positive LD as the baseline expectation for tightly linked variants due to either spatial structure or clustered mutations, so that subsequent selective processes and interactions cause LD to deviate from that expectation. It may then be more appropriate to compare the negative LD observed between linked loss-of-function variants to that positive baseline expectation instead of zero, which would imply that they are more recessive or have stronger synergistic epistasis than from inferences assuming a neutral expectation of zero. Additional analyses and simulations will be required to tease apart the effects of population structure, selective interference at linked sites, and clustered mutational events on patterns of LD.

Challenges to distinguishing modes of selective interactions from LD

When partitioning measurements of LD by mutation classes or regions within genes, the decreasing number of pairwise comparisons leads to large measurement noise. Within each population, confidence intervals of observed σd1 often overlap with zero or overlap with that of other classes of mutations. While observed patterns are remarkably consistent across the 15 populations considered here, their joint evolutionary histories make formal testing of significance difficult due to shared variation, as they cannot be treated as independent measurements. Detailed simulations will likely be needed to more thoroughly assess significance.

From a modeling perspective, the space of plausible selection scenarios becomes large as we relax the strict assumptions of additivity and multiplicative interactions. This makes performing forward simulations that span the range of all such selective interaction scenarios burdensome. Instead, closed numerical approaches allow us to efficiently explore this highly parameterized space of models and to perform likelihood-based inference using signed LD or other two-locus summaries of the data. For example, inferring the joint distribution of dominance and selection is underpowered using the SFS alone, but because signed LD is sensitive to the levels of dominance (Fig. 4), inferring the DFE with dominance may be feasible using the joint distribution of allele frequencies and LD. The results presented here do not cover the space of all possible two-locus models, and other unexplored models may result in similar patterns of signed LD. Comparisons to empirical data should therefore be treated with caution.

Using a single low-order summary of signed LD, such as E[D] or σd1, is likely insufficient to confidently discriminate modes of selective interactions that produce similar LD patterns. Among all interaction models, the extent of LD and rate of its decay also depends on the underlying distribution of selection coefficients among a class of mutations, which are unknown for a given pair of mutations, so that we must integrate over a DFE. This DFE, however, will have been inferred under a simple set of assumptions, such as additivity and interchangeability between sites within a gene, potentially biasing any inference using previously inferred DFEs to learn about patterns of interactions. Again, this underlines the need to jointly infer strengths and interactions of selected variants and to consider patterns of variation at finer genomic scales.

Finally, expectations for a large family of informative two-locus statistics can be computed directly from the full two-locus sampling distribution, which can be compared to empirical observations from either phased or unphased data (Ragsdale and Gravel 2020). Exploring additional patterns of correlations between mutations should uncover overlooked statistics that will improve our ability to distinguish between modes of selective interactions.

Data and software availability

All data and software used in this article are publicly available and open source. I downloaded the Thousand Genomes annotations and genotypes VCFs from the ftp server at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, and the Thousand Genomes human ancestor fasta file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/. Protein domain information from Stanek et al. (2020) was downloaded from http://prot2hg.com/dbdownload.php.

Implementation of moment equations to compute expectations for two-locus and LD statistics are implemented in Python using Numpy (Harris et al. 2020) and sparse matrix solvers in Scipy (Virtanen et al. 2020). These methods are packaged within moments, and analyses here were performed using moments version 1.1.10, available from https://bitbucket.org/simongravel/moments and via conda, with extensive documentation at https://moments.readthedocs.org. Scripts to run all analyses, recreate figures, and compile this manuscript are available at https://github.com/apragsdale/two_locus_selection. Each URL was last accessed July 5, 2022.

Supplemental material is available at GENETICS online.

Supplementary Material

iyac097_Supplementary_Data

Acknowledgments

I thank Alex Diaz-Papkovich, Eric Friedlander, Simon Gravel, Mashaal Sohail, Matthias Steinrücken, and Kevin Thornton for helpful discussions and feedback on earlier versions of this manuscript.

Funding

Support for this research was provided by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin–Madison with funding from the Wisconsin Alumni Research Foundation.

Conflicts of interest

None declared.

Literature cited

  1. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Agrawal AF, Whitlock MC.. Inferences about the distribution of dominance drawn from yeast gene knockout data. Genetics. 2011;187(2):553–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bank C, Hietpas RT, Jensen JD, Bolon DNA.. A systematic survey of an intragenic epistatic landscape. Mol Biol Evol. 2015;32(1):229–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Barton NH, Charlesworth B.. Why sex and recombination? Science. 1998;281(5385):1986–1990. [PubMed] [Google Scholar]
  5. Bershtein S, Segal M, Bekerman R, Tokuriki N, Tawfik DS.. Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein. Nature. 2006;444(7121):929–932. [DOI] [PubMed] [Google Scholar]
  6. Besenbacher S, Sulem P, Helgason A, Helgason H, Kristjansson H, Jonasdottir A, Jonasdottir A, Magnusson OT, Thorsteinsdottir U, Masson G, et al. Multi-nucleotide de novo mutations in humans. PLoS Genet. 2016;12(11):e1006315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Birky CW Jr, Walsh JB.. Effects of linkage on rates of molecular evolution. Proc Natl Acad Sci USA. 1988;85(17):6414–6418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Boyko AR, Williamson SH, Indap AR, Degenhardt JD, Hernandez RD, Lohmueller KE, Adams MD, Schmidt S, Sninsky JJ, Sunyaev SR, et al. Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet. 2008;4(5):e1000083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Callahan B, Neher RA, Bachtrog D, Andolfatto P, Shraiman BI.. Correlated evolution of nearby residues in Drosophilid proteins. PLoS Genet. 2011;7(2):e1001315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Charlesworth B. Mutation-selection balance and the evolutionary advantage of sex and recombination. Genet Res. 1990;55(3):199–221. [DOI] [PubMed] [Google Scholar]
  11. Charlesworth D, Charlesworth B, Morgan MT.. The pattern of neutral molecular variation under the background selection model. Genetics. 1995;141(4):1619–1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Clark AG. Mutation-selection balance with multiple alleles. Genetica. 1998;102–103(1–6):41–47. [PubMed] [Google Scholar]
  13. Comeron JM. Background selection as baseline for nucleotide variation across the Drosophila genome. PLoS Genet. 2014;10(6):e1004434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cutter AD, Payseur BA.. Genomic signatures of selection at linked sites: unifying the disparity among species. Nat Rev Genet. 2013;14(4):262–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ethier SN, Griffiths RC.. On the two-locus sampling distribution. J Math Biol. 1990;29(2):131–159. [Google Scholar]
  16. Friedlander E, Steinrücken M.. A numerical framework for genetic hitchhiking in populations of variable size. Genetics. 2022;220(3):iyac012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Garcia JA, Lohmueller KE.. Negative linkage disequilibrium between amino acid changing variants reveals interference among deleterious mutations in the human genome. PLoS Genet. 2021;17(7):e1009676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Golding GB. The sampling distribution of linkage disequilibrium. Genetics. 1984;108(1):257–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Good BH. Linkage disequilibrium between rare mutations. Genetics. 2022;220(4):iyac004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Haller BC, Messer PW.. SLiM 3: forward genetic simulations beyond the Wright–Fisher model. Mol Biol Evol. 2019;36(3):632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Harris K, Nielsen R.. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Res. 2014;24(9):1445–1454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hill WG, Robertson A.. The effect of linkage on limits to artificial selection. Genet Res. 1966;8(3):269–294. [PubMed] [Google Scholar]
  24. Hill WG, Robertson A.. Linkage disequilibrium in finite populations. Theor Appl Genet. 1968;38(6):226–231. [DOI] [PubMed] [Google Scholar]
  25. Huber CD, Durvasula A, Hancock AM, Lohmueller KE.. Gene expression drives the evolution of dominance. Nat Commun. 2018;9(1):2750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hudson RR. Two-locus sampling distributions and their application. Genetics. 2001;159(4):1805–1817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ivankov DN, Finkelstein AV, Kondrashov FA.. A structural perspective of compensatory evolution. Curr Opin Struct Biol. 2014;26:104–112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Jouganous J, Long W, Ragsdale AP, Gravel S.. Inferring the joint demographic history of multiple populations: beyond the diffusion approximation. Genetics. 2017;206(3):1549–1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kacser H, Burns JA.. The molecular basis of dominance. Genetics. 1981;97(3–4):639–666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kamm JA, Spence JP, Chan J, Song YS.. Two-locus likelihoods under variable population size and fine-scale recombination rate estimation. Genetics. 2016;203(3):1381–1399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kim BY, Huber CD, Lohmueller KE.. Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples. Genetics. 2017;206(1):345–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kim Y, Stephan W.. Joint effects of genetic hitchhiking and background selection on neutral variation. Genetics. 2000;155(3):1415–1427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kimura M. Random genetic drift in multi-allelic locus. Evolution. 1955;9(4):419–435. [Google Scholar]
  34. Kimura M, Maruyama T.. The mutational load with epistatic gene interactions in fitness. Genetics. 1966;54(6):1337–1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kondrashov AS. Selection against harmful mutations in large sexual and asexual populations. Genet Res. 1982;40(3):325–332. [DOI] [PubMed] [Google Scholar]
  36. Kondrashov AS. Dynamics of unconditionally deleterious mutations: Gaussian approximation and soft selection. Genet Res. 1995;65(2):113–121. [DOI] [PubMed] [Google Scholar]
  37. McVean G. The structure of linkage disequilibrium around a selective sweep. Genetics. 2007;175(3):1395–1406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. McVean GA, Charlesworth B.. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics. 2000;155(2):929–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. McVean GAT, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnelly P.. The fine-scale structure of recombination rate variation in the human genome. Science. 2004;304(5670):581–584. [DOI] [PubMed] [Google Scholar]
  40. Novembre J, Di Rienzo A.. Spatial patterns of variation due to natural selection in humans. Nat Rev Genet. 2009;10(11):745–755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Ohta T, Kimura M.. Linkage disequilibrium at steady state determined by random genetic drift and recurrent mutation. Genetics. 1969;63(1):229–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ohta T, Kimura M.. Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics. 1971;68(4):571–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Otwinowski J, McCandlish DM, Plotkin JB.. Inferring the shape of global epistasis. Proc Natl Acad Sci USA. 2018;115(32):E7550–E7558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Puchta O, Cseke B, Czaja H, Tollervey D, Sanguinetti G, Kudla G.. Network of epistatic interactions within a yeast snoRNA. Science. 2016;352(6287):840–844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Ragsdale AP, Gravel S.. Models of archaic admixture and recent history from two-locus statistics. PLoS Genet. 2019;15(6):e1008204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Ragsdale AP, Gravel S.. Unbiased estimation of linkage disequilibrium from unphased data. Mol Biol Evol. 2020;37(3):923–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Ragsdale AP, Gutenkunst RN.. Inferring demographic history using two-locus statistics. Genetics. 2017;206(2):1037–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Romero PA, Arnold FH.. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol. 2009;10(12):866–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Roze D. A simple expression for the strength of selection on recombination generated by interference among mutations. Proc Natl Acad Sci USA. 2021;118:e2022805118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Sandler G, Wright SI, Agrawal AF.. Patterns and causes of signed linkage disequilibria in flies and plants. Mol Biol Evol. 2021;38(10):4310–4321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sanjak JS, Long AD, Thornton KR.. A model of compound heterozygous, loss-of-function alleles is broadly consistent with observations from complex-disease GWAS datasets. PLoS Genet. 2017;13(1):e1006573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Smith JM, Haigh J.. The hitch-hiking effect of a favourable gene. Genet Res. 1974;23(1):23–35. [PubMed] [Google Scholar]
  53. Sohail M, Vakhrusheva OA, Sul JH, Pulit SL, Francioli LC, van den Berg LH, Veldink JH, de Bakker PIW, Bazykin GA, Kondrashov AS, et al. ; Alzheimer’s Disease Neuroimaging Initiative. Negative selection in humans and fruit flies involves synergistic epistasis. Science. 2017;356(6337):539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Song YS, Song JS.. Analytic computation of the expectation of the linkage disequilibrium coefficient r2. Theor Popul Biol. 2007;71(1):49–60. [DOI] [PubMed] [Google Scholar]
  55. Speidel L, Forest M, Shi S, Myers SR.. A method for genome-wide genealogy estimation for thousands of samples. Nat Genet. 2019;51(9):1321–1329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Stanek D, Bis-Brewer DM, Saghira C, Danzi MC, Seeman P, Lassuthova P, Zuchner S.. Prot2HG: a database of protein domains mapped to the human genome. Database. 2020;2020:baz161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Steinberg B, Ostermeier M.. Shifting fitness and epistatic landscapes reflect trade-offs along an evolutionary pathway. J Mol Biol. 2016;428(13):2730–2743. [DOI] [PubMed] [Google Scholar]
  58. Taverner AM, Blaine LJ, Andolfatto P.. Epistasis and physico-chemical constraints contribute to spatial clustering of amino acid substitutions in protein evolution. BioRxiv. 2020; doi:10.1101/2020.08.05.237594. [Google Scholar]
  59. Thornton KR. Polygenic adaptation to an environmental shift: temporal dynamics of variation under Gaussian stabilizing selection and additive effects on a single trait. Genetics. 2019;213(4):1513–1530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Turelli M, Orr HA.. Dominance, epistasis and the genetics of postzygotic isolation. Genetics. 2000;154(4):1663–1679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, et al; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Yeang C-H, Haussler D.. Detecting coevolution in and among protein domains. PLoS Comput Biol. 2007;3(11):e211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zhao L, Charlesworth B.. Resolving the conflict between associative overdominance and background selection. Genetics. 2016;203(3):1315–1334. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

iyac097_Supplementary_Data

Data Availability Statement

All data and software used in this article are publicly available and open source. I downloaded the Thousand Genomes annotations and genotypes VCFs from the ftp server at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, and the Thousand Genomes human ancestor fasta file from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/. Protein domain information from Stanek et al. (2020) was downloaded from http://prot2hg.com/dbdownload.php.

Implementation of moment equations to compute expectations for two-locus and LD statistics are implemented in Python using Numpy (Harris et al. 2020) and sparse matrix solvers in Scipy (Virtanen et al. 2020). These methods are packaged within moments, and analyses here were performed using moments version 1.1.10, available from https://bitbucket.org/simongravel/moments and via conda, with extensive documentation at https://moments.readthedocs.org. Scripts to run all analyses, recreate figures, and compile this manuscript are available at https://github.com/apragsdale/two_locus_selection. Each URL was last accessed July 5, 2022.

Supplemental material is available at GENETICS online.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES