Skip to main content
PLOS Pathogens logoLink to PLOS Pathogens
. 2021 Jun 21;17(6):e1009669. doi: 10.1371/journal.ppat.1009669

An evolution-based high-fidelity method of epistasis measurement: Theory and application to influenza

Gabriele Pedruzzi 1, Igor M Rouzine 1,*
Editor: Santiago F Elena2
PMCID: PMC8248644  PMID: 34153082

Abstract

Linkage effects in a multi-locus population strongly influence its evolution. The models based on the traveling wave approach enable us to predict the average speed of evolution and the statistics of phylogeny. However, predicting statistically the evolution of specific sites and pairs of sites in the multi-locus context remains a mathematical challenge. In particular, the effects of epistasis, the interaction of gene regions contributing to phenotype, is difficult to predict theoretically and detect experimentally in sequence data. A large number of false-positive interactions arises from stochastic linkage effects and indirect interactions, which mask true epistatic interactions. Here we develop a proof-of-principle method to filter out false-positive interactions. We start by demonstrating that the averaging of haplotype frequencies over multiple independent populations is necessary but not sufficient for epistatic detection, because it still leaves high numbers of false-positive interactions. To compensate for the residual stochastic noise, we develop a three-way haplotype method isolating true interactions. The fidelity of the method is confirmed analytically and on simulated genetic sequences evolved with a known epistatic network. The method is then applied to a large sequence database of neurominidase protein of influenza A H1N1 obtained from various geographic locations to infer the epistatic network responsible for the difference between the pre-pandemic virus and the pandemic strain of 2009. These results present a simple and reliable technique to measure epistatic interactions of any sign from sequence data.

Author summary

Interactions between genomic sites create a fitness landscape. The knowledge of topology and strength of interactions is vital for predicting the escape of viruses from drugs and immune response and their passing through fitness valleys. Many efforts have been invested into measuring these interactions from DNA sequence sets. Unfortunately, reproducibility of the results remains low due partly to a very small fraction of interaction pairs and partly to stochastic linkage noise masking true interactions. Here we propose a method to separate stochastic linkage and indirect interactions from epistatic interactions and apply it to influenza virus sequence data.

Introduction

About a century ago, it was realized that the evolution of a population is strongly affected by the fact that the fates of alleles at different loci are linked unless separated by recombination. These linkage effects include clonal interference [1,2], background selection, genetic hitchhiking [3], enhanced accumulation of deleterious mutations (Muller’s ratchet) [4], and the increase of genetic drift at one locus due to selection at another [5]. Linkage decreases the speed of adaptation and creates random associations between pairs of mutations occurring on the same branch of the ancestral tree.

These effects have been taken into account in early mathematical models considering two loci [6] and, more recently, in the traveling wave approach, which describes an arbitrarily large number of linked sites [712]. These models describe the dynamics of fitness classes and include the factors of selection, mutation, random genetic drift and recombination [1316]. All these models predict a narrow fitness distribution traveling in the fitness space in a direction depending on the initial conditions and parameters [17,18]. This "traveling wave" consists of the deterministic bulk and the leading stochastic edge, where the generation and establishment of rare beneficial mutations limit the adaptation rate. Alternatively, the distribution may move backwards accumulating more and more deleterious alleles (Muller’s ratchet). These models are able to express, in the general form, important observable quantities in terms of model parameters, such as the population size, mutation rate, and the distribution of selection coefficients over loci. The observable quantities include the adaptation rate [710], Muller ratchet rate [7,9], the conditions of full equilibrium [7,19], fixation probability of an allele, and the most probable selection coefficient [12]. The same general approach was used to predict the statistical properties of the ancestral tree [15,16,2022].

Despite of all the progress, prediction of the evolution of specific sites in the multi-site context remains an open question. How do allelic frequencies at each site change in time when the system is adapting? Although the dependence of allelic frequencies on time is stochastic due to the combined effects of selection, random drift, and linkage, what can be said about the average allelic frequency of a given site with a given fitness effect of mutation? Also, what can we say about the evolution of site pairs, especially in the presence of epistatic interaction?

Epistasis defined as the interaction of genes and gene regions contributing to phenotype is an omnipresent phenomenon [23]. Gene interactions are reported to be responsible for a considerable fraction of the organism’s genetic inheritance [24]. They create fitness valleys in the evolutionary path [25]. In pathogens, epistasis facilitates the development of drug resistance and immune escape and impedes reversion of drug-resistant mutations [2631]. Most of HIV variation in untreated patients has been argued to arise from mutations compensating early immune escape mutations [32].

Pairwise epistasis can be measured from binding free energy [33] and measuring fitness gains [34]. A large number of approaches have been proposed to measure epistasis from genomic data [3537]. The simplest methods are based on pairwise allelic correlations [5,38]. The problem with all these approaches is that linkage and indirect interactions create strong inter-site associations even between non-interacting locus pairs, and these false-positive pairs are much more numerous than the true epistatic pairs. Stochastic effects are well-recognized as the most serious obstacle to the detection of epistatic effects [39]. In a single asexual population, stochastic linkage completely overshadows the epistatic footprint, except in a narrow range of times and parameters [40]. The same limitation exists for the tree-based methods of detection [41,42].

A method to eliminate false-positive links arising due to indirect epistatic interactions in the absence of linkage has been developed and successfully applied to protein sequences isolated from different species [43,44]. A similar technique has been applied to the fitness landscape of antibody-binding regions of HIV protein gp120 [45]. However, none of these methods enable reliable measurement of epistasis in asexual populations or in sexual populations at close loci from the same species [39]. Any attempt to detect epistasis, whether by using covariance measures (D′,r2, mutual entropy, universal footprint of epistasis (UFE) [46]) or the tree-based methods [42] faces the same problem, the overwhelming linkage effects. The existing methods are based on the approximation of quasi-linkage equilibrium which neglect linkage effects assuming the limit of strong recombination (see [47] for review).

As we have shown in [40], the increasingly dominant effect of linkage over epistasis in allelic associations results from the random divergence of independent populations in time, so that all sequences are similar to their most recent common ancestors, and the common ancestors move away from the origin and from each other along stochastic trajectories [40]. As a result, any measure of co-variance, or even the use of the entire tree, produces only strong noise of random sign. Co-variation due to random linkage completely masks the epistasis signature in a population. The only way to resolve this issue is to average the haplotype frequencies over many independent populations with similar parameters under similar conditions. Without sampling multiple populations, it is not possible to infer epistasis in principle, due to the stochastic nature of phylogenetic relation of sequences. This fundamental limitation resulting from random phylogeny [40] cannot be resolved by any existing or future method. Furthermore, as we show below, even 200 populations may be not enough to eliminate the false-positives in a system of 40 loci. The contribution of the present work is not to overcome this fundamental limitation, which is not possible in principle, but to demonstrate the existence of a large number of residual false-positive interactions left after averaging over an ensemble of 20 to 200 populations, and to propose a new method to eliminate these false links.

The new technique is based on the use of three-way haplotypes. The basic idea behind this method is that demanding a majority allele at a neighbor site of a measured pair of sites interrupts a “detour”, i.e., a path along interacting sites that creates a false-positive interaction for the pair of interest. In other terms, the additional condition splits the genome into independent blocks.

The high fidelity of this detection technique is demonstrated below by using two parallel methods: analytic derivation for a simple network topology and Monte-Carlo simulation. The analytic derivation makes no assumptions about the range of E or sign or epistasis. The simulation example considers compensatory epistasis, 0<E<1, but the technique applies at any value of E, including negative epistasis.

Then, we apply the method to real virus sequences from an adapting population. Influenza virus evolving in a human population can be mapped onto the traveling wave theory with an effective selection pressure caused by accumulating memory cells [48,49]. Therefore, it is expected to be amenable to our method. Using more than 8000 influenza sequences obtained from various geographic locations, both before and after the pandemic of 2009, we use our method to predict the epistatic interactions among alleles from their observed associations by isolating them from linkage effects in a surface protein, neurominidase. We chose this specific protein, because it underwent strong changes when it was replaced by a new strain in 2009. The old strain and the pandemic strain of H1N1 share 80% of homology. This indicates that, at some point in the past, the two strains had a common ancestor.

We infer the primary and compensatory mutations that allowed the new strain to outcompete the old strain. The method cannot infer the order in which these compensatory mutations have emerged but only the resulting network. In a similar fashion, epistatic networks in some common proteins are estimated from the comparison between their sequences in long-diverged species neglecting linkage effects [43,44]. The difference in our case is that our approach does not make such an assumption and eliminates linkage effects as well.

Thus, the aim of our theoretical paper is to offer a new method of epistatic detection based on an analytic derivation, test it on simulated sequence data, apply it to a real sequence set, and make a testable prediction for a protein network. We do not aim to describe the evolution of H1N1 influenza strain in detail leaving it for other projects.

Results

Simulation model to generate sequences for the test

We start by simulating the evolution of a haploid asexual population using a Wright-Fisher process including the factors of random mutation, random genetic drift, and constant directional selection (Fig 1A) (Methods). We assume two possible alleles at each locus (site), 0 and 1, where 1 stands for an allele that decreases the logarithm of fitness by a fixed value s≪1 when it does not interact with any other sites. The effects of variable selection coefficients are addressed elsewhere [12,50]. The binary simplification provides a major reduction in the computational cost and is accurate for relatively conserved sequences. We let some pairs of sites to interact epistatically, with a mutual degree of compensation between deleterious alleles chosen to be E = 0.75.

Fig 1. Schematic diagram of the method and its testing on simulated sequence data.

Fig 1

A. The computer model of asexual evolution includes the factors of random mutation, selection, epistasis, and random genetic drift. Pairwise haplotype frequencies fij are averaged over simulation runs (independent populations). The pairwise correlation measure UFEij is calculated from Eq 1. The indirect links and the residual linkage are detected and filtered out by using the tri-way correlation measure, UFEij0, from Eq 2. B. Pre-set epistatic network for 50 sites. Green curves: real epistatic interactions. Red lines: indirect interactions. Blue lines: examples of stochastic linkage effects. C-D. The network of strong (UFEij > 0.5) candidate epistatic interactions predicted (C) from a single population and (D) after averaging over 200 populations. E. Scatter plot of the three-way haplotype, min(UFEij0) shown against UFEij for the pairs identified in (D). The dashed sector corresponds to the direct interactions. The upper dashed line is the diagonal, UFEij = UFEij0, and the lower dashed line separates direct and indirect interactions. F. Predicted network accurately recapitulates the pre-set epistatic network. Parameters: initial allele frequency f0 = 0.45, mutation rate per genome μL = 0.07, fixed selection coefficient s = 0.1, N = 1000, L = 40, epistatic strength E = 0.75.

We consider a simple epistatic network consisting of double arches (Fig 1B). The network has three types of correlations: direct interactions, indirect interactions, and linkage, as shown by different colors. If the first and the second loci interact, and the second loci and the third loci interact, this leads to correlations between the allelic composition at the first and third loci, even though they do not interact, termed "indirect interaction". In addition, any pair of loci correlates due to linkage, i.e., having a common ancestor. A more complex network is discussed later on.

First step: Averaging over populations

Genome sequences produced by the simulation demonstrate the presence of strong pairwise correlations between the allelic composition of different loci originating from three sources: direct epistatic interaction, indirect interaction, and stochastic linkage effect. Our first task is to detect all potential epistatic interactions using correlation analysis. As we mentioned, their detection is masked by strong stochastic linkage arising from common ancestors [40]. To decrease linkage effects, for each pair of sites (i, j), we calculate pairwise haplotype frequencies, fαβij, where α,β = 0 or 1, and then we average them over multiple evolutionary-independent populations of the same size. Then, we calculate a metric termed “universal footprint of epistasis” (UFE)

UFEij=1log(f11/f00)log(f01f10/f002) (1)

where f00, f10, f01, f11 are the haplotype frequencies averaged over the ensemble of populations (we dropped indices i, j). More traditional correlation measures, such as D’ and Pearson coefficient r2, have been shown to generate similar stochastic noise [40]. As compared to these measures, UFEij has the unique advantage of directly measuring the degree of mutual compensation of two alleles for infinite averaging, UFEij = E, provided the interacting pair is epistatically isolated from the other sites [46]. Because the logarithms in Eq 1 diverge when one of the four haplotype frequencies is zero, we consider only site pairs such that all four fij are larger than fcut, where fcut ≪ 1 is a low cutoff, which is set below at fcut = 0.05.

Next, we keep only pairs with sufficiently high correlation, UFEij > 0.6 (we remind that we set E at 0.75). For a single population, the raw graph of inferred pairs is extremely complex and completely hides true epistatic interactions (in this case, 24 interactions) (Fig 1C). For a genome longer than L = 40 we took here, it would be even worse. A significant reduction of the number of false-positive interactions is obtained by averaging fij over 200 independent populations (Fig 1D). However, we can see that the vast majority of remaining links are still false-positive, because their number is still much higher than the number of actual interactions (Fig 1B, green arches).

Step 2: Three-way correlation

To clean the network from residual false-positive links caused by incomplete averaging over ensemble, we can either try to average over tens of thousands of independent populations, which are never available in real life, or use a trick, as follows.

The idea of the procedure is based on the fact that false-positive links are created by "detours" around pairs of sites, such as chains of indirect epistatic interactions or sites linked due to the common phylogenetic origin (red and blue curves in Fig 1B). To break up the association created by detours, we demand that a neighbor site of the site pair of interest is 0 (better-fit, wild-type) and recalculate the correlation. If that neighbor site happens to be at the most important detour, this condition will break up or, at least, decrease the indirect correlation. Direct interactions are affected by such an additional condition to a smaller extent (if at all). Thus, for each connected pair i, j, we calculate the three-way measure

UFEij0=1log(f110/f000)log(f010f100/f0002) (2)

where 0 in the third position selects the sequences with the consensus allele 0 at a chosen site adjacent to one site of the tested pair. We consider all possible connected sites, one by one, as the 0-node and calculate the minimum value of UFEij0 over all possible 0-nodes, min(UFEij0). Finding the minimum not only detects a detour but also finds the most important detour if there is more than one. Thus, we can identify and remove false-positive links as those with a low ratio min(UFEij0)/UFEij.

For every potential link between sites i and j detected in Fig 1D, we calculate min(UFEij0) (Fig 1A, bottom). The scatter plot in Fig 1E demonstrates that, for the false-positive pairs, min(UFEij0) is several-fold smaller than UFEij (red dots in Fig 1E). For true links (green dots in Fig 1B and 1E), the two correlation measures are nearly the same. The choice of the threshold in min(UFEij0)/UFEij (low dashed line in Fig 1E) is not crucial, as long as we average the haplotype frequencies over at least ~20 populations (for our parameter choice). In this case, the two groups, false-positive and true interactions, remain distinct. The end result is 100% perfect detection (Fig 1B). As a bonus, we obtain accurate estimates for the compensation strength: UFEE within 15% accuracy (Fig 1F, numbers at green links).

In our previous paper [40], we showed that averaging over dozens of populations is required to isolated epistatic links. In our example in Fig 1, we demonstrate that, at L = 50 sites and moderate population sizes, a half of false-positive links remain even after 200 populations. The present method provides 100% fidelity at L = 40 sites or less, for the number of replicate populations between 20 and 200, and epistatic strength E = 0.5−0.75. Performance drops sharply at E < 0.25.

Analytic results

In the Methods, we show analytically that min(UFEij0) ≪ min(UFEij) for indirect interactions, as given by

UFEij14(1E)
UFEij00

and that this condition can be used to eliminate indirect interactions analytically, in the general form. In the analytic derivation, we assume that the system is under directed selection and in the multiple-mutation regime (the traveling wave regime), which takes place if log(NU) ≫ log(s/U) [12]. In this regime, selection sweeps at many sites overlap in time and interfere with each other [7]. We also assume that the population is far from mutation-selection-drift equilibrium, so that deleterious mutation events are negligible. The derivation given in the Methods applies at negative values of E as well, but we focus on positive E, which case is termed "diminishing returns epistasis". The reason for this choice is strong effects of epistasis and strong indirect interactions in this region. The analytic derivation applies at any E < 1, i.e., below the full compensation point, which covers all basic types of epistasis.

We also repeated this derivation for a more complex topology of closed squares (S1A Fig). Here indirect interaction occurs between the opposite corners of the square, and direct interaction between the sites of one side. This topology is more complex, because it has a loop, and there are two paths connecting the opposite corners. The results show that at E > 1/3, the magnitude of direct and indirect correlations is the same. The 3-way correlation method decreases indirect correlation to a larger degree, which can be used to tell the direct and indirect correlations apart (S1B Fig and S1 Table). Thus, the 3-way method is robust with respect to a topology. The difference from the double-arch case is that indirect correlation does not disappear completely and remains of the same order of magnitude as the direct interaction. This is especially true if E is close to the full compensation point (in this case, E = 1/2). Because, in real biological systems, the value of E varies broadly across pairs, such a difference may be not enough for the reliable detection of the true links demonstrated above for a loop-less topology. The natural way to address this issue is to add another 0 and measure a 4-way correlation to interrupt both equal detours connecting the two points. This trick, indeed, removes indirect interactions completely in the entire interval of E, as given by UFEind000 (S2 Fig and S1 Table).

For the general topology with many loops, the number of the additional zeros required to kill an indirect interaction completely is equal to the number of directions in which a detour can occur (the connectivity parameter). For example, if a chosen site of interest has six neighbor sites with strong correlation (direct or indirect, we do not know), and three of them create a distinct detour to the other site of interest, we would need to add three zeros and calculating the minimum over all possible combination. Therefore, to decipher a very complex network, one needs to add extra zeros iteratively around sites with many neighbors and see if anything has changed at each point. In the example with virus data below, a loopless network emerges already after the three-way test.

Application to influenza A virus

After testing our method analytically and on simulated sequences, for the sake of demonstration, we now infer an epistatic network for an evolving viral population. Our choice is the surface protein of Influenza A H1N1, neuraminidase (NA), important for virus infectivity and an important target of drug therapy and immune response. This protein is one of two proteins that control the virus entry into a host cell (the other is Hemaglutinin).

Our aim is to identify mutations and their interactions that allowed the pandemic strain of 2009 to outcompete the pre-2009 strain. For this aim, we compared the sequences of the first strain to the sequences of the second strain, both sampled worldwide from dozens of locations. In order to be able to apply our method, we assumed that various local populations are nearly independently evolving, and migration between them is very slow.

Although they are not actually independent, but represents parts of one metapopulation, this is the best approximation one can obtain in any study working with world pandemics data.

We have used about 8000 sequences found in the cited database for the period 2000–2010 from various geographic locations. Our goal was to understand the difference between the strains before and after pandemic of 2009, which have 80% of mutual homology. We wished to infer only the epistatic sub-network related to that difference. For this end, we compared worldwide samples of sequences from the two strains. We randomly sampled similar amounts of sequences from the first and second strains, and re-sampled them several hundred times. We also checked the robustness of the results to the exact sampling size (Fig 1F). We have observed that the old and the new strains are both diverse. The two strains evolved together over several years, and the new strain gradually replaced the old strain. In terms of travelling wave, that implies that we have two traveling waves with overlapping fitness distribution, one is gradually waning over several years and replacing the other.

To simplify our task, we binarized the sequences by setting each consensus allele to 0 and each non-consensus ("mutant") allele to 1. This simplification is adequate for the aim of detection of interactions, unless several amino acid variants are present at a site at similar frequencies, which we found to be a rare occurrence. We considered only the sites that were strongly polymorphic (>5%) and observed a bimodal distribution of sequences in the mutant allele frequency per genome with two separate maxima of different height, at f = 0.05 and 0.2. The low-frequency peak was taller. The bimodal distribution reflects the mixture of two strains, the old and the new, with 80% homology. Thus, the old and the new strain differed in NA in about 100 sites, which is 20% of the length.

In order to compensate for unequal sampling from the pre-pandemic and pandemic strain, the more abundant sequences with mutation frequency per genome less than a preset value, f < dv, were randomly sampled and down-weighted by a coefficient, Dw, ranging from 5% to 50%. This procedure was done to balance the number of sequences sampled between the two strains. To obtain the average pairwise haplotype frequencies, fij, we repeated the resampling 200 times.

Next, we followed the procedure described above in Fig 1C–1F and calculated the two-way and three-way association UFE, Eqs 1 and 2, to infer the intra-protein network of interactions (Fig 2). To avoid divergence, we have considered only pairs when all four haplotypes were present in excess of a cutoff, fij > 0.05. The dependence of results on (dv, Dw), which infers between 15 and 22 compensatory sites, originates from the unequal presentation of the two strains in the database. We observe that slightly different weighting gives similar results within a plateau region in Fig 2F. We choose cases C, D, E based on the robustness with respect to the two sampling parameters seen as the broad plateau in the 2D diagram, see panels C, D, E. These cases correspond to a roughly equal amount of each strain. If one strain is strongly overrepresented, the network disappears (Fig 2A and 2B). Below, we choose the network variant shown in Fig 2D as the "golden middle" of the set.

Fig 2. Epistatic network predicted from sequence data on surface protein sequences of Influenza A H1N1 obtained between years 2005 and 2010.

Fig 2

The circular diagrams show the network of interaction between variable amino acid sites in the neuraminidase protein. Sequences with homology to the consensus less than dv were randomly re-sampled 200 times, with their number downweighted by coefficient Dw. (F) 2D heatmap showing the total number of links as a function of dv (X-axis) and Dw (Y-axis). Different versions of wheels in A-E correspond to different choices of Dw and dv shown by crosses in F. All links have estimated E > 0.5. (A-E) Colors correspond to different locations in the protein.

A primary mutation and compensatory sites

We obtain that site 248 in NA represents the primary site connected to multiple compensatory sites (Fig 2D). Thus, our three-way method, tested in simulation and analytically, shows that the new strain that has outcompeted the old strain, only because it had a primary mutation 248 and many compensatory mutations. The network has the typical appearance of a fitness valley network observed, for example, in HIV for drug-resistant mutations.

One might ask whether these inferred mutations in Fig 2 are simply a driving mutation with hitchhiker mutations present in the invading strain. As easy to understand, in this case, our method would give zero signal instead of the network inferred (Fig 2). To make this fact obvious, we consider an extreme example, where the population represents a mix of the uniform old and uniform new strain. We focus on the part of the genome where the two strain differ, for example, 101100… and 010011 …, respectively. These differences can be either due to driving or hitchhiking mutations. Note that, for any pair of these sites, only two haplotypes fij out of the four are present. For example, for the first and second sites, we have haplotypes 10 and 01. Hence, when calculating correlation UFEij and UFEij0 in Eqs 1 and 2, these pairs are excluded from analysis due to the cutoff condition, fij > 0.05. Thus, by design, our method does not measure the difference between strains, but only a specific type of association between the fluctuations of alleles that is caused by epistasis.

Structural interpretation

It is instructive to place the inferred epistatic sites on a three-dimensional protein structure (Fig 3). The active pocket of NA (purple) serves to bind sialic acid on target cell surface. We observe that the inferred primary mutation at residue 248 is located near the active pocket. The inferred compensatory mutations (Fig 2D) helping the mutant strain of NA to improve its fitness are all located on the protein surface in α-helixes connecting and determining the mutual orientation of β-sheets. Inferred primary mutation 248 was previously shown to enhance the low-pH stability of NA [51]. It is ubiquitous in all influenza A H1N1 variants isolated after the 2009 pandemic, regardless of a geographic location [5254].

Fig 3. Structural location of the predicted epistasis network for the neuraminidase of influenza virus.

Fig 3

The figure shows the three-dimensional structure of Influenza A H1N1 neuraminidase (PDB ID code 4QVZ). Colored spheres represent predicted epistatic residues from Fig 2D. Red sphere: Predicted primary mutation (residue 248 in Fig 2D). Orange spheres: Compensatory residues from Fig 2D.

Unlike the epistatic links, the false positives are not linked to any specific biology or proximity to 248. Linkage is indiscriminatory in this sense, as it is a simple consequence of stochastic phylogeny. Almost every pair of diverse sites in database is a potential false positive interaction, see Fig 1C as an illustration.”

Discussion

In the present work, we propose an efficient evolution-based method to tell apart co-variance caused by epistasis from co-variance caused by stochastic linkage effects due to common inheritance and indirect interactions. First, we average the observed haplotype frequencies over independent populations, then we select the links with a high co-variance, and then we apply a tri-way haplotype test for each candidate link to eliminate the residual false-positives. We validate the tri-way haplotype method using a simple analytical model (Methods) assuming a quasi-equilibrium state created by a slowly-moving traveling wave. The existence of quasi-equilibrium has been tested previously by simulation in a broad parameter range [46,50]. Intuitively, the distribution of alleles between sites has a sufficient time to attain the most probable state, i.e., the state with largest number of possible sequences given fitness.

To demonstrate the high fidelity of the method in a controlled environment, we used a simulated sequence set evolved in a Wright-Fisher population with a known epistatic network. In the case of a simple network topology and 40 loci, the method eliminated all false-positive interactions.

To illustrate the application of our method, we averaged haplotype frequencies over influenza H1N1 sequences obtained from a large number of geographic locations. We identified primary and compensatory mutations responsible for the post-2009 strain. We did not address the origins or the history of the strain. We note that Influenza virus has been shown to map to the traveling wave theory [48,49], which justifies the use of our method assuming directional selection and the quasi-equilibrium assumption. Our results infer a single primary site and 15–20 strong compensatory mutations, which number is in the same general range as the number of compensatory mutations observed, for example, in drug-resistant strains of HIV. The inferred primary mutation, 248, has been observed in all influenza A H1N1 strains after the 2009 pandemic in various geographic locations [5254]. It was shown to affect virus infectivity [51].

We did not find any empiric evidence in the literature for or against the inferred network of compensatory mutations. Hence, we make a new testable prediction for a future experimental test. Primary site 248 can be predicted by simple alignment and is well studied experimentally. Its compensatory sites, however, are impossible to detect in vivo without our method, due to strong linkage noise. It would be very useful to compare these predictions with the results of deep mutational scanning [55,56], which we hope will be available in the future.

It is worth noting that Influenza sequences represent a meta-population with many connected islands, not a single well-mixed population, nor completely independent populations. The approximation used in our work here is that the average over different geographic locations allows to obtain, at least, a partial average over the ensemble of independent populations. It is unclear to which extent the lack of the full ensemble average affects the estimate, but at least, we are able to compensate the residual linkage errors within that partial ensemble, something that has not been done before.

As compared to the existing techniques of elimination of indirect interactions, developed for different animal species that diverged millions of years ago [43,44], our method is designed for recently diverged populations (thousands of generations or less) of the same species and recently emerged mutations. Furthermore, our method is capable of eliminating stochastic linkage, which is of less importance when different species are compared. Where both methods can be potentially applied, such as calculating the fitness landscape of HIV Ab-binding regions [45], our method is much faster computationally, because it is local in the genome. Indeed, we can consider one pair of loci at a time without the need of simultaneous optimization of L2/2 parameters of the full interaction matrix. Also, it helps to avoid the situation when the number of fitting parameter is too large, and the system is over-defined.

The main limitation of the proposed approach is that it assumes constant directional selection, as opposed to balancing selection or time-dependent selection, such as occurs under changing external conditions [57]. While the evolution of influenza in a population under the selection pressure of accumulating immune memory B cells has been mapped to the case with constant selection [48,49], the case of virus evolution under the CD8 T cell response [58] or the case of a virus co-evolving with its defective interference particle [59,60] have no such connection and require separate investigation.

Our aim here was to propose the first method that can, in principle, isolate linkage from epistasis. Our analytic derivation, under the stated conditions, demonstrates that the method works in a broad range. The future task will be to find the full region of the applicability of our method in terms of N, s, U, and the number of sites L. We hope to address the upper limit on L in the future work.

The application of our technique to the metapopulation of influenza is based on the assumption that migration between local populations (“demes”) is sufficiently slow, so that their evolution along stochastic trajectories remains mostly independent. The standard criterion of independence is that, for the genomic region of interest, the migration rate from directly connected demes is much smaller than the mutation rate per region. In the opposite limit when migration is very fast, the entire metapopulation is well-mixed, and the method becomes useless. There exists a large intermediate interval of migration rates when the neighboring connected demes form well-mixed clusters, and the population represents a large number of such independent clusters, so the method still applies, except the linkage noise is increasing as their number is decreasing. When migration is so fast that a new best-fit genome arising in the metapopulation spreads faster to any deme than the local production of best-fit genomes by mutation, we have the panmixia case and the method ceases to apply.

To summarize, we proposed a technique to infer the epistatic effect on the evolution of locus pairs and tease it out from stochastic linkage effects. We hope that our approach and further development of this technique will prove useful for all researchers interested in finding fitness landscapes of various organisms from genetic samples.

Methods

Model

We simulate the evolution of a haploid asexual population of N binary sequences. In an individual genome, each locus (site, nucleotide position, amino acid position) numbered i = 1,2,…,L is occupied by one of two alleles, either the wild-type allele, denoted ai = 0, or the mutant allele, ai = 1. We use a discrete generation scheme in the absence of generation overlap (Wright-Fisher model). The evolutionary factors included in the model are random mutation with rate μL per genome, constant directional selection, and random genetic drift due to random sampling of progeny. Selection includes an epistatic network with a set strength and topology. Recombination is absent. A previous modeling study shows that moderate levels of recombination can enhance epistatic detection [40]. We use the standard model of fitness landscape with pairwise interaction. The logarithm of the average progeny number of an individual genome, W, depends on sequence [ai], as given by

W[ai]=i=1Lsiai+i<jLsijaiaj (3)
sij=Eij(si+sj)Tij (4)

where Tij = 0 or 1 is the binary matrix that shows interacting pairs. Here the selection coefficients si and sj denote the individual fitness costs of two deleterious mutations that are partially compensated by each other. By the definition, Eij is the degree of compensation of deleterious alleles at sites i and j. Values E = 0 and 1 represent no epistasis and full compensation, respectively.

The formalism applies at negative values of E as well, but we focus on positive E, which case is termed "diminishing returns epistasis". The reason for this choice is strong effects of epistasis and strong indirect interactions in this region. The analytic derivation in the Methods applies at any E < 1, i.e., below the full compensation point, which covers most basic types of epistasis.

In our simulation example in Fig 1, we consider a haploid population with the initial frequency of deleterious alleles, f0 = 0.45, beneficial mutation rate, U = 0.07, fixed selection coefficient, s = 0.1, population of N = 1000 individuals, L = 40 sites, and fixed epistatic strength E = 0.75. The core Monte-Carlo simulation code for Fig 1 is written in MATLAB and deposited at site https://github.com/rbatorsky/hiv-recombination. It can be modified for different types of epistasis and recombination. The code for Fig 2 (data analysis) has been uploaded to https://github.com/irouzine/Pedruzzi.

Main approximations

We assume that the system is under directed selection and in the multiple-mutation regime (traveling wave regime), which takes place if log(NU) ≫ log(s/U) [12]. In this case, we have interfering selection sweeps occurring at many sites at once. We are far from mutation-selection-drift equilibrium, so that reverse (deleterious) mutations are negligible. In a broad parameter range, an adapting population can be represented by a slowly-moving, narrow peak in fitness coordinate [7,9,12,14,16]. Evolution is slow, because the limiting factor is the addition of a rare beneficial mutation established within a highly-fit genetic background [7,12]. Because the fitness distribution moves slowly, the entropy (the log number of possible sequences given fitness) of the mutation distribution over genomes has enough time to reach its current maximum, restricted by the current average fitness of the population. This situation is called "quasi-equilibrium". At each moment, each fitness class has enough time to reach the most probable, most chaotic state given its fitness. Previously, we verified the validity of quasi-equilibrium in a broad range of parameters and initial conditions after time ~ 1/<s> [46].

Linkage measure

We will use a binary measure of allelic correlations defined in Eq 1, where f00, f10, f01, f11 are the haplotype frequencies averaged over the ensemble of populations [46]. UFE performs similarly to more traditionally used measures, such as Lewontin’s D’, Pearson correlation coefficient, r2 [40], or mutual information [43,44]. As compared to these measures, UFE has the unique advantage of directly measuring the degree of mutual compensation of two alleles E, provided they do not interact with other sites. If a pair of loci does not interact with the other loci in the genome, we have UFE = E. If they are a part of a network, this measure overestimates E [46]. In the main text, we calculate UFE for every pairs of sites (Fig 1A). We leave only those pairs where UFE exceeds a set threshold of 0.6.

Tri-way linkage measure

To test whether a detected correlation for a pair of sites i,j is due to direct interaction rather than linkage or indirect correlation, we also calculate the three-way measure, Eq 2, where 0 in the third position selects only for the sequences with the consensus allele 0 at a chosen site connected to one site of the tested pair. We consider all possible connected sites as 0-nodes and calculate the minimum value of UFEij0 over all possible 0-nodes. Finding the minimum not only detects a detour but also finds the most important detour if there is more than one. Thus, we can identify and remove false-positive links as those with a low ratio min(UFEij0)/UFEij.

Analytic test of the method

To demonstrate, in the general form, that the above method works on indirect interactions, we consider a simplified case of the fitness landscape model in Eqs 3 and 4. We assume a fixed selection coefficient si = s0 and a fixed epistatic strength Eij = E (Eqs 3 and 4). Then, we can fully characterize a genome by the numbers of interacting allelic clusters of different size, i. Each such cluster comprises i directly interacting alleles. Let ki denote the number of clusters with i alleles and bi interactions. Then, from Eqs 3 and 4, we can express log fitness W as a sum over clusters of different size (Fig 4)

Ws0f0L=s0i=1imaxki(i2Ebi) (5)

New notation f0 has the meaning of the effective frequency of non-interacting alleles that would have the same total fitness, W. The number of interactions, bi at i ≥ 3, depends on the topology of the epistatic network. Here b1 = 0, b2 = 1 for any topology. For the chosen network of double arches, we have clusters of single, double, and triple alleles, and b3 = 2 (Fig 4).

Fig 4. An epistatic network made of double arches used for simulation and analytic derivation.

Fig 4

Numbers ki denote the numbers of clusters with i connected sites in a genome. Dots show deleterious alleles. Interactions between the existing deleterious alelles are shown in red.

As mentioned above, we assume quasi-equilibrium, as determined by the current fitness. At each moment of time, numbers ki are determined by the condition that the entropy of the system is maximal given its fitness, Eq 5. The evolving population reaches the maximum-entropy state at the given fitness level with respect to the polymorphic sites. Entropy S is defined as the log number of possible sequence configurations (for example, 0111100101)

S=log[i=1imaxCLiki(ni)ki] (6)

where Li is the number of all possible locations for a cluster of size i, and ni is the number of each cluster’s configurations (shapes). The values of Li and ni depend on the network topology.

Previously, we applied this argument for several topologies to derive the numbers of clusters of different size [46]. We showed, for the topology in Fig 4, that the frequencies of clusters of size i = 1,2 and 3, denoted fi = ki/L, are related as

f2=13f122E,f3=23f134EE<34 (7)

The 1st and 3d site in each triplet in Fig 1B do not interact directly, but only indirectly through site 2. For these two sites, the haplotype frequencies are

f11=3f3+f12,f10=32f2+f1 (8)

When epistatic interaction is sufficiently strong, as given by the condition E > 1/2, triplets dominate numerically over single alleles and doubles, as given by f1f2f3 (Eq 7). From Eqs 7 and 8, using these strong inequalities, we can approximate the haplotype frequencies as

f112f134E,f1012f122E12<E<34 (9)

Using covariance measure UFEij defined in Eq 1, we obtain

UFEij14(1E) (10)

For directly interacting sites 1 and 2 (Fig 1), we previously obtained f11dirf134E,f10dir13f122E [see [46], Supplement, Eqs (3.29) and (3.30)]. This gives the same result, Eq 10. We observe that the indirect covariance between sites 1 and 3 is as strong as for directly interacting sites.

However, if we calculate three-way measure UFEij0 in Eq 2 instead of UFEij by including only the sequences with majority allele 0 at site 2, then, instead of Eq 8 and Eq 9, we obtain

f101f12,f100f1 (11)
UFEij00.

Thus, the phantom covariance disappears when we select only the sequences with a majority allele inserted between the two tested sites. This result is intuitively clear: by the definition of fitness (Eq 3), only minority alleles interact with each other, while majority alleles form a neutral background. The same method turns out to be extremely effective for eliminating false-positive interactions created by linkage (compare Fig 1B with Fig 1E).

We also repeated the same derivation for a more complex topology of closed squares (S1 Text and S1 Fig). The results are discussed in the main text.

Sequence preparation

We have applied the three-way test to influenza virus. We performed a multiple progressive alignment for amino acid sequences of Neuraminidase protein of Influenza A virus strain H1N1 obtained from public database https://www.fludb.org. We focused on NA because of the massive amount of sequence data and because strong changes in NA are responsible for the higher infectivity of the pandemic strain.

The amount of data must be sufficiently high to ensure that each three-site haplotype entering UFE be represented by a large number of sequences, with their inverse square root being the relative error. In our case, this condition was fulfilled by setting cutoff at allelic frequency f > 5%. We downloaded 8440 sequences of NA from a public database (https://www.fludb.org). They were collected worldwide, from different geographic locations, from year 2005 and year 2010, and included both pre-pandemic and post-pandemic strain. All sequences were aligned. We considered only the sites that were strongly polymorphic (>5%) sometimes in the window of 5 years, during which time they contribute to the calculated correlation measures. Then, we found the common consensus (majority) allele for each amino acid position. Note that we have chosen the common consensus of the entire set as the reference sequence for calculating allele frequencies. The choice of the reference sequence does not matter for the site-site correlations and the inferred network.

Pairwise distances between sequences were computed using pairwise alignment. The obtained consensus, defined as the most frequent variant in the population, served as a universal reference to binarized data sequences. Before applying the detection algorithm, the protein sequences were binarized, by direct comparison of each sequence to the consensus. Each amino-acid residue was set to 0 or 1 for consensus or non–consensus. Although combining all amino acid variants per site ignores the specific biochemistry of substitutions, this approach greatly reduces the number of haplotype combinations and also increases the sensitivity by effectively increasing the haplotype frequencies.

Next, we measured the mutational frequency for each sequence along sequences and for each site across sequences. The subset of low-diversity sequences with allelic frequency below a cut-off dv was randomly sampled and down-weighted according to a set coefficient, Dw. Then, we determined the average pairwise and three-way haplotype frequencies for all pairs and triplets of sites, as described in the Results section.

Supporting information

S1 Text. Derivation of UFE for the closed square topology.

(PDF)

S1 Table. Direct and indirect UFE values for the square topology.

Index 0 indicates a 3-way measure.

(TIFF)

S1 Fig. Square topology UFE calculation.

A) Possible configurations and their symmetry. B) Indirect interaction pairwise. C) Indirect interaction three-way, with a fixed zero at a node. d) Indirect interaction with two fixed zeros. E) Direct interaction pairwise. F) Direct interaction three-way.

(TIFF)

S2 Fig. Dependence of 5 different types of linkage measure (S1B–S1F Fig) on epistatic strength predicted for square topology.

(TIFF)

Acknowledgments

We thank Martin Weigt and Alessandra Carbone for useful comments.

Data Availability

Influenza sequence data are taken from public database https://www.fludb.org. Our software is deposited at https://github.com/rbatorsky/hiv-recombination, https://github.com/irouzine/Pedruzzi To characterize the three-dimensional network of epistatic interaction in Fig 4, we used software package ChimeraX from internet site https://www.rbvi.ucsf.edu/chimerax/.

Funding Statement

This research has been funded by l'Agence Nationale de la Recherche, grant J16R389 to IMR, http://www.agence-nationale-recherche.fr/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Fisher RA. The genetical theory of natural selection. Oxford, United Kingdom: Clarendon Press, 1958; 1930. [Google Scholar]
  • 2.Muller HJ. Some genetic aspects of sex. Am Nat. 1932; 66:118–28. [Google Scholar]
  • 3.Rice SH. Evolutionary Theory: Mathematical And Conceptual Foundations: Sinauer Associated; 2004.
  • 4.Felsenstein J. The evolutionary advantage of recombination. Genetics. 1974;78(2):737–56. Epub 1974/10/01. ; PubMed Central PMCID: PMC1213231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hill WG, Robertson A. The effect of linkage on limits to artificial selection. Genet Res. 1966;8(3):269–94. Epub 1966/12/01. . [PubMed] [Google Scholar]
  • 6.Kimura M. Population genetics, molecular evolution, and the neutral theory. Selected papers. Takahata N, editor. Chicago: The University of Chicago Press; 1994. [Google Scholar]
  • 7.Rouzine IM, Wakeley J, Coffin JM. The solitary wave of asexual evolution. Proc Natl Acad Sci U S A. 2003;100(2):587–92. Epub 2003/01/15. doi: 10.1073/pnas.242719299 ; PubMed Central PMCID: PMC141040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Desai MM, Fisher DS. Beneficial mutation selection balance and the effect of linkage on positive selection. Genetics. 2007;176(3):1759–98. Epub 2007/05/08. doi: 10.1534/genetics.106.067678 ; PubMed Central PMCID: PMC1931526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rouzine IM, Brunet E, Wilke CO. The traveling-wave approach to asexual evolution: Muller’s ratchet and speed of adaptation. Theor Popul Biol. 2008;73(1):24–46. doi: 10.1016/j.tpb.2007.10.004 ; PubMed Central PMCID: PMC2246079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Brunet E, Rouzine IM, Wilke CO. The stochastic edge in adaptive evolution. Genetics. 2008;179(1):603–20. Epub 2008/05/22. doi: 10.1534/genetics.107.079319 ; PubMed Central PMCID: PMC2390637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hallatschek O. The noisy edge of traveling waves. Proc Natl Acad Sci U S A. 2011;108(5):1783–7. Epub 2010/12/29. doi: 10.1073/pnas.1013529108 ; PubMed Central PMCID: PMC3033244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Good BH, Rouzine IM, Balick DJ, Hallatschek O, Desai MM. Distribution of fixed beneficial mutations and the rate of adaptation in asexual populations. Proc Natl Acad Sci U S A. 2012;109(13):4950–5. Epub 2012/03/01. doi: 10.1073/pnas.1119910109 ; PubMed Central PMCID: PMC3323973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Neher RA, Shraiman BI, Fisher DS. Rate of adaptation in large sexual populations. Genetics. 2010;184(2):467–81. doi: 10.1534/genetics.109.109009 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rouzine IM, Coffin JM. Evolution of human immunodeficiency virus under selection and weak recombination. Genetics. 2005;170(1):7–18. doi: 10.1534/genetics.104.029926 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rouzine IM, Coffin JM. Highly fit ancestors of a partly sexual haploid population. Theor Popul Biol. 2007;71(2):239–50. doi: 10.1016/j.tpb.2006.09.002 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rouzine IM, Coffin JM. Multi-site adaptation in the presence of infrequent recombination. Theor Popul Biol. 2010;77(3):189–204. doi: 10.1016/j.tpb.2010.02.001 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rouzine I, Weinberger LS. The quantitative theory of within-host viral evolution. J Stat Mech: Theory and Experiment. 2013;2013:P01009. [Google Scholar]
  • 18.Rouzine IM. Mathematical modelling of evolution: Volume 1: One-locus and multi-locus theory and recombination. Boston/Berlin: De Gruyter; 2020. 181 p. [Google Scholar]
  • 19.Goyal S, Balick DJ, Jerison ER, Neher RA, Shraiman BI, Desai MM. Dynamic mutation-selection balance as an evolutionary attractor. Genetics. 2012;191(4):1309–19. doi: 10.1534/genetics.112.141291 ; PubMed Central PMCID: PMC3416009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Brunet E, Derrida B, Mueller AH, Munier S. Effect of selection on ancestry: An exactly soluble case and its phenomenological generalization. Physical Review E. 2007;76:041104–1. doi: 10.1103/PhysRevE.76.041104 [DOI] [PubMed] [Google Scholar]
  • 21.Walczak AM, Nicolaisen LE, Plotkin JB, Desai MM. The structure of genealogies in the presence of purifying selection: a fitness-class coalescent. Genetics. 2012;190(2):753–79. doi: 10.1534/genetics.111.134544 ; PubMed Central PMCID: PMC3276618. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Neher RA, Hallatschek O. Genealogies of rapidly adapting populations. Proc Natl Acad Sci U S A. 2013;110(2):437–42. Epub 2012/12/28. doi: 10.1073/pnas.1213113110 ; PubMed Central PMCID: PMC3545819. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Carlborg O, Jacobsson L, Ahgren P, Siegel P, Andersson L. Epistasis and the release of genetic variation during long-term selection. Nat Genet. 2006;38(4):418–20. Epub 2006/03/15. doi: 10.1038/ng1761 . [DOI] [PubMed] [Google Scholar]
  • 24.Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A. 2012;109(4):1193–8. Epub 2012/01/10. doi: 10.1073/pnas.1119675109 ; PubMed Central PMCID: PMC3268279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Weissman DB, Desai MM, Fisher DS, Feldman MW. The rate at which asexual populations cross fitness valleys. Theor Popul Biol. 2009;75(4):286–300. Epub 2009/03/17. doi: 10.1016/j.tpb.2009.02.006 ; PubMed Central PMCID: PMC2992471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gonzalez-Ortega E, Ballana E, Badia R, Clotet B, Este JA. Compensatory mutations rescue the virus replicative capacity of VIRIP-resistant HIV-1. Antiviral Res. 2011;92(3):479–83. Epub 2011/10/27. doi: 10.1016/j.antiviral.2011.10.010 . [DOI] [PubMed] [Google Scholar]
  • 27.Handel A, Regoes RR, Antia R. The role of compensatory mutations in the emergence of drug resistance. PLoS Comput Biol. 2006;2(10):e137. Epub 2006/10/17. doi: 10.1371/journal.pcbi.0020137 ; PubMed Central PMCID: PMC1599768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Levin BR, Perrot V, Walker N. Compensatory mutations, antibiotic resistance and the population genetics of adaptive evolution in bacteria. Genetics. 2000;154(3):985–97. Epub 2000/04/11. ; PubMed Central PMCID: PMC1460977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Nijhuis M, Schuurman R, de Jong D, Erickson J, Gustchina E, Albert J, et al. Increased fitness of drug resistant HIV-1 protease as a result of acquisition of compensatory mutations during suboptimal therapy. AIDS. 1999;13(17):2349–59. Epub 1999/12/22. doi: 10.1097/00002030-199912030-00006 . [DOI] [PubMed] [Google Scholar]
  • 30.Piana S, Carloni P, Rothlisberger U. Drug resistance in HIV-1 protease: Flexibility-assisted mechanism of compensatory mutations. Protein Sci. 2002;11(10):2393–402. Epub 2002/09/19. doi: 10.1110/ps.0206702 ; PubMed Central PMCID: PMC2384161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wu NC, Thompson AJ, Xie J, Lin CW, Nycholat CM, Zhu X, et al. A complex epistatic network limits the mutational reversibility in the influenza hemagglutinin receptor-binding site. Nat Commun. 2018;9(1):1264. doi: 10.1038/s41467-018-03663-5 ; PubMed Central PMCID: PMC5871881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rouzine IM, Coffin JM. Search for the mechanism of genetic variation in the pro gene of human immunodeficiency virus. J Virol. 1999;73(10):8167–78. doi: 10.1128/JVI.73.10.8167-8178.1999 ; PubMed Central PMCID: PMC112834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Adams RM, Kinney JB, Walczak AM, Mora T. Epistasis in a Fitness Landscape Defined by Antibody-Antigen Binding Free Energy. Cell Syst. 2019;8(1):86–93 e3. Epub 2019/01/07. doi: 10.1016/j.cels.2018.12.004 ; PubMed Central PMCID: PMC6487650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lyons DM, Zou Z, Xu H, Zhang J. Idiosyncratic epistasis creates universals in mutational effects and evolutionary trajectories. Nat Ecol Evol. 2020;4(12):1685–93. Epub 2020/09/09. doi: 10.1038/s41559-020-01286-y ; PubMed Central PMCID: PMC7710555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chen CC, Schwender H, Keith J, Nunkesser R, Mengersen K, Macrossan P. Methods for identifying SNP interactions: a review on variations of Logic Regression, Random Forest and Bayesian logistic regression. IEEE/ACM Trans Comput Biol Bioinform. 2011;8(6):1580–91. Epub 2011/03/09. doi: 10.1109/TCBB.2011.46 . [DOI] [PubMed] [Google Scholar]
  • 36.Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nature reviews Genetics. 2009;10(6):392–404. doi: 10.1038/nrg2579 PMC2872761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ueki M, Cordell HJ. Improved statistics for genome-wide interaction analysis. PLoS Genet. 2012;8(4):e1002625. Epub 2012/04/13. doi: 10.1371/journal.pgen.1002625 ; PubMed Central PMCID: PMC3320596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Barton NH. Linkage and the limits to natural selection. Genetics. 1995;140(2):821–41. Epub 1995/06/01. ; PubMed Central PMCID: PMC1206655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wei W-H, Hemani G, Haley CS. Detecting epistasis in human complex traits. Nature Reviews Genetics. 2014;15:722. doi: 10.1038/nrg3747 [DOI] [PubMed] [Google Scholar]
  • 40.Pedruzzi G, Rouzine IM. Epistasis detectably alters correlations between genomic sites in a narrow parameter window. PLoS One. 2019;14(5):e0214036. Epub 2019/06/01. doi: 10.1371/journal.pone.0214036 ; PubMed Central PMCID: PMC6544209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Kryazhimskiy S, Dushoff J, Bazykin GA, Plotkin JB. Prevalence of epistasis in the evolution of influenza A surface proteins. PLoS Genet. 2011;7(2):e1001301. doi: 10.1371/journal.pgen.1001301 ; PubMed Central PMCID: PMC3040651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Neverov AD, Kryazhimskiy S, Plotkin JB, Bazykin GA. Coordinated Evolution of Influenza A Surface Proteins. PLoS Genet. 2015;11(8):e1005404. doi: 10.1371/journal.pgen.1005404 ; PubMed Central PMCID: PMC4527594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. Rep Prog Phys. 2018;81(3):032601. doi: 10.1088/1361-6633/aa9965 . [DOI] [PubMed] [Google Scholar]
  • 44.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci U S A. 2009;106(1):67–72. doi: 10.1073/pnas.0805923106 ; PubMed Central PMCID: PMC2629192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Louie RHY, Kaczorowski KJ, Barton JP, Chakraborty AK, McKay MR. Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc Natl Acad Sci U S A. 2018;115(4):E564–E73. doi: 10.1073/pnas.1717765115 ; PubMed Central PMCID: PMC5789945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Pedruzzi G, Barlukova A, Rouzine IM. Evolutionary footprint of epistasis. PLoS Comput Biol. 2018;14(9):e1006426. doi: 10.1371/journal.pcbi.1006426 ; PubMed Central PMCID: PMC6177197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Neher RA, Shraiman BI. Statistical genetics and evolution of quantitative traits. Rev of Modern Physics. 2011;83:1283–300. [Google Scholar]
  • 48.Rouzine IM, Rozhnova G. Antigenic evolution of viruses in host populations. PLoS Pathog. 2018;14(9):e1007291. doi: 10.1371/journal.ppat.1007291 ; PubMed Central PMCID: PMC6173453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yan L, Neher RA, Shraiman BI. Phylodynamic theory of persistence, extinction and speciation of rapidly adapting pathogens. Elife. 2019;8. doi: 10.7554/eLife.44205 ; PubMed Central PMCID: PMC6809594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Barlukova A, Rouzine IM. The evolutionary origin of the universal distribution of mutation fitness effect. PLoS Comput Biol. 2021;17(3):e1008822. Epub 2021/03/09. doi: 10.1371/journal.pcbi.1008822 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Takahashi T, Song J, Suzuki T, Kawaoka Y. Mutations in NA that induced low pH-stability and enhanced the replication of pandemic (H1N1) 2009 influenza A virus at an early stage of the pandemic. PLoS One. 2013;8(5):e64439. doi: 10.1371/journal.pone.0064439 ; PubMed Central PMCID: PMC3655982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Byarugaba DK, Erima B, Millard M, Kibuuka H, Lkwago L, Bwogi J, et al. Whole-genome analysis of influenza A(H1N1)pdm09 viruses isolated in Uganda from 2009 to 2011. Influenza Other Respir Viruses. 2016;10(6):486–92. doi: 10.1111/irv.12401 ; PubMed Central PMCID: PMC5059949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Otte A, Marriott AC, Dreier C, Dove B, Mooren K, Klingen TR, et al. Evolution of 2009 H1N1 influenza viruses during the pandemic correlates with increased viral pathogenicity and transmissibility in the ferret model. Sci Rep. 2016;6:28583. doi: 10.1038/srep28583 ; PubMed Central PMCID: PMC4919623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Seibert CW, Rahmat S, Krammer F, Palese P, Bouvier NM. Efficient transmission of pandemic H1N1 influenza viruses with high-level oseltamivir resistance. J Virol. 2012;86(9):5386–9. doi: 10.1128/JVI.00151-12 ; PubMed Central PMCID: PMC3347392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hom N, Gentles L, Bloom JD, Lee KK. Deep Mutational Scan of the Highly Conserved Influenza A Virus M1 Matrix Protein Reveals Substantial Intrinsic Mutational Tolerance. J Virol. 2019;93(13). doi: 10.1128/JVI.00161-19 ; PubMed Central PMCID: PMC6580950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Lee JM, Huddleston J, Doud MB, Hooper KA, Wu NC, Bedford T, et al. Deep mutational scanning of hemagglutinin helps predict evolutionary fates of human H3N2 influenza variants. Proc Natl Acad Sci U S A. 2018;115(35):E8276–E85. doi: 10.1073/pnas.1806133115 ; PubMed Central PMCID: PMC6126756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Schiffels S, Szollosi GJ, Mustonen V, Lassig M. Emergent neutrality in adaptive asexual evolution. Genetics. 2011;189(4):1361–75. Epub 2011/09/20. doi: 10.1534/genetics.111.132027 ; PubMed Central PMCID: PMC3241435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Batorsky R, Sergeev RA, Rouzine IM. The route of HIV escape from immune response targeting multiple sites is determined by the cost-benefit tradeoff of escape mutations. PLoS Comput Biol. 2014;10(10):e1003878. doi: 10.1371/journal.pcbi.1003878 ; PubMed Central PMCID: PMC4214571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Rast LI, Rouzine IM, Rozhnova G, Bishop L, Weinberger AD, Weinberger LS. Conflicting Selection Pressures Will Constrain Viral Escape from Interfering Particles: Principles for Designing Resistance-Proof Antivirals. PLoS Comput Biol. 2016;12(5):e1004799. doi: 10.1371/journal.pcbi.1004799 ; PubMed Central PMCID: PMC4859541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Rouzine IM, Weinberger LS. Design requirements for interfering particles to maintain coadaptive stability with HIV-1. J Virol. 2013;87(4):2081–93. Epub 2012/12/12. doi: 10.1128/JVI.02741-12 ; PubMed Central PMCID: PMC3571494. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Santiago F Elena, Raul Andino

7 Feb 2021

Dear Dr. Rouzine,

Thank you very much for submitting your manuscript "An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza" for consideration at PLOS Pathogens. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Santiago F. Elena, PhD

Guest Editor

PLOS Pathogens

Raul Andino

Section Editor

PLOS Pathogens

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************

Dear Dr. Rouzine,

Thanks for giving us the opportunity to review your work, which is tackling a most relevant question in evolutionary virology. As you will see, the three reviewers find merits in your work but also detect some important weaknesses that need to be addressed before the manuscript can be accepted in PLoS Pathogens.

You will notice that two reviewers agree that you haven't provide strong enough evidence that your method works and which parameter regimes is expected to perform best. Also, an effort for clarity, specifically aiming to a readership of experimental virologists, must be done. Linking with this criticism, making available your code for others to use it is considered as essential.

All reviewers also point to many typos and places where the manuscript needs to be polished. Please, pay close attention to them.

Looking forward to see your revised version.

Santiago Elena

Reviewer's Responses to Questions

Part I - Summary

Please use this section to discuss strengths/weaknesses of study, novelty/significance, general execution and scholarship.

Reviewer #1: This study is valuable in its approach of using population genetics theory to devise a method of detecting epistasis among sequences sites. The approach is rigorous and well defended.

Reviewer #2: The authors propose a version of the linkage-disequilibrium test for detecting pairwise epistasis in sequence data. A major obstacle in detecting epistatic pairs is that many other pairs are also at linkage disequilibrium, due to hitchhiking or due to indirect interactions. The authors propose a procedure for reducing this noise which relies on estimating conditional LD, conditioned on a potential confounding site being at its wildtype state. They apply this approach to identifying epistatic networks in influenza H1N1 neuraminidase data and find that a previously known site 248 near the active pocket is involved in many interactions.

The authors ask an important question. Their core idea makes sense and is promising. However, I see two major issues with this manuscript, which I will discuss in detail below. The first problem is that the authors have not demonstrated how well their method works. In what regimes is it expected to work? When does to fail? The second problem is the organization of the manuscript and a lack of clarity in many places, which make reading it difficult and at times frustrating, even for a specialist.

Reviewer #3: attached

**********

Part II – Major Issues: Key Experiments Required for Acceptance

Please use this section to detail the key new experiments or modifications of existing experiments that should be absolutely required to validate study conclusions.

Generally, there should be no more than 3 such required experiments or major modifications for a "Major Revision" recommendation. If more than 3 experiments are necessary to validate the study conclusions, then you are encouraged to recommend "Reject".

Reviewer #1: (No Response)

Reviewer #2: 1. INSUFFICIENT EVIDENCE. The idea to detect epistasis using linkage disequilibrium is not new. It is clear why it works in sexual populations or in a system at mutation-selection balance. Here, the authors argue that essentially the same idea can also work in rapidly adapting asexual populations, such as the influenza virus. But why and when it works is unclear. Figure 2 convinces me that it works sometimes. But I think for this paper to have impact, the authors need to provide at least some guidance as to when the method works and when it fails.

For this method to work, the epistatically interacting sites must be at minimum simultaneously polymorphic. The triple-site method requires three sites to be polymorphic at the same time. The genetic diversity of a population depends on the population genetic parameters, as the authors are well aware. For example, if the population evolves in the successional mutation regime, the UFE statistic cannot be evaluated and this method will not work. But it is not clear to me whether simply being in the concurrent mutation regime is sufficient. The authors heavily rely on their earlier paper (Pedruzzi et al, PLoS Comp Biol 2018), but I could not find a clear delineation of different parameter regimes. Ideally, I would like to see a phase diagram in the U, N and s space for a some simple epistatic model like the “double arches” showing where the UFE statistic between a specific pair of sites correlates well with the true value of E, where it does not correlate and where it cannot be evaluated.

The second issue that I see is that the ability to detect epistasis using this method may depend on the structure of the epistatic network. Figure 2 shows that it works for the simplest possible structure, the “double arches”. We do not have a great picture of how epistatic networks look like within actual proteins. But I think evaluating the performance of this method for some more complex networks would be necessary.

The third issue that I am concerned about is how the amount of data and the sampling protocol might affect method’s performance. Most sites that carry beneficial mutations, e.g., in influenza, are polymorphic only transiently. This presumably means that the proposed test has to be applied to sequences sampled within a certain time window. So, how should one choose this window? And how sensitive is the method to this choice? I am guessing me that the downweighing that the authors did in their analysis of the influenza data set address this issue. But that was not clearly explained and seems ad hoc. In any case, this seems like an important issue and should be addressed directly in the paper. Also, how much data is actually needed in practice to reliably estimate the conditional UFE?

2. CLARITY. There are several major clarity issues. Minor issues are listed below.

I think the authors made a poor choice in how to introduce their approach. First, the second half of the intro where the authors essentially summarize part of their previous paper (Pedruzzi et al, 2018) is not understandable on its own without actually reading the supplement of that paper. Second, this theory is only relevant in so far as it provides sort of an explanation why the LD-based method might work in adapting asexual populations. But it is actually unnecessary for understanding the basic idea behind the UFE statistic, which is just a way to detect LD. Third, this explanation is not even satisfactory because it remains unclear when the assumptions made in their model actually hold (see above). So, the reader becomes very confused and skeptical before getting to the Results. Fourth, the UFE statistics are not introduced in the Results section. I think they should be introduced, since it is the central contribution of this methodological paper.

I found the analytical results for the double-arches model quite helpful. I think the authors should consider discussing these results in the main text.

The authors stress multiple times in the text that it is important to average the UEF statistic across multiple populations to reduce linkage noise. But they do not do this explicitly in their analysis of the flu data. This has to be addressed.

As mentioned above, the downsampling procedure is not part of their method, at least as described, and so its purpose has to be clearly explained.

Finally, I don’t quite understand how the authors estimate allele and haplotype frequencies from their data, given the fact that the sequences were sampled over multiple years. The authors appear to estimate frequencies from the entire alignment, which seems wrong because it does not represent a snapshot of a population. I also don’t understand how the authors are confident that the mutation at site 248 that they detected actually genetically interacts with other sites, given that this mutation is present in all post-2009 N1 variants? My understanding (but I may be wrong) is that pre-2009 H1N1 lineage was outcompeted by the 2009 “swine flu” variant. So, couldn’t it be then that the mutation at site 248 was simply present in the 2009 founder strain and is now shared by all of its descendants, without epistatically interacting with any other sites? What’s the evidence that it is actually epistatic?

Reviewer #3: attached

**********

Part III – Minor Issues: Editorial and Data Presentation Modifications

Please use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity.

Reviewer #1: p. 11 - The explanation for using the triple-site haplotype method should be made clearer at this point, even though it is explained in the Methods section.

p. 12 - Although it seems a reasonable assumption, there should be some justification for assuming an epistatic network for NA. Surely there is some empirical evidence for this.

p. 15 - Is there any empirical evidence that the 15-20 compensatory mutations are actually compensatory? Also, the identification of the primary site could presumably be made by simply comparing aligned sequences from before and after the pandemic. It would be useful to comment on this.

More generally, the method has been tested using compensatory epistasis, but is there any evidence it works with other forms of epistasis?

This is an editorial comment. There are minor errors of sentence structure or typographical errors throughout the manuscript. These seem typical of writers that are not native English speakers and could be easily corrected by a native speaker or by the journal.

Reviewer #2: (No Response)

Reviewer #3: attached

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Christina L. Burch

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here on PLOS Biology: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plospathogens/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: Rouzine 2021.docx

Decision Letter 1

Santiago F Elena, Raul Andino

11 May 2021

Dear Dr. Rouzine,

Thank you very much for submitting your manuscript "An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza" for consideration at PLOS Pathogens. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we may consider this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Let us apologize for the long time it has taken to to reach a decision. I'm afraid to say that reviewers are still quite dispar in their opinions, from rejection to minor revision. After our own reading, like the three reviewers, we found your study very important and of interest for a very broad audience. However, we do concur with the most critical reviewer that the manuscript is still far from being clear enough for the (mostly) non-technical audience of PLoS Pathogens. Actually, some of us found PLoS Computational Biology a much better venue for this particular study. But this is just a personal preference.

Very rarely PLoS Pathogens gives second opportunities, however, given the interest of the manuscript. We'd like to consider a new largely revised version tackling the concerns by the most critical reviewer in a very constructive manner. not simply ignoring them. If you decide to do so, then we'll most likely reach a decision without sending out to a third round of reviewing.

Sorry we can't be more positive this time.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Santiago F. Elena, PhD

Guest Editor

PLOS Pathogens

Raul Andino

Section Editor

PLOS Pathogens

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************

Dear Dr. Rouzine,

Let me apologize for the long time it has taken to to reach a decision. I'm afraid to say that reviewers are still quite dispar in their opinions, from rejection to minor revision. After my own reading, like the three reviewers, I found your study very important and of interest for a very broad audience. However, I do concur with the most critical reviewer that the manuscript is still far from being clear enough for the (mostly) non-technical audience of PLoS Pathogens. Actually, I'd found PLoS Computational Biology a much better venue for this particular study. But this is just my personal feeling.

Very rarely PLoS Pathogens gives second opportunities, however, given the interest of the manuscript. I'd like to consider a new largely revised version tackling the concerns by the most critical reviewer in a very constructive manner. not simply ignoring them. If you decide to do so, then I'll most likely reach a decision without sending out to a third round of reviewing.

Sorry I can't be more positive this time.

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Part I - Summary

Please use this section to discuss strengths/weaknesses of study, novelty/significance, general execution and scholarship.

Reviewer #1: The authors appear to have done a thorough job of responding to the reviewers' comments.

Reviewer #2: The authors mostly addressed my concerns related to the structure of the paper. I found the current manuscript much more readable. The additional analytical work now provided in the supplement is also a step in the right direction. However, two major concerns remain. First, the present manuscript still does not give the reader a good sense of when the method works and how well, when it is expected to fail and how. Second, the claims regarding the influenza analysis remain unsubstantiated and the analysis remains unclear. In addition, my minor concerns from the first round of review have been apparently not conveyed to the authors. I reiterate several of them below.

Reviewer #3: This revised manuscript is greatly improved in terms of clarity. Most of the issues I raised have been resolved and I am most appreciative of the care taken to provide greater intuition for the methods choices. I now have only one major concern, one moderate suggestion, and a few copy editing suggestions. Once the major concern has been addressed, I think the manuscript will make a strong contribution to PLoS Pathogens.

**********

Part II – Major Issues: Key Experiments Required for Acceptance

Please use this section to detail the key new experiments or modifications of existing experiments that should be absolutely required to validate study conclusions.

Generally, there should be no more than 3 such required experiments or major modifications for a "Major Revision" recommendation. If more than 3 experiments are necessary to validate the study conclusions, then you are encouraged to recommend "Reject".

Reviewer #1: No major issues.

Reviewer #2: 1. ROBUSTNESS OF THE METHOD. PLoS Pathogens is a journal that is read by a fairly wide audience, including practitioners who might be interested in applying the method proposed here to other viruses. So, in my opinion, a paper published in this journal needs to go a bit beyond a mere “proof of principle”. Doing so is not actually too laborious.

Specifically, I think that a typical PLoS Pathogens reader would want to have at least some sense of what performance to expect of the method presented here. In fact, the author’s own analysis of influenza exposes the need for a sensitivity analysis. To address this concern, maybe the authors can use some quantitative metrics of performance, e.g., the rate of false positive and false negative epistatic pairs detected by their method, and report how their method performs with respect to these metrics as evolutionary parameters vary, at least in their double arches topology shown in Figure 1B. This is fairly straightforward to do. The authors know better than me which parameters are most important to vary, but I’d certainly be interested to know at least how the strength of epistasis E and the number of independent replicates affect the performance.

2. INFLUENZA CLAIMS. This part of the paper has the following two major issues.

2A. The description of the analysis of the influenza data remains unclear in several places, its organization could be streamlined, and some key pieces of information are missing. All of this makes it very difficult to understand what has been done, and makes this analysis not reproducible as described.

I think the following two specific points need to be addresses before I’d be comfortable recommending this paper for publication.

2A-i. Describe the analysis in more detail, provide key pieces of information in the text, make data and code available, as described in (a) through (h) below.

(a) The authors say that the sequences were split into several subsets by geographic location (LL 281-287). Into how many geographic groups have the authors actually split the data? How coarse-grained are these locations? How many sequences of each type (pre-pandemic and pandemic) are in each location? As far as I can tell, this information is currently not provided anywhere. Please provide it, e.g., as a data table, and report some key summary statistics in the text.

(b) Was the frequency cutoff of 5% applied to each location separately or is it 5% of the total number of sequences, i.e., 8440 * 0.05 = 422? I suspect it is the latter. In this case, please state this explicitly. But then I wonder how is the UFE statistic evaluated for sites that are not polymorphic in some sub-populations? Or does this never happen? Please discuss this.

(c) Please provide the estimated consensus sequence as well as the binarized individual sequences as a supplementary data files. Or at the very least provide the code and the raw sequence data file that you used that would generate all these intermediate data files.

(d) Please provide the estimated individual allele frequencies, bi-allele frequencies and the conditional allele frequencies for each geographic location.

(e) Please show the distribution of mutant allele frequencies as a figure and please provide an estimate of the fraction of sequences around each peak.

(f) I also would very much like to see a phylogeny showing how the pre-pandemic and pandemic clades are related to each other. Specifically, I would like to see how many mutations separate the pandemic clade from the rest of the pre-pandemic variants. This information is critical for understanding how the authors arrive at their epistasis network for site 248 later on.

(g) Rather than saying “multiple times”, please specify how many times exactly has the resampling been performed (L 314).

(h) Please provide the estimated values of the UFE and conditional UFE statistics for each location for each pair of sites in a supplementary data file.

2A-ii. Streamline this part of the paper.

LL 281-287. I don’t understand what the purpose of this paragraph is. Is it a description of what is being done or an introduction into what will be described next? The next paragraph describes the processing of the data, so I infer that the current paragraph is introductory. Then it is unnecessary and confusing. The previous paragraph gives enough background information, and diving right into a brief description of what is being done would be preferable. The current paragraph also mentions how sequences were divided among geographic locations, which is not explained in more details anywhere else, as far as I could tell. I suggest removing the redundant parts of the paragraph and providing missing information in the Methods section and as a supplementary data, as suggested above.

LL 288-300. IMO, this paragraph largely belongs to the Methods section and also requires more details, as discussed above.

2B. There is currently not enough support for the main result that the mutation at site 248 genetically interacts with other sites. I once again ask the authors to explain how they reject the following hypothesis. Suppose that all pandemic N1 variants carry 15 to 22 mutations (including a mutation at site 248) that differentiate all these variants from all pre-pandemic variants. In this case, there is simply no information about the epistatic interactions between these 15 to 22 sites. All these mutations could be hitchhikers. We only know that they form a single haplotype which is sweeping to fixation. The argument that “our three-way method, tested in simulation and analytically, shows…” is not sufficient because neither the simulations nor the analytical calculations have explored a scenario where a new strain with dozens of new mutations invades a population. This is not a quasi-equilibrium travelling wave scenario that was at least to some extent explored in the first part of the paper. So, the fact that the conditional UFE method reveals some significant pairs in this analysis is not sufficient to convince me that site 248 is actually involved in any epistatic interactions, let alone the claim that “the new strain that has outcompeted the old strain, only because it had a primary mutation 248 and many compensatory mutations.”

A smaller but still important issue is that the phase diagram in Figure 4F shows that the number of detected interacting site pairs is quite sensitive to the choice of the resampling parameters. But the authors do not specify how these parameters should be chosen. They simply declare by fiat that choices A and B are incorrect whereas choices C, D, E are correct. I don’t understand what justifies this choice.

Reviewer #3: Attached

**********

Part III – Minor Issues: Editorial and Data Presentation Modifications

Please use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity.

Reviewer #1: The authors, in their response to my comment, state:

(p. 12 - Although it seems a reasonable assumption, there should be some justification for assuming an epistatic network for NA.)

“We do not assume. We infer it. We detect true interactions and measure their magnitude. I made it clear now:”

Epistasis is defined as interactions among alleles that affect the phenotype. You are inferring associations among alleles, not epistasis. Also, you do not “detect true interactions” if you are inferring them. It would be more convincing if you could provide independent evidence of epistasis. This is a theoretical proof-of-concept after all.

The authors state on p. 6:

“Co-variation due to random linkage completely masks the epistasis signature in a population. The only way to resolve this issue is to average the haplotype frequencies over many independent populations with similar parameters under similar conditions. Without sampling multiple populations, it is not possible to infer epistasis in principle, due to the stochastic nature of phylogenetic relation of sequences. This fundamental limitation cannot be resolved by any existing or future method.”

Why can’t this limitation be resolved by any existing or future method? Why can’t multiple independent populations be sampled?

Reviewer #2: Throughout the text. It is stylistically preferable to avoid loose jargon such as “detour”, “kills indirect interactions” and others. “Detour” is used particularly often. I suggest that the authors either define what they mean by “detour” more precisely or simply replace it with something more precise like “indirect interaction” or “spurious interaction”.

LL 190. How is the UFE_{ij} statistic (or the UFE_{ij0}) evaluated if either f_{01} or f_{10} are zero?

LL 237-248. Please provide in the Results section the expressions for the UFE_{ij} and UFE_{ij0} statistics for epistatic and non-epistatic pairs obtained in the analytical calculations. I think it’s fine to relegate the calculations themselves to the Methods section. But the reader needs to see some substantiation for the statements made in the paragraph.

LL 245. It seems that calling this type of epistasis “diminishing returns” is not quite accurate (“diminishing returns epistasis” usually refers to beneficial mutations), and adding this label doesn’t add any value to this discussion. I suggest removing it.

LL 357-359. This statement “We note that Influenza virus has been shown to map to the traveling wave theory …” is somewhat misleading. Yes, antigenic drift in influenza is reasonably well described by the travelling wave theory. However, this is not the scenario that the authors tackle here. The pandemic 2009 flu is a relatively distant strain of H1N1 that invaded the resident H1N1 strain. So, it’s not just a single travelling wave situation, as mentioned above. Please rephrase.

L 427. Should there be a + sign in front of the second sum? Otherwise, I do not see how E = 1 represents compensation.

L 441. What exactly is f0?

L 441. Do U and µL represent the same quantity?

L 446. What does “epistatically isolated” mean?

LL 454-457. I think this statement requires some qualifications. I think the authors mean that the evolving population reaches the maximum-entropy state at the given fitness level with respect to the polymorphic sites. There could be other genotypes that have the same level of fitness but would requires flipping some sites that are fixed in the population.

Section “Analytic test of the method” (LL 479-523). I am confused. Are these calculations for the “double arches” topology shown in Figure 1 or for the “triple arches” topology shown in Figure 4. These paragraphs refer to both figures. See my related comment to Figure 4 below. Please resolve this confusion.

L 489. “interaction s” remove the space.

LL 510-511. I still don’t follow how the expression for UFEij is derived from Eq (9). It seems some approximations are made. Please state them explicitly.

LL 513-514. Please provide an expression for the UFE for a truly epistatic pair so that the expressions for UFE for epistatic and non-epistatic pairs can be directly compared.

LL 517-518. And the same for the UFE_{ij0}. Direct comparison of conditional UFE for both types of pairs would also be helpful.

L 541-544: What is a guide tree and how is it used? It seems that it is used to generate the consensus sequence, which is then used to binarize each amino acid residue in each sequence. But how the consensus sequence is obtained is not clear from this description.

Figure 4.

(1) This looks like “triple arches” to me, not “double arches”, so it is a different topology than the one that the authors explore numerically in Figure 1. I pointed this out in my first review, but this has not been addressed. Please resolve this.

(2) What does “active interaction” mean?

Reviewer #3: Attached

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

 

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Attachment

Submitted filename: Rouzine 2021 Revision.pdf

Decision Letter 2

Santiago F Elena, Raul Andino

25 May 2021

Dear Dr. Rouzine,

We are pleased to inform you that your manuscript 'An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza' has been provisionally accepted for publication in PLOS Pathogens.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Santiago F. Elena, PhD

Guest Editor

PLOS Pathogens

Raul Andino

Section Editor

PLOS Pathogens

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************************************************

Dear Dr. Rouzine,

Thank you for the time and effort put into responding to the reviewers comments in a positive manner. I find this revised version much improved and I glad to recommend its acceptation. I'm sure this will be a very influential contribution to the field.

Best regards,

Reviewer Comments (if any, and for reference):

Acceptance letter

Santiago F Elena, Raul Andino

15 Jun 2021

Dear Dr. Rouzine,

We are delighted to inform you that your manuscript, "An evolution-based high-fidelity method of epistasis measurement: theory and application to influenza," has been formally accepted for publication in PLOS Pathogens.

We have now passed your article onto the PLOS Production Department who will complete the rest of the pre-publication process. All authors will receive a confirmation email upon publication.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any scientific or type-setting errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Note: Proofs for Front Matter articles (Pearls, Reviews, Opinions, etc...) are generated on a different schedule and may not be made available as quickly.

Soon after your final files are uploaded, the early version of your manuscript, if you opted to have an early version of your article, will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Derivation of UFE for the closed square topology.

    (PDF)

    S1 Table. Direct and indirect UFE values for the square topology.

    Index 0 indicates a 3-way measure.

    (TIFF)

    S1 Fig. Square topology UFE calculation.

    A) Possible configurations and their symmetry. B) Indirect interaction pairwise. C) Indirect interaction three-way, with a fixed zero at a node. d) Indirect interaction with two fixed zeros. E) Direct interaction pairwise. F) Direct interaction three-way.

    (TIFF)

    S2 Fig. Dependence of 5 different types of linkage measure (S1B–S1F Fig) on epistatic strength predicted for square topology.

    (TIFF)

    Attachment

    Submitted filename: Rouzine 2021.docx

    Attachment

    Submitted filename: Rebuttal.pdf

    Attachment

    Submitted filename: Rouzine 2021 Revision.pdf

    Attachment

    Submitted filename: Rebuttal2.pdf

    Data Availability Statement

    Influenza sequence data are taken from public database https://www.fludb.org. Our software is deposited at https://github.com/rbatorsky/hiv-recombination, https://github.com/irouzine/Pedruzzi To characterize the three-dimensional network of epistatic interaction in Fig 4, we used software package ChimeraX from internet site https://www.rbvi.ucsf.edu/chimerax/.


    Articles from PLoS Pathogens are provided here courtesy of PLOS

    RESOURCES