Efficient epistasis inference via higher-order covariance matrix factorization

Kai S Shimagaki; John P Barton

doi:10.1101/2024.10.14.618287

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Oct 14:2024.10.14.618287. [Version 1] doi: 10.1101/2024.10.14.618287

Efficient epistasis inference via higher-order covariance matrix factorization

Kai S Shimagaki ^1,², John P Barton ^1,^2,^†

PMCID: PMC11507688 PMID: 39464126

Abstract

Epistasis can profoundly influence evolutionary dynamics. Temporal genetic data, consisting of sequences sampled repeatedly from a population over time, provides a unique resource to understand how epistasis shapes evolution. However, detecting epistatic interactions from sequence data is technically challenging. Existing methods for identifying epistasis are computationally demanding, limiting their applicability to real-world data. Here, we present a novel computational method for inferring epistasis that significantly reduces computational costs without sacrificing accuracy. We validated our approach in simulations and applied it to study HIV-1 evolution over multiple years in a data set of 16 individuals. There we observed a strong excess of negative epistatic interactions between beneficial mutations, especially mutations involved in immune escape. Our method is general and could be used to characterize epistasis in other large data sets.

Introduction

Epistasis is common in nature and plays an important role in evolution ^1,2. In the presence of epistasis, the fitness effects of mutations are contingent on the genetic background in which they appear, making the relationship between sequence and function complex ^3–5. More accurate estimates of epistasis could improve our ability to predict evolution, both at the level of genetic sequences and phenotypes ^6–8.

Enormous amounts of time-resolved sequence data have been generated in recent years, opening up the possibility of inferring epistasis from observations of evolution. Naively, we anticipate that sets of mutations that improve fitness will be found together in the same genetic sequence more often than expected by chance, while sets of deleterious mutations will be observed less frequently. However, phenomena such as genetic hitchhiking ⁹ and clonal interference ¹⁰ can also generate correlations between mutations that are unrelated to function. At present, a few methods exist to estimate pairwise epistatic interactions from temporal data, but computational constraints limit their applicability to small numbers of loci ^11–13.

Here we propose an efficient method for inferring epistatic fitness that extends and vastly improves the computational efficiency of an approach developed by Sohail and collaborators ¹³. With this new approach, the required memory and computational complexity scale only quadratically with the number of loci. These improvements are due to an efficient higher-order covariance matrix factorization (HCMF) method, which allows us to analyze much larger data sets than in previous analyses.

After validating our method in simulations, we apply it to study epistasis in within-host human immunodeficiency virus (HIV)-1 evolution in a cohort of 16 individuals. Several past studies have highlighted the role of epistasis in viral evolution. Early experimental work found evidence for both synergistic ¹⁴ and negative ¹⁵ epistasis in different viruses. Epistasis has been observed in influenza and in SARS-CoV-2, especially in the context of immune evasion ^16–21. In HIV-1, epistasis has been observed between mutations involved in drug resistance ^22–24 and immune escape ^25,26. Here we found a consistent pattern of negative epistasis in HIV-1, with an interaction strength that typically scales along with the fitness effects of the individual mutations. Overall, our HCMF method enables the estimation of epistasis in large data sets, and our analysis contributes to the quantification of epistasis in viral evolution.

Epistasis inference framework

As in related work ^13,27, our modeling framework is based on the Wright-Fisher (WF) model ^28–30. We write the number of haploid individuals with genotype $α$ at time $t$ in a population as $n_{α} (t)$ . Given a total population size of $N$ , we write the genotype frequency as $z_{α} (t) = n_{α} (t) / N$ . The state of the population is described by frequencies $z = (z_{1}, z_{2}, \dots, z_{M}) = {(z_{α} (t))}_{α}$ for each of the $M$ possible genotypes. Its evolution is influenced by several factors, including mutation, recombination, and the fitness of each genotype. To model the fitness effects of individual mutations and pairwise epistasis, we assume a fitness function

f^{α} = f (g^{α}) = 1 + \sum_{i} s_{i} g_{i}^{α} + \sum_{i < j} s_{i j} g_{i}^{α} g_{j}^{α} .

(1)

In the expression above, each locus $i$ has a corresponding selection coefficient $s_{i}$ that quantifies the fitness effect of the mutant allele at that locus and epistatic interactions $s_{i j}$ with mutant alleles at all other loci $j$ . For simplicity, we’ve used a binary model where each allele is either wild-type (WT) or mutant, but this can easily be extended to realistic sequence models (see Methods). Following this binary assumption, there are $M = 2^{L}$ possible genotypes for sequences with $L$ loci. The $g_{i}^{α}$ above are indicator functions, with a value equal to one if genotype $α$ has a mutant allele at locus $i$ and zero otherwise. Ultimately, our goal will be to infer the underlying fitness parameters ${(s_{i})}_{i = 1}^{L}$ and ${(s_{i j})}_{i < j}$ from temporal genetic data.

Under the WF model, the probability of obtaining a certain distribution of genotype frequencies in the next generation $z (t + 1)$ , given the current distribution $z (t)$ , is multinomial:

P (z (t + 1) | z (t); N, θ) = N! \prod_{α = 1}^{M} \frac{p_{α} {(z (t))}^{N_{z_{α}} (t + 1)}}{[N_{z_{α}} (t + 1)]!} .

(2)

We use $θ$ as a shorthand for all evolutionary parameters, including parameters describing selection and rates of mutation and recombination (Methods).

While inferring fitness parameters directly from (2) is challenging, when generation-to-generation changes in genotype frequencies are small, we can apply the simplified diffusion approximation of the WF model ^31,32. Through the diffusion approximation, we can obtain an analytically tractable expression for the probability of an evolutionary trajectory $L ({(z (t_{k}))}_{k = 1}^{K})$ , which we refer to as the path likelihood ^13,27 (Methods). This allows us to compute the fitness parameters (including individual selection coefficients $s_{i}$ and pairwise epistatic interactions $s_{i j}$ ) that best fit a data set of sequences collected over time.

To express the results, it’s useful to define a new vector $s = (s_{1}, s_{2}, \dots, s_{L}, s_{1, 2}, s_{1, 3}, \dots, s_{L - 1, L}) = {(s_{e})}_{e}$ . This vector combines both selection coefficients for individual mutations and pairwise epistatic interactions, with a generalized index $e$ that runs over both single loci $(1, 2, \dots, L)$ and pairs of loci $((1, 2), (1, 3), \dots, (L - 1, L))$ . Similarly, we can define a mutant allele frequency vector $x = (x_{1}, \dots, x_{L}, x_{1, 2}, \dots, x_{L - 1, L})$ , where

x_{i} = \sum_{α} g_{i}^{α} z_{α} x_{i, j} = \sum_{α} g_{i}^{α} g_{j}^{α} z_{α} .

(3)

Higher order mutant allele frequencies (e.g., $x_{i, j, k}$ ) are defined similarly.

The fitness parameters $\hat{s}$ that maximize the path likelihood, together with a Gaussian prior distribution (or equivalently, ridge regression penalty) for the selection coefficients and epistatic interactions, are then given by ^13,27

\hat{s} = {(C^{int} + γ I)}^{- 1} (Δ x^{int} - u^{int} - v^{int}) .

(4)

In this expression, the observed net allele frequency change over the observation period is

Δ x^{int} = \sum_{k = 0}^{K - 1} Δ x (t_{k}) = \sum_{k = 0}^{K - 1} (x (t_{k + 1}) - x (t_{K})) = x (t_{K}) - x (t_{0}),

while $u^{int}$ and $v^{int}$ represent expected cumulative frequency changes due to mutation and recombination, respectively. Explicit derivations and definitions for these terms are given in Methods. $C^{int}$ is the allele frequency covariance matrix integrated over time, and $γ$ quantifies the width of the prior distribution for the selection coefficients and epistatic interactions.

Although (4) is complicated, it can be interpreted intuitively. Essentially, (4) states that allele frequency change that is not explained by the forces of mutation or recombination is evidence of selection. The sign and magnitude of inferred selection depend on the net change in allele frequencies (how much, quantified by $Δ x^{int}$ , and how fast, quantified by the diagonal part of $C^{int}$ ) as well as the effects of genetic background (quantified by the off-diagonal terms of $C^{int}$ ).

Results

Factorization of higher-order integrated covariance matrix and efficient inference framework

While (4) provides a powerful expression to simultaneously estimate the fitness effects of mutations and pairwise epistasis from temporal genetic data, it faces a serious computational limitation. The covariance matrix $C^{int}$ is $D = q L (q L + 1) / 2 -$ dimensional, where $q$ is the number of alleles at each locus. The computational complexity of inverting the covariance matrix thus scales as $O ({(q L)}^{6})$ , with memory costs related to the storage of covariance matrix entries scaling as $O ({(q L)}^{4})$ . For data sets with hundreds or thousands of loci, simply storing the covariance matrix in memory becomes challenging.

We developed an efficient and generic method to resolve the major computational bottleneck hindering the application of this approach to larger data sets. The key idea of our approach is to exploit the regular structure of the covariance matrix, allowing us to factorize the matrix and perform calculations in a lower-dimensional space without any loss of information. Specifically, we can write the integrated covariance matrix in terms of a rectangular matrix with dimensions $D \times d$ , which depends on the number of unique sequences in the data set. Writing the number of unique sequences in the data set at time $t_{k}$ as $d_{k}$ , we have $d = \sum_{k = 0}^{K} d_{k} + K - 1$ . As we will show below, $d$ may be multiple orders of magnitude smaller than $D$ for relevant data sets of interest, allowing this factorization to dramatically speed up analyses.

In our analysis, we integrate the covariance matrix over the evolution using linear interpolation between sampling times, which mitigates periods of sparse sampling ^13,27. The integrated covariance matrix with linear interpolation can be factorized as follows (see Methods for details):

C^{int} = \sum_{k = 0}^{K - 1} Δ t_{k} \frac{C (t_{k}) + C (t_{k + 1})}{2} + \sum_{k = 1}^{K - 1} Δ t_{k} \frac{Δ x (t_{k}) Δ x {(t_{k})}^{⊤}}{6} = \sum_{k = 0}^{K} \sum_{α} ξ_{α} (t_{k}) ξ_{α} {(t_{k})}^{⊤} + \sum_{k = 1}^{K - 1} ξ (t_{k}) ξ {(t_{k})}^{⊤} = : Ξ Ξ^{⊤} .

(5)

The $ξ$ vectors are defined as

ξ_{α} (t_{k}) = \{\begin{array}{l} \sqrt{\frac{z_{α} (t_{0}) Δ t_{0}}{2}} (σ^{α} - x (t_{0})) & k = 0 \\ \sqrt{\frac{z_{α} (t_{k}) (Δ t_{k} + Δ t_{k - 1})}{2}} (σ^{α} - x (t_{k})) & 0 < k < K \\ \sqrt{\frac{z_{α} (t_{K}) Δ t_{K - 1}}{2}} (σ^{α} - x (t_{K})) & k = K, \end{array} ξ (t_{k}) = \sqrt{\frac{Δ t_{k}}{6}} Δ x (t_{k}),

(6)

where $Δ t_{k} = t_{k + 1} - t_{k}$ , $z_{α} (t_{k})$ is the frequency of genotype $α$ in the data at time $t_{k}$ , and $σ^{α}$ is a $D$ -dimensional vector with entries

σ_{e}^{α} = \{\begin{array}{l} g_{i}^{α} & for e \leq L \\ g_{i}^{α} g_{j}^{α} & for e > L . \end{array}

(7)

More complex covariance interpolation using spline curves ³³ can also be expressed in a form similar to (5).

Using the factorized $Ξ$ matrix (5), we can rewrite the equation for the estimated selection coefficients and epistatic interactions (4) as

\hat{s} = γ^{- 1} (Δ {\tilde{x}}^{int} - Ξ Δ η),

(8)

with

Δ {\tilde{x}}^{int} = Δ x^{int} - u^{int} - v^{int}, Δ η = {(Ξ^{⊤} Ξ + γ I)}^{- 1} (Ξ^{⊤} Δ {\tilde{x}}^{int}) .

(9)

Critically, computing (8) is far less computationally intensive than (4) when $D ≫ d$ , as the matrix to be inverted in (8) is only $d \times d$ . In total, the computational complexities of calculations in (8) are: matrix-vector products of $Ξ Δ η$ and $Ξ^{⊤} Δ {\tilde{x}}^{int}$ take $O (d D)$ ; matrix-matrix product of $Ξ^{⊤} Ξ$ requires $O (d^{2} D)$ ; solving the equation for $Δ η$ without directly solving its inverse is smaller than $O (d^{2 + ω})$ , with $ω$ a small positive number $0 < ω \leq 1$ , depending on linear optimization solvers.

This substantial computational reduction was achieved by implicitly computing the integrated covariance matrix without ever storing the covariance matrix itself. Therefore, our epistasis inference scheme is more efficient and scalable as the computational complexity scales only linearly with $D$ (and thus quadratically with $L$ ). In comparison, even naive selection inference without epistasis scales as $D^{3}$ . The expression for the selection coefficients in (8) uses no approximations. Thus, its solution is exact in the diffusion limit ¹³.

For simplicity, we initially assumed the same regularization $γ$ for selection and epistasis. However, we have also generalized our approach so that the regularization values $γ_{e}$ can differ and implemented this in our code (Methods). While our analysis considers only pairwise epistatic interactions, one could further extend the fitness function to consider even higher-order interactions. For $p$ -way epistatic interactions, the computational complexity would become $O (d D)$ with $D = \sum_{l = 1}^{p} q^{l} (\begin{array}{l} L \\ l \end{array})$ .

HCMF substantially reduces computational costs

To assess the efficiency of HCMF, we simulated population evolution under the WF model using different numbers of loci, ranging from $L = 50 t o 1600$ . We used a constant population size of $N = 10^{3}$ , a mutation rate of $μ = 10^{- 3}$ , and a recombination rate of $r = 10^{- 4}$ per site per generation. Our simulations ranged over 2000 generations, with virtual samples collected for inference every 10 generations. We used a fitness landscape in which 25% of mutations were beneficial $(s_{i} = 0.03)$ , 25% were deleterious $(s_{i} = - 0.03)$ , and 50% were neutral $(s_{i} = 0)$ . Similarly, 25% of all pairs of sites were randomly selected to have positive/negative epistatic interactions ( $s_{i j} = 0.03 or - 0.03$ , respectively), with the remaining 50% of the possible epistatic interactions set to zero. To ensure sufficient sampling to measure typical results, we performed 500 simulations for each condition.

Since the size of the covariance matrix increases quadratically with sequence length, the required memory size of the naive approach increases as $O (L^{4})$ . However, memory requirements only scale as $O (L^{2})$ for the HCMF method (Fig. 1a). HCMF also dramatically reduces the run time of the inference, scaling as $O (L^{2})$ compared to $O (L^{6})$ for the naive approach (Fig. 1b). For example, for $L = 400$ , HCMF is 10⁴ times faster than the naive approach. This computational advantage should further increase for larger sequence lengths. As noted above, since the HCMF approach involves no approximations, the selection coefficients and epistatic interactions inferred by this approach match the ones from the naive method exactly within machine precision.

Fig. 1. — a, Required memory size versus number of loci $L$ (measured in 100 bp). The required memory size of the naive method scales as $O (L^{4})$ , while our method reduces it to $O (L^{2})$ . For the naive method, we did not consider $L > 400$ due to computational constraints. b, Required computational time (in seconds) versus number of loci. As anticipated, the computational time of the naive method and HCMF scale by $O (L^{6})$ and $O (L^{2})$ , respectively.

Importance of higher-order covariance information for inferring epistasis

One of the main barriers to inferring epistatic interactions via (4) is computing and inverting the integrated covariance matrix. The HCMF approach we have developed offers one solution to this problem. However, one could also simplify (4) by neglecting the off-diagonal terms of the covariance matrix. This greatly reduces the computational burden of the problem, but neglects important information about linkage disequilibrium that could inform the inference of selection coefficients and epistatic interactions. We refer to this approximation as the independent model (analogous to the single locus model of ref. ²⁷).

We performed additional simulations to compare the accuracy of the HCMF method, which includes higher-order covariance information, and the independent model, which does not. These simulations were performed with the same parameters as in the previous section, using $L = 50$ loci and sparser epistatic interactions. Here we chose a random set of $L / 2 = 25$ pairs of sites to have positive epistatic interactions $(s_{i j} = 0.03)$ , $L / 2$ pairs with negative interactions $(s_{i j} = - 0.03)$ , and set the remaining epistatic interactions to zero.

The inferred epistatic interactions using the full model with HCMF are much closer to the true ones than those inferred with the independent model (Fig. 2a–b). For the full model, the distribution of inferred positive/neutral/negative epistatic interactions is roughly normal, with peaks that can easily be distinguished from one another. In contrast, the epistatic coefficients inferred using the independent model are distributed much more broadly and irregularly. While positive and negative interactions can, on average, still be distinguished from one another using the independent model, it is more difficult to do so. To quantify this difference, we computed the receiver operating characteristic (ROC) curve and area under the curve (AUC) for identifying positive (Fig. 2c) and negative (Fig. 2d) epistatic interactions using the full and independent models. Thus, by all metrics we find that the inclusion of higher-order covariance information improves the ability to identify epistatic interactions from data.

Fig. 2. — Distribution of inferred epistasis using the full model inferred via HCMF (a), which includes higher-order covariance information, and using the independent model (b), in which off-diagonal terms of the covariance matrix are set to zero. c, Receiver operating characteristic (ROC) curve for identifying positive epistasis. The area under the curve (AUC) value is 0.90 for the full model, while it drops to 0.81 for the independent model. d, Analogous ROC and AUC values for identifying negative epistasis. The AUC values are 0.90 and 0.69 for full and independent models, respectively.

Modeling epistasis improves the inference of selection coefficients

An alternative approach to reducing the computational costs of (4) is to use a simpler fitness landscape, such as a purely additive one with all $s_{i j} = 0$ . This assumption may be especially appropriate for analyzing highly similar sequences. However, in models with substantial, strong epistatic interactions, omitting epistasis could skew fitness estimates (Fig. 3a–b). In these conditions, inference using models with epistasis yields more accurate estimates of individual selection coefficients and improves the detection of beneficial and deleterious mutations (Fig. 3c–d).

Fig. 3. — Distribution of selection coefficients inferred with the full model via HCMF (a) and a simpler model with no epistatic interactions (b). When epistasis is present, including it in the model also improves estimates of selection coefficients. c, ROC curves and their AUC values for identifying positive selection coefficients. The AUC value of the full model is 0.94, while the AUC value of the model without epistasis drops to 0.87. d, Analogous ROC and AUC values for identifying deleterious selection coefficients. The AUC values are 0.88 and 0.74 for the full model and the model without epistasis, respectively. Simulation parameters are the same as in Fig. 2.

Joint inference from multiple replicates

Some evolution experiments, such as deep mutational scanning studies ^34,35, have multiple independent replicates collected under the same conditions. Using our approach, we can estimate selection coefficients and epistatic interactions that best explain the data across all replicates, as shown in prior work ^13,36,37. To demonstrate this phenomenon in a challenging setting for inference, we increased the density of epistatic interactions to 50%, with half set to be positive $(s_{i j} = 3 %)$ and half negative $(s_{i j} = 3 %)$ . In this case, epistatic effects dominate the fitness function. With a single replicate, the AUC values were 0.82 (0.81) for identifying beneficial (deleterious) selection coefficients and 0.74 (0.72) for positive (negative) epistasis. Combining data from two replicates raised the AUC to 0.93 (0.89) for selection coefficients and 0.86 (0.85) for epistasis, with further improvements as the number of replicates increases (Supplementary Fig. 1).

Epistasis in intrahost HIV-1 evolution

As a practical application of our approach, we studied within-host HIV-1 evolution in 16 individuals who were not treated with antiretroviral drugs during the sampling time. This data set included individuals enrolled in the CHAVI 001 and CAPRISA 002 studies in the United States, Malawi, and South Africa ^38,39. Each individual was identified shortly after HIV-1 infection, and the viral population within each individual was sampled frequently for several months to years afterward. For most individuals, the 3′ and 5′ halves of the HIV-1 genome were sequenced separately using single genome amplification methods, preserving information about linkage disequilibrium between mutations even at long distances. For two individuals, denoted CH505 and CAP256, only the HIV-1 surface protein Env was sequenced. Most data sets consisted of around 50–100 HIV-1 sequences in total for each sequencing region, collected over 5–8 time points, with several hundred polymorphic loci (see Supplementary Table 1). However, the viral population was also sequenced more deeply in a few individuals, featuring as many as 1205 HIV-1 sequences collected at 31 time points over roughly 5 years.

Using this data, we inferred selection coefficients and epistatic interactions between HIV-1 mutations for each individual and sampling region. We used prior estimates to set the mutation ⁴⁰ and effective recombination rates ⁴¹ in (8). By convention, we set the selection coefficients and epistatic interactions for the transmitted/founder (TF) sequence, the natural analog of WT, to zero (Methods). Thus, fitness effects are expressed relative to the strain of the virus that originally infected each individual. In general, the ability to transform the model parameters (i.e., selection coefficients and epistatic interactions) without affecting the dynamics of the model is referred to as a gauge freedom. Choosing a specific convention for the parameters is important for comparing fitness effects in different contexts and for improving the interpretability of the model ^37,42–44.

Here we focused specifically on epistatic interactions between nearby sites (separated by <50 bp), with distant epistatic interactions suppressed by strong regularization. There were two reasons for our focus on short-range interactions. First, due to the high effective recombination rate in HIV-1, the size of the sequencing region, and some large time gaps between samples, the expected change in correlations between mutant alleles due to recombination may violate the mathematical assumptions of the diffusion approximation, biasing our inferences for these sites. Second, short-range epistatic interactions may be of particular biological interest in HIV-1 evolution.

The accumulation of mutations within cytotoxic T lymphocyte (CTL) epitopes – linear peptides roughly 10 amino acids in length that are recognized by cytotoxic T cells – allows mutant viruses to escape from the immune system. Past work has shown that T cells are especially important in controlling HIV-1 replication ⁴⁵, and that the virus faces significant selective pressure to escape from CTLs ^{27,39,45–47}. However, because the recognition of CTL epitopes is highly specific, even one nonsynonymous mutation within the epitope can be sufficient to confer escape ^48–50. We anticipate that this phenomenon could lead to negative epistasis between CTL escape mutations, as the fitness benefit of multiple mutations within the epitope should be lower than expected based on the beneficial effect of each individual escape mutation.

While most of the epistatic interactions we inferred were very close to zero, a few were significantly negative (Fig. 4). We observed a general trend where negative epistatic interactions were more common between beneficial mutations, especially CTL escape mutations (Fig. 4). This pattern of negative epistasis between beneficial mutations, including CTL escape mutations, was robustly observed across all individuals and sequencing regions that we studied (Supplementary Fig. 2).

Consistency with prior estimates of selection in HIV-1

Past work has studied HIV-1 evolution in part of this data set with different modeling choices, including a model with purely additive selection ²⁷ and one that includes specific terms for CTL escape ⁵¹. Neither of these models includes pairwise epistatic interactions. Thus, we compared the selection coefficients inferred in our analysis with those from previous models to understand how the inclusion of pairwise epistasis affects the interpretation of the fitness effects of individual mutations.

Figure 5a shows a typical example of the inferred selection coefficients, ${\hat{s}}_{i}$ , with and without the inclusion of epistasis. Overall, we find that the inferred selection coefficients are similar to those in past models (mean Pearson’s $R = 0.94$ ). In particular, all models find very strong selection for CTL escape mutations ^27,51. As in previous work, the great majority of inferred selection coefficients are very close to zero (Fig. 5b). However, the model without epistasis also features heavier tails in the distribution of inferred selection coefficients, with more mutations inferred to have either very beneficial or very deleterious individual effects.

Fig. 5. — a, Comparison of inferred selection coefficients in models with and without epistasis in a typical case (700010077-3; see Supplementary Fig. 3 for all individuals). While the exact values differ, there is excellent general agreement between the mutations that are inferred to strongly affect fitness and those that are inferred to be nearly neutral. b, Distribution of selection coefficients across all individuals. Both distributions are peaked near zero, but the tails of the distributions in the full model are longer.

Discussion

Epistasis is prevalent in nature, and has been observed to influence viral evolution ^1,15,52–54. However, inferring epistasis from data is technically and computationally challenging. Here, we developed a new approach for the path likelihood inference framework ^{13,27,37,51,55} that greatly reduces computational costs for many data sets of interest, especially for inferring epistasis. Our key innovation was the efficient factorization of the higher-order covariance matrix, which allows us to analytically estimate selection coefficients and epistatic interactions from data without ever explicitly computing the covariance matrix or its inverse. For this reason, we referred to our method as higher-order matrix factorization (HCMF). The HCMF approach is general and can be applied under different assumptions about the structure of the fitness landscape. HCMF does not introduce any new approximations, so it suffers no loss in accuracy compared to prior approaches.

After validating our approach in simulations, we applied HCMF to study HIV-1 evolution within 16 individuals. The fitness effects of mutations that we inferred were consistent with past computational results ^27,36,47,51 and with experimental findings. In particular, we found strong selection for mutations that allow the virus to escape from the host immune system, in agreement with a large body of experimental work and clinical observations ^{25,39,45,46,56}.

In this HIV-1 data set, the distribution of epistatic interactions that we inferred was peaked near zero, but with a substantial tail of strong negative epistasis. Patterns of negative epistasis have also been observed in other viruses ¹⁵. Negative epistasis was especially common between CTL escape mutations, consistent with the finding that single mutations within an epitope typically already disrupt T cell recognition ^48–50. We also observed negative epistasis between pairs of beneficial mutations more generally. This finding is consistent with more general studies that have observed decreasing effect sizes of beneficial mutations over time ^6,57–59.

Our approach to inferring epistasis from temporal data differs from some prior methods, which used statistical models to explain correlations in protein sequences collected from many individuals or species ^{22,23,26,60–64}. These models treat sequence data as samples from a static, equilibrium distribution, and interpret correlations between mutations as possible evidence for epistasis. Only a handful of methods allow for the possibility that linkage disequilibrium may arise from an underlying phylogenetic structure to the sequence data ^65,66, or simply by chance. Nonetheless, these approaches have also been successful at tasks such as predicting the fitness effects of HIV-1 mutations in experiments ^62,63 and the dynamics of immune escape within individual patients ²⁶. In contrast to the present work, these models offer a “global” view of epistasis averaged across many related sequences.

The HCMF approach that we have developed is general. While our study focused on HIV-1, future work could be applied to other populations, including viruses like influenza and SARS-CoV-2 (ref. ⁵⁵), experimental evolution ³⁷, or bacteria ^67,68. As one example, recent studies have suggested that epistasis plays an important role in maintaining fitness among SARS-CoV-2 Spike mutations that escape from antibodies and control receptor binding ^19,20. More systematic studies could reveal the importance of epistasis in different aspects of SARS-CoV-2 evolution. More generally, a deeper understanding of epistasis may also improve our ability to understand and predict viral evolution.

Supplementary Material

Supplement 1

NIHPP2024.10.14.618287v1-supplement-1.pdf^{(960KB, pdf)}

ACKNOWLEDGEMENTS

The work of K.S.S. and J.P.B. reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM138233.

References

1.Phillips P. C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9, 855–867 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Mackay T. F. Epistasis and quantitative traits: using model organisms to study gene–gene interactions. Nature Reviews Genetics 15, 22–33 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Whitlock M. C., Phillips P. C., Moore F. B.-G. & Tonsor S. J. Multiple fitness peaks and epistasis. Annual review of ecology and systematics 601–629 (1995). [Google Scholar]
4.Szendro I. G., Schenk M. F., Franke J., Krug J. & De Visser J. A. G. Quantitative analyses of empirical fitness landscapes. Journal of Statistical Mechanics: Theory and Experiment 2013, P01005 (2013). [Google Scholar]
5.Bank C. Epistasis and adaptation on fitness landscapes. Annual Review of Ecology, Evolution, and Systematics 53, 457–479 (2022). [Google Scholar]
6.Kryazhimskiy S., Rice D. P., Jerison E. R. & Desai M. M. Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science 344, 1519–1522 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Park Y., Metzger B. P. & Thornton J. W. Epistatic drift causes gradual decay of predictability in protein evolution. Science 376, 823–830 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sailer Z. R. & Harms M. J. High-order epistasis shapes evolutionary trajectories. PLoS computational biology 13, e1005541 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Smith J. M. & Haigh J. The hitch-hiking effect of a favourable gene. Genetics Research 23, 23–35 (1974). [PubMed] [Google Scholar]
10.Hill W. G. & Robertson A. The effect of linkage on limits to artificial selection. Genetics Research 8, 269–294 (1966). [PubMed] [Google Scholar]
11.Illingworth C. J. Fitness inference from short-read data: within-host evolution of a reassortant h5n1 influenza virus. Molecular Biology and Evolution 32, 3012–3026 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Pedruzzi G. & Rouzine I. M. An evolution-based high-fidelity method of epistasis measurement: Theory and application to influenza. PLoS Pathogens 17, e1009669 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Sohail M. S., Louie R. H., Hong Z., Barton J. P. & McKay M. R. Inferring epistasis from genetic time-series data. Molecular Biology and Evolution 39, msac199 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Burch C. L. & Chao L. Epistasis and its relationship to canalization in the rna virus φ6. Genetics 167, 559–567 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Sanjuán R., Moya A. & Elena S. F. The contribution of epistasis to the architecture of fitness in an rna virus. Proceedings of the National Academy of Sciences 101, 15376–15379 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Lyons D. M. & Lauring A. S. Mutation and epistasis in influenza virus evolution. Viruses 10, 407 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Starr T. N. et al. Shifting mutational constraints in the sars-cov-2 receptor-binding domain during viral evolution. Science 377, 420–424 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Starr T. N. et al. Deep mutational scans for ace2 binding, rbd expression, and antibody escape in the sars-cov-2 omicron ba. 1 and ba. 2 receptor-binding domains. PLoS pathogens 18, e1010951 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Moulana A. et al. Compensatory epistasis maintains ace2 affinity in sars-cov-2 omicron ba. 1. Nature Communications 13, 7011 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Moulana A. et al. The landscape of antibody binding affinity in sars-cov-2 omicron ba. 1 evolution. Elife 12, e83442 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Witte L. et al. Epistasis lowers the genetic barrier to sars-cov-2 neutralizing antibody escape. Nature Communications 14, 302 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Flynn W. F. et al. Deep sequencing of protease inhibitor resistant hiv patient isolates reveals patterns of correlated mutations in gag and protease. PLoS computational biology 11, e1004249 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Butler T. C., Barton J. P., Kardar M. & Chakraborty A. K. Identification of drug resistance mutations in hiv from constraints on natural evolution. Physical Review E 93, 022412 (2016). [DOI] [PubMed] [Google Scholar]
24.Zhang T. h. et al. Predominance of positive epistasis among drug resistance-associated mutations in hiv-1 protease. PLoS genetics 16, e1009009 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Brockman M. A. et al. Escape and compensation from early hla-b57-mediated cytotoxic t-lymphocyte pressure on human immunodeficiency virus type 1 gag alter capsid interactions with cyclophilin a. Journal of virology 81, 12608–12618 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Barton J. P. et al. Relative rate and location of intra-host hiv evolution to evade cellular immunity are predictable. Nature communications 7, 11660 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Sohail M. S., Louie R. H., McKay M. R. & Barton J. P. Mpl resolves genetic linkage in fitness inference from complex evolutionary histories. Nature biotechnology 39, 472–479 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Fisher R. A. The genetical theory of natural selection: a complete variorum edition (Oxford University Press, 1999). [Google Scholar]
29.Wright S. Evolution in mendelian populations. Genetics 16, 97 (1931). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Ewens W. J. Mathematical population genetics: theoretical introduction, vol. 27 (Springer, 2004). [Google Scholar]
31.Kimura M. Diffusion models in population genetics. Journal of Applied Probability 1, 177–232 (1964). [Google Scholar]
32.Crow J. F. An introduction to population genetics theory (Scientific Publishers, 2017). [Google Scholar]
33.Shimagaki K. & Barton J. P. Bézier interpolation improves the inference of dynamical models from data. Physical Review E 107, 024116 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Fowler D. M. et al. High-resolution mapping of protein sequence-function relationships. Nature methods 7, 741–746 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gasperini M., Starita L. & Shendure J. The power of multiplexed functional analysis of genetic variants. Nature protocols 11, 1782–1787 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Shimagaki K. S., Lynch R. M. & Barton J. P. Parallel hiv-1 evolutionary dynamics in humans and rhesus macaques who develop broadly neutralizing antibodies. bioRxiv 2024–07 (2024). [Google Scholar]
37.Hong Z. & Barton J. P. popdms infers mutation effects from deep mutational scanning data. bioRxiv (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Tomaras G. D. et al. Initial b-cell responses to transmitted human immunodeficiency virus type 1: virion-binding immunoglobulin m (igm) and igg antibodies followed by plasma anti-gp41 antibodies with ineffective control of initial viremia. Journal of virology 82, 12449–12463 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Liu M. K. et al. Vertical t cell immunodominance and epitope entropy determine hiv-1 escape. The Journal of clinical investigation 123 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Zanini F., Puller V., Brodin J., Albert J. & Neher R. A. In vivo mutation rates and the landscape of fitness costs of hiv-1. Virus evolution 3, vex003 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Neher R. A. & Leitner T. Recombination rate and selection strength in hiv intra-patient evolution. PLoS computational biology 6, e1000660 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Shimagaki K. & Weigt M. Selection of sequence motifs and generative hopfield-potts models for protein families. Physical Review E 100, 032128 (2019). [DOI] [PubMed] [Google Scholar]
43.Posfai A., Zhou J., McCandlish D. M. & Kinney J. B. Gauge fixing for sequence-function relationships. bioRxiv (2024). [Google Scholar]
44.Posfai A., McCandlish D. M. & Kinney J. B. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. bioRxiv (2024). [Google Scholar]
45.McMichael A. J., Borrow P., Tomaras G. D., Goonetilleke N. & Haynes B. F. The immune response during acute hiv-1 infection: clues for vaccine development. Nature Reviews Immunology 10, 11–23 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Allen T. M. et al. Selective escape from cd8+ t-cell responses represents a major driving force of human immunodeficiency virus type 1 (hiv-1) sequence diversity and reveals constraints on hiv-1 evolution. Journal of virology 79, 13239–13249 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Zanini F. et al. Population genomics of intrapatient hiv-1 evolution. Elife 4, e11282 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lee J. K. et al. T cell cross-reactivity and conformational changes during tcr engagement. The Journal of experimental medicine 200, 1455–1466 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Huseby E. S. et al. How the t cell repertoire becomes peptide and mhc specific. Cell 122, 247–260 (2005). [DOI] [PubMed] [Google Scholar]
50.Huseby E. S., Crawford F., White J., Marrack P. & Kappler J. W. Interface-disrupting amino acids establish specificity between t cell receptors and complexes of major histocompatibility complex and peptide. Nature immunology 7, 1191–1199 (2006). [DOI] [PubMed] [Google Scholar]
51.Gao Y. & Barton J. P. A binary trait model reveals the fitness effects of hiv-1 escape from t cell responses. bioRxiv (2024). [Google Scholar]
52.Sanjuán R., Cuevas J. M., Moya A. & Elena S. F. Epistasis and the adaptability of an rna virus. Genetics 170, 1001–1008 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Elena S. F., Solé R. V. & Sardanyés J. Simple genomes, complex interactions: epistasis in rna virus. Chaos: An Interdisciplinary Journal of Nonlinear Science 20 (2010). [DOI] [PubMed] [Google Scholar]
54.Polster R., Petropoulos C. J., Bonhoeffer S. & Guillaume F. Epistasis and pleiotropy affect the modularity of the genotype–phenotype map of cross-resistance in hiv-1. Molecular biology and evolution 33, 3213–3225 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Lee B. et al. Inferring effects of mutations on sars-cov-2 transmission from genomic surveillance data. MedRxiv 2021. –12 (2022). [Google Scholar]
56.Liu Y. et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection. Journal of Virology 80, 9519–9529 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Khan A. I., Dinh D. M., Schneider D., Lenski R. E. & Cooper T. F. Negative epistasis between beneficial mutations in an evolving bacterial population. Science 332, 1193–1196 (2011). [DOI] [PubMed] [Google Scholar]
58.Chou H.-H., Chiu H.-C., Delaney N. F., Segrè D. & Marx C. J. Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science 332, 1190–1192 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Johnson M. S., Reddy G. & Desai M. M. Epistasis and evolution: recent advances and an outlook for prediction. BMC biology 21, 120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Weigt M., White R. A., Szurmant H., Hoch J. A. & Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Morcos F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108, E1293–E1301 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Ferguson A. L. et al. Translating hiv sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Mann J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology 10, e1003776 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Biswas A., Haldane A., Arnold E. & Levy R. M. Epistasis and entrenchment of drug resistance in hiv-1 subtype b. Elife 8, e50524 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Rodriguez Horta E., Barrat-Charlaix P. & Weigt M. Toward inferring potts models for phylogenetically correlated sequence data. Entropy 21, 1090 (2019). [Google Scholar]
66.Colavin A., Atolia E., Bitbol A.-F. & Huang K. C. Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Scientific Reports 12, 820 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Li Y. & Barton J. P. Estimating linkage disequilibrium and selection from allele frequency trajectories. Genetics 223, iyac189 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Li Y. & Barton J. P. Correlated allele frequency changes reveal clonal structure and selection in temporal genetic data. Molecular Biology and Evolution 41, msae060 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2024.10.14.618287v1-supplement-1.pdf^{(960KB, pdf)}

[R1] 1.Phillips P. C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9, 855–867 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Mackay T. F. Epistasis and quantitative traits: using model organisms to study gene–gene interactions. Nature Reviews Genetics 15, 22–33 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Whitlock M. C., Phillips P. C., Moore F. B.-G. & Tonsor S. J. Multiple fitness peaks and epistasis. Annual review of ecology and systematics 601–629 (1995). [Google Scholar]

[R4] 4.Szendro I. G., Schenk M. F., Franke J., Krug J. & De Visser J. A. G. Quantitative analyses of empirical fitness landscapes. Journal of Statistical Mechanics: Theory and Experiment 2013, P01005 (2013). [Google Scholar]

[R5] 5.Bank C. Epistasis and adaptation on fitness landscapes. Annual Review of Ecology, Evolution, and Systematics 53, 457–479 (2022). [Google Scholar]

[R6] 6.Kryazhimskiy S., Rice D. P., Jerison E. R. & Desai M. M. Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science 344, 1519–1522 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Park Y., Metzger B. P. & Thornton J. W. Epistatic drift causes gradual decay of predictability in protein evolution. Science 376, 823–830 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Sailer Z. R. & Harms M. J. High-order epistasis shapes evolutionary trajectories. PLoS computational biology 13, e1005541 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Smith J. M. & Haigh J. The hitch-hiking effect of a favourable gene. Genetics Research 23, 23–35 (1974). [PubMed] [Google Scholar]

[R10] 10.Hill W. G. & Robertson A. The effect of linkage on limits to artificial selection. Genetics Research 8, 269–294 (1966). [PubMed] [Google Scholar]

[R11] 11.Illingworth C. J. Fitness inference from short-read data: within-host evolution of a reassortant h5n1 influenza virus. Molecular Biology and Evolution 32, 3012–3026 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Pedruzzi G. & Rouzine I. M. An evolution-based high-fidelity method of epistasis measurement: Theory and application to influenza. PLoS Pathogens 17, e1009669 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Sohail M. S., Louie R. H., Hong Z., Barton J. P. & McKay M. R. Inferring epistasis from genetic time-series data. Molecular Biology and Evolution 39, msac199 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Burch C. L. & Chao L. Epistasis and its relationship to canalization in the rna virus φ6. Genetics 167, 559–567 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Sanjuán R., Moya A. & Elena S. F. The contribution of epistasis to the architecture of fitness in an rna virus. Proceedings of the National Academy of Sciences 101, 15376–15379 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Lyons D. M. & Lauring A. S. Mutation and epistasis in influenza virus evolution. Viruses 10, 407 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Starr T. N. et al. Shifting mutational constraints in the sars-cov-2 receptor-binding domain during viral evolution. Science 377, 420–424 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Starr T. N. et al. Deep mutational scans for ace2 binding, rbd expression, and antibody escape in the sars-cov-2 omicron ba. 1 and ba. 2 receptor-binding domains. PLoS pathogens 18, e1010951 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Moulana A. et al. Compensatory epistasis maintains ace2 affinity in sars-cov-2 omicron ba. 1. Nature Communications 13, 7011 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Moulana A. et al. The landscape of antibody binding affinity in sars-cov-2 omicron ba. 1 evolution. Elife 12, e83442 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Witte L. et al. Epistasis lowers the genetic barrier to sars-cov-2 neutralizing antibody escape. Nature Communications 14, 302 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Flynn W. F. et al. Deep sequencing of protease inhibitor resistant hiv patient isolates reveals patterns of correlated mutations in gag and protease. PLoS computational biology 11, e1004249 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Butler T. C., Barton J. P., Kardar M. & Chakraborty A. K. Identification of drug resistance mutations in hiv from constraints on natural evolution. Physical Review E 93, 022412 (2016). [DOI] [PubMed] [Google Scholar]

[R24] 24.Zhang T. h. et al. Predominance of positive epistasis among drug resistance-associated mutations in hiv-1 protease. PLoS genetics 16, e1009009 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Brockman M. A. et al. Escape and compensation from early hla-b57-mediated cytotoxic t-lymphocyte pressure on human immunodeficiency virus type 1 gag alter capsid interactions with cyclophilin a. Journal of virology 81, 12608–12618 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Barton J. P. et al. Relative rate and location of intra-host hiv evolution to evade cellular immunity are predictable. Nature communications 7, 11660 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Sohail M. S., Louie R. H., McKay M. R. & Barton J. P. Mpl resolves genetic linkage in fitness inference from complex evolutionary histories. Nature biotechnology 39, 472–479 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Fisher R. A. The genetical theory of natural selection: a complete variorum edition (Oxford University Press, 1999). [Google Scholar]

[R29] 29.Wright S. Evolution in mendelian populations. Genetics 16, 97 (1931). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Ewens W. J. Mathematical population genetics: theoretical introduction, vol. 27 (Springer, 2004). [Google Scholar]

[R31] 31.Kimura M. Diffusion models in population genetics. Journal of Applied Probability 1, 177–232 (1964). [Google Scholar]

[R32] 32.Crow J. F. An introduction to population genetics theory (Scientific Publishers, 2017). [Google Scholar]

[R33] 33.Shimagaki K. & Barton J. P. Bézier interpolation improves the inference of dynamical models from data. Physical Review E 107, 024116 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Fowler D. M. et al. High-resolution mapping of protein sequence-function relationships. Nature methods 7, 741–746 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Gasperini M., Starita L. & Shendure J. The power of multiplexed functional analysis of genetic variants. Nature protocols 11, 1782–1787 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Shimagaki K. S., Lynch R. M. & Barton J. P. Parallel hiv-1 evolutionary dynamics in humans and rhesus macaques who develop broadly neutralizing antibodies. bioRxiv 2024–07 (2024). [Google Scholar]

[R37] 37.Hong Z. & Barton J. P. popdms infers mutation effects from deep mutational scanning data. bioRxiv (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Tomaras G. D. et al. Initial b-cell responses to transmitted human immunodeficiency virus type 1: virion-binding immunoglobulin m (igm) and igg antibodies followed by plasma anti-gp41 antibodies with ineffective control of initial viremia. Journal of virology 82, 12449–12463 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Liu M. K. et al. Vertical t cell immunodominance and epitope entropy determine hiv-1 escape. The Journal of clinical investigation 123 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Zanini F., Puller V., Brodin J., Albert J. & Neher R. A. In vivo mutation rates and the landscape of fitness costs of hiv-1. Virus evolution 3, vex003 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Neher R. A. & Leitner T. Recombination rate and selection strength in hiv intra-patient evolution. PLoS computational biology 6, e1000660 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Shimagaki K. & Weigt M. Selection of sequence motifs and generative hopfield-potts models for protein families. Physical Review E 100, 032128 (2019). [DOI] [PubMed] [Google Scholar]

[R43] 43.Posfai A., Zhou J., McCandlish D. M. & Kinney J. B. Gauge fixing for sequence-function relationships. bioRxiv (2024). [Google Scholar]

[R44] 44.Posfai A., McCandlish D. M. & Kinney J. B. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. bioRxiv (2024). [Google Scholar]

[R45] 45.McMichael A. J., Borrow P., Tomaras G. D., Goonetilleke N. & Haynes B. F. The immune response during acute hiv-1 infection: clues for vaccine development. Nature Reviews Immunology 10, 11–23 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Allen T. M. et al. Selective escape from cd8+ t-cell responses represents a major driving force of human immunodeficiency virus type 1 (hiv-1) sequence diversity and reveals constraints on hiv-1 evolution. Journal of virology 79, 13239–13249 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Zanini F. et al. Population genomics of intrapatient hiv-1 evolution. Elife 4, e11282 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Lee J. K. et al. T cell cross-reactivity and conformational changes during tcr engagement. The Journal of experimental medicine 200, 1455–1466 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Huseby E. S. et al. How the t cell repertoire becomes peptide and mhc specific. Cell 122, 247–260 (2005). [DOI] [PubMed] [Google Scholar]

[R50] 50.Huseby E. S., Crawford F., White J., Marrack P. & Kappler J. W. Interface-disrupting amino acids establish specificity between t cell receptors and complexes of major histocompatibility complex and peptide. Nature immunology 7, 1191–1199 (2006). [DOI] [PubMed] [Google Scholar]

[R51] 51.Gao Y. & Barton J. P. A binary trait model reveals the fitness effects of hiv-1 escape from t cell responses. bioRxiv (2024). [Google Scholar]

[R52] 52.Sanjuán R., Cuevas J. M., Moya A. & Elena S. F. Epistasis and the adaptability of an rna virus. Genetics 170, 1001–1008 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Elena S. F., Solé R. V. & Sardanyés J. Simple genomes, complex interactions: epistasis in rna virus. Chaos: An Interdisciplinary Journal of Nonlinear Science 20 (2010). [DOI] [PubMed] [Google Scholar]

[R54] 54.Polster R., Petropoulos C. J., Bonhoeffer S. & Guillaume F. Epistasis and pleiotropy affect the modularity of the genotype–phenotype map of cross-resistance in hiv-1. Molecular biology and evolution 33, 3213–3225 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Lee B. et al. Inferring effects of mutations on sars-cov-2 transmission from genomic surveillance data. MedRxiv 2021. –12 (2022). [Google Scholar]

[R56] 56.Liu Y. et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection. Journal of Virology 80, 9519–9529 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Khan A. I., Dinh D. M., Schneider D., Lenski R. E. & Cooper T. F. Negative epistasis between beneficial mutations in an evolving bacterial population. Science 332, 1193–1196 (2011). [DOI] [PubMed] [Google Scholar]

[R58] 58.Chou H.-H., Chiu H.-C., Delaney N. F., Segrè D. & Marx C. J. Diminishing returns epistasis among beneficial mutations decelerates adaptation. Science 332, 1190–1192 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Johnson M. S., Reddy G. & Desai M. M. Epistasis and evolution: recent advances and an outlook for prediction. BMC biology 21, 120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Weigt M., White R. A., Szurmant H., Hoch J. A. & Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Morcos F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108, E1293–E1301 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Ferguson A. L. et al. Translating hiv sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Mann J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology 10, e1003776 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Biswas A., Haldane A., Arnold E. & Levy R. M. Epistasis and entrenchment of drug resistance in hiv-1 subtype b. Elife 8, e50524 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Rodriguez Horta E., Barrat-Charlaix P. & Weigt M. Toward inferring potts models for phylogenetically correlated sequence data. Entropy 21, 1090 (2019). [Google Scholar]

[R66] 66.Colavin A., Atolia E., Bitbol A.-F. & Huang K. C. Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Scientific Reports 12, 820 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Li Y. & Barton J. P. Estimating linkage disequilibrium and selection from allele frequency trajectories. Genetics 223, iyac189 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] 68.Li Y. & Barton J. P. Correlated allele frequency changes reveal clonal structure and selection in temporal genetic data. Molecular Biology and Evolution 41, msae060 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Efficient epistasis inference via higher-order covariance matrix factorization

Kai S Shimagaki

John P Barton

Abstract

Introduction

Epistasis inference framework

Results

Factorization of higher-order integrated covariance matrix and efficient inference framework

HCMF substantially reduces computational costs

Fig. 1. HCMF substantially reduces the required memory size and computational time.

Importance of higher-order covariance information for inferring epistasis

Fig. 2. Higher-order covariance information improves the inference of epistasis.

Modeling epistasis improves the inference of selection coefficients

Fig. 3. Modeling epistasis improves the inference of additive selection coefficients.

Joint inference from multiple replicates

Epistasis in intrahost HIV-1 evolution

Fig. 4. Predominance of negative epistasis between beneficial HIV-1 mutations.

Consistency with prior estimates of selection in HIV-1

Fig. 5. Consistency of inferred selection coefficients in models with and without epistasis.

Discussion

Supplementary Material

ACKNOWLEDGEMENTS

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

Efficient epistasis inference via higher-order covariance matrix factorization

Kai S Shimagaki

John P Barton

Abstract

Introduction

Epistasis inference framework

Results

Factorization of higher-order integrated covariance matrix and efficient inference framework

HCMF substantially reduces computational costs

Fig. 1. HCMF substantially reduces the required memory size and computational time.

Importance of higher-order covariance information for inferring epistasis

Fig. 2. Higher-order covariance information improves the inference of epistasis.

Modeling epistasis improves the inference of selection coefficients

Fig. 3. Modeling epistasis improves the inference of additive selection coefficients.

Joint inference from multiple replicates

Epistasis in intrahost HIV-1 evolution

Fig. 4. Predominance of negative epistasis between beneficial HIV-1 mutations.

Consistency with prior estimates of selection in HIV-1

Fig. 5. Consistency of inferred selection coefficients in models with and without epistasis.

Discussion

Supplementary Material

ACKNOWLEDGEMENTS

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases