Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Aug 1.
Published in final edited form as: Nat Biotechnol. 2017 Jan 16;35(2):128–135. doi: 10.1038/nbt.3769

Mutation effects predicted from sequence co-variation

Thomas A Hopf 1,2,3,*, John B Ingraham 1,*, Frank J Poelwijk 4, Charlotta PI Schärfe 1,5, Michael Springer 1, Chris Sander 2,4, Debora S Marks 1,§
PMCID: PMC5383098  NIHMSID: NIHMS852897  PMID: 28092658

Abstract

Increasing interest in determining the effects of genetic variation for bioengineering, human health and basic biological research has propelled the development of technologies for high-throughput mutagenesis and selection. However, since designing functional assays is challenging and systematic testing of combinations of mutations is intractable, there is a parallel need to develop more accurate computational predictions.. Most computational methods have relied significantly on the signal of evolutionary conservation, but do not account for dependencies between positions in a sequence. We present an unsupervised method for predicting the effects of mutations (EVmutation) that explicitly captures residue dependencies between positions. We find that it improves the prediction accuracies of a comprehensive collection of recent high-throughput experimental fitness landscapes, biochemical measurements and human disease mutations. We suggest EVmutation may be useful to assess the quantitative effects of mutations in genes of any organism and provide precomputed predictions for ~ 7000 human proteins.

Introduction

Understanding the phenotypic effects of genetic variation is a central challenge for bioengineering and basic biology. Molecular technologists strive to engineer biologics that are safe and effective1, design new genomes2, 3, and develop ‘smart’ libraries of synthesized molecules4, 5. Biologists seek to pinpoint the genetic changes that underlie organism phenotypes or complex diseases6 from the ever-expanding catalog of variation in humans and model organisms. Propelled by these needs new technologies have emerged that simultaneously assess the effects of thousands of mutations in parallel4, 727 (sometimes referred to as “deep mutational scans”28 or ‘MAVEs’29 ). In these assays, the measured attributes or processes vary, ranging from ligand binding, splicing and catalysis7, 11, 14, 16, 24, 30,26 to cellular or organismal fitness under selection pressure 810, 12, 15, 17, 20, 22, 23.

However, although high-throughput mutational scans have improved our understanding of the consequences of genetic variation, their relevance to organism fitness and physiology critically depends on the choice of the measured phenotype29, 31. It is reasonable that the relationship between a biochemical phenotype and organism fitness may be non-linear32 and even non-monotonic31. For the foreseeable future, scans are also limited in their scalability which limits researchers to focus either on a modest number of positions or alleles. Nature has not been subject to these limitations however, and proteins diverged by hundreds of positions can often functionally replace one another33.

Statistical models of natural sequence variation can complement experimental approaches. These models take advantage of the fact that evolution has been performing its own massively parallel mutagenesis and selection experiments over time. Most computational methods including SIFT, PolyPhen-2, and CADD exploit evolutionary conservation to predict the effects of mutations3436 but their sequence analysis does not explicitly consider genetic interactions between mutations and the sequence background, despite widespread evidence for non-independence of the effects of mutations, known as epistasis37, 38.

Recent work has applied unsupervised statistical models based on the co-evolution of natural sequences to accurately predict 3D contacts in protein and RNA structures3945, suggesting that dependencies between residues can be systematically captured from natural sequences. Although related models have been applied to predict the effect of mutations on a few proteins4648, these methods have not been tested systematically across many proteins or against human disease mutations.

We present a method, EVmutation, which accounts for epistasis by explicitly modeling interactions between all the pairs of residues in proteins (and bases in RNAs) and which can then quantify the effects of mutations, including multiple mutations, simultaneously. We compare predictions generated by our models with measurements from 34 mutagenesis experiments and in classification of human disease mutations showing that the modeled mutational landscapes are in better agreement with experimental data than existing non-epistatic methods.

Results

Probabilistic model captures residue dependencies

Extant proteins and RNAs exhibit strong signatures of selection throughout their evolution. It is not uncommon for genes separated by hundreds of millions of years, exhibiting negligible sequence identity, to nevertheless exhibit remarkable conservation of their structures and functions. Here, we aimed to build statistical models that can recover some of the dominant constraints that define families of homologous sequences.

We model the evolutionary process that has produced each family as a sequence generator at equilibrium that produces a sequence σ with probability P(σ) as

P(σ)=1Zexp{E(σ)} Eq. 1

Different parametric forms of the “energy” function E(σ) enable the model to capture different types of constraints on the sequences (epistatic or not) and Z normalizes the distribution to sum to one over all possible sequences of a fixed length. E(σ) may be thought of as a (negative) energy of a model from statistical physics or as proportional to the scaled fitness NeF in toy, equilibrium models of population genetics49. We use an energy function E(σ) with two types of constraints: pairwise constraints that describe co-dependencies in combinations of amino acids or nucleotides for each pair of sites, and site-specific constraints reflecting bias towards or away from specific amino acids or nucleotides at each position. The total energy for a specific sequences E(σ) is the sum of coupling terms Jij between every pair of residues and a sum of site-wise bias terms hi (fields),

E(σ)=ihi(σi)+i<jJij(σi,σj) Eq. 2

Combining the sequence generator model (Equation 1) with our energy function (Equation 2) produces a model that is known as a pairwise undirected graphical model in computer science and a Potts model in statistical physics. When fit to data, these models explain the global correlations observed between variables in a system in terms of direct pairwise interactions Jij that are typically simpler and more localized. Determining these interactions involves simultaneous accounting for all possible couplings between all pairs of positions, which is not possible with local measures of correlation such as Mutual Information50. When models of this form are fit to natural sequence families, the magnitudes of the Jij terms have consistently predicted contacts in the 3D structures of proteins, with sufficient accuracy to predict the folds of proteins41, 43, 51, 52.

We apply these models to make sequence-specific predictions about the relative favorability of mutations. Starting from a multiple alignment of a sequence family, we estimate the site and coupling parameters h and J using regularized maximum pseudolikelihood 40, 5356. After the parameters are inferred, we quantify the effects of single or higher-order substitutions on a particular sequence background with the log-odds ratio of sequence probabilities between the wild-type and mutant sequences (Fig. 1, Supplementary Fig. 1):

ΔE(σmut,σwt)=logP(σmut)P(σwt)=E(σmut)-E(σwt) Eq. 3

Figure 1. Inferring context-dependent effects of mutations from sequences.

Figure 1

Evolution has generated diverse families of proteins and RNAs with varied sequences that perform a common function. An unsupervised probabilistic model trained to generate the natural diversity in a multiple sequence alignment of a family can be used to predict the relative favorability of unseen mutations. Left: Existing models describe functional constraints on each position i in a sequence σ independently, averaging over the effect of background positions j. This can lead to incorrect predictions of neutrality. Right: Our approach infers a global probability model with pairwise interactions between positions i and j (Jij, see Methods) as well as background biases at single positions (hi). For a more detailed graphical schematic of the calculation, see Supplementary Fig. 1.

The summation over coupling terms Jij between all pairs of positions in the evolutionary statistical energy E directly incorporates sequence context, i.e. the effects of pairwise epistasis, into the calculation of mutation effects. We refer to this model, which is the basis of EVmutation, as the “epistatic model”.

Model captures experimental fitness landscapes

We assessed the extent to which the statistical energy landscapes of computed using EVmutation corresponded with experimentally measured changes of phenotypes. We collected data from both saturation mutagenesis experiments of proteins and RNA genes as well as focused low-throughput studies15, 46, 5760 resulting in 34 datasets from 29 non-redundant experiments (21 proteins, and a tRNA gene27; Supplementary Table 1) that includes all currently available mutation experiments where the protein or RNA had a sufficiently large and diverse alignment (Online Methods). For each protein or RNA tested, we generated a multiple sequence alignment of the sequence family, inferred the parameters h and J of the epistatic model, and compared the change in statistical energy (ΔE) to the corresponding experimental measurements of the effects of the mutations (Fig. 2; Supplementary Table 2). Since relationships between protein function and organismal fitness are not expected to be linear31, we focused on reporting rank correlations as the primary metric for evaluating predictive performance, but our results are robust to a variety of measures (Supplementary Table 3).

Figure 2. Saturation mutagenesis experiments provide a quantitative test of context-dependent predictions.

Figure 2

The computed ΔE mutational landscape of the DNA methyltransferase M.HaeIII (left, colour range from 5th percentile to 0) agrees quantitatively with experimental measurements of M.HaeIII fitness under selection for cleaving activity of a restriction enzyme (right, ρ=0.69, N=1634; marginal distributions in orange). The average mutational sensitivity per position shows improved correlation beyond individual effects (ρ=0.80, N=304).

We found significant correlation between the computed ΔE of the epistatic model and the experimental measurements of phenotype and fitness for all high-throughput experimental datasets (Spearman’s ρ 0.4–0.7; p-values from <10−300 to <10−27 and ρ=0.2, p<10−12 for one of the BRCA1 experiments not expected to correlate well (see below)). We found that the agreement between ΔE and experimental data is higher if the assayed phenotype is closely linked to an essential process, and that agreement depends on the strength of purifying selection applied in the experiment (Fig. 3; Supplementary Table 3, Supplementary Fig. 2). For instance, some of the strongest correlations between ΔE and experimental data were for the enzymatic activity of a methyltransferase that protects DNA from degradation17, and of a β-glucosidase that hydrolyzes biomass4. Likewise, ΔE was more weakly correlated with data from experiments where the laboratory selection pressures may not match those in the natural environment and here the distribution of experimental mutation effects tends to be skewed towards mostly neutral effects (for example BRCA1 binding to BARD1) or towards mostly very deleterious (a bacterial kinase subject to high doses of kanamycin) (Supplementary Table 4, Supplementary Fig. 3). The dependence on the strength of selection is nicely exemplified by the increasing correlation of our predictions with the experiments on β-lactamase and kanamycin kinase as the antibiotic doses are titrated to reveal the full dynamic range of effects (Supplementary Figs. 4 and 5).

Figure 3. ΔE captures experimental fitness landscapes and identifies deleterious human variants.

Figure 3

(a) Computed effects of specific mutations (difference in evolutionary statistical energy ΔE) based on the epistatic model agree with diverse experimental measurements of fitness and molecular function for 34 experiments for 20 proteins, a protein complex and an RNA molecule (underlined) as measured by Spearman’s rank correlation coefficient ρ (for equivalent site average plot, see Supplementary Figure 2; for correlations across all different assays tested in the experiments, see Supplementary Fig. 4). (b) Evolutionary statistical energies ΔE distinguish human disease-associated variants from common alleles in the population. This separation increases with the minimum allele frequency (AF) of the variants assumed to be neutral (area under the ROC curve (AUC)=0.92 for AF≥0.1, AUC=0.94 for AF≥0.25, AUC=0.96 for AF≥0.5). (c) The epistatic model shows stronger agreement with experiments than the established methods SIFT and PolyPhen-2, a baseline model based on the BLOSUM62 substitution matrix, and a corresponding independent model without pairwise interactions (differences of ρ of more than 0.6 were included in the bin at 0.6).

The epistatic model also successfully captured the effects of mutations in low-throughput experiments on thermostability of Trypsin58 (ρ=0.77, N=23) and SH357 (ρ=0.69, N=48) and the catalytic efficiency of β-lactamase15 (ρ=0.87, N=30), including double and triple substitutions. Although these experiments consist of a biased choice of mutants that may benefit our correlation, in principle, the results suggest EVmutation could be used to design stabilizing or catalytically stronger proteins.

To assess whether ΔE distinguishes human variants associated with diseases from putatively neutral variants, we compared predictions for 9008 variants (1553 proteins) annotated as pathogenic in ClinVar61, to variants in the 60,000 human exome sequencing collection (ExAC6) at increasing allele frequencies (AF; 2190 proteins; Online Methods). The ΔE scores for the disease variants are significantly different to those common alleles presumed mostly to be neutral from the ExAC collection (area under the ROC curve AUC=0.92–0.96, p-values <10−300, 2-sided sample KS tests; Fig. 3b).

Comparison of epistatic model with published methods

We sought to evaluate the utility of adding epistatic interactions to mutation prediction by comparing with commonly applied tools on both the assembled experimental data as well as human genetic variants. For the experimental data, we compared predictions from EVmutation with SIFT34 and PolyPhen-262, along with the substitution matrix BLOSUM62. The rank correlations of ΔE (epistatic model) are higher than the other methods for the majority of the analyzed mutation experiments (Fig. 3c; Supplementary Table 5). The average increase in ρ is 0.30 over the BLOSUM62 matrix (EVmutation better on 30/31 datasets) and 0.18 over SIFT (better on 29/31 datasets) and 0.18 over PolyPhen-2 (better on 26/29 datasets). Whilst this work was in review, another study used an epistatic model to predict effects of mutations in β-lactamase 48, but using different statistical inference and alignment generation protocols. EVmutation predictions are mildly more accurate when compared directly across the four proteins tested (average increase in correlation: 0.04) and on a par with a computationally expensive approach for inference for a small number of HIV protein mutations46 (EVmutation ρ=0.81 vs. ρ=0.80; Supplementary Table 6).

We compared predictions from the unsupervised EVmutation method with PolyPhen-2, a supervised mutation predictor, on human disease variants. Despite no training on clinical variants, the performance of ΔE predictions are comparable with those of PolyPhen-2 and SIFT on the HumVar benchmark set6264E: AUC=0.89; PolyPhen-2: AUC=0.88; SIFT: AUC=0.85) and better than PolyPhen-2 with the subset of variants that are predicted differently by PolyPhen-2 and SIFT (AUC 0.71 versus AUC 0.61; Supplementary Fig. 6). It has been reported that prediction methods, such as PolyPhen-2, that use supervised machine learning, can have inflated measures of accuracy due to multiple routes of circularity in the training and test sets16, 64. Therefore it is surprising that our unsupervised model performs as well, or better than, supervised methods that have been directly trained to segregate deleterious human variants.

Interactions terms underlie improved predictions

In order to understand whether the improved predictions of our epistatic model result from the added pairwise interactions J, we used a control, non-epistatic, “independent” model that is identical to ΔE but lacks interactions between sites. We fit this model to the same alignments as the epistatic model (Methods). The independent model is closely related to the log-frequency, Position-Weight Matrices and other scores of conservation 16, 34, 36, 65 that are commonly used by existing methods. The ability of ΔE (epistatic model) to capture mutational effects is better (difference in ρ ≥ 0.05) than the independent model across 23 of 34 of the analyzed experiments (Fig. 3c; Supplementary Table 5; average increase of ρ: 0.14, including high-throughput fitness experiments with well-defined selective pressures such as β-lactamase10, 15, 18, 48, kanamycin kinase13 and DNA methyltransferase M.HaeIII)17 and is also more accurate for human disease mutations (AUC 0.92 versus 0.88 for AF≥0.1), despite the ascertainment bias of these mutations64.

One particularly striking contrast between the epistatic and independent models is for the toxin-antitoxin complex ParED25, especially when the two proteins are modeled jointly as a protein interaction 40. In this case the epistatic model captured the experimental effects with much higher correlation (ρ=0.51 for singles, doubles and triples, N=3335 and ρ=0.43, N=9194 for all mutations) than the independent model, which hardly correlated at all (ρ=−0.05 for all up to triplets). Interestingly, if ΔE is computed on couplings from ParD alone, the match to experiment is lower (ρ=0.33) and when computed only on couplings between (but not within) the complex subunits, the predictions are comparable to the full model (ρ=0.53 vs ρ=0.51). Hence the interactions learned in the epistatic model support accurate prediction of the interaction specificity of the complex25.

Experiments where the epistatic model performs comparably to the independent model (11/34 sets) tend to be where the correlations are below average for either method (Supplementary Table 5). This includes three of the four viral proteins and may be a consequence of the limited diversity of the sequence alignments or alternatively, a discrepancy between the proxy for viral fitness in the laboratory and the in vivo fitness of the virus22.

We next asked how the results for the epistatic and independent models depend on the evolutionary depth of the alignments used for inference. As sequence alignments become narrower around the mutated protein, i.e. evolutionarily distant sequences are progressively excluded, the epistatic model and independent model make similar predictions that both perform less well against experimental data. For specific alignment depths of some proteins, the independent model can capture the experimental data as accurately as the epistatic model (Supplementary Table 7). However, since we do not know which alignment depth to choose a priori, the ability of the epistatic model to capture constraints without prior knowledge of the optimal alignment depth may be an advantage. The choice of alignment in our method is blind to the mutation experiments and is based on the algorithm used for alignment choice when computing 3D structure contacts using EVfold41 (Methods).

Where is the epistatic model better?

To investigate where inclusion of epistatic interactions leads to improved predictions, we compared the epistatic and independent models on a mutation-by-mutation basis. Direct comparison of residuals can be misleading when the functional relationship between predictions and experimental measurements is unknown and/or nonlinear, so we computed deviations between the epistatic and independent models after mapping through a quantile-quantile transformation. This allowed us to identify specific mutations that have the largest reduction of prediction error when using the epistatic model. For instance, in an RNA-recognition motif of the poly(A)-binding protein, the 3 mutations with the largest reduction in error using the epistatic model are H172T, N127R and S154D. The epistatic model agrees with experimental data showing deleterious effects of these mutations, but the independent model suggests that the same mutations are neutral or even beneficial (Fig. 4a, Supplementary Table 8). Although these substitutions are frequently observed in other sequences in the protein family alignment, only the epistatic model captures that they are not acceptable in the background of the target sequence. These three residues form a network of spatially proximal residues that contact RNA (Fig. 4a), and are highly coupled when their total coupling is summarized across all amino acid combinations (Supplementary Table 9). The predictions that differ most between the epistatic and independent models (defined as >2σ of error change distribution) are in 8 positions that are within 6 Å of the bound RNA, which includes the positions identified above (Fig. 4b).

Figure 4. Improvements of the epistatic model for functional sites.

Figure 4

(a) Left: The RNA-binding residue H172 of PolyA-binding protein (PABP) is strongly coupled to other residues in the binding interface that are close in 3D. Right: The epistatic coupling leads to strong constraints on acceptable amino acids in position 172, as observed in an experimental mutation scan of PABP. Only the epistatic model correctly identifies these co-constraints, while a model without sequence context (independent model) suggests many more substitutions would be acceptable (range of experimental preferences scaled to range of predicted preferences based on full set of mutants for entire domain). (b) Positions in PABP for which prediction accuracy improves the most by considering epistasis (≥2σ difference in root mean squared prediction error, spheres) cluster around the RNA ligand (yellow sticks, PDB: 4f02). (c) For seven high-throughput datasets where the correlation ρ of the epistatic and independent models differs more than 0.05, the epistatic model is more accurate overall (1st column), specifically for the effects of mutations of residues in interaction and ligand-binding sites (2nd column), where the residue mutation is rare versus frequent in the evolutionary sequence alignment (3rd and 4th columns, and where the residue change is damaging versus neutral in the experiment (5th and 6th columns) (Methods and Supplementary Table 8).

We then explored whether these observations generalize to other proteins by applying the above analysis for all proteins with single substitution high-throughput scans in which the epistatic and independent models differed (ρ(ΔE(epistatic)) − ρ(ΔE(independent)) > 0.05) but still showed reasonable correlation to the experimental data (ρ>0.50; Fig. 4c, Supplementary Fig. 7). Comparison of predictions made by the two models suggests that the advantages of the epistatic model stem at least in part from more accurate modeling of positions that facilitate interactions with ligands or other proteins, and show strong deleterious effects in experiments (Fig. 4c; Supplementary Fig. 7).

Discussion

We report here that a prediction method built on natural sequence variation can partially capture the experimental effects of mutations in a variety of biological contexts. This can be applied to predict or interpret the effects of genetic variation in any species of interest, and for higher-order mutations that change multiple positions at the same time.

Limitations of the model include biases that arise from evolutionarily-younger families of limited diversity, non-uniform selective constraints across a sequence family and higher order epistasis. Although incorporating epistatic terms into the model results in a practical improvement over other methods, there are remaining challenges regarding the interpretation and inference of the model parameters. For interpretation, it is difficult to distinguish those couplings in the model that may be due to subfamily-specific selection from those that may be due to more universal epistatic constraints across the whole family66 and future methods may begin to address these issues by consideration of phylogeny. For inference of the parameters, the statistical challenge remains that the typical number of free parameters (~106–108) vastly exceeds the number of available sequences (~103–105) and since evolutionary sampling is highly correlated and consequently highly redundant, this may lead to significant challenges in extrapolating a reasonable space of possible sequences.

The success of this and other models based on sequence variation at recapitulating high-throughput mutation experiments depends in part on the extent to which experimental assays capture phenotypes that are under direct, long-term selection (Fig. 5). For example, thermostability, activity, or binding energetics of a protein will generally not all contribute to fitness in the same way, so even a perfect model and perfect measurements might have imperfect correlations. The excellent correlation between model and experimental data for β-lactamase plausibly reflects how survival ‘in the wild’ in the presence of β-lactam antibiotics depends directly on that specific protein. For other assays, such as nonessential peripheral enzymes or signaling proteins, the property being tested in the lab may have only indirect, context-dependent impact on the organism. So the evaluation of agreement between model and experiment is two-way, reliant on whether the model is effective as a method and, when multiple kinds of measurements are available for the same protein, to what extent the lab assay captures evolutionarily conserved properties.

Figure 5. Computational predictions complement experimental measurements.

Figure 5

Various molecular phenotypes (center) such as structure, thermostability, activity, and ligand-binding affinity are determined by genotype and contribute to fitness in a complicated manner that is not known a priori. However, the distribution of contemporary genotypes (left) provides a record of historical fitness values (right) which can roughly be inferred by computational methods. Identifying those phenotypes that connect to inferred fitness may shed light on which molecular phenotypes have historically been the most relevant to the organism.

Nevertheless, our probabilistic approach is readily applicable for analysis of specific variants and combinations thereof for numerous families of proteins and RNAs and the interactions between them. We anticipate that analyses of genetic variation and mechanisms of evolution will benefit from global probability models of sequence families that explicitly incorporate interactions between positions, such as the model presented here. The consistency of our predictions with prior biological knowledge on functional sites highlights how inclusion of epistatic interactions will facilitate more accurate assessment of mutations and aid the design of libraries of protein sequences. To enable the community to analyze ΔE for their proteins of interest, we make software (Supplementary Code) and pre-computed mutation effects for ~ 7000 human proteins and available at evmutation.org.

ONLINE METHODS

Generation of multiple sequence alignments

For each analyzed protein (target sequence), multiple sequence alignments of the corresponding protein family were obtained by the default five search iterations of the profile HMM homology search tool jackhmmer67 against the UniRef100 database of non-redundant protein sequences68 (release 11/2015). To control for comparable evolutionary depth across different families, we used length-normalized bit scores to threshold sequence similarity rather than E-values40. A default bit score of 0.5 bits/residue was used as a threshold for inclusion unless the alignment yielded < 80% coverage of the length of the target domain or if there were not enough sequences (redundancy-reduced number of sequences ≥10L); in the first case, the threshold was increased in steps of 0.05 bits/residue until sufficient coverage was obtained; in the second case, the threshold was decreased until there were sufficient sequences (≥10L). If these two objectives were conflicting, precedence was given to maintaining more than 10L sequences. Since the sequence diversity of viral protein families is typically much lower than that of bacterial and eukaryotic families, the alignment depth for viral proteins was chosen as a default of 0.5 bits/residue even if the redundancy-reduced number of sequences was lower than 10L. The alignments were post-processed to exclude positions with more than 30% gaps and to exclude sequence fragments that align to less than 50% of the length of the target sequence. For the ParE-ParD toxin-antitoxin interaction, a joint sequence alignment with matched homologs of both interaction partners was generated using our previously described approach EVcomplex40. Alignments for RNA sequence families were obtained from the Rfam database69 and redundancy-reduced at the same 80% identity cutoff as proteins. The tRNA alignment was filtered to contain only sequences with a CCU anticodon.

Inference of epistatic models of biological sequences

Model

Each family is modeled as a distribution over the space of all possible sequences that is parameterized by two types of constraints: site-specific constraints on the biases for specific amino acids or nucleotides at each position, and pairwise constraints for combinations of amino acids or nucleotides for each pair of sites. The use of such a global probabilistic approach is motivated by the idea that the induced correlations observed in a multiple sequence alignment can be explained by a simpler set of underlying couplings between positions. Local measures of coupling, such as the log ratios that define Mutual Information, cannot deconvolve these transitive correlations between positions50.

The form of the chosen distribution can be thought of as the least-structured (i.e. maximum entropy) distribution over sequence space that is consistent with the single-site and pairwise marginal distributions of amino acids or nucleotides observed in the alignment43, 50, 55, 70. Under the model, the probability of a sequence σ of length N is defined as

P(σ)=1Zexp{E(σ)}

The partition function Z normalizes the distribution by summing over the relative weights (Boltzmann factors) of all possible sequences σ′. The strength of the Boltzmann factors for each sequence σ is defined by the evolutionary statistical energy E(σ) as the sum of all its pairwise coupling constraints Jij and single-site constraints hi (fields), where i and j are positions along the sequence:

E(σ)=ihi(σi)+i<jJij(σi,σj)

In a Maximum likelihood fit without regularization, the site-specific parameters hii) and pair-specific parameters Jij(σi,σj) implement the constraints that the marginal probability distributions of the model agree with the empirical marginal frequencies fii) and fiji, σj) in the sequence data. This model is also as a Markov Random field or a Potts model in statistical physics.

Insertions and deletions

While the model readily describes fixed-length sequences such as strings of 20 amino acids or 4 nucleotides, it is unclear how to meaningfully represent the insertions and deletions in a way that reasonably reflects the generative process occurring in evolution. Traditionally, statistical approaches have tended to model indels either as (i) an extra character or (ii) missing data. Both of these approaches are conceptually problematic, since the former partitions a single deletion event into many separate ‘gap characters’ while the latter forces the model to impute a hidden variable in gapped positions that can affect other variables in the system despite knowledge that no such coding variable exists. We used an alternative approach for modeling indels that does not invoke a gap character and does not give missing data the ability to influence the system. We regard the indel process as a separate, observed process and instead model the conditional distribution of the amino acids given an indel pattern. If zi are Bernoulli indicators that are 1 when a site is coding and 0 when gapped the conditional energy function for coding regions is:

E(σ;z)=izihi(σi)+i<jzizjJij(σi,σj)

Given an observed set of gaps, the marginal distribution of the amino acids at the ungapped positions defines a conditional model for a sequence. In principle, this approach could be made fully generative by combining it with a model for gaps, but for the purposes of this study we only fit the conditional distribution.

Sample reweighting

Natural sequences descend from a common ancestry and consequently may be highly correlated. Moreover, uneven sampling of the family by both sequencing projects and natural processes may result in over-abundance of sequences from some parts of phylogeny (e.g. model organisms) and an under-abundance of others. To partially account for this redundancy, we use the established approach of reweighting sequences by a measure of their uniqueness43, 44. Briefly, we define the sequence weight πs of a given sequence as πs=(tI[DH(σs,σt)<θ])-1 where DH(σs,σt) is the normalized Hamming distance (i.e. 1.0 - percentage identity) between sequences s and t and θ is a threshold for the percent divergence of sequences classified as similar. We used an 80% identity cutoff (θ = 0.2 ) for all proteins except viral proteins, for which we used a 99% identity cutoff (θ = 0.01).

Inference

Given observed sequence data (with potential reweighting for redundancy), the model parameters hi and Jij could in principle be estimated by maximum likelihood, i.e. by finding the parameters that maximize the probability of observing the sequence data. This approach, however, is deterministically intractable due to the 4N (RNA) or 20N (protein) sequences in sequence space that must be to compute the partition function Z. As a replacement for the full likelihood, we used a site-factored pseudolikelihood approximation53.

Regularization

The number of parameters for the pairwise model, which for typical protein families of 50–500 amino acids will range from 105 to 107, outnumbers the typical number of sequences available (102 to 105) for even the largest families by several orders of magnitude. In this under-sampled regime, standard maximum likelihood estimation is highly prone to overfit the sample data. We penalized model complexity with l2-regularization55, 56, which may also be interpreted as maximum a posteriori (MAP) inference under zero-mean Gaussian priors on the parameters. The strength of l2-regularization was set as λh=0.01 for the single-site constraints hi and λJ=0.01·q(N−1) for the pairwise coupling constraints Jij, where N is the number of sites in the model and q corresponds to the number of possible states (q=20 for proteins, q=4 for RNA). Combining the pseudolikelihood approximation, sequence reweighting, treatment of indels, and regularization, we estimate the parameters by maximizing the objective

{h,J}=argmaxh,Js,iπslogPis(σ)-λhi,ahi(a)2-λJi,j,a,bJij(a,b)2

where the log conditional likelihood Pis(σ) is defined whenever position i in sequence s is ungapped as

Pis(σ)=exp(hi(σis)+jizjsJij(σis,σjs)))/aexp(hi(a)+jizjsJij(a,σjs)))

and is 0 otherwise. We solve this optimization problem with a quasi-Newton method (L-BFGS), and make our C software implementing this available at github.com/debbiemarkslab/plmc.

Calculation of context-dependent mutation effects (evolutionary statistical energy)

Using the inferred probability models, one can then quantify the effect of single or higher-order substitution on a particular sequence background by computing the log-odds ratio of probabilities between the mutant sequence σmut and the wild-type sequence σwt. This ratio of probabilities is simply the difference of the evolutionary statistical energies of the two sequences, which is

ΔE(σmut,σwt)=i(hi(σimut)-hi(σiwt))+i<j(Jij(σimut,σimut)-Jij(σiwt,σiwt)).

For a given mutant, the evolutionary statistical energy difference ΔE will be the sum of differences of the single-site constraints hi for all substituted sites, plus the sum of differences of the coupling parameters for all pairs of positions involving at least one mutated site. By evaluating the change of couplings to other sites, the sequence context and therefore epistatic effects are explicitly incorporated into the computed mutation effects. Values of ΔE above 0 correspond to more probable mutant sequences (putatively beneficial), values below 0 to less probable mutant sequences (putatively deleterious) and values equal to 0 to equally probable sequences (putatively neutral).

Throughout the manuscript, mutation effects computed using this model are referred to as statistical energy differences from the epistatic model.

Calculation of evolutionary couplings

Between any two positions i and j in the protein family, the coupling matrix Jij describes the co-constraint on all possible 202 amino acid or 42 nucleotide combinations. To quantify the total epistatic constraint between pairs of sites across all sequences in the alignment, the Frobenius norm was used to summarize each matrix Jij into a single number that is proportional to the standard deviation of the (zero-mean) matrix56. After computing the norm scores for every pair of sites, we remove background coupling that is presumed to be caused by limited sampling and phylogenetic relationships between sequences by applying the average product correction (APC)71. This is equivalent to setting the dominant singular value or eigenvalue of the coupling matrix to zero. Significantly constrained pairs (evolutionary couplings, ECs) were then selected by a mixture model-based strategy quantifying the probability of any pair to belong to the high-scoring tail of the score distribution rather than the background noise distribution40, 72.

Inference of independent statistical models

To assess the contribution of epistatic interactions to ΔE, additional maximum entropy models were inferred that describe protein sequences using only site-specific amino acid constraints hi, without considering explicit inter-dependencies between sites (i.e., predict mutation effects independently of the sequence context). Since the parameters of each independent model were inferred from the same sequence alignment as the epistatic model, it serves as a direct control for the contribution of epistatic interactions. The probability of any amino acid sequence σ under this “independent” model is given by

P(σ)=1Zexp{ihi(σi)}

Consistent with the regularization applied to the epistatic model, the strength of the l2 penalty was set to λh=0.01 when estimating the model parameters hi. The statistical energy difference between two sequences can be analogously inferred by calculating the statistical energy difference between the mutant and the wild-type sequences, which again corresponds to the logs-odds ratio of their probabilities. This formalism is closely related to the log-odds conservation scores used in many methods to predict mutation effects from sequence36,73.

Mutational landscape datasets

Mutation effect datasets were identified by a comprehensive literature search for quantitative high-throughput mutagenesis experiments of entire proteins, protein domains or RNA molecules. All experiments that targeted proteins or RNAs with insufficient sequence diversity (redundancy-reduced number of sequences <10L, where L=length of protein or domain), covered only small subregions, or tested synthetic wild-type proteins were excluded from the final compilation of datasets (Supplementary Table 1). For further comparisons, the dataset was extended with low-throughput measurements of molecular phenotypes (stability, catalytic activity, binding), with a focus on sequence co-evolution studies. If a dataset reported more than one measurement for the wild-type sequence (as was the case for 11 of the high-throughput scans with one wild-type value per position in the sequence), we averaged these redundant experimental measurements into a single value (Supplementary Table 2).

Classification of experimental mutation effects

18 of the experimental mutation scans had visible bimodality in their effect distribution. Since this can lead to biased Pearson and Spearman correlations, we tested these with a binary classification measure, Mathews Correlation Coefficient, in addition to the other analyses. We classified mutations as damaging or neutral by (i) fitting a two-component Gaussian mixture models to each data set, and (ii) by assigning individual mutations to the mixture model component returning the higher posterior probability. (Mutation effects (enrichment ratios of sequencing reads before and after functional selection) were transformed into log-space where the experiment was reported in linear space.)

Correspondence between computed and experimental mutational landscapes

The agreement between computational and experimental mutation effects was evaluated using standard metrics of bivariate dependence. Due to strongly ed experimental effect distributions and the expected non-linear relationships between protein function and organism fitness we focused on Spearman’s rank correlation coefficient with dense ranking as our main evaluation metric. To test the robustness of our results, we additionally evaluated all relationships using distance correlations74 and, where applicable, Pearson correlation coefficients and the corresponding linear regression R2 values. For datasets showing bimodal effect distributions, we also tested prediction quality in a binary setting using the Matthews correlation coefficient75. Quantitative ΔE values from the epistatic and independent models were assigned as damaging or neutral if they were below or above the median of the respective effect distribution, respectively.

Analysis of properties of distributions

Systematic biases in experimental or computed effect distributions can influence which correlation measures are applicable, and what bivariate relationship between the two can be expected. To describe the overall shape of distributions, in particular deviations from normality and strong biases towards damaging or neutral effects, for each distribution we calculated its skewness (scipy.stats.skew) and the R2 of the sample data against a normal distribution in a probability plot (scipy.stats.probplot). The latter quantity corresponds to the test statistic of the Shapiro-Francia test for normality.

Comparison to existing mutation effect prediction methods

For the comparison of ΔE to existing approaches for the prediction of mutation effects, we computed mutational landscapes using local installations of SIFT and PolyPhen-2 applied to the same input sequences as for our sequence alignments. Quantitative effect scores and binary classifications (neutral/damaging) were obtained from the respective columns in the output files. Mutation effects computed using the BLOSUM62 matrix correspond to the respective matrix entry of wild-type and substituted residue. Correlations were calculated on the joint set of variants that could be predicted by all methods.

Human gene and disease variant analysis

Besides the evaluation on quantitative mutation effect data, we assessed the ability of ΔE to discriminate between known disease mutations and neutral amino acid variants in a binary classification setting. For this evaluation, we computed quantitative effect scores and tested how well both types of variants are separated by ΔE. Unlike the existing method PolyPhen-2 which trains on known disease variants, our mutation effects are purely based on sequences without learning on the outcome variable.

Alignments for protein sequences with disease and/or neutral variants were generated by identifying Pfam domains in the respective sequence using hmmscan from HMMER and running our own alignment protocol for each domain region. If possible, regions were extended by 10 residues on either side to correct for the lack of N- and C-terminal coverage often observed in Pfam. Following the same protocol as for mutational scans, alignments were inferred at a bitscore threshold of 0.5 bits/residue. Families where the number of sequences in the alignment exceeded our computational resources were run at a bitscore of 0.8 bits/residue. Alignments were filtered at a 95% sequence identity cutoff to reduce computation time. Epistatic and independent models were then inferred for all alignments and used to compute the effect of amino acid substitutions. In the evaluation, we used all variants in domains that had alignments with Meff/L ≥ 1 and could be unambiguously mapped to UniProt sequences.

To assess if ΔE separates disease and neutral variants, we derived a dataset based on high-confidence disease and neutral variants. We obtained 15405 amino acid variants in 2387 proteins from ClinVar61 that were unambiguously annotated with clinical significance “pathogenic”. Similarly, we derived sets of amino acid variants assumed to be neutral because of their high allele frequency in the ExAC exome sequence dataset6, at increasing levels of stringency (allele frequency (AF)≥0.1: 13643 variants/6993 proteins; AF≥0.25: 8595 variants/5193 proteins; AF≥0.5: 4700 variants/3282 proteins). Of the pathogenic ClinVar variants, 10556 were covered by an alignment (in 1848 proteins), and 9008 (1553 proteins) had sufficient sequences (Meff/L ≥1) and were used for evaluation. Of the ExAC variants, 3514 variants (2190 proteins) remained at AF≥0.1, 2193 variants (1524 proteins) at AF≥0.25 and 1182 variants (937 proteins) at AF ≥0.5.

For comparison to the existing mutation effect classifiers SIFT and PolyPhen-2, we chose the HumVar dataset (as provided by Grimm et al.)62, 64 as this is commonly used as a benchmark. However, using this benchmark set in the comparison to our unsupervised method is conservative, since most current supervised methods, including PolyPhen-2, train in a supervised way on these known variants and may have a tendency to over estimate their own accuracy64. Ideally, one would construct a truly unbiased test set by excluding these and related variants, but since this leads to a strong reduction in the number of variants that can be used for evaluation, we left the benchmark set as used by others64. We started from 40389 variants (9231 proteins) in the dataset, and after using the above domain identification and alignment protocol, arrived at a set of 21915 variants in 5067 proteins jointly predicted by all methods. Of these, 18001 variants in 3912 proteins had Meff/L≥1 and were used for evaluation. In addition to the full set of variants with sufficient alignment depth, we also evaluated performance on “difficult” examples where the predictions between SIFT and PolyPhen-2 disagreed (3126 variants in 1459 proteins).

The statistical energy distributions between neutral and disease variants were compared using two-sample Kolmogorov-Smirnov tests (two-sided; function scipy.stats.ks_2samp). To assess the discrimination both types of variants, the area under the receiver operating characteristic curve (AUC) was calculated using scikit-learn (function sklearn.metrics.roc_auc_score) and compared between the different methods. We also evaluated the discovery of damaging variants with high specificity by computing the partial AUC up to a false positive rate of 20% using the pROC R package76.

Pre-computed mutation landscapes for human proteins

Since we calculated epistatic models for all human proteins that have variants annotated as pathogenic, ExAC variants with allele frequency ≥ 0.1, and others in the course of this analysis, this results in a resource of models for 6955 unique human proteins (9957 alignment regions, Meff/L≥1). For these proteins, we provide single substitution landscapes, sequence alignments, and evolutionary couplings (summarized epistatic constraint for pairs of positions) at evmutation.org. Mutational landscapes can be explored using interactive visualization. Together with the provided software (Supplementary Code), the downloadable files allow to reproduce our analyses and compute higher-order mutation landscapes for any of these proteins.

Error analysis of evolutionary statistical energy predictions

Besides the overall correlation analysis between prediction and experiment, we wanted to understand how large the prediction error is for individual mutations, and if particular mutations were predicted with lower error by either the epistatic or the independent, independent model. To calculate individual error terms, we need to compare the predicted value to the respective experimental value. In our case, this is however complicated by the fact that ΔE and the experimental data are on different scales and the relationships between both are often non-linear, with no particular expectation on the shape of the relationship. This problem prevents the use of standard regression approaches to calculate error terms.

Instead, we chose to transform predicted values into the space of the experiment by a quantile mapping strategy (a perfect prediction would mean that the normalized ranks of each variant in the predicted and experimental distribution agree). For any substitution S, let ΔE(S) correspond to the p-quantile of the predicted distribution. Then we transform a prediction ΔE(S) into experimental space by mapping it onto the respective p-quantile of the experimental distribution. We denote this mapping by Q(ΔE(S)). For robustness against a small number of experimental outliers, Q assigns the respective experimental 0.01/0.99 experimental quantile for any value below/above the predicted 0.01/0.99 quantile. By defining a quantile-quantile mapping for each of the models, we can now calculate an error term ε(ΔE(S)) for each variant S by comparing the mapped prediction Q(ΔE(S)) to the actual experimental value y(S) of the variant S: ε(ΔE(S)) = Q(ΔE(S)) − y(S). The error term ε(ΔE(S)) is normalized by the range of the full experimental distribution between the 0.01 and 0.99 quantiles to make it comparable between different experiments.

To evaluate if the use of epistatic interactions improves or decreases the error of individual variant predictions (i.e., variants that are predicted with different log-odds scores ΔE(S) compared to the independent model rather than similar log-odds scores), we had to modify the above approach because differently predicted, but wrong variants might bias the quantile-quantile mappings in favor of each method. Instead, we defined quantile-quantile mappings Q′ not on the full distributions but only on those variants that are predicted similarly between both models (|ΔE(epistatic) − ΔE(independent)| ≤ 2, threshold chosen so that there are enough variants to define a smooth curve). Q′ allows to map any arbitrary score based on the nearest quantile that was used during calculation of the mapping. Comparison of the respective absolute error terms ε(ΔE(S)) = Q′(ΔE(S)) − y(S) between the epistatic and the independent model allows to assess for any substitution S if the use of epistatic interactions increases or decreases the error compared to the experiment.

The computation of error terms for individual variants makes it possible to assess if certain groups of variants are systematically predicted more or less accurately using epistatic interactions. For this analysis, we compared the root mean square errors (RMSE, summing the individual error terms ε(ΔE(S))) across all individual substitutions within a particular group between the epistatic and the independent model. Subgroups were defined as follows: (1) High/low experimental effect: Gaussian mixture model of effect distribution as described above; (2) Frequent/Rare substitution: frequency of substitution in respective column of alignment ≥ 0.01 or < 0.001; (3) Ligand binding/protein interaction: all substitutions to residues within 4Å minimum atom distance of ligand or interaction partner, but excluding evolutionarily conserved cofactors; (4) Buried/Exposed: Relative solvent accessibility calculated using DSSP < 0.1 or > 0.25; (5) Conserved/Variable: Column conservation of position in alignment > 0.4 or < 0.2; (6) Strongly/weakly coupled: At least 4/no coupled pairs in list of N top-ranking long-range (|i − j|>5) evolutionary couplings.

Analysis of structural features

Evolutionary couplings calculated from multiple sequence alignments were compared to experimental protein 3D structures from the PDB77 to assess if the identified epistatic constraints correspond to structural contacts. Structures and mappings to the target sequence were obtained using jackhmmer-based searches against the PDB (one search iteration), and residue pair distances calculated for up to 10 of the most significant hits with a normalized bit-score of at least than 0.5 bits/residue to the target sequence. Two residues were considered to be in contact if any of their atoms are closer than 5 Å in any of the identified structures; a distance threshold of 4 Å was applied to interactions between amino acid residues and ligands.

Data analysis and method availability

All data analysis was conducted using Jupyter notebooks78 and the scientific Python stack79, 80. Supplemental Web Data, human protein predictions and code (Supplementary Code) to calculate statistical models and mutation effects from sequence alignments are available at evmutation.org.

Supplementary Material

Inegrated Supp Figs
Sup Tabl 5
Sup Tabl2
Sup Table 3
Sup Table 4
Sup Table 6
Sup Table 7
Sup Table 8
Sup Table 9
Sup Table1

Editors summary.

The global effects of epistasis on protein and RNA function are revealed by an unsupervised model of amino acid co-conservation in evolutionary sequence variation.

Acknowledgments

The authors would like to thank A. Lapedes, B. Rost and members of the Marks lab for scientific discussion, and J. Reeb for help with existing mutation prediction software. C.S. was funded by NIGMS (R01GM106303). D.S.M. and T.A.H. were funded by NIGMS (R01GM106303) and the Raymond and Beverley Sackler Foundation. J.B.I. was funded by an NSF Graduate Research Fellowship.

Footnotes

Competing Financial Interests Statement

The authors declare no competing financial interests.

References

  • 1.Miersch S, Sidhu SS. Intracellular targeting with engineered proteins. F1000Res. 2016;5 doi: 10.12688/f1000research.8915.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Boeke JD, et al. GENOME ENGINEERING. The Genome Project-Write. Science. 2016;353:126–127. doi: 10.1126/science.aaf6850. [DOI] [PubMed] [Google Scholar]
  • 3.Ostrov N, et al. Design, synthesis, and testing toward a 57-codon genome. Science. 2016;353:819–822. doi: 10.1126/science.aaf3639. [DOI] [PubMed] [Google Scholar]
  • 4.Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:7159–7164. doi: 10.1073/pnas.1422285112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev. 2015;44:1172–1239. doi: 10.1039/c4cs00351a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Roscoe BP, Bolon DN. Systematic exploration of ubiquitin sequence, E1 activation efficiency, and experimental fitness in yeast. Journal of molecular biology. 2014;426:2854–2870. doi: 10.1016/j.jmb.2014.05.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Roscoe BP, Thayer KM, Zeldovich KB, Fushman D, Bolon DN. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. Journal of molecular biology. 2013;425:1363–1377. doi: 10.1016/j.jmb.2013.01.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Melamed D, Young DL, Gamble CE, Miller CR, Fields S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein. Rna. 2013;19:1537–1551. doi: 10.1261/rna.040709.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Stiffler MA, Hekstra DR, Ranganathan R. Evolvability as a Function of Purifying Selection in TEM-1 beta-Lactamase. Cell. 2015;160:882–892. doi: 10.1016/j.cell.2015.01.035. [DOI] [PubMed] [Google Scholar]
  • 11.McLaughlin RN, Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. The spatial architecture of protein function and adaptation. Nature. 2012;491:138–142. doi: 10.1038/nature11500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kitzman JO, Starita LM, Lo RS, Fields S, Shendure J. Massively parallel single-amino-acid mutagenesis. Nature methods. 2015;12:203–206. doi: 10.1038/nmeth.3223. 204 p following 206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Melnikov A, Rogov P, Wang L, Gnirke A, Mikkelsen TS. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic acids research. 2014;42:e112. doi: 10.1093/nar/gku511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Araya CL, et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences of the United States of America. 2012;109:16858–16863. doi: 10.1073/pnas.1209751109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Firnberg E, Labonte JW, Gray JJ, Ostermeier M. A comprehensive, high-resolution map of a gene’s fitness landscape. Mol Biol Evol. 2014;31:1581–1592. doi: 10.1093/molbev/msu081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Starita LM, et al. Massively Parallel Functional Analysis of BRCA1 RING Domain Variants. Genetics. 2015 doi: 10.1534/genetics.115.175802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rockah-Shmuel L, Toth-Petroczy A, Tawfik DS. Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations. PLoS Comput Biol. 2015;11:e1004421. doi: 10.1371/journal.pcbi.1004421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jacquier H, et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:13067–13072. doi: 10.1073/pnas.1215206110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Qi H, et al. A quantitative high-resolution genetic profile rapidly identifies sequence determinants of hepatitis C viral fitness and drug sensitivity. PLoS Pathog. 2014;10:e1004064. doi: 10.1371/journal.ppat.1004064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wu NC, et al. Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality. PLoS genetics. 2015;11:e1005310. doi: 10.1371/journal.pgen.1005310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Mishra P, Flynn JM, Starr TN, Bolon DN. Systematic Mutant Analyses Elucidate General and Client-Specific Aspects of Hsp90 Function. Cell Rep. 2016;15:588–598. doi: 10.1016/j.celrep.2016.03.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Doud MB, Bloom JD. Accurate measurement of the effects of all amino-acid mutations to influenza hemagglutinin. bioRxiv. 2016 doi: 10.3390/v8060155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Deng Z, et al. Deep sequencing of systematic combinatorial libraries reveals beta-lactamase sequence constraints at high resolution. Journal of molecular biology. 2012;424:150–167. doi: 10.1016/j.jmb.2012.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Starita LM, et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:E1263–1272. doi: 10.1073/pnas.1303309110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Aakre CD, et al. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell. 2015;163:594–606. doi: 10.1016/j.cell.2015.09.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Julien P, Minana B, Baeza-Centurion P, Valcarcel J, Lehner B. The complete local genotype-phenotype landscape for the alternative splicing of a human exon. Nat Commun. 2016;7:11558. doi: 10.1038/ncomms11558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li C, Qian W, Maclean CJ, Zhang J. The fitness landscape of a tRNA gene. Science. 2016 doi: 10.1126/science.aae0568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nature methods. 2014;11:801–807. doi: 10.1038/nmeth.3027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gasperini M, Starita L, Shendure J. The power of multiplexed functional analysis of genetic variants. Nat Protoc. 2016;11:1782–1787. doi: 10.1038/nprot.2016.135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sarkisyan KS, et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533:397–401. doi: 10.1038/nature17995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Boucher JI, Bolon DN, Tawfik DS. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Protein Sci. 2016 doi: 10.1002/pro.2928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gong LI, Suchard MA, Bloom JD. Stability-mediated epistasis constrains the evolution of an influenza protein. eLife. 2013;2:e00631. doi: 10.7554/eLife.00631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kachroo AH, et al. Evolution. Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science. 2015;348:921–925. doi: 10.1126/science.aaa0769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Sim NL, et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic acids research. 2012;40:W452–457. doi: 10.1093/nar/gks539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Adzhubei I, Jordan DM, Sunyaev SR. Predicting functional effect of human missense mutations using PolyPhen-2. Current protocols in human genetics/editorial board, Jonathan L. Haines … [et al.] 2013;Chapter 7(Unit7):20. doi: 10.1002/0471142905.hg0720s76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. Epistasis as the primary factor in molecular evolution. Nature. 2012;490:535–538. doi: 10.1038/nature11510. [DOI] [PubMed] [Google Scholar]
  • 38.McCandlish DM, Shah P, Plotkin JB. Epistasis and the Dynamics of Reversion in Molecular Evolution. Genetics. 2016;203:1335–1351. doi: 10.1534/genetics.116.188961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife. 2014;3:e02030. doi: 10.7554/eLife.02030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Hopf TA, et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife. 2014;3 doi: 10.7554/eLife.03430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Hopf TA, et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012;149:1607–1621. doi: 10.1016/j.cell.2012.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Marks DS, Hopf TA, Sander C. Protein structure prediction from sequence variation. Nature biotechnology. 2012;30:1072–1080. doi: 10.1038/nbt.2419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Marks DS, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011;6:e28766. doi: 10.1371/journal.pone.0028766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Morcos F, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:E1293–1301. doi: 10.1073/pnas.1111471108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28:184–190. doi: 10.1093/bioinformatics/btr638. [DOI] [PubMed] [Google Scholar]
  • 46.Mann JK, et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput Biol. 2014;10:e1003776. doi: 10.1371/journal.pcbi.1003776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lapedes A, Giraud B, Jarzynski C. Using sequence alignments to predict protein structure and stability with high accuracy. 2012 arXiv preprint arXiv:1207.2484. [Google Scholar]
  • 48.Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1. Mol Biol Evol. 2016;33:268–280. doi: 10.1093/molbev/msv211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:9541–9546. doi: 10.1073/pnas.0501865102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Giraud BG, Heumann JM, Lapedes AS. Superadditive correlation. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 1999;59:4983–4991. doi: 10.1103/physreve.59.4983. [DOI] [PubMed] [Google Scholar]
  • 51.Ovchinnikov S, et al. Large-scale determination of previously unsolved protein structures using evolutionary information. eLife. 2015;4:e09248. doi: 10.7554/eLife.09248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS One. 2014;9:e92197. doi: 10.1371/journal.pone.0092197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Besag J. Statistical analysis of non-lattice data. The statistician. 1975:179–195. [Google Scholar]
  • 54.Balakrishnan S, Kamisetty H, Carbonell JG, Lee SI, Langmead CJ. Learning generative models for protein fold families. Proteins. 2011;79:1061–1078. doi: 10.1002/prot.22934. [DOI] [PubMed] [Google Scholar]
  • 55.Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:15674–15679. doi: 10.1073/pnas.1314045110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical review. E, Statistical, nonlinear, and soft matter physics. 2013;87:012707. doi: 10.1103/PhysRevE.87.012707. [DOI] [PubMed] [Google Scholar]
  • 57.Di Nardo AA, Larson SM, Davidson AR. The relationship between conservation, thermodynamic stability, and function in the SH3 domain hydrophobic core. Journal of molecular biology. 2003;333:641–655. doi: 10.1016/j.jmb.2003.08.035. [DOI] [PubMed] [Google Scholar]
  • 58.Halabi N, Rivoire O, Leibler S, Ranganathan R. Protein sectors: evolutionary units of three-dimensional structure. Cell. 2009;138:774–786. doi: 10.1016/j.cell.2009.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Philip AF, Kumauchi M, Hoff WD. Robustness and evolvability in the functional anatomy of a PER-ARNT-SIM (PAS) domain. Proceedings of the National Academy of Sciences of the United States of America. 2010;107:17986–17991. doi: 10.1073/pnas.1004823107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Bershtein S, Mu W, Shakhnovich EI. Soluble oligomerization provides a beneficial fitness effect on destabilizing mutations. Proceedings of the National Academy of Sciences of the United States of America. 2012;109:4857–4862. doi: 10.1073/pnas.1118157109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Landrum MJ, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids research. 2016;44:D862–868. doi: 10.1093/nar/gkv1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nature methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Capriotti E, Calabrese R, Casadio R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006;22:2729–2734. doi: 10.1093/bioinformatics/btl423. [DOI] [PubMed] [Google Scholar]
  • 64.Grimm DG, et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Human mutation. 2015;36:513–523. doi: 10.1002/humu.22768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Bromberg Y, Yachdav G, Rost B. SNAP predicts effect of mutations on protein function. Bioinformatics. 2008;24:2397–2398. doi: 10.1093/bioinformatics/btn435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.van Nimwegen E. Inferring Contacting Residues within and between Proteins: What Do the Probabilities Mean? PLoS Comput Biol. 2016;12:e1004726. doi: 10.1371/journal.pcbi.1004726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Suzek BE, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Nawrocki EP, et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research. 2015;43:D130–137. doi: 10.1093/nar/gku1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Jaynes ET. Information theory and statistical mechanics. Physical review. 1957;106:620. [Google Scholar]
  • 71.Dunn SD, Wahl LM, Gloor GB. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics. 2008;24:333–340. doi: 10.1093/bioinformatics/btm604. [DOI] [PubMed] [Google Scholar]
  • 72.Toth-Petroczy A, et al. Structured States of Disordered Proteins from Genomic Sequences. Cell. 2016;167:158–170. e112. doi: 10.1016/j.cell.2016.09.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Reva B, Antipin Y, Sander C. Determinants of protein function revealed by combinatorial entropy optimization. Genome Biol. 2007;8:R232. doi: 10.1186/gb-2007-8-11-r232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Székely GJ, Rizzo ML. Brownian distance covariance. The annals of applied statistics. 2009;3:1236–1265. doi: 10.1214/09-AOAS312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405:442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]
  • 76.Robin X, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Berman HM, et al. The Protein Data Bank. Nucleic acids research. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Pérez F, Granger BE. IPython: a system for interactive scientific computing. Computing in Science & Engineering. 2007;9:21–29. [Google Scholar]
  • 79.Van Der Walt S, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering. 2011;13:22–30. [Google Scholar]
  • 80.Hunter JD. Matplotlib: A 2D graphics environment. Computing in science and engineering. 2007;9:90–95. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Inegrated Supp Figs
Sup Tabl 5
Sup Tabl2
Sup Table 3
Sup Table 4
Sup Table 6
Sup Table 7
Sup Table 8
Sup Table 9
Sup Table1

RESOURCES