Evolution Rapidly Optimizes Stability and Aggregation in Lattice Proteins Despite Pervasive Landscape Valleys and Mazes

Jason Bertram; Joanna Masel

doi:10.1534/genetics.120.302815

. 2020 Feb 27;214(4):1047–1057. doi: 10.1534/genetics.120.302815

Evolution Rapidly Optimizes Stability and Aggregation in Lattice Proteins Despite Pervasive Landscape Valleys and Mazes

Jason Bertram ^*,^†,¹, Joanna Masel ^‡,¹

PMCID: PMC7153934 PMID: 32107278

The fitness landscapes of genetic sequences are high-dimensional and “rugged” due to sign epistasis. Empirical limitations and the abstractness of many landscape models limit our understanding of how ruggedness shapes the mode and tempo...

Keywords: fitness landscape, sign epistasis, pleiotropy, computational complexity, protein folding, hydrophobic zipping

Abstract

The “fitness” landscapes of genetic sequences are characterized by high dimensionality and “ruggedness” due to sign epistasis. Ascending from low to high fitness on such landscapes can be difficult because adaptive trajectories get stuck at low-fitness local peaks. Compounding matters, recent theoretical arguments have proposed that extremely long, winding adaptive paths may be required to reach even local peaks: a “maze-like” landscape topography. The extent to which peaks and mazes shape the mode and tempo of evolution is poorly understood, due to empirical limitations and the abstractness of many landscape models. We explore the prevalence, scale, and evolutionary consequences of landscape mazes in a biophysically grounded computational model of protein evolution that captures the “frustration” between “stability” and aggregation propensity. Our stability-aggregation landscape exhibits extensive sign epistasis and local peaks galore. Although this frequently obstructs adaptive ascent to high fitness and virtually eliminates reproducibility of evolutionary outcomes, many adaptive paths do successfully complete the ascent from low to high fitness, with hydrophobicity a critical mediator of success. These successful paths exhibit maze-like properties on a global landscape scale, in which taking an indirect path helps to avoid low-fitness local peaks. This delicate balance of “hard but possible” adaptation could occur more broadly in other biological settings where competing interactions and frustration are important.

THE effect of an allele on organismal fitness can depend on its genetic context, a phenomenon known as epistasis. Epistasis is often visualized in terms of a “landscape” representing the mapping from high-dimensional genotype space to fitness (Wright 1932; de Visser and Krug 2014; Kondrashov and Kondrashov 2015). In the absence of epistasis, new mutations have the same effect regardless of the order in which they occur, and adaptive evolution on any genetic background leads to the same peak genotype(s). The latter still applies if epistasis only affects the magnitude, not the sign, of mutational fitness effects. However, if there is sign epistasis (mutations that are beneficial on some genetic backgrounds are deleterious on others), fitness landscapes become rugged; many or all of the mutational paths between two genotypes pass through valleys of lower fitness even if those genotypes only differ by a few mutations (Weinreich et al. 2005).

Sign epistasis is pervasive, particularly at the level of individual proteins (Starr and Thornton 2016). It is therefore important to understand the evolutionary consequences of landscape ruggedness. It is well known that rugged landscapes can be permeated with local fitness peaks where all available mutations are deleterious (Whitlock et al. 1995; de Visser and Krug 2014; Kondrashov and Kondrashov 2015; Marchi et al. 2019), halting adaptive evolution unless valley crossing occurs (Weissman et al. 2009). Sign epistasis can also cause the reversion of previously adaptive mutations along adaptive paths (DePristo et al. 2007; Pollock et al. 2012; Palmer et al. 2015; Wu et al. 2016; Starr et al. 2018), creating “indirect” adaptive paths (Wu et al. 2016) with maze-like landscape features (Kaznatcheev 2019).

Compared with peaks and valleys, the drivers and consequences of maze-like behavior are far less well understood. In the extreme case where reversions occur repeatedly across many loci, computational complexity arguments show that astronomically long indirect paths could theoretically be required to reach even local fitness peaks (Kaznatcheev 2019). This previously unappreciated challenge for adaptive evolution could be in addition to the need to cross valleys once local peaks have been reached. Alternatively, indirect paths could facilitate adaptation by providing a way to navigate around inferior local peaks (Wu et al. 2016). Thus, it is not apparent a priori whether indirect paths should be regarded as a help or a hindrance.

It is also not apparent what the scale of maze-like features might be on biologically plausible landscapes. Computational complexity arguments show that long indirect paths do exist in model landscapes (Kaznatcheev 2019), but without any indication of whether they are likely to occur in a typical ensemble of such landscapes (which are themselves of uncertain biological plausibility). Intuitively, it would seem that long indirect paths are hard to achieve because the sign epistasis necessary for mazes also creates peaks, and by definition peaks obstruct adaptive paths (Li 2018). Long indirect paths seem to require a careful balance where adaptation is hard but possible: hard because the landscape topography precludes rapid direct ascent to high fitness but possible because high fitness is still attainable without valley crossing. Thus, we could imagine that maze-like features might be largely confined to short local detours rather than the deeper source of landscape complexity proposed by Kaznatcheev (2019).

Experiments have had limited success clarifying these issues because most empirical landscape studies, including the few studies that have demonstrated indirect paths (DePristo et al. 2007; Palmer et al. 2015; Wu et al. 2016), are confined to the local scale (e.g., in the neighborhood of some reference genotype; De Visser and Krug 2014; Sarkisyan et al. 2016). Thus, the overall prevalence of “backtracking” along adaptive paths on biological fitness landscapes is not known, nor is it known whether backtracking extends beyond local detours to create larger (potentially global) scale mazes.

However, there is indirect evidence that is consistent with the view that fitness landscape mazes might be biologically important. First, long-term microbial evolution experiments famously exhibit power-law fitness accumulation where beneficial mutations show no sign of depleting after tens of thousands of generations (Wiser et al. 2013); this is a possible signature of long indirect paths (Kaznatcheev 2019). Second, viral protein phylogenetic analyses (Popova et al. 2019) have documented a “senescence” phenomenon whereby older mutations are more likely to be lost (the opposite of “entrenchment”; Ashenberg et al. 2013; Shah et al. 2015; McCandlish et al. 2016; Flynn et al. 2017). But while this could represent epistatic backtracking, the phenomenon’s association with validated antigenic sites makes the slow evolution of host immunity a viable alternative explanation (Popova et al. 2019). Third, mammalian protein age estimates suggest the existence of long-term evolutionary trends (Neme and Tautz 2013), specifically an increase in hydrophobicity and a decrease in the clustering of hydrophobic (H) amino acids along the primary sequence (Foy et al. 2019). Given the long timescales involved, the patterns observed by Foy et al. (2019) likely depend on the global features of protein fitness landscapes: the consistent direction of evolution combined with its failure of settle at an optimal value for these protein properties within such long timescales could conceivably be explained by slow valley-crossing processes (pervasive peaks) (Guo et al. 2019), long adaptive paths (mazes), entropic barriers of neutral meandering (van Nimwegen and Crutchfield 2000), or a combination of these. While these empirical observations fall far short of establishing the existence of mazes, they highlight the need to improve our understanding of how epistasis impacts evolution over long evolutionary trajectories, with the role of evolutionary mazes standing out as one key unresolved issue.

While the high dimensionality of biological landscapes likely precludes detailed empirical mapping beyond the local scale (the number of possible genotypes in the landscape of even a single protein is exponentially large in the number of residues), theoretical models could, in principle, provide global insight (Fragata et al. 2019). One influential approach is to analyze randomly constructed landscapes (Stadler and Happel 1999). These “random field models” have been used to investigate the distribution of local peaks (Hwang et al. 2018), connectivity via adaptive paths (Hwang et al. 2018; Li 2018), and the special features of landscape sections traversed by adaptive evolution (Agarwala and Fisher 2019). A strength of random field approaches is allowing general statements to be made about ensembles of landscapes characterized by a few statistical parameters (Agarwala and Fisher 2019). However, the applicability of random field models to biological landscapes is not clear. In particular, it is not clear whether random representations of sign epistasis are able to capture the hard but possible aspect of evolutionary mazes; indeed, random field models more typically exhibit sudden transitions between highly connected and highly disconnected topographies when ruggedness is increased (Li 2018).

Here, we adopt an alternative approach by focusing on the fitness landscapes of biophysical protein models (Serohijos and Shakhnovich 2014; Bastolla et al. 2017), sacrificing some of the generality of random field approaches in favor of greater biological realism. Protein fitness landscapes are particularly interesting from the perspective of understanding evolutionary mazes because there are biophysical reasons to suspect that hard but possible evolutionary dynamics might arise. Contacts between H residues are responsible not only for “good” intramolecular bonding (stable folding), but also for “bad” intramolecular (misfolding) and intermolecular bonding (aggregation), an example of frustration caused by competing interactions (Wolf et al. 2018). This can severely constrain adaptive paths because mutations between H and polar (P) residues will often cause both stability and misfolding/“aggregation potential” to increase, or decrease, together (Gershenson et al. 2014). On the other hand, if adaptive evolution is nevertheless possible in the face of this constraint, structural changes will modify each residue’s role, creating ideal conditions for reversions and maze-like behavior. For instance, previously shielded H residues could become exposed on the protein surface and present an aggregation risk, while previously peripheral P residues selected for aggregation avoidance might become situated in more structurally important positions where a H residue would be better for stability.

A major theme of the molecular fitness landscape literature has been the evolutionary accessibility of different protein structures, i.e., the mutational connectivity between high-fitness folds. The simplest way to address accessibility questions is to reduce fitness to a binary of “functional” and “nonfunctional” genotypes, creating a neutral network of functional genotypes connected by mutation (Maynard Smith 1970). Global landscape structure then becomes a matter of diffusion via neutral mutations among equivalently high-fitness genotypes (Lipman and Wilbur 1991; Wilke 2001) [see Gavrilets (1999) for a similar approach applied at the organismal level]. But this approach does not yield information about the global topography of macromolecular fitness differences in the crucial landscape regions below high fitness peaks or plateaus (Kondrashov and Kondrashov 2015), where global-scale indirect paths may be required to reach high fitness to begin with. Approaches that do focus on suboptimal genotypes tend to do so in the immediate vicinity of high fitness peaks rather than along evolutionary trajectories, e.g., studies of phenomena such as marginal fold stability and mutational robustness (Govindarajan and Goldstein 1997; Fontana and Schuster 1998; van Nimwegen et al. 1999; Ancel and Fontana 2000; Chan and Bornberg-Bauer 2002).

We investigate the global fitness landscape of an evolving protein using a computational lattice model to evaluate the mapping from amino acid sequence to molecular fitness. We evaluate thousands of adaptive trajectories to explore the difficulty of reaching high fitness, the diversity of high-fitness local peaks reached, the extent to which local peaks obstruct adaptation, and the prevalence and scale of maze-like behavior. In response to the empirical results of Foy et al. (2019), we also explore whether our model is capable of generating long-term directional changes in sequence properties like hydrophobicity and H clustering.

Following the classic approach of Lau and Dill (1989), our model groups amino acids as either H or P, and maps protein structure onto a square lattice. Previous applications of HP lattice models to protein evolution have focused on functional amino acid sequences that have unique lowest-free-energy “native” conformations (Lipman and Wilbur 1991; Govindarajan and Goldstein 1997; Chan and Bornberg-Bauer 2002). Our approach differs in two respects.

First, we do not define fitness in terms of native conformations as is usually done (Chan and Bornberg-Bauer 2002), because this would preclude the overwhelming majority of lower-fitness sequences. Instead, we implement an “H zipping” algorithm (Dill et al. 1993) that samples the molecular products of any sequence by simulating the kinetics of protein folding. Although other sampling algorithms are available for searching through the conformational space of a sequence, these are heuristics focused on computational efficiency (Zhao 2008). The H zipping algorithm provides a kinetic model for the folding of sequences without a native fold, while also being efficient enough to explore the landscapes of proteins many tens of residues long.

Second, to fully account for the sign-epistatic role of bad H contacts, our fitness metric is not based solely on intramolecular H–H contacts (Chan and Bornberg-Bauer 2002), but also factors in the risk of intermolecular aggregation, an important aspect of how proteins affect organismal fitness (DePristo et al. 2005; Levy et al. 2012).

We find that evolution on our stability-aggregation fitness landscape is characterized by an abundance of local peaks and widespread maze-like behavior, causing strong path dependence, largely eliminating reproducibility of outcomes at the sequence level, and frequently obstructing adaptive ascent to high fitness. Despite these obstacles, evolution does often manage to reach high fitness. Maze-like adaptive paths are created by pervasive sign epistasis, but are only of order L steps long, much shorter than the extremely long $O (2^{L})$ paths that are hypothetically possible on sign-epistatic landscapes (Kaznatcheev 2019). Attaining higher fitness is associated with longer adaptive path lengths and staggered reversions throughout the adaptive process, indicative of a global-scale maze that bypasses inferior local peaks. Biophysically, adaptive paths that reach high fitness in our model tend to involve uncommon innovations that do better than naively expected from the intrinsic frustration between stability and aggregation.

Methods

H zipper model of protein stability and aggregation

We implement a two-dimensional (2D) HP lattice protein-folding model. The amino acid heteropolymer is modeled as “beads on a string,” where each amino acid “bead” can be either H or P. The allowable conformations of the heteropolymer are self-avoiding walks on a 2D square lattice (beads that are sequential neighbors must be one lattice step apart, and each lattice position holds at most one bead).

To be biologically useful, folded heteropolymers must be both thermodynamically stable and unlikely to aggregate with the other heteropolymers in the cellular environment. Following the classic HP model (Lau and Dill 1989), the thermodynamic stability of a conformation (free energy of folding) is assumed to be proportional to the number of contacts between H monomers that are not sequential neighbors; this latter number we will refer to as stability S. Similarly, we define the aggregation potential A of a conformation as the number of potential H contacts that the conformation leaves exposed to bind to other molecules (Figure 1). Combining these together, we assign an overall stability/aggregation score $F = S - α A$ to each conformation, where $α \geq 0$ is a parameter representing the relative importance of avoiding aggregation. The frustration between stability and aggregation (see the Introduction) implies that S and A are strongly pleiotropic.

(A) Hydrophobic zipping of the 10-monomer sequence HPPHPHHPPH (H: solid circles, P: hollow circles). Folding starts at the leftmost possible H–H contact (nucleation). Depending on the order in which H–H contacts form (a stochastic process), it is possible that the zipper gets stuck before much of the sequence is folded (left branch). Such partially folded structures are assigned S = 0. Orange stars denote potential H contacts contributing to the aggregation potential A. For fully folded structures, S is equal to the number of nonsequential H–H contacts (dotted lines). (B) In the absence of the fitness penalty for failing to form a complete fold, evolution gets stuck at sequences that only fold locally (unfolded disordered regions shown with dashed line). Multiple nucleations shown for illustration; we only present results for a single nucleation. H, hydrophobic; P, polar.

We model the kinetic process taking the amino acid heteropolymer from a disordered string to a folded globule as H zipping (Dill et al. 1993). Our implementation follows the algorithm outlined in Dill and Fiebig (1994). The zipping process starts with an initial nucleation in which a pair of H monomers that are sequentially nearby, 3-aa apart in our square lattice model, form an H–H contact (Figure 1A). We assume that nucleation occurs between the leftmost such pair in the sequence (e.g., because folding begins before translation is complete; we return to this assumption in the Discussion). The formation of the nucleating contact brings other H monomers closer together, facilitating the formation of more H–H contacts. Each new H–H contact is chosen randomly from among the set of candidate H pairs that are possible given the current partially zipped topology. To identify candidate H pairs, Dijkstra’s shortest path algorithm is applied to identify three-step shortest paths connecting H residues on the graph of amino acid beads (nodes) connected by sequence adjacency and existing H–H contacts (i.e., both solid and dotted lines in Figure 1A are graph edges). A checking procedure excludes candidate pairs incompatible with the topology of the existing zipped structure. Zipping proceeds in this way until no further contacts are possible.

H zipping is a stochastic process because a given HP sequence x may zip to different conformations depending on the order in which H–H contacts form (Figure 1A). Consequently, $F (x)$ is a random variable for each sequence x. We define the overall fitness $\bar{F} (x)$ of x as the expectation of $F (x)$ . Although $\bar{F} (x)$ is not a direct measure of organismal or protein performance, and can even be negative, we use the term fitness for simplicity since $\bar{F} (x)$ defines our fitness landscape. We estimate $\bar{F} (x)$ numerically by computing $F (x)$ for a sample of 1000 zipped conformations and taking the sample mean.

Biophysically, H zipping represents a rapid initial phase of conformational entropy loss (Dill et al. 1993). Zipping sometimes finds the lowest free energy conformations possible for a given sequence, but in general reaching the lowest free energy conformations requires breaking H–H contacts formed during zipping, a process that would presumably require longer timescales than the initial rapid collapse (Dill et al. 1993). Zipped conformations nevertheless approximate the lowest free energy conformations, and are produced with sample frequencies that reflect kinetic accessibility. With respect to aggregation, much of the aggregation risk associated with the expression of a sequence could be attributable to these rapidly formed conformations.

In contrast to the traditional approach of enumerating the low-energy conformations of sequences of a fixed length subject to compactness constraints (Chan and Bornberg-Bauer 2002), H zipping is more flexible in its handling of protein length. A given sequence of length L may terminate zipping in a partially folded state, implying a domain length shorter than L. Indeed, in the absence of an additional length constraint, adaptive evolution will almost always reach local peaks whereby a small ( $\sim 10$ residues), highly stable structure has formed (subunits in Figure 1B). Allowing for multiple nucleation points does not rectify this problem, merely leading to the formation of multiple such structures separated by disordered regions (Figure 1B). These local peaks effectively block evolution from grappling with the relevant biological problem of finding a good fold for the entire length L domain. Therefore, we impose an additional length constraint: partially folded zipped structures are assumed to have $S = 0$ . As an added bonus, our zipping algorithm then does not need to process further nucleation events after the initial one, since multinucleated partially folded structures would still have $S = 0$ . Biologically, our length constraint can be interpreted as a rudimentary function requirement: structures that do not fold to the requisite length are unable to serve a useful purpose but still pose an aggregation risk.

Sequence evolution

We simulate sequence evolution as an origin-fixation process. Each substitutional step, we compute the fitnesses $\bar{F} (x)$ of the current sequence x as well as the fitnesses $\bar{F} (x')$ of all possible mutants $x'$ of x. We allow only single H $\to$ P or P $\to$ H mutations. There are L such mutants where L is sequence length. If fitter mutants are found to exist, one of them is chosen to replace x. We implement one of two alternative decision rules: choosing the fittest mutant or choosing a random fitter mutant (“greedy” and “random” adaptive walks, respectively). If all mutants are found to have lower fitness than x, we reestimate $\bar{F} (x)$ with a larger sample size and again check if the previously estimated mutant fitnesses exceed $\bar{F} (x)$ . If no fitter mutants are found (a local peak), evolution is stopped; otherwise the above is repeated on the chosen fitter mutant. The reestimation of $\bar{F} (x)$ helps to ensure that we do not unintentionally miss any beneficial mutations due to sampling error in $\bar{F} (x)$ .

We preclude mutations that create runs of three or more P monomers, since these would jam the zipper before all L amino acids could be incorporated in the fold. We also initialize evolution on sequences that start and end on H monomers, and do not have runs of three or more P monomers, but are otherwise randomly generated. We do not specify precise initial hydrophobicity; instead, we specify different odds of H vs. P monomers to generate a broad range of initial hydrophobicities. We only consider sequences of length $L = 60$ , which we found to be a comfortable length for simulating large numbers of trajectories, while still being long enough to plausibly constitute a functional protein domain (further comments in Discussion).

H clustering

To connect our findings to the H clustering results of Foy et al. (2019) we use the same H clustering metric, defined as follows. Split an HP sequence of total length L into blocks of length l. In each block, subtract the number of P residues from the number of H residues to obtain a block score n. H clustering is then given by $Ψ = σ^{2} (n) / K$ , where $σ^{2} (n)$ is the variance in n among blocks and $K = l \frac{L^{2} (1 - {\bar{n}}^{2} / l^{2})}{L^{2} - L} (1 - \frac{l}{L})$ where $\bar{n}$ is the block mean n (Irbäck et al. 1996; Irbäck and Sandelin 2000). Intuitively, $Ψ$ is a modified dispersion index for the block scores n that accounts for finite sequence length L (similar to Bessel’s correction for estimating variance from a finite sample). The expectation of $Ψ$ for completely random HP sequences with any L and any hydrophobicity is equal to 1 (Irbäck et al. 1996; Irbäck and Sandelin 2000). We use a block length of $l = 3$ and report the average $Ψ$ over the three possible block phases.

Data availability

Data and source code for simulations and figures can be accessed at https://github.com/jasonbertram/HPzipper_folding_vs_aggregation. Supplemental material available at figshare: https://doi.org/10.25386/genetics.11904921.

Results

Adaptive ascent to high fitness

In this section, we examine the global topography of the fitness landscape $\bar{F} (x)$ by simulating adaptive trajectories starting from randomly generated initial sequences (Sequence evolution describes the random generation algorithm). This gives a general sense of how likely it is to ascend to high fitness, and how adaptive evolution changes properties such as hydrophobicity and H clustering. We assume a greedy adaptive walk scenario, such that the fitness landscape is ascended at the greatest possible rate. Finally, we assume $α = 1$ , such that stability and aggregation contribute equally to F. We discuss other choices for α below.

We classify the evolved sequences at terminal peaks of adaptive paths into three qualitatively distinct groups. The highest fitness sequences always form a complete fold, and zip to only one or a few conformations; we call these sequences “reliable.” The lowest fitness sequences never zip to a complete conformation (in a finite sample of 1000 zipped conformations) and are called “unfoldable.” The remaining “unreliable” sequences only sometimes form a complete fold.

Random initial sequences are usually unfoldable, and adaptive evolution from these initial sequences frequently terminates at unfoldable, low-fitness local peaks (blue points in Figure 2, A–D). Evolution does nevertheless manage to reach high-fitness reliable sequences, and it is immediately apparent that maze-like paths are likely involved, since overall adaptive path lengths in Figure 2B approach a value of L, the maximum possible Hamming distance between HP sequences.

(A–F) Adaptive origin-fixation evolution to a local peak on the fitness landscape $\bar{F} (x)$ with $α = 1$ . Colors indicate the type of sequence ultimately reached: reliable sequences in red, unfoldable sequences in blue, and unreliable sequences in orange. When sample points exactly overlap, the plotted circle is scaled proportional to the number of overlapping samples. Hydrophobic clustering is defined in *Hydrophobic clustering*. In total, 1491 sequences of length $L = 60$ were evolved, selecting the fittest beneficial mutation for each substitution.

It can be reasoned that the attained fitnesses are high in a global sense as follows. On a square 2D lattice, each nonterminal residue can be one of the partners in at most two bonds, implying a maximum possible stability of $S \approx L$ . Assuming a maximally compact folded molecule with square dimensions, the most favorable scenario for forming H–H contacts and avoiding aggregation, the exposed exterior is $\approx 4 \sqrt{L}$ residues long. Thus, if all exterior residues were H, F would be reduced by $\approx 4 \sqrt{L}$ . The same result applies more generally: replacing $H \to P$ at an exterior residue would lower A by one, but also implies the loss of one H–H bond from S. Therefore, global maximum fitness is approximately $(L - 4 \sqrt{L}) / L$ per residue, which is $\approx 0.5$ for $L = 60$ , close to the highest fitness values attained computationally.

Initial sequences that evolve reliability do not start with higher fitness relative to other random initial sequences (Figure 2A). Rather, it is the specific combination of H and P residues that determines whether high fitness is attainable. Local peaks that are reliable generally take more adaptive steps to reach than local peaks that are unreliable or unfoldable, but there is considerable overlap in the adaptive path lengths (Figure 2A). This means that some adaptive paths experience long ascents in which beneficial mutations do not run out before reaching high fitness, whereas others attain high fitness via relatively quick shortcuts. Adaptive paths starting at low hydrophobicity rarely reach a reliable local peak, while high initial hydrophobicity improves the odds of reaching a reliable sequence peak (Figure 2C). However, many trajectories with high initial hydrophobicity evolve to peaks short of reliability, indicating that reaching high fitness is also sensitive to the exact sequence of H and P residues.

Despite the sensitivity of fitness outcomes to genetic background, HP sequence evolution does exhibit consistent patterns. The highest fitness sequences all have hydrophobicity near $\approx 0.6$ (Figure 2D). Hydrophobicity tends to decline over an adaptive trajectory, although a significant minority of unreliable peak sequences gain H residues (Figure 2E). Adaptive evolution also tends to reduce the within-sequence clustering of H residues (Figure 2F; H clustering).

The preceding results are not sensitive to the severity of aggregation. In simulations where aggregation is either more ( $α = 2.0$ ; Figure S1) or less ( $α = 0.5$ ; Figure S2) severe, reliability is frequently attained, with the same optimal hydrophobicity of $\approx 0.6$ and path lengths approach the maximum Hamming limit of L suggestive of mazes. These findings are consistent with our suggestion in the Introduction that protein evolution to increase stability and avoid aggregation is broadly conducive to producing mazes.

In fact, even in the total absence of aggregation ( $α = 0$ ), a similar conflict between good and bad contacts exists, but bad contacts then solely occur in the form of intramolecular misfolding. Nevertheless, since aggregation risk is known to be an essential aspect of protein “fitness” in the crowded cellular environment (DePristo et al. 2005; Levy et al. 2012), we account for both misfolding and aggregation to ensure that we have captured the full sign-switching repercussions of evolving different conformations. The $α = 0$ case in our model is in any case not ideal for studying maze-like adaptive walks because neutral meandering is common on $α = 0$ landscapes. Such neutral meandering has historically been highlighted as a key source of landscape connectivity in lattice protein models without aggregation risk (Lipman and Wilbur 1991), but this ignores the role of aggregation as a major potential source of landscape ruggedness. Thus, we henceforth focus our attention on the symmetric nonzero case $α = 1$ .

Evolutionary mazes

Having established that adaptive evolution sometimes reaches globally high fitness, we now focus on the possible adaptive paths that an initial sequence capable of attaining high fitness might follow. To achieve this, we repeatedly initialize evolution from the same sequence found to evolve reliability in Adaptive ascent to high fitness, but now following a random adaptive walk rather than a greedy one.

The landscape $\bar{F} (x)$ exhibits strong path dependence. Despite starting from the same sequence, almost every adaptive path reaches a different peak spanning a wide range of fitness values, depending on the order and identity of beneficial mutations (Figure 3A). Adaptive paths also exhibit substantial backtracking, with total adaptive path lengths significantly exceeding the number of HP differences between the initial and final sequences (Figure 3B). Taken together, these imply a topography that is densely peaked, with many poor path choices possibly leading to low-fitness dead-ends, but also maze-like, with meandering adaptive walks sometimes needed to reach high fitness (Figure 4 shows an example). The maze does not simply involve short local detours; while some reversions may occur soon after the preceding substitution, some reversions come at a much later time along the adaptive trajectory. In this respect, HP lattice mazes have a global scale, involving a complex relationship between genetic background and the sign of mutational effects, and spanning globally low to globally high fitness.

Indirect paths in adaptive HP lattice protein evolution. Starting from the same initial sequence, 987 sequences were evolved picking a random beneficial mutation at each step. (A) Many different evolutionary outcomes are possible. (B) The length of indirect paths is of order L (only reliable sequences shown for clarity; circle sizes proportional to number of sequences, line indicates length of direct paths). HP, hydrophobic–polar.

Example of a long ( $> L$ ) adaptive path ultimately leading to reliability. The evolving HP sequence shows repeated reversions, *i.e.*, backtracking (top left panel). These reversions occur when the preferred state of a residue changes due to substitutions elsewhere (see residue highlighted between white lines in top right). Beneficial mutations with $Δ \bar{F} \geq 1$ are yellow, deleterious mutations with $Δ \bar{F} \leq - 1$ are dark aubergine, and close-to-neutral mutations with $| Δ \bar{F} | < 1$ (whose scoring as beneficial/deleterious is more sensitive to sampling noise) are olive green. Red lines connect the series of mutations that occurred, each shown as a small red dot. Sign-epistasis can be seen from the presence of yellow $\leftrightarrow$ dark aubergine changes without an immediately preceding red dot at that residue. Residues with extended stretches of dark aubergine continuing to the last substitution have become entrenched. HP, hydrophobic–polar.

Adaptive paths that terminate at unfoldable peaks are distinguished by the rapid removal of H residues (Figure 5A). This represents a “lazy” strategy of reducing aggregation risk by simply reducing overall hydrophobicity (we use the word strategy loosely to describe the mutations that happen to fix; the trajectories underlying Figure 5 are all evolving using the same random adaptive walk algorithm). By contrast, adaptive paths that eventually attain reliability maintain H residues for longer, or even transiently accumulate them. In spite of this, in the early stages of adaptation, reductions in aggregation propensity A may still be the main source of gains in fitness F (Figure 4). Therefore, these paths represent an “industrious” strategy of tinkering with the fold to both reduce aggregation propensity and increase stability. Interestingly, initial fitness gains are comparable in both cases (Figure 5B), but the purging of H residues results in the depletion of beneficial mutations (Figure 5C). Moreover, although the distribution of beneficial fitness effects for mutations fixed along adaptive paths has a similar median, this distribution has a substantially longer tail for adaptive paths resulting in reliability (Figure 5C). This extended tail reflects the fact that although most fixed mutations either need to increase aggregation risk to improve stability or do the converse, the industrious strategy is more effective at finding the rare mutations that improve both (Figure 5D).

The fate of an adaptive path depends strongly on the evolution of its hydrophobicity. (A) Adaptive paths ending at unfoldable peaks (blue) remove more hydrophobic residues to reduce aggregation propensity compared with those that end at reliable peaks (red). (B and C) Initial fitness gains are similar for paths terminating at reliable and unfoldable peaks, but the latter run out of beneficial mutations sooner, and are less successful at breaking the negative pleiotropy between S and A. Lines show the mean (A and B) and median [(C) to compensate for skew] from the same 987 sequences as in Figure 3. Shaded areas show 10th and 90th percentiles. (D) The two-dimensional distribution of fitness effects for mutations that fixed at the 10th and 30th substitution.

Discussion

We have shown that an HP lattice protein stability-aggregation fitness landscape exhibits pervasive sign epistasis, resulting in a high density of local peaks, evolutionary mazes, and strong path dependence of evolutionary outcomes. Despite the ruggedness of the stability-aggregation landscape, evolution from low-fitness random initial sequences frequently reaches high fitness without crossing valleys. These high-fitness adaptive paths often require back-and-forth H $\leftrightarrow$ P mutations at some loci, and may consequently be longer than the Hamming distance between initial and final HP sequences, and sometimes even longer than L. Therefore, we confirm, in a biologically grounded model landscape, the importance of mazes posited by Kaznatcheev (2019) on the basis of more abstract landscape models.

However, all of our adaptive paths are of order L, much shorter than the $O (2^{L})$ paths that are hypothetically possible under sign epistasis (Kaznatcheev 2019). Intuitively, the latter exponential paths would require an amount of conformational rearrangement that is not consistent with gradual improvement of stability or aggregation potential at each adaptive step in our model, because even along maze-like paths, some alleles become structurally critical for maintaining accumulated fitness gains. Further mutations at the corresponding loci would be highly deleterious (right panel in Figure 4). Such entrenchment (Ashenberg et al. 2013; Shah et al. 2015; McCandlish et al. 2016; Flynn et al. 2017) appears to make exponential-length adaptive paths unlikely, at least at the level of individual proteins.

One remarkable biological feature of our stability-aggregation landscape, which is quite different from random field landscapes, is the presence of strong anisotropy: the hydrophobicity “dimension” is a major determinant of how far (Hamming distance) and how high (fitness) it is possible to move along adaptive paths. Reliable evolved sequences tend to start at higher hydrophobicity (Figure 2C) and maintain high hydrophobicity for longer (Figure 5A) than unfoldable sequences. Since the early stages in the evolution of a random sequence will generally entail high conformational diversity, many H residues will not have consistent structural roles and will therefore contribute little to expected stability $\bar{S}$ . What we have termed the lazy strategy is to simply remove these H residues to reduce aggregation risk. However, since H residues are the raw material needed to fold, this risks closing off future avenues for innovation. By contrast, what we term the industrious strategy requires putting the existing H residues to better structural use without removing them (a distinct mechanism from adding “surplus” stability to enable a stability-reducing innovation; Bloom et al. 2006), retaining conformational options and resulting in a longer tail in the distribution of beneficial mutations.

Similar observations apply to H clustering. Despite the fact that the clustering metric is constructed to remove the direct effect of hydrophobicity, the two are highly correlated. This is because H residues are best situated in particular, anticlustered configurations, and the more H residues there are, the more such anticlustered regions of sequence there are. For this reason, we cannot separately parse out the evolutionary determinants of clustering from those of hydrophobicity in our model. Both are driven by the dynamics of adaptive walks from primitive to more sophisticated solutions to the folding problem.

Remarkably, although evolutionary outcomes in our model are highly nonreproducible, even at the coarse level of fitness $\bar{F}$ (Figure 3), both hydrophobicity and H clustering ultimately decline for almost all reliable sequences (Figure 2, E and F). This appears to be a generic consequence of improving the chance to form a complete fold; increases in H clustering primarily occur in unreliable sequences, which have high final hydrophobicity and, consequently, high conformational diversity.

A decline in H clustering over time along the most successful adaptive paths is compatible with the empirical finding of lower clustering in older genes, but our decline in hydrophobicity seems to contradict the empirical finding of higher hydrophobicity in older genes (Foy et al. 2019). It is possible that the empirical trends are driven not by adaptive paths of descent with modification, but by the differential retention of higher fitness peaks. In other words, the lazy strategy might generate new intrinsically disordered local peaks, scored as $S = 0$ by us, but these are dispensible and replaceable. Conditional on long-term persistence, the highest fitness peaks corresponding to our reliable sequences have relatively high hydrophobicity. Selection at this higher level of organization, i.e., among protein folds rather than among alleles for the same protein fold, might explain why the empirical trends are present over such extraordinarily long time periods.

Importantly, protein length L also evolves in real proteins, with older proteins tending to be longer. A decreasing surface:volume ratio is an alternative hypothesis for why hydrophobicity might increase over time. The H zipping algorithm is well suited to handling variable length sequences, but substantial extensions would need to be made to the simple single-residue substitution mutation model presented here to account for insertion/deletions or stop codon loss.

Our HP zipping model shares the usual disclaimer of all HP lattice models: though grossly simplifying reality, they nevertheless capture essential aspects of how protein folding is driven by contact between H residues subject to conformational constraints (Dill and Fiebig 1994; Serohijos and Shakhnovich 2014). Our zipping model makes additional assumptions that add further caveats. It seems unlikely that our assumption of a single, fixed nucleation site applies to real proteins. In principle, our findings could be sensitive to this assumption since it closes off adaptive paths whereby separately nucleated “subdomains” separated by disordered regions combine into larger folded structures. However, we find that this scenario almost never occurs: in simulations where multiple nucleations are allowed, adaptive evolution almost always reaches a local peak consisting of a fragmented string of small ( $\sim 10$ residues), highly stable, independently nucleated structures (Figure 1B). A related caveat is precluding H–H contacts from breaking once they have formed, which, if allowed in a more sophisticated kinetic model, might provide ways for separately nucleated substructures to merge. While we cannot rule out such possibilities, it seems plausible that the outcome of the rapid initial phase of folding representing by zipping is a good indicator of the stability and aggregation properties of a sequence. Finally, we only considered sequences of one length, $L = 60$ . The structure of the $\bar{F} (x)$ landscape is highly sensitive to length for much shorter sequences ( $L < 20$ ) when the 10-residue structures shown in Figure 1B largely determine global landscape structure. We avoided this short sequence scenario since such small structures are too small to represent a functional protein domain and are also sensitive to the assumed lattice topology (a square lattice in our case). Larger sequences $L > 20$ are far less sensitive to this artifact, displaying a much wider variety of high-fitness folds, and we did not observe any substantive length dependence of adaptive walk properties for $20 < L \leq 60$ . We cannot rule out the possibility that sequences much longer than $L = 60$ might evolve differently (e.g., reaching high fitness might become increasingly difficult), although it is not clear what aspect of our model would be sensitive to length in this way, and our findings encompass sequences long enough to plausibly constitute a protein domain.

Frustration between stability and aggregation/misfolding creates an evolutionary quandary in which mutations that increase (decrease) a positive attribute (stability) often also increase (decrease) a negative attribute (aggregation/misfolding propensity). Beneficial mutations will consequently often only constitute a small proportion of all possible mutations (Figure 4). Moreover, akin to a tradeoff, many of these beneficial mutations only improve the balance between stability and aggregation, not both independently (e.g., by gaining more in stability than aggregation potential; Figure 5D). Given these constraints on adaptation, how are adaptive trajectories leading from globally low to globally high fitness possible? The answer is that the quandary between stability and aggregation is not a strict tradeoff: there exist (rare) mutations that are able to escape the status quo, both gaining stability and lowering aggregation potential.

It seems plausible that fitness landscapes with similar hard but possible properties could occur more widely than the protein folding landscape studied here. Frustration seems to be a universal feature of biological macromolecules (Ferreiro et al. 2014, 2018), where conflict occurs between the tendency to engage in beneficial interactions and the tendency to engage in deleterious interactions, i.e., between affinity and specificity. Conflict between affinity and specificity also constrains evolution at the cellular level of protein–protein interaction networks; similar to our findings, evolutionary adaptation can still occur by adjusting the fitness consequences of this inescapable conflict (Heo et al. 2011). More broadly, frustration effects occur across scales of biological organization (Wolf et al. 2018). The prevalence of constraint-breaking mutations is less apparent, but environmental change, which is also ubiquitous, can act as a transient constraint breaker (de Vos et al. 2015). Our protein fitness landscape model can thus be viewed as a case study of the evolutionary effects of frustration, with lessons that may extend well beyond protein evolution.

Acknowledgments

We thank David McCandlish for comments on the manuscript, Artem Kaznatcheev for discussions, and the Indiana University Pervasive Technology Institute for providing high performance computer Karst resources that have contributed to the research results reported within this paper. J.B. and J.M. were funded by the John Templeton Foundation (60814), and J.B. was funded by the Environmental Resilience Institute at Indiana University. This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute, and in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative at IU was also supported in part by Lilly Endowment, Inc.

Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.11904921.

Communicating editor: L. Wahl

Literature Cited

Agarwala A., and Fisher D. S., 2019. Adaptive walks on high-dimensional fitness landscapes and seascapes with distance-dependent statistics. Theor. Popul. Biol. 130: 13–49. 10.1016/j.tpb.2019.09.011 [DOI] [PubMed] [Google Scholar]
Ancel L. W., and Fontana W., 2000. Plasticity, evolvability, and modularity in RNA. J. Exp. Zool. 288: 242–283. [DOI] [PubMed] [Google Scholar]
Ashenberg O., Gong L. I., and Bloom J. D., 2013. Mutational effects on stability are largely conserved during protein evolution. Proc. Natl. Acad. Sci. USA 110: 21071–21076. 10.1073/pnas.1314781111 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bastolla U., Dehouck Y., and Echave J., 2017. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr. Opin. Struct. Biol. 42: 59–66. 10.1016/j.sbi.2016.10.020 [DOI] [PubMed] [Google Scholar]
Bloom J. D., Labthavikul S. T., Otey C. R., and Arnold F. H., 2006. Protein stability promotes evolvability. Proc. Natl. Acad. Sci. USA 103: 5869–5874. 10.1073/pnas.0510098103 [DOI] [PMC free article] [PubMed] [Google Scholar]
Chan H. S., and Bornberg-Bauer E., 2002. Perspectives on protein evolution from simple exact models. Appl. Bioinformatics 50: 121–144. [PubMed] [Google Scholar]
DePristo M. A., Weinreich D. M., and Hartl D. L., 2005. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet. 6: 678–687. 10.1038/nrg1672 [DOI] [PubMed] [Google Scholar]
DePristo M. A., Hartl D. L., and Weinreich D. M., 2007. Mutational reversions during adaptive protein evolution. Mol. Biol. Evol. 24: 1608–1610. 10.1093/molbev/msm118 [DOI] [PubMed] [Google Scholar]
de Visser J. A., and Krug J., 2014. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15: 480–490. 10.1038/nrg3744 [DOI] [PubMed] [Google Scholar]
de Vos M. G., Dawid A., Sunderlikova V., and Tans S. J., 2015. Breaking evolutionary constraint with a tradeoff ratchet. Proc. Natl. Acad. Sci. USA 112: 14906–14911. 10.1073/pnas.1510282112 [DOI] [PMC free article] [PubMed] [Google Scholar]
Dill K. A., and Fiebig K. M., 1994. Hydrophobic zippers: a conformational search strategy for proteins, pp. 109–113 in Statistical Mechanics, Protein Structure, and Protein Substrate Interactions. Springer Science+Business Media, New York: 10.1007/978-1-4899-1349-4_11 [DOI] [Google Scholar]
Dill K. A., Fiebig K. M., and Chan H. S., 1993. Cooperativity in protein-folding kinetics. Proc. Natl. Acad. Sci. USA 90: 1942–1946. 10.1073/pnas.90.5.1942 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferreiro D. U., Komives E. A., and Wolynes P. G., 2014. Frustration in biomolecules. Q. Rev. Biophys. 47: 285–363. 10.1017/S0033583514000092 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferreiro D. U., Komives E. A., and Wolynes P. G., 2018. Frustration, function and folding. Curr. Opin. Struct. Biol. 48: 68–73. 10.1016/j.sbi.2017.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
Flynn W. F., Haldane A., Torbett B. E., and Levy R. M., 2017. Inference of epistatic effects leading to entrenchment and drug resistance in hiv-1 protease. Mol. Biol. Evol. 34: 1291–1306. 10.1093/molbev/msx095 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fontana W., and Schuster P., 1998. Continuity in evolution: on the nature of transitions. Science 280: 1451–1455. 10.1126/science.280.5368.1451 [DOI] [PubMed] [Google Scholar]
Foy S. G., Wilson B. A., Bertram J., Cordes M. H., and Masel J., 2019. A shift in aggregation avoidance strategy marks a long-term direction to protein evolution. Genetics 211: 1345–1355. 10.1534/genetics.118.301719 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fragata I., Blanckaert A., Louro M. A. D., Liberles D. A., and Bank C., 2019. Evolution in the light of fitness landscape theory. Trends Ecol. Evol. 34: 69–82. 10.1016/j.tree.2018.10.009 [DOI] [PubMed] [Google Scholar]
Gavrilets S., 1999. A dynamical theory of speciation on holey adaptive landscapes. Am. Nat. 154: 1–22. 10.1086/303217 [DOI] [PubMed] [Google Scholar]
Gershenson A., Gierasch L. M., Pastore A., and Radford S. E., 2014. Energy landscapes of functional proteins are inherently risky. Nat. Chem. Biol. 10: 884–891. 10.1038/nchembio.1670 [DOI] [PMC free article] [PubMed] [Google Scholar]
Govindarajan S., and Goldstein R. A., 1997. The foldability landscape of model proteins. Biopolymers 42: 427–438. [DOI] [PubMed] [Google Scholar]
Guo Y., Vucelja M., and Amir A., 2019. Stochastic tunneling across fitness valleys can give rise to a logarithmic long-term fitness trajectory. Sci. Adv. 5: eaav3842 10.1126/sciadv.aav3842 [DOI] [PMC free article] [PubMed] [Google Scholar]
Heo M., Maslov S., and Shakhnovich E., 2011. Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc. Natl. Acad. Sci. USA 108: 4258–4263. 10.1073/pnas.1009392108 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hwang S., Schmiegelt B., Ferretti L., and Krug J., 2018. Universality classes of interaction structures for nk fitness landscapes. J. Stat. Phys. 172: 226–278. 10.1007/s10955-018-1979-z [DOI] [Google Scholar]
Irbäck A., and Sandelin E., 2000. On hydrophobicity correlations in protein chains. Biophys. J. 79: 2252–2258. 10.1016/S0006-3495(00)76472-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Irbäck A., Peterson C., and Potthast F., 1996. Evidence for nonrandom hydrophobicity structures in protein chains. Proc. Natl. Acad. Sci. USA 93: 9533–9538. 10.1073/pnas.93.18.9533 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kaznatcheev A., 2019. Computational complexity as an ultimate constraint on evolution. Genetics 212: 245–265. 10.1534/genetics.119.302000 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kondrashov D. A., and Kondrashov F. A., 2015. Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31: 24–33. 10.1016/j.tig.2014.09.009 [DOI] [PubMed] [Google Scholar]
Lau K. F., and Dill K. A., 1989. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22: 3986–3997. 10.1021/ma00200a030 [DOI] [Google Scholar]
Levy E. D., De S., and Teichmann S. A., 2012. Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc. Natl. Acad. Sci. USA 109: 20461–20466. 10.1073/pnas.1209312109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L., 2018. Phase transition for accessibility percolation on hypercubes. J. Theor. Probab. 31: 2072–2111. 10.1007/s10959-017-0769-x [DOI] [Google Scholar]
Lipman D. J., and Wilbur W. J., 1991. Modelling neutral and selective evolution of protein folding. Proc. Biol. Sci. 245: 7–11. 10.1098/rspb.1991.0081 [DOI] [PubMed] [Google Scholar]
Marchi J., Galpern E. A., Espada R., Ferreiro D. U., Walczak A. M. et al. , 2019. Size and structure of the sequence space of repeat proteins. PLoS Comput. Biol. 15: e1007282 10.1371/journal.pcbi.1007282 [DOI] [PMC free article] [PubMed] [Google Scholar]
Maynard Smith J., 1970. Natural selection and the concept of a protein space. Nature 225: 563–564. 10.1038/225563a0 [DOI] [PubMed] [Google Scholar]
McCandlish D. M., Shah P., and Plotkin J. B., 2016. Epistasis and the dynamics of reversion in molecular evolution. Genetics 203: 1335–1351. 10.1534/genetics.116.188961 [DOI] [PMC free article] [PubMed] [Google Scholar]
Neme R., and Tautz D., 2013. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14: 117 10.1186/1471-2164-14-117 [DOI] [PMC free article] [PubMed] [Google Scholar]
Palmer A. C., Toprak E., Baym M., Kim S., Veres A. et al. , 2015. Delayed commitment to evolutionary fate in antibiotic resistance fitness landscapes. Nat. Commun. 6: 7385 10.1038/ncomms8385 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pollock D. D., Thiltgen G., and Goldstein R. A., 2012. Amino acid coevolution induces an evolutionary stokes shift. Proc. Natl. Acad. Sci. USA 109: E1352–E1359. 10.1073/pnas.1120084109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Popova A. V., Safina K. R., Ptushenko V. V., Stolyarova A. V., Favorov A. V. et al. , 2019. Allele-specific nonstationarity in evolution of influenza a virus surface proteins. Proc. Natl. Acad. Sci. USA 116: 21104–21112. 10.1073/pnas.1904246116 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sarkisyan K. S., Bolotin D. A., Meer M. V., Usmanova D. R., Mishin A. S. et al. , 2016. Local fitness landscape of the green fluorescent protein. Nature 533: 397–401. 10.1038/nature17995 [DOI] [PMC free article] [PubMed] [Google Scholar]
Serohijos A. W., and Shakhnovich E. I., 2014. Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr. Opin. Struct. Biol. 26: 84–91. 10.1016/j.sbi.2014.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
Shah P., McCandlish D. M., and Plotkin J. B., 2015. Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl. Acad. Sci. USA 112: E3226–E3235. 10.1073/pnas.1412933112 [DOI] [PMC free article] [PubMed] [Google Scholar]
Stadler P. F., and Happel R., 1999. Random field models for fitness landscapes. J. Math. Biol. 38: 435–478. 10.1007/s002850050156 [DOI] [Google Scholar]
Starr T. N., and Thornton J. W., 2016. Epistasis in protein evolution. Protein Sci. 25: 1204–1218. 10.1002/pro.2897 [DOI] [PMC free article] [PubMed] [Google Scholar]
Starr T. N., Flynn J. M., Mishra P., Bolon D. N., and Thornton J. W., 2018. Pervasive contingency and entrenchment in a billion years of hsp90 evolution. Proc. Natl. Acad. Sci. USA 115: 4453–4458. 10.1073/pnas.1718133115 [DOI] [PMC free article] [PubMed] [Google Scholar]
van Nimwegen E., and Crutchfield J. P., 2000. Metastable evolutionary dynamics: crossing fitness barriers or escaping via neutral paths? Bull. Math. Biol. 62: 799–848. 10.1006/bulm.2000.0180 [DOI] [PubMed] [Google Scholar]
van Nimwegen E., Crutchfield J. P., and Huynen M., 1999. Neutral evolution of mutational robustness. Proc. Natl. Acad. Sci. USA 96: 9716–9720. 10.1073/pnas.96.17.9716 [DOI] [PMC free article] [PubMed] [Google Scholar]
Weinreich D. M., Watson R. A., and Chao L., 2005. Perspective: sign epistasis and genetic costraint on evolutionary trajectories. Evolution 59: 1165–1174. [PubMed] [Google Scholar]
Weissman D. B., Desai M. M., Fisher D. S., and Feldman M. W., 2009. The rate at which asexual populations cross fitness valleys. Theor. Popul. Biol. 75: 286–300. 10.1016/j.tpb.2009.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
Whitlock M. C., Phillips P. C., Moore F. B.-G., and Tonsor S. J., 1995. Multiple fitness peaks and epistasis. Annu. Rev. Ecol. Syst. 26: 601–629. 10.1146/annurev.es.26.110195.003125 [DOI] [Google Scholar]
Wilke C. O., 2001. Adaptive evolution on neutral networks. Bull. Math. Biol. 63: 715–730. 10.1006/bulm.2001.0244 [DOI] [PubMed] [Google Scholar]
Wiser M. J., Ribeck N., and Lenski R. E., 2013. Long-term dynamics of adaptation in asexual populations. Science 342: 1364–1367. 10.1126/science.1243357 [DOI] [PubMed] [Google Scholar]
Wolf Y. I., Katsnelson M. I., and Koonin E. V., 2018. Physical foundations of biological complexity. Proc. Natl. Acad. Sci. USA 115: E8678–E8687. 10.1073/pnas.1807890115 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wright, S., 1932 The roles of mutation, inbreeding, crossbreeding, and selection in evolution, pp., 356–366 in Proceedings of the Sixth International Congress on Genetics, Vol. 1, edited by D. F. Jones. Brooklyn Botanic Garden, New York.
Wu N. C., Dai L., Olson C. A., Lloyd-Smith J. O., and Sun R., 2016. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife 5: e16965. 10.7554/eLife.16965 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao X., 2008. Advances on protein folding simulations based on the lattice hp models with natural computing. Appl. Soft Comput. 8: 1029–1040. 10.1016/j.asoc.2007.03.012 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[bib1] Agarwala A., and Fisher D. S., 2019. Adaptive walks on high-dimensional fitness landscapes and seascapes with distance-dependent statistics. Theor. Popul. Biol. 130: 13–49. 10.1016/j.tpb.2019.09.011 [DOI] [PubMed] [Google Scholar]

[bib2] Ancel L. W., and Fontana W., 2000. Plasticity, evolvability, and modularity in RNA. J. Exp. Zool. 288: 242–283. [DOI] [PubMed] [Google Scholar]

[bib3] Ashenberg O., Gong L. I., and Bloom J. D., 2013. Mutational effects on stability are largely conserved during protein evolution. Proc. Natl. Acad. Sci. USA 110: 21071–21076. 10.1073/pnas.1314781111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Bastolla U., Dehouck Y., and Echave J., 2017. What evolution tells us about protein physics, and protein physics tells us about evolution. Curr. Opin. Struct. Biol. 42: 59–66. 10.1016/j.sbi.2016.10.020 [DOI] [PubMed] [Google Scholar]

[bib5] Bloom J. D., Labthavikul S. T., Otey C. R., and Arnold F. H., 2006. Protein stability promotes evolvability. Proc. Natl. Acad. Sci. USA 103: 5869–5874. 10.1073/pnas.0510098103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Chan H. S., and Bornberg-Bauer E., 2002. Perspectives on protein evolution from simple exact models. Appl. Bioinformatics 50: 121–144. [PubMed] [Google Scholar]

[bib7] DePristo M. A., Weinreich D. M., and Hartl D. L., 2005. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat. Rev. Genet. 6: 678–687. 10.1038/nrg1672 [DOI] [PubMed] [Google Scholar]

[bib8] DePristo M. A., Hartl D. L., and Weinreich D. M., 2007. Mutational reversions during adaptive protein evolution. Mol. Biol. Evol. 24: 1608–1610. 10.1093/molbev/msm118 [DOI] [PubMed] [Google Scholar]

[bib9] de Visser J. A., and Krug J., 2014. Empirical fitness landscapes and the predictability of evolution. Nat. Rev. Genet. 15: 480–490. 10.1038/nrg3744 [DOI] [PubMed] [Google Scholar]

[bib10] de Vos M. G., Dawid A., Sunderlikova V., and Tans S. J., 2015. Breaking evolutionary constraint with a tradeoff ratchet. Proc. Natl. Acad. Sci. USA 112: 14906–14911. 10.1073/pnas.1510282112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Dill K. A., and Fiebig K. M., 1994. Hydrophobic zippers: a conformational search strategy for proteins, pp. 109–113 in Statistical Mechanics, Protein Structure, and Protein Substrate Interactions. Springer Science+Business Media, New York: 10.1007/978-1-4899-1349-4_11 [DOI] [Google Scholar]

[bib12] Dill K. A., Fiebig K. M., and Chan H. S., 1993. Cooperativity in protein-folding kinetics. Proc. Natl. Acad. Sci. USA 90: 1942–1946. 10.1073/pnas.90.5.1942 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Ferreiro D. U., Komives E. A., and Wolynes P. G., 2014. Frustration in biomolecules. Q. Rev. Biophys. 47: 285–363. 10.1017/S0033583514000092 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Ferreiro D. U., Komives E. A., and Wolynes P. G., 2018. Frustration, function and folding. Curr. Opin. Struct. Biol. 48: 68–73. 10.1016/j.sbi.2017.09.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Flynn W. F., Haldane A., Torbett B. E., and Levy R. M., 2017. Inference of epistatic effects leading to entrenchment and drug resistance in hiv-1 protease. Mol. Biol. Evol. 34: 1291–1306. 10.1093/molbev/msx095 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Fontana W., and Schuster P., 1998. Continuity in evolution: on the nature of transitions. Science 280: 1451–1455. 10.1126/science.280.5368.1451 [DOI] [PubMed] [Google Scholar]

[bib17] Foy S. G., Wilson B. A., Bertram J., Cordes M. H., and Masel J., 2019. A shift in aggregation avoidance strategy marks a long-term direction to protein evolution. Genetics 211: 1345–1355. 10.1534/genetics.118.301719 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Fragata I., Blanckaert A., Louro M. A. D., Liberles D. A., and Bank C., 2019. Evolution in the light of fitness landscape theory. Trends Ecol. Evol. 34: 69–82. 10.1016/j.tree.2018.10.009 [DOI] [PubMed] [Google Scholar]

[bib19] Gavrilets S., 1999. A dynamical theory of speciation on holey adaptive landscapes. Am. Nat. 154: 1–22. 10.1086/303217 [DOI] [PubMed] [Google Scholar]

[bib20] Gershenson A., Gierasch L. M., Pastore A., and Radford S. E., 2014. Energy landscapes of functional proteins are inherently risky. Nat. Chem. Biol. 10: 884–891. 10.1038/nchembio.1670 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Govindarajan S., and Goldstein R. A., 1997. The foldability landscape of model proteins. Biopolymers 42: 427–438. [DOI] [PubMed] [Google Scholar]

[bib22] Guo Y., Vucelja M., and Amir A., 2019. Stochastic tunneling across fitness valleys can give rise to a logarithmic long-term fitness trajectory. Sci. Adv. 5: eaav3842 10.1126/sciadv.aav3842 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Heo M., Maslov S., and Shakhnovich E., 2011. Topology of protein interaction network shapes protein abundances and strengths of their functional and nonspecific interactions. Proc. Natl. Acad. Sci. USA 108: 4258–4263. 10.1073/pnas.1009392108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] Hwang S., Schmiegelt B., Ferretti L., and Krug J., 2018. Universality classes of interaction structures for nk fitness landscapes. J. Stat. Phys. 172: 226–278. 10.1007/s10955-018-1979-z [DOI] [Google Scholar]

[bib25] Irbäck A., and Sandelin E., 2000. On hydrophobicity correlations in protein chains. Biophys. J. 79: 2252–2258. 10.1016/S0006-3495(00)76472-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] Irbäck A., Peterson C., and Potthast F., 1996. Evidence for nonrandom hydrophobicity structures in protein chains. Proc. Natl. Acad. Sci. USA 93: 9533–9538. 10.1073/pnas.93.18.9533 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Kaznatcheev A., 2019. Computational complexity as an ultimate constraint on evolution. Genetics 212: 245–265. 10.1534/genetics.119.302000 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Kondrashov D. A., and Kondrashov F. A., 2015. Topological features of rugged fitness landscapes in sequence space. Trends Genet. 31: 24–33. 10.1016/j.tig.2014.09.009 [DOI] [PubMed] [Google Scholar]

[bib29] Lau K. F., and Dill K. A., 1989. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22: 3986–3997. 10.1021/ma00200a030 [DOI] [Google Scholar]

[bib30] Levy E. D., De S., and Teichmann S. A., 2012. Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc. Natl. Acad. Sci. USA 109: 20461–20466. 10.1073/pnas.1209312109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] Li L., 2018. Phase transition for accessibility percolation on hypercubes. J. Theor. Probab. 31: 2072–2111. 10.1007/s10959-017-0769-x [DOI] [Google Scholar]

[bib32] Lipman D. J., and Wilbur W. J., 1991. Modelling neutral and selective evolution of protein folding. Proc. Biol. Sci. 245: 7–11. 10.1098/rspb.1991.0081 [DOI] [PubMed] [Google Scholar]

[bib33] Marchi J., Galpern E. A., Espada R., Ferreiro D. U., Walczak A. M. et al. , 2019. Size and structure of the sequence space of repeat proteins. PLoS Comput. Biol. 15: e1007282 10.1371/journal.pcbi.1007282 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Maynard Smith J., 1970. Natural selection and the concept of a protein space. Nature 225: 563–564. 10.1038/225563a0 [DOI] [PubMed] [Google Scholar]

[bib35] McCandlish D. M., Shah P., and Plotkin J. B., 2016. Epistasis and the dynamics of reversion in molecular evolution. Genetics 203: 1335–1351. 10.1534/genetics.116.188961 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Neme R., and Tautz D., 2013. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14: 117 10.1186/1471-2164-14-117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] Palmer A. C., Toprak E., Baym M., Kim S., Veres A. et al. , 2015. Delayed commitment to evolutionary fate in antibiotic resistance fitness landscapes. Nat. Commun. 6: 7385 10.1038/ncomms8385 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] Pollock D. D., Thiltgen G., and Goldstein R. A., 2012. Amino acid coevolution induces an evolutionary stokes shift. Proc. Natl. Acad. Sci. USA 109: E1352–E1359. 10.1073/pnas.1120084109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Popova A. V., Safina K. R., Ptushenko V. V., Stolyarova A. V., Favorov A. V. et al. , 2019. Allele-specific nonstationarity in evolution of influenza a virus surface proteins. Proc. Natl. Acad. Sci. USA 116: 21104–21112. 10.1073/pnas.1904246116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] Sarkisyan K. S., Bolotin D. A., Meer M. V., Usmanova D. R., Mishin A. S. et al. , 2016. Local fitness landscape of the green fluorescent protein. Nature 533: 397–401. 10.1038/nature17995 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] Serohijos A. W., and Shakhnovich E. I., 2014. Merging molecular mechanism and evolution: theory and computation at the interface of biophysics and evolutionary population genetics. Curr. Opin. Struct. Biol. 26: 84–91. 10.1016/j.sbi.2014.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Shah P., McCandlish D. M., and Plotkin J. B., 2015. Contingency and entrenchment in protein evolution under purifying selection. Proc. Natl. Acad. Sci. USA 112: E3226–E3235. 10.1073/pnas.1412933112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] Stadler P. F., and Happel R., 1999. Random field models for fitness landscapes. J. Math. Biol. 38: 435–478. 10.1007/s002850050156 [DOI] [Google Scholar]

[bib44] Starr T. N., and Thornton J. W., 2016. Epistasis in protein evolution. Protein Sci. 25: 1204–1218. 10.1002/pro.2897 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Starr T. N., Flynn J. M., Mishra P., Bolon D. N., and Thornton J. W., 2018. Pervasive contingency and entrenchment in a billion years of hsp90 evolution. Proc. Natl. Acad. Sci. USA 115: 4453–4458. 10.1073/pnas.1718133115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] van Nimwegen E., and Crutchfield J. P., 2000. Metastable evolutionary dynamics: crossing fitness barriers or escaping via neutral paths? Bull. Math. Biol. 62: 799–848. 10.1006/bulm.2000.0180 [DOI] [PubMed] [Google Scholar]

[bib47] van Nimwegen E., Crutchfield J. P., and Huynen M., 1999. Neutral evolution of mutational robustness. Proc. Natl. Acad. Sci. USA 96: 9716–9720. 10.1073/pnas.96.17.9716 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Weinreich D. M., Watson R. A., and Chao L., 2005. Perspective: sign epistasis and genetic costraint on evolutionary trajectories. Evolution 59: 1165–1174. [PubMed] [Google Scholar]

[bib49] Weissman D. B., Desai M. M., Fisher D. S., and Feldman M. W., 2009. The rate at which asexual populations cross fitness valleys. Theor. Popul. Biol. 75: 286–300. 10.1016/j.tpb.2009.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Whitlock M. C., Phillips P. C., Moore F. B.-G., and Tonsor S. J., 1995. Multiple fitness peaks and epistasis. Annu. Rev. Ecol. Syst. 26: 601–629. 10.1146/annurev.es.26.110195.003125 [DOI] [Google Scholar]

[bib51] Wilke C. O., 2001. Adaptive evolution on neutral networks. Bull. Math. Biol. 63: 715–730. 10.1006/bulm.2001.0244 [DOI] [PubMed] [Google Scholar]

[bib52] Wiser M. J., Ribeck N., and Lenski R. E., 2013. Long-term dynamics of adaptation in asexual populations. Science 342: 1364–1367. 10.1126/science.1243357 [DOI] [PubMed] [Google Scholar]

[bib53] Wolf Y. I., Katsnelson M. I., and Koonin E. V., 2018. Physical foundations of biological complexity. Proc. Natl. Acad. Sci. USA 115: E8678–E8687. 10.1073/pnas.1807890115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib54] Wright, S., 1932 The roles of mutation, inbreeding, crossbreeding, and selection in evolution, pp., 356–366 in Proceedings of the Sixth International Congress on Genetics, Vol. 1, edited by D. F. Jones. Brooklyn Botanic Garden, New York.

[bib55] Wu N. C., Dai L., Olson C. A., Lloyd-Smith J. O., and Sun R., 2016. Adaptation in protein fitness landscapes is facilitated by indirect paths. Elife 5: e16965. 10.7554/eLife.16965 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] Zhao X., 2008. Advances on protein folding simulations based on the lattice hp models with natural computing. Appl. Soft Comput. 8: 1029–1040. 10.1016/j.asoc.2007.03.012 [DOI] [Google Scholar]

PERMALINK

Evolution Rapidly Optimizes Stability and Aggregation in Lattice Proteins Despite Pervasive Landscape Valleys and Mazes

Jason Bertram

Joanna Masel

Abstract

Methods