Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 May 1.
Published in final edited form as: Biochim Biophys Acta. 2015 Oct 20;1857(5):485–492. doi: 10.1016/j.bbabio.2015.10.002

Fast, Cheap and Out of Control - Insights into Thermodynamic and Informatic Constraints on Natural Protein Sequences from de novo Protein Design

J M Brisendine , Ronald L Koder ↑,↓,*
PMCID: PMC4856154  NIHMSID: NIHMS778190  PMID: 26498191

Abstract

The accumulated results of thirty years of rational and computational de novo protein design have taught us important lessons about the stability, information content, and evolution of natural proteins. First, de novo protein design has complicated the assertion that biological function is equivalent to biological structure - demonstrating the capacity to abstract active sites from natural contexts and paste them into non-native topologies without loss of function. The structure-function relationship has thus been revealed to be either a generality or strictly true only in a local sense. Second, the simplification to —maquette topologies carried out by rational protein design also has demonstrated that even sophisticated functions such as conformational switching, cooperative ligand binding, and light-activated electron transfer can be achieved with low-information design approaches. This is because for simple topologies the functional footprint in sequence space is enormous and easily exceeds the number of structures which could have possibly existed in the history of life on Earth. Finally, the pervasiveness of extraordinary stability in designed proteins challenges accepted models for the —marginal stability of natural proteins, suggesting that there must be a selection pressure against highly stable proteins. This can be explained using recent theories which relate non-equilibrium thermodynamics and self-replication.

Introduction

The thermodynamics of folding

The central tenets of the theory of biopolymer folding are implied by the —Anfinsen principle [1]: the information required to reach the native state is contained in the primary sequence of amino acids. The well-known experimental proof of this principle lies in the reversible nature of the folding process in dilute aqueous conditions—if information from an outside source was needed to specify the native state then the protein would not spontaneously refold in isolation. This simple principle places strong thermodynamic constraints on the nature of the folding process and provided a starting point for a more comprehensive thermodynamic picture of folding and the development of the lattice models of folding [2]. It has also been known for some time, however, that this simple picture of folding only strictly holds for small proteins and protein subunits, that some proteins require chaperones to reach their native state, and furthermore that some proteins are intrinsically disordered in their native state and thus do not —fold at all in the sense defined by the theory [3].

The clearest implication of the Anfinsen principle is that protein folding is spontaneous in the appropriate environment and can be regarded as a phase transition from random-coil states with continuous energy spectra to an ordered or semi-ordered set of states characterized by discrete, well separated energy levels. The energy gap between the folded state and the unfolded state must therefore have a negative free energy change large enough with respect to thermal noise for the folded state to exist a sufficient fraction of the time to perform its biological function (a timescale which itself varies considerably for different proteins) [4].

Computational models of folding

The first computational evidence for the present framework was provided by lattice models [5], in which a polymer chain is folded onto a 2D or 3D lattice and the energy of a structure is calculated directly from the energy of all contact pairs:

E=i<jEijΔ(ri-rj) (1)

Where the delta function counts the number of non-adjacent i,j pairs in contact and Eij is the energy of amino acid i in contact with amino acid j. Folding results from contact pairings in the native state both between the residues of the chain and between the surface residues with solvent. The most simple —HP model , in which the chain identity is simply either Hydrophobic (H) or Polar (P), requires assigning only three potentials corresponding to H-H, P-P, and H-P contacts between the chain elements [6]. More detailed models with all 20 amino acids make frequent use of the Miyazawa-Jergen matrix of contact potentials which utilizes experimentally derived energies for all 20x20 pairwise interactions of the amino acids [7]. With a set of contact potentials the energy of any sequence in any structure can be readily computed and there is consensus that these simple models confirm the predictions of the —thermodynamic hypothesis and further, generically reproduce many other features of natural proteins such as native states with high degrees of symmetry and fast folding kinetics.

One important concept to emerge from the analysis of these models is the notion of —designability, defined for a particular structure S as the number of sequences (Ns) which have that structure as their native state [8]. The distribution of designabilities of different structures was found to vary significantly from the expectation for a Poisson distribution, with many structures having designabilities orders of magnitude larger than the mean of the distribution. The implications of these findings for protein evolution have been discussed and debated extensively [9, 10]. Highly designable structures are, by definition, more tolerant of mutation and require less sequence information per amino acid to encode, suggesting a number of reasons that natural selection would have favored more designable structures [11]. Additionally, in a purely random search of sequence space with no biasing on the part of the environment whatsoever, the probability of finding a structure is simply Ns/AN, the designability of that structure divided by the total number of sequences of equal length (length N drawn from alphabet of size A). If we now note that the requirement for reversible folding is equivalent, in Shannon‘s information theory terms [12], to the claim that the uncertainty about the structure given the sequence is zero (H(str│seq)=0), then we see that reversible folding is a —noiseless channel [13]. It follows that the mutual information between the sequence and structure of an Anfinsen folder can be simply written in terms of the designability Ns:

I(seq:str)=H(seq)-H(strseq)=4.32N-log(Ns)=-log(Ns/20N) (2)

Which shows that the mutual information between sequence and structure is simply the negative logarithm of the probability of selecting an acceptable sequence out of all possible sequences of equal length.

Binary Patterning

The explanatory power of these simple models was corroborated by early efforts in rational de novo protein design which began around this same time. The concept of designing proteins simply by —binary patterning of polar and hydrophobic amino acids represented the first experimental tests of the predictions of these models in an environment completely divorced from natural selection [14, 15]. As in the lattice models, the results of these efforts have confirmed that even an approach that is as —low information as binary patterning can successfully produce proteins with stable native topologies and even sophisticated function with a modest rate of success [1618]. A number of authors have expressed differing opinions on the role of complexity in natural proteins but analysis of the differences between natural proteins and their designed counterparts, for which the information content can be exactly calculated, demonstrates that at least some of the complexity of natural proteins is excrescent--design efforts have reproduced sophisticated functions in much simpler scaffolds (21). Strozak‘s definition of fitness based on activity rather than structure is an intriguing proposal in this light (22). A sober review of the difference between the information requirements placed on natural proteins inside a cell and the information requirements placed on a protein which only has to function in dilute solution, however, leaves plenty of room for caution against the notion that much of natural complexity is useless. Most natural proteins are adapted to their cellular environment in many more ways than their immediate biological function—they spatially localize, bind and unbind other molecules, and have their production and degradation tightly controlled via signaling systems (23).

The protein design field has seen tremendous growth in the intervening decades and synthetic biology has become a subject of both scientific and cultural interest. Along with the rapid growth of the field have come a large number of increasingly sophisticated computational approaches to redesigning natural proteins and engineering entirely new ones [19]. However, the success of the binary patterning approach has also inspired a generation of rational design based on informatics and physical-chemical intuition concerning amino acids. The rationale for such a strategy is two-fold: first, one major utility of de novo proteins arises from their role as maquettes: drastically simplified models of their natural counterparts that enable us to ask testable questions about the engineering principles which underlay protein function [20, 21]. Secondly, design through the combination of informatics and intuition allows one to be explicit about the information content of the design process—and this matters because our understanding of the notion of reversible folding depends on the underlying assumption that no information from outside the sequence is required to specify the structure. In turn, this suggests clear reasons why natural selection may have preferred structures that can be encoded more efficiently (with less sequence information).

I. The fold doesn't matter - divorcing fold and function

One thing made clear by these design efforts is that structure and function are not simply interchangeable concepts in biology, as a lazy version of biological dogma might assert. Some of the earliest experiments in biocatalyst design involved the creation of catalytic antibodies which catalyzed reactions originally catalyzed by enzymes with substantially different folds than that of an antibody [22]. We have implanted the oxygen transport function into a four alpha helix bundle fold [2325], a function previously associated only with the globin fold (Figure 1). Both light-activated electron transfer [26, 27] and ligand-activated conformational switching [28] have similarly been implanted into four alpha helix bundles, functions only seen before in much more complex structures. Hecht and coworkers have screened large libraries of binary patterned helical bundles for function and found many catalytic activities never observed in this fold [29, 30]. Similar function has recently been realized with a very small set of amino acids positioned appropriately in short catalytic peptides – function without a fold at all [31].

Figure 1.

Figure 1

Comparison of functionally equivalent natural (left) and artificial (right) oxygen transport proteins. Both structures bind heme using histidine coordination and both reversibly bind molecular oxygen to a ferrous heme iron. The de novo structure is less topologically complex yet the 4-helix bundle fold is not associated with any known natural oxygen transport proteins.

In a similar vein, the Baker group has explicitly demonstrated the capacity to place the same active site within multiple folds [32]. In one design series of de novo enzymes with retro-aldol activity, they observe catalytic rate enhancements with two distinct active site designs in five different scaffolds, creating 32 functional enzymes in all [33]. Certainly not all active sites are compatible with any fold, but the notion that any one function is restricted to only —the perfect fold for that function is an idealization of actual biological evolution. It is clear that there are many possible topologies that might perform, for example, the task of sequestering an active site from water while allowing substrate and product exchange with solvent, which are likely to be sufficient engineering principles for a large number of biological functions [34].

It is often observed that there are far fewer folds than seem possible given the size of sequence space [35], nevertheless the achievements of protein design indicate that even these few thousand natural folds may be far more than necessary. Although this might sound ludicrous to an intuition trained in the study of complex natural structures, it might be possible to recapitulate nearly all biological function within a handful of simplified topologies such as helical bundles and beta-barrels. Perhaps instead of asking why there are so few folds found in nature, one should be asking why there are so many!

II. The functional footprint in sequence space is large, making both protein design and evolution easier than one might suspect

A guiding principle of maquette-based rational design is the creation of natural protein function without natural protein complexity [36]. The core insight of patterning approaches is that, while an active or binding site may be highly constrained with respect to the amino acids capable of carrying out the intended function, specifying the folded state depends mainly on just correctly assigning different average polarities to different regions of the sequence and respecting known secondary structure formation rules. In turn it follows that both the evolution and design of functional proteins may be simpler than previously estimated. If, aside from the active site, a primary sequence is only required to collapse to a state placing the active residues in the necessary orientation within a hydrophobic core, then the number of possible sequences which should impart function becomes impossibly large with only a modest increase in protein length, and quickly exceeds the number of seconds which have elapsed since the big bang. Clearly, it is impossible that evolution sampled any significant fraction of these possibilities. More importantly, this train of thought demonstrates that evolution did not need to exhaustively search this space in order to achieve functional success, and the conceptual difficulties associated with reconciling biological efficiency with a blind search process disappear. In the realm of de novo protein design, these ideas indicate that a successful design strategy need not be computationally intensive because once the unnecessary constraints associated with the long evolutionary history and complex environment of natural proteins are removed, it is seen that the functional region of sequence space for most individual chemical tasks is massive and does not require explicit identification of more than a few residues. This has been demonstrated in a few cases for natural proteins, particularly by Harbury and coworkers, who used complementation experiments coupled to random mutagenesis of the essential enzyme triose phosphate isomerase to show that only a few select residues were necessary for function in the alpha/beta barrel enzyme [37].

This is a significant departure from the prevailing attitude in regards to protein evolution, in part because standard biochemical complementation experiments are quite adept at demonstrating the opposite phenomenon: the identification of one or few key residues which are conserved in evolution and necessary for function and survival. These conserved residues are the high information sites in the Shannon formalism, since the probability of finding particular amino acids at these sites is very high. The emphasis on conserved positions, however, is incapable of assessing the degree of variability vs. constraint present in a whole sequence because it is only interested in mutations which change or abolish function. Rational design allows for quantification of the size of functional sequence space through an iterative design methodology which begins with an active site [38] and the most naive possible set of design rules which allow for the random assignment of remaining residue identities [39]. One then proceeds to randomly sample members of the resulting distribution, screen for function, and make rational changes to the distribution based on observation. This modifies the distribution and the process is repeated until the success rate of the design constraints is deemed acceptable.

It is important to recognize that establishing a set of —design rules is mathematically equivalent to selecting a probability distribution for the frequency of amino acids at each position. It is then straightforward to establish the information content of those design rules - which can be thought of as quantifying the volume of —sequence space occupied by functional sequences, or alternatively, as how much a protein sequence has to deviate from a completely random assignment of amino acids in order to satisfy the specified constraints. We have demonstrated the applicability of such a design algorithm in creating heme binding four helix bundles using binary patterning in combination with a bioinformatically derived heme binding consensus sequence [40], and estimated from these results that there are on the order of 1060 104 residue long sequences capable of folding into four-helix bundles that internally bind heme and/or porphyrin cofactors [41]. This is a large designability indeed. Note, however, that in comparison to the size of sequence space itself (approximately10130 sequences of equal length), this number is still vanishingly small. In any case, it is clear that both 10130 and 1060 are each so large that it is impossible that evolution has had time to search this space [42]. Some authors have argued based on this observation that the enormous number of possible sequences is a —red herring [43], and that closer inspection of the constraints on protein sequences make the accessible size of sequence space much smaller. While we agree that the space of possible sequences is much too large to have been searched, we also note that the size of the available sequence space places upper bounds on the information content of protein sequences, and that the vastness of sequence space thus does play a significant role in protein evolution. If the space of possible sequences was more limited, the variability of natural sequences and thus the capacity of proteins to adapt to their environment would in turn be more restricted. The information perspective helps one to see that such large numbers are in fact not uncommon at all in nearly any combinatoric situation. Inserting these numbers into equation (2), given that we assigned probabilities (some =1) to 104 positions, shows that

I=4.32104-log2(1060)250bits (3)

Where the term bit is a quantitative indicator of how far the design rules vary from true randomness – in the case of 250 bits of information, a truly randomly created protein sequence would have a sequence which does not conflict with these design rules 1 in 2250 times. This is a relatively small amount of information. Note, however, that all that this outlook does is tame these enormous combinatoric possibilities by taking a logarithm, and if 250 bits suddenly seems like a reasonable number then so was the ratio Ns/AN.

A second point we stress is that the success of patterning approaches demonstrates that function and structure can be factored into distinct design constraints which imply distinct information contents. Indeed, one can assign distinct probability distributions at every position, and the total information remains simply additive. Our design approach utilizes a consensus sequence required specifying only 5 positions out of 26 on a helix [40], with the remaining positions fixed by a randomized patterning probability distribution intended to produce four helix bundles [41]. Considering that it is certain that there are folds other than four helix bundles which can bind heme (many such examples are found in nature already), the functional footprint for sequences that bind heme irrespective of structure is much larger even still than 1060. Again this underscores the fact that the vastness of sequence space is no impediment to the search for functional sequences, and one does not need to deny this vastness in order to explain the existence of evolved enzymes with high selectivity and activity.

Finally, note that patterning methodologies quantify the information content of chosen design rules and not the size of the functional fold space directly--there exists as yet no satisfactory method of estimating this latter quantity outside the context of lattice models. But the higher the success rate of the chosen design rules, the more the entropy of those rules approaches the true entropy of the functional sequence space for a given fold. An accurate estimation of this entropy would in turn lead directly to a quantification of how difficult it is to randomly find an acceptable solution to a particular chemical problem and, once found, how much work would be required for natural selection to maintain that function in the face of randomizing mutations. Accurate estimations of the size of functional sequence space should then be of immense value to questions pertaining to the earliest evolving chemical organizations on earth and perhaps even the statistical likelihood of finding life elsewhere in the universe.

Limiting values of information and adaptibility

The designability inserted into equation 2 must lie between 1 and AN, corresponding to maximum and zero mutual information between sequence and structure, with Nlog(A) setting the maximum value. Given that the mutual information is bounded above and below, it is worthwhile considering the limiting values, and the relationship of these bounds to the distribution of natural protein lengths. It is easy to see that a structure with Ns=1 has, by definition, zero mutational tolerance. A structure where Ns begins to approach some appreciable fraction of AN, corresponding to almost no information content, is implausible due to the rate at which AN grows. However, if such structures were possible, they would place strong constraints on the available diversity of protein structures, due to the fact that each such structure covers a significant fraction of the sequence space. The vastness of sequence space does indeed then play an important role in making protein design and evolution a feasible search: it allows the designability of structures to grow to arbitrarily large values while still leaving plenty of room for the accumulation of new structural diversity.

A second point to be noted is that, while increasing the alphabet size is a good way to increase the size of sequence space, a much easier method is to merely increase the chain length. Thus, through these simple general assumptions, we can rationalize the need for protein sequences, or independently folding domains, to reach a minimum length such that the designability of structures can grow large without threatening to limit the diversity of possible structures, a limit easily achieved due to the exponential growth of sequence space with N.

This demonstrates that the relationship between the designability of a structure and the mutual information between structure and sequence--the information in the sequence which specifies the structure--but it does not address the presence of information in the sequence which is unrelated to the structure. Again, a dogmatic approach to the structure-function relationship would suggest that the only information necessary in a primary sequence is the mutual information between sequence and structure, but this is not the case.

One well-known example of such non-structural information is the coding which uses the n-terminal amino acid of a protein to control protease affinity and thereby protein half-lives [44]. Indeed, kinetic regulation of any sort cannot be achieved without some transmission of information, and this information must be shared between the primary sequence of the protein and its native environment. Spatial localization of proteins is another example of information which is potentially non-structural, involving recognition motifs that provide nascent proteins with a —shipping label for the motor proteins that actively transport cellular contents along the cytoskeleton [45]. All of this mutual information between sequence and environment is indeed structural, just not merely on the scale of the protein‘s tertiary structure. This underscores the care that must be taken when asserting that biological structure yields biological function.

Understanding the concept of information in statistical physics requires recognizing that all information is stored in states of matter, which is equivalent to saying that this information is stored in structures. Physical structures exist at different size and energy scales, however, and proteins themselves have three clearly distinguished levels of structure. Thus, the mutual information between sequence and —structure is more accurately described as the mutual information between the primary and tertiary structural levels of the protein. This can lead to misunderstandings if one then imagines that the mutual information between the primary sequence and the protein‘s fold reflects the entirety of the mutual information between the primary sequence and the native environment. In other words natural proteins are more adapted to their environment than would be the case if adopting a particular fold was their only functional requirement. Thus, given that natural proteins are highly adapted to the cellular environment and may play many distinct, functional roles, it is necessary that certain structural components of native sequences are necessary for certain functions and some are not. The immense unlikelihood of any reasonably long protein having either the maximum or minimum value for the mutual information between sequence and structure should make it clear that not every position in a primary sequence is completely determined with respect to any particular function and likewise there are no particular functions which require specifying the entire sequence.

This interplay between specifying structural information while leaving room for additional information related to different functions is at the heart of the Darwinian principle of multiple utility [46]. The rational design literature has often cited this principle as an explanation for the complexity of natural protein structures and connected it to the notion of Mueller‘s ratchet, the accumulation of contingent mutual information which has become necessary to an organism because of later selective changes that depend on the contingent conditions under which they were discovered [47, 48]. The history of maquette based design has largely vindicated this view in the sense that it has consistently recapitulated complex functions in much simpler structures than the natural versions of the same proteins. We stress that in performing these simplifications one is necessarily tossing out any additional information the structure carries about its cellular environment. Thus one trades a complex structure capable of many kinds of interactions with its environment for a simplified structure which (hopefully) does only what it was designed to do. Once the minimum structural or informational requirements for a function are understood, however, the design of de novo proteins which interact with their environment as robustly and variously as their natural counterparts should also become possible.

III. Protein stability is limited by evolution

Extraordinary stability in designed proteins

One outstanding feature of the accumulated results of protein design efforts, either in the case of de novo folds or redesign of natural sequences, is the preponderance of designed proteins which are much more stable than their natural counterparts [49, 50]. Redesigned natural proteins are often more than 10 kcal/mol more stable than their wild-type counterparts [51], and some cases these extremely stable proteins have been produced through very simple design procedures [52]. In the context of designing enzymes or structural proteins for practical applications in which the molecules must function outside of a biological context, such stabilization is highly desirable: one simply wants an enzyme or molecule that lasts as long as possible before being irreversibly damaged. For example, helical bundles have recently been reported with extrapolated stabilities in excess of 60 kcal/mol [53]. This is a validation of our current understanding of the molecular forces determining protein stability, but it raises serious questions about our grasp of the evolutionary forces which govern the intrinsic stabilities of uniquely folded proteins.

The majority of natural proteins are —marginally stable, with an average native state stability of 5–10 kcal/mol [54]. The predominant theory which explains this is that marginal stability is a simple consequence of stochastic drift in the —neutral network of sequences which confer a given function or structure [9] – that the majority of mutations which do not affect function either have no effect on protein stability or a destabilizing one, and only when successive destabilizing mutations affect function do they undergo adverse selection [55]. Thus, the standard model states that for a fixed fitness landscape marginal stability is a natural consequence of drift in the neutral network, and furthermore that designability and stability are positively correlated [56, 57]. The latter can be intuitively understood as relating the size of a structure‘s functional footprint in sequence space to the depth of the potential well corresponding to the net stability of the interactions which constitute the structure. Structures designed by more sequences are more tolerant of mutation, and this is made possible if they begin with some stability to spare [58]. What recent results have made clear, however, is just how deep the stability well for highly designable structures is. Given the apparent depth of this well - at minimum 60 kcal/mol in the case of helical bundle proteins - the fractional sequence space of highly designable structures that should exhibit greater than marginal stability now appears to be much larger than previously appreciated. This, together with the fact that proteins with higher than marginal stability are so rare, makes it clear that there must be a selection process against high protein stability, suggesting that the functional sequence footprint in sequence space is even larger than originally thought (see Figure 3).

Figure 3.

Figure 3

The functional footprint in sequence space is even larger than originally thought, making both protein design and evolution easier than one might suspect. (A) The great majority of natural proteins have marginal folding stabilities of −10 kcal/mol or less. However, the preponderance of exceptionally stable proteins, with stabilities as large as −60 kcal/mol, suggests that in fact functional sequence space is much larger (B), and that one or more selection mechanisms selects against these highly stable sequences. Even after adverse selection, the resultant truncated distribution of stable sequences (C) would be significantly larger than that depicted in (A).

Cellular protein concentrations and protein stability

The recent work on proteostasis by Kelly and coworkers gives one possible reason for protein stability to be evolutionarily limited [59]. They have shown that it is possible to manipulate steady-state protein concentrations inside the cell by creating a series of small molecules which bind either to the folded states of helpful cellular proteins, stabilizing them, or to the unfolded states of deleterious proteins, destabilizing them - demonstrating clear relationships between protein production and degradation rates, the free energy of folding, and cellular protein homeostasis (Scheme I). In the absence of binding interactions which preferentially stabilize either the folded or unfolded states of the protein, the steady-state population of the native state N is:

[N]=ksynthesiskdegradatione-ΔGfoldingRT (3)

Scheme 1.

Scheme 1

Simply substituting a value of 60 kcal/mol into this equation results in the prediction that cellular protein concentration should exceed the total mass of the cell, and indeed that of the human body! While one could make an efficiency argument that longer-lived proteins require less energy to maintain at their functional cellular concentration, this back-of-the-envelope calculation makes it clear that as stabilities increase, protein accumulation rapidly reaches a counterproductive level.

Stability considerations on folding kinetics

Another possible selection mechanism concerns the rates of folding and unfolding of these proteins: for a two-state, reversibly folding protein at a fixed rate of unfolding, fast folding is achieved by increasing the magnitude of ΔG. However, at limiting folding speeds, increasing the unfolding rate for purposes of kinetic regulation can only be achieved by reducing the stability of the fold. This argument hinges on the often overlooked fact that there is a purifying selection bias against proteins that take too long to fold [60]. This selection requirement is as unavoidable as the necessity of the unfolding energy being above some minimal cutoff that would ensure stability. However, placing restrictions on both the free energy of folding and the folding rate determines the spontaneous unfolding rate. Consider, for instance, that a reasonable minimum value of the folding rate for biological relevance is approximately 1 s−1. If additionally one requires that the free energy of folding be 15 kcal/mol, then at room temperature this fixes the spontaneous unfolding rate to be on the order of 10,000 years! At 60 kcal/mol, maintaining a one second folding rate implies an unfolding rate longer than the age of the universe.

Folding energies and cellular replication rates

It may be objected that the spontaneous unfolding rate is irrelevant, since nearly all cellular proteins are actively degraded by degradative metabolic pathways. However, it is thermodynamically forbidden for such a transition to take place without paying the cost of the free energy of unfolding--and this holds no matter what mechanism is used to degrade the protein. Thus, the fact that the basic conditions of the proteasome make the unfolding of a protein —spontaneous at a given pH does not allow the cell to evade paying this unfolding cost, since the formation of a pH gradient implies an increase in the internal entropy change of the cell, and more stable proteins will necessitate a larger, more costly pH gradient. So whether the cell pays the cost of unfolding directly through ATP-driven enzymatic activity or indirectly through creation of local non-equilibrium environments, the strictures of thermodynamics make it impossible that this cost not be paid in order to actively regulate protein degradation.

Selection against stability thus makes sense in the organizational context of cellular growth and replication. Connecting the simple kinetic model of proteostasis to Crooks‘ fluctuation theorem [61] implies relationships between protein production and degradation rates, the free energy of folding, and cellular growth rates. England has recently applied Crook's non-equilibrium extension of the second law of thermodynamics to the context of organismal self-replication [62], demonstrating that the overall cellular growth rate is limited by the overall free energy change of the replication process and the degradation rate of the structures formed:

g=e-ΔGRTd (4)

where g is the growth rate and d is the durability of the cell – more durable cellular structures slow replication Thus, all external energy sources being equal, one way in which any organism can outcompete a population of similar self-replicators is by reducing the stability of any protein whose degradation must be actively controlled until it achieves marginal stability.

The neutral theory of evolution is one of the more influential ideas of 20th century biology, and it plays an indispensable role in our contemporary understanding of how evolution produces diversity [63]. Another well appreciated fact, however, is that it can be notoriously difficult to distinguish between selective mechanisms in real environments. Much of this difficulty is due to the complexity of information accumulation in natural selection alluded to above. Indeed, as evolutionary biologists are aware, detecting neutral diversity in static or controlled environments is a much more straightforward task than presenting evidence for either stabilizing or purifying selection in a natural setting [64]. Natural ecosystems present fitness landscapes for both molecules and organisms that vary in time, often in highly stochastic fashion, and are likely to be very —high-dimensional in accord with Darwin‘s principle of multiple utility. In other words, during natural evolution stability was far from the only parameter undergoing selection via changes to protein sequences.

A point which should be emphasized is that the stability landscape over sequence space and the fitness landscape over sequence space are not the same thing. The stability landscape is at most one slice through the much higher-dimensional fitness landscape. Lattice models intrinsically study only folding, and thus have nothing to tell us about how other selective pressures, such as those coming from the higher-level organizational constraints of cellular growth, replication, environmental responsivity and —evolvability might impact the relationship between stability and fitness. In this light, it is interesting to note results of digital evolution on lattice models expanded to include a —cofactor binding site [65]. It was observed that stability provided a fitness advantage on a static fitness landscape (in this case a fixed cofactor geometry), but that the —evolvability of sequences, measured by their capacity to bind a new cofactor geometry, was higher for sequences with less stable folds.

Another caveat to this discussion is that there have also been a number of reasons advanced for selective pressure against stability which have not been borne out by rational design efforts. In particular, the supposed inverse relationship between stability and flexibility, which was thought to assist in catalytic activity, does not appear to hold in many cases that have been studied [66]. The previous example illustrates the complexity of the reasoning involved in trying to answer the question —why are natural proteins marginally stable? [67]. Granting that neutral network drift is a sufficient mechanism for generating marginal stability by no means precludes the possibility that there are also selective pressures acting on fold stability. If neutral drift also tends to maintain proteins within these critical stability limits, then that merely indicates that natural selection‘s job in regard to keeping stability within bounds was not particularly difficult. Nevertheless, in the several examples we have discussed, the observed stabilities of rationally designed proteins--even those generated by very simple rules with low information content--are often several kcal/mol more stable than average natural proteins of comparable size, weakening the thesis that the majority of natural structures are simply marginally stable by virtue of stochastic drift. Protein folding models which do not include the effects of the complex kinetic requirements imposed on protein concentrations cannot therefore evaluate the extent to which those constraints impose selective pressure on protein sequences.

Conclusion

While the —protein folding problem remains formally intractable [68], rational protein design has demonstrated that the basic molecular driving forces governing folding are well understood. Indeed, the rapidly growing number of designed functional proteins demonstrates that the inverse folding problem‘ – the design of sequences that fold into target structures [69], has for the most part been solved. The success of these design efforts has in turn thrown certain aspects of natural proteins into greater relief. First, the functional footprint in sequence space is extremely large – identical functions can exist on radically different folds, and each fold itself has a multitude of functional sequences, making protein evolution easier and thus much more rapid (fast) than originally appreciated.

Second, natural proteins are both less stable and more complex than their designed counterparts. Marginal stability has enough adaptive value that natural proteins will tend toward marginal stability even if this is a relatively rare property of amino acid sequences—as long as the folding energy exceeds the minimum cutoff, then the less the folding energy the less work required when it inevitably becomes time for the cell to dispose of the protein (cheap). In our view, their additional complexity results both from contingent, historical aspects of natural evolution embodied by Mueller‘s ratchet and the polyvalence of information content that results from Darwinian multiple utility. That is, natural proteins are complex both because they are more adapted to their environment than proteins designed for a single purpose and because evolution has no foresight. It is in this spirit that a statistical physicist views protein evolution as a random walk through sequence space (out of control).

Taken together, we believe that the cumulative results of rational protein design offer powerful empirical vindication for the claim, corroborated by non-equilibrium thermodynamics, that the secret of biological evolution is to be —fast, cheap, and out of control by design.

Figure 2.

Figure 2

Distribution of designabilities derived from an HP lattice study of all 3x3x3 compact structures formed by all 227 possible HP sequences. The expectation for a Poisson distribution (dotted line) is shown for comparison. Figure reproduced with permission from reference 6.

Acknowledgments

RLK gratefully acknowledges support via National Institutes of Health grant 1R01-GM111932. Program and infrastructure support from the National Institutes of Health National Center for Research Resources to the City College of New York (5G12-RR03060). RLK is a member of the New York Structural Biology Center (NYSBC). NMR data collected at NYSBC was made possible by a grant from NYSTAR.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. [DOI] [PubMed] [Google Scholar]
  • 2.Dill KA. Theory for the folding and stability of globular proteins. Biochemistry. 1985;24:1501–1509. doi: 10.1021/bi00327a032. [DOI] [PubMed] [Google Scholar]
  • 3.Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics. 2008;9(Suppl 2):S1. doi: 10.1186/1471-2164-9-S2-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Dill KA, Bromberg S. Molecular Driving Forces. Garland Science; New York, N.Y: 2002. [Google Scholar]
  • 5.Sali A, Shakhnovich E, Karplus M. Kinetics of protein folding. A lattice model study of the requirements for folding to the native state. J Mol Biol. 1994;235:1614–1636. doi: 10.1006/jmbi.1994.1110. [DOI] [PubMed] [Google Scholar]
  • 6.Helling R, Li H, Melin R, Miller J, Wingreen N, Zeng C, Tang C. The designability of protein structures. Journal of Molecular Graphics & Modelling. 2001;19:157–167. doi: 10.1016/s1093-3263(00)00137-6. [DOI] [PubMed] [Google Scholar]
  • 7.Miyazawa S, Jernigan RL. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol. 1996;256:623–644. doi: 10.1006/jmbi.1996.0114. [DOI] [PubMed] [Google Scholar]
  • 8.Li H, Helling R, Tang C, Wingreen N. Emergence of Preferred Structures in a Simple Model of Protein Folding. Science. 1996;273:666–669. doi: 10.1126/science.273.5275.666. [DOI] [PubMed] [Google Scholar]
  • 9.Bloom JD, Raval A, Wilke CO. Thermodynamics of neutral protein evolution. Genetics. 2007;175:255–266. doi: 10.1534/genetics.106.061754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Noirel J, Simonson T. Neutral evolution of proteins: The superfunnel in sequence space and its relation to mutational robustness. J Chem Phys. 2008;129:185104. doi: 10.1063/1.2992853. [DOI] [PubMed] [Google Scholar]
  • 11.Kussell E. The designability hypothesis and protein evolution. Protein Pept Lett. 2005;12:111–116. doi: 10.2174/0929866053005881. [DOI] [PubMed] [Google Scholar]
  • 12.Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review. 2001;5:3–55. [Google Scholar]
  • 13.Dewey TG. Algorithmic complexity and thermodynamics of sequence-structure relationships in proteins. Physical Review E. 1997;56:4545–4552. [Google Scholar]
  • 14.Kamtekar S, Schiffer JM, Xiong H, Babik JM, Hecht MH. Protein design by binary patterning of polar and nonpolar amino acids. Science. 1993;262:1680–1685. doi: 10.1126/science.8259512. [DOI] [PubMed] [Google Scholar]
  • 15.Hecht MH, Das A, Go A, Bradley LH, Wei YN. De novo proteins from designed combinatorial libraries. Protein Science. 2004;13:1711–1723. doi: 10.1110/ps.04690804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bradley LH, Thumfort PP, Hecht MH. De novo proteins from binary-patterned combinatorial libraries. Methods Mol Biol. 2006;340:53–69. doi: 10.1385/1-59745-116-9:53. [DOI] [PubMed] [Google Scholar]
  • 17.Bradley LH. High-quality combinatorial protein libraries using the binary patterning approach. Methods Mol Biol. 2014;1216:117–128. doi: 10.1007/978-1-4939-1486-9_6. [DOI] [PubMed] [Google Scholar]
  • 18.Wei Y, Liu T, Sazinsky SL, Moffet DA, Pelczer I, Hecht MH. Stably folded de novo proteins from a designed combinatorial library. Protein Sci. 2003;12:92–102. doi: 10.1110/ps.0228003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302:1364–1368. doi: 10.1126/science.1089427. [DOI] [PubMed] [Google Scholar]
  • 20.Reddi AR, Reedy CJ, Mui S, Gibney BR. Thermodynamic investigation into the mechanisms of proton-coupled electron transfer events in heme protein maquettes. Biochemistry. 2007;46:291–305. doi: 10.1021/bi061607g. [DOI] [PubMed] [Google Scholar]
  • 21.AJ, KR, MC, DP . Controlling complexity and water penetration in functional de novo protein design. 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hilvert D. Critical analysis of antibody catalysis. Annual Review Of Biochemistry. 2000;69:751–793. doi: 10.1146/annurev.biochem.69.1.751. [DOI] [PubMed] [Google Scholar]
  • 23.Koder RL, Anderson JLR, Solomon LA, Reddy KS, Moser CC, Dutton PL. Design and engineering of an O2 transport protein. Nature. 2009;458:305–309. doi: 10.1038/nature07841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhang L, Anderson JLR, Ahmed I, Norman JA, Negron C, Mutter AC, Dutton PL, Koder RL. Manipulating Cofactor Binding Thermodynamics in an Artificial Oxygen Transport Protein. Biochemistry. 2011;50:10254–10261. doi: 10.1021/bi201242a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhang L, Andersen EME, Khajo A, Maggliozzo RS, Koder RL. Dynamic factors affecting gaseous ligand binding in an artificial oxygen transport protein. Biochemistry. 2013;52:447–455. doi: 10.1021/bi301066z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Anderson JLR, Armstrong CT, Kodali G, Lichtenstein BR, Watkins DW, Mancini JA, Boyle AL, Farid TA, Crump MP, Moser CC, Dutton PL. Constructing a man-made c-type cytochrome maquette in vivo: electron transfer, oxygen transport and conversion to a photoactive light harvesting maquette. Chemical Science. 2014;5:3659–3659. doi: 10.1039/C3SC52019F. vol 5, pg 507, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Farid TA, Kodali G, Solomon LA, Lichtenstein BR, Sheehan MM, Fry BA, Bialas C, Ennist NM, Siedlecki JA, Zhao Z, Stetz MA, Valentine KG, Anderson JLR, Wand AJ, Discher BM, Moser CC, Dutton PL. Elementary tetrahelical protein design for diverse oxidoreductase functions. Nature Chemical Biology. 2014;10:164–164. doi: 10.1038/nchembio.1362. vol 9, pg 826, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Grosset AM, Gibney BR, Rabanal F, Moser CC, Dutton PL. Proof of principle in a de novo designed protein maquette: An allosterically regulated, charge-activated conformational switch in a tetra-alpha-helix bundle. 2001;40:5474–5487. doi: 10.1021/bi002504f. [DOI] [PubMed] [Google Scholar]
  • 29.Wei YN, Hecht MH. Enzyme-like proteins from an unselected library of designed amino acid sequences. Protein Eng Des Sel. 2004;17:67–75. doi: 10.1093/protein/gzh007. [DOI] [PubMed] [Google Scholar]
  • 30.Patel SC, Bradley LH, Jinadasa SP, Hecht MH. Cofactor binding and enzymatic activity in an unevolved superfamily of de novo designed 4-helix bundle proteins. Protein Science. 2009;18:1388–1400. doi: 10.1002/pro.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Maeda Y, Javid N, Duncan K, Birchall L, Gibson KF, Cannon D, Kanetsuki Y, Knapp C, Tuttle T, Ulijn RV, Matsui H. Discovery of Catalytic Phages by Biocatalytic Self-Assembly. J Am Chem Soc. 2014;136:15893–15896. doi: 10.1021/ja509393p. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rothlisberger D, Khersonsky O, Wollacott AM, Jiang L, DeChancie J, Betker J, Gallaher JL, Althoff EA, Zanghellini A, Dym O, Albeck S, Houk KN, Tawfik DS, Baker D. Kemp elimination catalysts by computational enzyme design. Nature. 2008;453:190–U194. doi: 10.1038/nature06879. [DOI] [PubMed] [Google Scholar]
  • 33.Jiang L, Althoff EA, Clemente FR, Doyle L, Rothlisberger D, Zanghellini A, Gallaher JL, Betker JL, Tanaka F, Barbas CF, Hilvert D, Houk KN, Stoddard BL, Baker D. De novo computational design of retro-aldol enzymes. Science. 2008;319:1387–1391. doi: 10.1126/science.1152692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Anderson JLR, Koder RL, Moser CC, Dutton PL. Controlling complexity and water penetration in functional de novo protein design. Biochem Soc Trans. 2008;36:1106–1111. doi: 10.1042/BST0361106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wolf YI, Grishin NV, Koonin EV. Estimating the number of protein folds and families from complete genome data. J Mol Biol. 2000;299:897–905. doi: 10.1006/jmbi.2000.3786. [DOI] [PubMed] [Google Scholar]
  • 36.Koder RL, Dutton PL. Intelligent design: the de novo engineering of proteins with specified functions. Dalton Trans. 2006;25:3045–3051. doi: 10.1039/b514972j. [DOI] [PubMed] [Google Scholar]
  • 37.Silverman JA, Balakrishnan R, Harbury PB. Reverse engineering the (beta/alpha)(8) barrel fold. Proc Natl Acad Sci U S A. 2001;98:3092–3097. doi: 10.1073/pnas.041613598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Summa CM, Lombardi A, Lewis M, DeGrado VF. Tertiary templates for the design of diiron proteins. Curr Opin Struct Biol. 1999;9:500–508. doi: 10.1016/S0959-440X(99)80071-2. [DOI] [PubMed] [Google Scholar]
  • 39.Gibney BR, Rabanal F, Skalicky JJ, Wand AJ, Dutton PL. Iterative protein redesign. J Am Chem Soc. 1999;121:4952–4960. [Google Scholar]
  • 40.Negron C, Fufezan C, Koder RL. Helical Templates for Porphyrin Binding in Designed Proteins. Proteins. 2009;74:400–416. doi: 10.1002/prot.22143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Everson BH, French CA, Mutter AC, Nanda V, Koder RL. Hemoprotein Design using Minimal Sequence Information. Biophys J. 2013;104:661A–661A. [Google Scholar]
  • 42.Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature. 2002;420:218–223. doi: 10.1038/nature01256. [DOI] [PubMed] [Google Scholar]
  • 43.Dryden DT, Thomson AR, White JH. How much of protein sequence space has been explored by life on Earth? J R Soc Interface. 2008:953–956. doi: 10.1098/rsif.2008.0085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mogk A, Schmidt R, Bukau B. The N-end rule pathway for regulated proteolysis: prokaryotic and eukaryotic strategies. Trends Cell Biol. 2007;17:165–172. doi: 10.1016/j.tcb.2007.02.001. [DOI] [PubMed] [Google Scholar]
  • 45.Braun AC, Olayioye MA. Rho regulation: DLC proteins in space and time. Cell Signal. 2015;27:1643–1651. doi: 10.1016/j.cellsig.2015.04.003. [DOI] [PubMed] [Google Scholar]
  • 46.Moser CC, Page CC, Dutton PL. Darwin at the molecular scale: selection and variance in electron tunnelling proteins including cytochrome c oxidase. Philos Trans R Soc Lond B Biol Sci. 2006;361:1295–1305. doi: 10.1098/rstb.2006.1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Koonin EV. The Turbulent Network Dynamics of Microbial Evolution and the Statistical Tree of Life. J Mol Evol. 2015;80:244–250. doi: 10.1007/s00239-015-9679-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Muller HJ. THE RELATION OF RECOMBINATION TO MUTATIONAL ADVANCE. Mutat Res. 1964;106:2–9. doi: 10.1016/0027-5107(64)90047-8. [DOI] [PubMed] [Google Scholar]
  • 49.Dantas G, Corrent C, Reichow SL, Havranek JJ, Eletr ZM, Isern NG, Kuhlman B, Varani G, Merritt EA, Baker D. High-resolution Structural and Thermodynamic Analysis of Extreme Stabilization of Human Procarboxypeptidase by Computational Protein Design. J Mol Biol. 2007:1209–1221. doi: 10.1016/j.jmb.2006.11.080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Korkegian A, Black ME, Baker D, Stoddard BL. Computational thermostabilization of an enzyme. Science. 2005;308:857–860. doi: 10.1126/science.1107387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Marshall SA, Mayo SL. Achieving stability and conformational specificity in designed proteins via binary patterning. J Mol Biol. 2001;305:619–631. doi: 10.1006/jmbi.2000.4319. [DOI] [PubMed] [Google Scholar]
  • 52.Shifman JM, Moser CC, Kalsbeck WA, Bocian DF, Dutton PL. Functionalized de novo designed proteins: Mechanism of proton coupling to oxidation/reduction in heme protein maquettes. Biochemistry. 1998;37:16815–16827. doi: 10.1021/bi9816857. [DOI] [PubMed] [Google Scholar]
  • 53.Huang PS, Oberdorfer G, Xu C, Pei XY, Nannenga BL, Rogers JM, DiMaio F, Gonen T, Luisi B, Baker D. High thermodynamic stability of parametrically designed helical bundles. Science. 2014;346:481–485. doi: 10.1126/science.1257481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Makhatadze GI, Privalov PL. Energetics of protein structure. Advances in Protein Chemistry. 1995;47:307–425. doi: 10.1016/s0065-3233(08)60548-3. [DOI] [PubMed] [Google Scholar]
  • 55.Oleksyk TK, Smith MW, O'Brien SJ. Genome-wide scans for footprints of natural selection. Philos Trans R Soc Lond B Biol Sci. 2010:185–205. doi: 10.1098/rstb.2009.0219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Emberly EG, Wingreen NS, Tang C. Designability of α-helical proteins. Proc Natl Acad Sci U S A. 2002:11163–11168. doi: 10.1073/pnas.162105999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Yang JY, Yu ZG, Anh V. Correlations between designability and various structural characteristics of protein lattice models. J Chem Phys. 2007;126:195101. doi: 10.1063/1.2737042. [DOI] [PubMed] [Google Scholar]
  • 58.The designability of protein structures. 2001;19:157–167. doi: 10.1016/s1093-3263(00)00137-6. [DOI] [PubMed] [Google Scholar]
  • 59.Balch WE, Morimoto RI, Dillin A, Kelly JW. Adapting proteostasis for disease intervention. Science. 2008;319:916–919. doi: 10.1126/science.1141448. [DOI] [PubMed] [Google Scholar]
  • 60.Kubelka J. The protein folding speed limit. 2004;14:76–88. doi: 10.1016/j.sbi.2004.01.013. [DOI] [PubMed] [Google Scholar]
  • 61.Crooks G. Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences. 2015 doi: 10.1103/physreve.60.2721. [DOI] [PubMed] [Google Scholar]
  • 62.England JL. Statistical physics of self-replication. 2013 doi: 10.1063/1.4818538. [DOI] [PubMed] [Google Scholar]
  • 63.Kimura M. The neutral theory of molecular evolution: a review of recent evidence. Jpn J Genet. 1991;66:367–386. doi: 10.1266/jjg.66.367. [DOI] [PubMed] [Google Scholar]
  • 64.Schrider DR, Mendes FK, Hahn MW, Kern AD. Soft shoulders ahead: spurious signatures of soft and partial selective sweeps result from linked hard sweeps. Genetics. 2015;200:267–284. doi: 10.1534/genetics.115.174912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Bloom JD, Wilke CO, Arnold FH, Adami C. Stability and the evolvability of function in a model protein. Biophys J. 2004;86:2758–2764. doi: 10.1016/S0006-3495(04)74329-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Teilum K, Olsen JG, Kragelund BB. Protein stability, flexibility and function. Biochim Biophys Acta. 2011;1814:969–976. doi: 10.1016/j.bbapap.2010.11.005. [DOI] [PubMed] [Google Scholar]
  • 67.Taverna DM, Goldstein RA. Why are proteins marginally stable? Proteins. 2002;46:105–109. doi: 10.1002/prot.10016. [DOI] [PubMed] [Google Scholar]
  • 68.Dill KA, MacCallum JL. The Protein-Folding Problem, 50 Years On. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. [DOI] [PubMed] [Google Scholar]
  • 69.Chiu TL, Goldstein RA. Optimizing potentials for the inverse protein folding problem. Protein Engineering. 1998;11:749–752. doi: 10.1093/protein/11.9.749. [DOI] [PubMed] [Google Scholar]

RESOURCES