Abstract
Knowing the determinants of conformational specificity is essential for understanding protein structure, stability, and fold evolution. To address this issue, a novel statistical measure of energetic compatibility between sequence and structure was developed, using an experimentally validated model of the energetics of the native state ensemble. This approach successfully matched sequences from a diverse subset of the human proteome to their respective folds. Unexpectedly, significant energetic compatibility between ostensibly unrelated sequences and structures was also observed. Interrogation of these matches revealed a general framework for understanding the origins of conformational specificity within a proteome: specificity is a complex function of both the ability of a sequence to adopt folds other than the native, and ability of a fold to accommodate sequences other than the native. The regional variation in energetic compatibility indicates that the compatibility is dominated by incompatibility of sequence for alternative fold segments, suggesting that evolution of protein sequences has involved substantial negative selection, with certain segments serving as “gatekeepers” that presumably prevent alternative structures. Beyond these global trends, a size dependence exists in the degree to which the energetic compatibility is determined from negative selection, with smaller proteins displaying more negative selection. This partially explains how short sequences can adopt unique folds, despite the higher probability in shorter proteins for small numbers of mutations to increase compatibility with other folds. In providing evolutionary ground rules for the thermodynamic relationship between sequence and fold, this framework imparts valuable insight for rational design of unique folds or fold switches.
Keywords: Thermodynamic Environments, Gapless Threading, Metamorphic Proteins, Rational Design, Fold Recognition
Introduction
Why does an amino acid sequence adopt one particular unique fold and not one of the few thousands of alternatives? How do new folds arise and change during evolution of the proteome? Insight into these essential biological questions will be obtained by understanding the determinants of conformational specificity, the well-known ability of structured proteins to retain a finite population of native fold even under destabilizing conditions. One particularly interesting aspect of this problem is revealed by the repeated observation of “chameleon sequences” [1–4], which can adopt different folds, and the emerging discovery of “metamorphic proteins” [5, 6], which change fold as part of their function. Such extremes of conformational specificity, which have already been shown to be amenable to protein engineering [7–11], may prove to be an evolutionarily important mechanism for both fold change [9, 12–15] and functional versatility (as a prominent sub-class of “moonlighting proteins” [16, 17]). However, current bioinformatics tools and molecular dynamics simulations, using sequence or structure information, fail to reliably identify chameleon or metamorphic proteins [8, 18–20]. Novel information, not entirely based on either sequence or structure alone, may facilitate development of a more effective compatibility measurement between sequence and structure.
Our approach to addressing the problem is rooted in the ensemble nature of proteins [21], leveraging the long-standing realization that fold stability and conformational specificity are both thermodynamic in origin, and partially separable [22, 23]. Proteins in solution sample myriad conformations according to a Boltzmann distribution. Even when a single folded conformation is dominant, alternative structures could be transiently populated, albeit at vanishingly small amounts. Indeed, if protein sequences do obey Boltzmann statistics, each sequence has some probability of adopting every fold. Thus, the question of conformational specificity may be more tractable if rephrased: what is the difference in stability between one sequence adopting each of two alternative folds? Answering this question requires knowledge of the energetics of the compatibility between amino acid sequence and protein structure.
In this work, conformational specificity [22] is addressed from such a thermodynamic standpoint by development of a statistical framework for measurement of the compatibility between sequence and structure. An ensemble-based description of protein thermodynamics [24] is applied to a diverse database of protein folds, for which the positional thermodynamic stability of every residue is estimated. Importantly, this computational stability has been experimentally demonstrated [21, 24–28] to largely capture the cooperativity imparted by both local and global interactions, and from both enthalpic and entropic contributions. Thus, every residue in a protein can be described, not by residue letter or structural type, but instead by a so-called “thermodynamic environment” [29, 30].
Using a previously validated threading algorithm [30, 31], the energetic compatibility of amino acid sequence fragments adopting varied thermodynamic environment contexts was exhaustively computed, exploring general principles for conformational specificity and the organization of protein fold space. The results indicate that there is substantial energetic compatibility between ostensibly unrelated proteins, composed of energetically compatible and incompatible contributions that are heterogeneously distributed throughout the sequence. We find that conformational specificity, operationally defined as the high energetic compatibility of one sequence with one fold, is a function of both sequence and fold, and that evolution of one fold from another may not be energetically improbable.
Furthermore, because energetic compatibility is correlated with the number of incompatible contributions, negative design appears to be important for conformational specificity, particularly for small, single domain proteins. This finding suggests that negative selection could be an evolutionary strategy to minimize the effects of metamorphic structure, as small proteins are expected to be more susceptible to fold-switching [11].
Materials and Methods
Ensemble-Based Thermodynamic Database of Diverse Human Proteins
This database has been described and used in previous analyses, of note is the presence of diverse secondary structural classes and fold types as curated by the SCOP database [32]. Briefly, 122 H. sapiens proteins of known structure (Table S1) were taken from the Protein Data Bank (PDB) [33] and native state Boltzmann-weighted thermodynamic ensembles were generated using the COREX/BEST algorithm [21, 34]. A summary of the computational procedure used to generate this database is given in Figure 4, below. When present in the PDB coordinates, selenomethionine residues were manually edited to methionine to permit execution of the algorithm. Parameters for the algorithm were: window size of 5 residues, minimum window size of 4 residues, simulated temperature of 25 °C, entropy weighting of 0.5, Monte Carlo sampling of at least 10,000 microstates per partition. Clustering of the COREX/BEST thermodynamic parameters ΔG, ΔHap, ΔHpol, TΔSconf to obtain eight thermodynamic environments was performed by partitioning-around-medoids, implemented in S-PLUS 6 (Insightful Corporation, Seattle, WA), as previously described [30, 31]. Thermodynamic parameters for the 17,801 residues in this database are given in Table S2. Log-odds scores (Figure 1), quantifying the observed to expected ratios of amino acids in thermodynamic environments, were computed from this database as previously described [29–31]. Secondary structure elements were assigned to each residue using STRIDE [35] and are listed in Table S2.
Exhaustive Gapless Scoring Between Sequence and Thermodynamic Environments
All amino acid sequences in the database were quantitatively compared with all proteins’ thermodynamic environments. This was performed twice, first using complete sequence strings compared with complete environment strings, and second after dividing complete strings into overlapping 13 residue fragments starting at all possible registers. The second procedure was deliberately chosen for three reasons: to reveal regional contributions to energetic compatibility, to avoid possible length-dependent artifacts, and to keep the total amount of computations tractable. (Fragments of lengths 6 and 25 were also explored, with little qualitative change in results, data not shown.) Each comparison of sequence to environments used gapless scoring, popularly referred to as gapless “threading” [36] of an amino acid string against an environment string. A comparison was simply defined as a sum of the log-odds scores given by each residue/environment pair in Figure 1, using custom scripts written in Mathematica 9.0 (Wolfram Research, Champaign, IL). For example, the 13 residue amino acid sequence fragment starting at position 155 in the PDB coordinate file 1BYQ is NDDEQYAWESSAG. The threading score of this sequence fragment compared with the 13 residue thermodynamic environments fragment from the PDB file 1GP0 starting at position 1538, i.e. 5742211111248, is calculated as the sum of all 13 log-odds scores corresponding to each amino acid/environment pair, as listed in Figure 1. For this example, the sum would be 0.07 − 0.61 − 0.42 − 0.07 − 0.25 − 0.93 + 0.45 − 0.90 − 0.37 − 0.03 + 0.05 + 0.02 − 1.28 = −4.27. These computations were repeated until all sequence fragments were scored against all environments fragments. For comparisons of full-length proteins, the shorter protein was matched in all possible registers against the longer protein, such that the number of terms in the sum for each register was identical to the length of the shorter protein. Then, the maximum score over all registers was taken to be the single final score for that protein pair.
Parameterization of Probability Distributions: Significance of Energetic Compatibility
To assess the quality of these raw summed scores, a mathematical model was developed to estimate the expected chance occurrence of any particular raw score. Proteins of random composition and varying length were created by randomly choosing amino acids according to background frequencies in the Table S1 database (which were similar to background frequencies of amino acids seen in large sequence databases). These random sequences of amino acids were compared to identical length random sequences of similarly chosen thermodynamic environments and the total raw scores computed as described above. 120,000 such random proteins were scored at each chain length to obtain the reported histograms and curve fits (Fig. S1). Random protein creation, scoring, curve and distribution fitting were performed in Mathematica using custom scripts.
Empirical distributions of random gapless summed scores between amino acid sequences and thermodynamic environments were discovered to be statistically Gaussian for all lengths tested (Fig. S1). This result allowed the parameterization of a useful probability model for a gapless match of any length protein (Fig. S1b). In this model, as the length of a gapless match increased, it became progressively less likely to obtain a positive log-odds score (Fig. S1a); in other words, a randomly chosen sequence was expected to be energetically incompatible with a randomly chosen structure. In contrast, an extremely high positive score is uncommon in the model, and thus a significantly high score would be consistent with an empirical observation of “conformational specificity”: defined here as the extreme case where one amino acid sequence is energetically compatible with only one unique structure (Fig. S1a).
Computing Compatibility Index of Significant Matches and Principal Components Analysis
The 122 database proteins were exploded into 16,337 overlapping fragments of length 13 residues. Exhaustive all-vs-all comparisons of these 16,337 fragments resulted in greater than 266 million raw scores. Each raw score was then treated as a limit of integration in the length 13 Gaussian random score distribution, and the probability of obtaining a score of at most the observed raw score was computed using custom scripts. This list of p-values was filtered such that the best (most positive) and worst (most negative) of all comparisons, defined as those exhibiting p < 0.01 or p > 0.99, were retained. The resulting filtered comparisons were then mapped back on to the positions of amino acid sequence or thermodynamic environments in the full-length proteins from which they originally came. Counts at each position were tabulated to produce a density of significant best, or worst, comparisons with regard to either sequence or structure. Thus, this analysis resulted in a total of four new attributes measured at every position in every protein: most significant matches of amino acid sequence against all other thermodynamic environments, least significant matches of amino acid sequence against all other thermodynamic environments, most significant matches of thermodynamic environments against all other amino acid sequences, and least significant matches of thermodynamic environments against all other amino acid sequences. These four attributes were, respectively, named “positive compatibility index (PCI with respect to sequence)”, “negative compatibility index (NCI with respect to sequence)”, “positive compatibility index (PCI with respect to structure)”, and “negative compatibility index (NCI with respect to structure)” throughout the rest of this paper. To minimize possible end effects, the N-terminal 12 and C-terminal 13 values for each protein were ignored, resulting in a total of 14,751 residue positions, with four density counts at each position. These data were treated as a four-dimensional space and were subjected to standard eigenvalue decomposition [37] using an in-house C program (Figure 2). “Aggregate Negative Compatibility Indices” with respect to sequence or structure of an individual protein were defined as the integrated area along the entire protein of these respective densities (i.e. the area under the blue curves in Figs. 7a and 7b, respectively).
Provisional Classification of Energetic Compatibility: Susceptibility to Fold Switch
The median PCI and NCI within each protein was used to classify residue positions according to the following definitions. Figure 3 is a visual representation of this classification that may be referenced when the various categories are discussed later in the text. “Gatekeeper” positions exhibited an NCI greater than median and a PCI less than median; the term “Gatekeeper” was meant to capture the intuitive notion of a protein fragment being energetically unlikely to adopt any known conformation. “Permissive” positions exhibited an NCI less than median and a PCI greater than median; the term “Permissive” was meant to capture the intuitive notion of a protein fragment being energetically likely to adopt many conformations. “Selective” positions exhibited NCI and PCI both greater than median; the term “Selective” was meant to capture the intuitive notion of a protein fragment being energetically likely to adopt multiple conformations but simultaneously being unlikely to adopt others. In other words, “Selective” positions could indicate regions of a protein more susceptible to fold switching. “Inactive” positions exhibited NCI and PCI both less than median; the term “Inactive” was meant to capture the intuitive notion of no strong conformational preference. Since NCI and PCI were separate attributes of both sequence and structure, each residue position was assigned two classifications, one in terms of sequence and one in terms of structure. These classifications are listed in Table S2 for the proteins analyzed in this work.
Results
Proteins Represented in Energetic Terms
Previous work has established that proteins can be represented in energetic rather than in structural terms [30]. The conceptual basis of this energetic representation is that the positional thermodynamic stability of a folded protein can be computationally estimated, by treating the protein as a Boltzmann-weighted ensemble of partially folded microstates [24]. This process, algorithmically named COREX/BEST [34], can be summarized as follows (Figure 4). The experimental coordinates (i.e. crystallographic or NMR structure) are the input for COREX/BEST (Fig. 4, Step 1). A large number, typically millions, of partially folded microstates involving all regions of the protein are generated based on the input (Fig. 4, Step 2); a key simplification here are the assumptions that any folded conformation is native-like and any unfolded conformation is expressed by average amounts of newly exposed polar and apolar surface area, relative to the PDB structure. [21, 24, 38] Each microstate is assigned a Gibbs free energy from a surface-area based function, and statistical weights and populations are calculated for every microstate in the ensemble (Fig. 4, Steps 2 & 3). For every residue position j in the protein, the entire ensemble is partitioned into sub-ensembles in which the position is either in a folded conformation or an unfolded conformation (Fig. 4, Step 4), thus defining a position-specific equilibrium constant, κf,j, between folded and unfolded. This equilibrium constant can be converted (Fig. 4, Step 5) to a position-specific stability, ΔGj, which quantitatively matches experimental position-specific stabilities measured from hydrogen exchange (Fig. 4, Step 6). Statistical analysis of the COREX/BEST output from a large number of diverse proteins results in a meaningful simplification of all position-specific stabilities into a small number (i.e. eight [30]) of clusters that share similar average values of stability.
Using our structure-based model of the native state ensemble (i.e. COREX/BEST) [21], it has been shown that these eight different “thermodynamic environments” [29] exist within any protein [30]. Furthermore, the propensities of amino acids to appear in these environments could be used as the basis of a fold recognition algorithm, much in the same way that helical sequences can be predicted from known helix propensities. Figure 5 shows an example protein color-coded according to the ensemble-based thermodynamic description of proteins, which is represented as eight color-coded environments [30]. Each energetic environment has a characteristic average stability resulting from enthalpic and entropic contributions associated with the computed change in solvent accessible polar and apolar surface upon locally unfolding each segment (Fig. 5, bottom) [21]. Importantly, these environments report on the energetics observed at a particular position rather than the contribution of the individual amino acid occupying that position, thus revealing how homologous proteins with marginal sequence identity can nonetheless share common thermodynamic signatures, and thus identical folds [29, 39]. As demonstrated, this representation has recapitulated numerous experimental observations that ground-state structures of proteins have regions of relatively high and low thermodynamic stability, and that these regions are not always intuitive upon visual inspection of the structure [28, 40].
As noted previously, several key features of this representation are exemplified in the Hsp90 protein (Fig. 5). First, the most stable regions are often in the core of the protein, which is true of this Hsp90. Second, elements of secondary structure, even those located in the core, are not uniformly stable: it is often observed that the middle residue positions of elements are more stable than the termini [41]. Third, although the most unstable regions are loops and turns, not all loops and turns are necessarily unstable, a counterintuitive result that has been borne out by experiment [42]. Although there are at least two low stability turns in this example (purple or blue), there is a prominent higher stability (orange) turn between strands 4 and 5 (upper left, Fig. 5), and the apparently coil-like linker (dark red) between strand 3 and helix 3 is among the highest stability regions of any protein in the database.
This energetic representation of proteins alone has formed the basis of an effective fold recognition algorithm, whereby sequences could be matched with their respective folds [29–31], even if the secondary structure information of the fold was not present in the training set [43]. This last result, that the energetic information of entirely alpha-helical proteins permitted recognition of entirely beta sheet proteins, compellingly established the universality of this energetic representation with regard to protein structure classification [44].
Quantifying Energetic Compatibility between Homologous and Non-Homologous Proteins
To test whether structured full-length proteins exhibit significant energetic compatibility with their respective sequences using the probability model described in Methods, we applied the model to the scores of all amino acid sequences in the database against all sets of thermodynamic environments (Fig. 6). Because the log-odds scores (Fig. 1) are dependent on both amino acid and thermodynamic environment, an all-vs.-all plot is necessarily separated into scoring of sequences against a structure (rows in Fig. 6), and structures against a sequence (columns in Fig. 6). Unlike scoring derived from symmetric amino acid substitution matrices, this analysis is not symmetric and thus may reveal differential scoring contributions from either a sequence or a structure perspective.
There are several noteworthy observations in Fig 6. First, the diagonal of this plot, representing “self” matches of an amino acid sequence to its known correct fold, was clearly populated by substantial and significant scores, indicating that the algorithm works. These correct matches were highly specific: except for known homologous proteins (as classified in the SCOP database), no non-self match exhibited a p-value more significant than approximately 0.001. Although expected conformational specificities were thus recapitulated by the significant energetic compatibilities, no obvious relationship was observed that differentiated conformational specificity with respect to sequence or environments (the median correlation coefficients between rows and columns of Fig. 6 was r = + 0.5, data not shown). Also not observed was any general pattern between fold type (e.g. all-alpha or all-beta, Fig. 6 braces) and energetic compatibility. For example, the mixed alpha + beta proteins 1BYQ and 1MWP did not exhibit increased energetic compatibility to other mixed alpha + beta proteins (boxed vertical columns in Fig. 6).
Unexpectedly, however, there was a large amount of marginal, yet significant, energetic compatibility between otherwise unrelated proteins: more than half of the non-self matches were significant at the 0.01 < p < 0.001 level (blue dots in Fig. 6). To investigate the source of this unexpected observation, the energetic compatibility between regions of individual proteins and the rest of the sequence or fold space was quantified.
Negative Contributions Dominate Energetic Compatibility between Sequence and Structure
The most statistically significant best and worst matches of 13 residue fragments were mapped to their locations on the full-length protein, and the densities of the matches were tabulated, as described in Methods. These densities were recorded in two ways: 1) mapping structure fragments to the full-length sequence, and 2) mapping sequence fragments to full-length structure. Thus, the highs and lows of density approximated the average energetic compatibility of a protein’s sequence or structure with a representative sample of the entire sequence or structure space. Since these densities were composed of the most extreme energetically compatible and incompatible matches between arbitrary sequences and arbitrary structures of globular proteins, they are referred to as “positive” and “negative” compatibility indices, respectively. In short, the fragment matches revealed regions of full-length proteins likely (or unlikely) to exhibit non-self conformational specificity, due to energetic characteristics shared between other globular proteins.
One example of these compatibility indices is displayed from the perspective of sequence (i.e. how a sequence scored in other fold fragments - Fig. 7a) and from the perspective of structure (i.e., how other sequence fragments scored in its fold – Fig. 7b). The variability of indices within an individual protein suggests that energetic compatibility is not uniformly distributed. Also clear is that sequence and structural compatibility indices are asymmetric. In other words, at a given position within an individual protein, the amino acid sequence at that position could have a very different compatibility for other environments than does the environments at that position for other sequences. For example, in labeled regions A, B, and C (Fig. 7), the negative compatibility index between the 1BYQ structure and all other sequences was relatively high, while the negative compatibility index between the sequence at this position and all other structures was low. This means that while the structure at that position does not accommodate many sequences, the sequence that is there, is compatible with many folds. A third observation is that the magnitude of the negative compatibility index is, in general, much greater than the magnitude of the positive compatibility index. In other words, the blue curves in Fig. 7, and in most other proteins, are larger in magnitude than the red curves, consistent with the higher likelihood of obtaining negative random scores in the probability model. No obvious relations between fold type, secondary structure type, location of secondary structure, and the compatibility indices were seen.
It was hypothesized that these indices contained detailed information about energetic compatibility with multiple structures, and thus would provide insight into conformational specificity. To explore this hypothesis, eigenvalue decomposition (principal components analysis) was used to simplify these four-dimensional compatibility indices (Fig. 2). As expected, the first two principal components of the decomposition were dominated by the sequence and structure negative compatibility indices (red circles in Fig. 2), and constituted almost the entire information content (60% + 35% = 95%). Unexpectedly, the decomposition also revealed a secondary, but substantial, correlation in the patterns of positive and negative compatibility indices, as the coefficients of these quantities are of the same sign and order of magnitude (Fig. 2). Thus, the locations of the largest negative compatibility indices with respect to structure are also often the locations of the largest positive compatibility indices with respect to structure. Examples of this phenomenon can be seen in Fig. 7b, boxes A and B, where the peaks and valleys of both red and blue curves (positive and negative indices, respectively) roughly track each other. In summary, 95% of the information about positive and negative energetic compatibility could be retained by considering only the first two principal components, which are largely due to negative compatibility. Therefore, despite the necessity of a high positive score for one sequence to be conformationally specific for one structure, thermodynamically incompatible regions of sequence and structure largely organize the energetic compatibility, and thus possibly the conformational specificity, of this representative sample of protein fold space.
The trends in Figure 7 were used as the basis for provisionally classifying the susceptibility of a sequence to switch fold (Fig. 7a) or the ability of a fold to accommodate other sequences. (Fig. 7b). Four types of sequence segments were defined (Fig. 3); “permissive”, “selective”, “inactive”, and “gatekeeper” (Fig. 7 a&b – upper bar). Permissive sequence, which accounts for 15% of the total sequence space, is so named because it is highly compatible with other folds, but rarely is it highly incompatible with other folds. In other words, these sequence segments may contribute to stabilizing a fold, but do little to select against other folds. Selective sequences, which at 35% of sequence space, constitutes one of the highest fractions, are those that score very highly in, and are thus highly compatible with, many folds, but are also highly incompatible with other folds. These sequence segments contribute to stabilizing the native fold, but also significantly select against other folds. Inactive sequences are those that appear to not contribute significantly to determining any particular fold and do little to select against any fold. Finally, there are so-called “gatekeeper” sequences that are not compatible with most other folds, and indeed significantly select against many folds, these comprise approximately 15% of sequence space. A similar analysis was performed to categorize the compatibility of fold segments; fractions of gatekeeper and permissive structure were each found to be approximately 11% and fractions of inactive and selective structure were each found to be approximately 39%.
Importantly, all proteins in this representative subset of the human proteome contained variable sized segments of each type of sequence (Fig. 7c) and fold (Fig. 7d) revealing an overall architecture, which indicates that sequence and fold contributions to energetic compatibility are heterogeneously distributed throughout individual proteins. Indeed, the relatively large fractions of inactive sequence and fold segments suggests that the specific folds, which some sequence segments adopt, may be context dependent, lacking significant intrinsic propensity. Ideas such as context dependent sequence propensities have been discussed for the particular case of beta strands, [7] although for these proteins we find no significant correlation between beta sheet and inactive sequence (data not shown).
Protein Size Dependence of Negative Energetic Compatibility
Although the magnitudes of negative (i.e. incompatible) and positive (i.e. compatible) scoring sequences were comparable, the amount of the incompatibilities was observed to be significantly higher than the amount of compatibilities. The dominance of energetic incompatibility is consistent with the idea of “negative selection” [23, 45–48], i.e. that through evolution most other competing folds become thermodynamically incompatible with a particular amino acid sequence. It was hypothesized that the total amount of negative energetic compatibility (i.e. the integrated area of the blue curves in Fig. 7) exhibited by a protein, from either sequence or structure, could be related to the widespread non-self energetic compatibility seen in Fig. 6. In other words, does the amount of negative selection scale with the overall ability of a sequence to adopt other folds? To address this question, the p-value of the optimal non-self scores in Fig. 6 were plotted against the aggregate energetic incompatibility of each protein (Fig. 8). Significant, though modest, correlations were indeed observed between aggregate negative compatibility and the energetic compatibility between sequence and structure, suggesting that negative selection exerts a significant influence on conformational specificity.
However, an unexpected pattern was observed in these correlations: the relationship between aggregate negative compatibility and energetic compatibility changed sign as a function of protein size (Fig. 9). Longer proteins, such as the 228-residue Hsp90 1BYQ, exhibited a negative correlation between negative compatibility and energetic compatibility (Fig. 8a), while shorter proteins, such as the 96-residue N-terminal domain of amyloid precursor protein 1MWP, exhibited a positive correlation (Fig. 8b). In other words, shorter proteins exhibited increased conformational specificity towards an alternative fold when that alternative fold exhibited increased energetic incompatibility with fragments from all other proteins. The positive correlation reached a maximum value at a protein size of approximately 100 residues (Fig. 9). The fact that the correlation changed sign indicates that both compatibility and incompatibility influence the specificity of proteins of all sizes, but that the relative contribution of incompatibility monotonically decreases with protein length. In other words, the requirement for negative selection appears to be released as sequence length increases.
Discussion
Energetic Incompatibility Influences Protein Conformational Specificity
Two significant insights emerge from these studies, accepting the hypothesis that energetic compatibility is a measure of the degree of conformational specificity. First, conformational specificity of a representative sample of the human proteome, and presumably the entire protein fold space, appears to be organized by energetic incompatibility. This key insight could not be obtained by inspection of the amino acid sequences or ground state structures. Second, protein conformational specificity is a complex function of both the sequence and the fold, with both positive and negative contributions. Just as not all sequence and structure segments contribute equally to protein stability, neither do they contribute equally to conformational specificity. Importantly, although a weak trend exists for more stable regions to exhibit higher negative compatibility index with respect to structure, there is imperfect correlation between stability and specificity; the most stable regions are not always the most specific nor are the least stable regions always the least specific. Thus, gatekeeper residues located in high-stability regions may be informative determinants of conformational specificity and potential targets for fold-switch engineering. Conversely, mutation or removal of permissive residues may permit increased specificity for a desired fold.
For all proteins, regardless of size, “designing-in” favorable interactions is important for adopting stable structure, a conclusion that can be drawn directly from the highly significant diagonal scores in Fig. 6. Indeed, structure-guided protein engineering has repeatedly employed this idea of “positive design” with much success [49–51]. However, the present analysis suggests that in natural proteins extremely unfavorable interactions in alternative folds (energetic incompatibility) dominate conformational specificity. Furthermore, the intriguing sign switch observed in Fig. 9 reveals that negative selection has an even more pronounced influence for proteins of small size.
Local Structure and Sequence Contributions to Negative Selection
The regions of highest NCI are enriched in proline (Pro) and glycine (Gly) residues (with respect to sequence, Fig. S2a) and are enriched in high-stability (with respect to structure, Fig. S2c). These enrichments suggest that the mechanisms for negative selection can be localized to individual positions of a protein’s structure and sequence: high stability environments, and the amino acids Gly and Pro. Examples include Box A of Fig. 7b, which is a high stability region of 1BYQ that exhibits high NCI with respect to structure, and Box D of Fig. 7a, a region enriched in Gly and Pro that exhibits high NCI with respect to sequence. Conformational restriction and freedom afforded by Pro and Gly side chains, respectively, are likely to be two physical mechanisms for mediating negative selection [52].
Localization of negative selection could guide protein engineering efforts to promote desired, or alternative, structure using targeted Pro or Gly substitutions [53] and core destabilization. The similarity in environmental propensities of negative compatibility and gatekeeper positions (Fig. S2c&d) suggests co-localization of gatekeeper positions, high structural stability, and negative selection. Bearing in mind that mutational effects of sequence and stability changes could be opposing (e.g. Fig. 1 indicates that a Pro substitution in a high stability region, intentioned to increase NCI with respect to sequence, is unfavorable and could be destabilizing with respect to structure), such changes might afford a crude tool to introduce or remove specificity. In any event, the analysis presented here provides the locations on each protein where such efforts should be targeted to increase chances for success.
Negative Selection Mediates Protein Domain Evolution
The average domain size of structured proteins is approximately 100 residues [54], and 90% of all known domains are less than 200 residues [55], spanning the size range of proteins sampled here. One implication for protein evolution is that single domain proteins, usually thought of as modular “building blocks” in the organization of larger proteins [56], might be particularly susceptible to sampling alternative structures. Fold-switching is expected to be more prevalent in smaller size proteins [11] and is expected to be less prevalent in larger size proteins [15]. These expectations are supported by strong positive correlations between experimental thermodynamic stability and size [38, 57], as well as from lattice models that exhibit a larger fraction of alternative minimum energy compact structures as the chain length decreases [58, 59]. Thus, increased negative design would be important for small proteins to preserve structure and function when faced with the constraints of small stability and large numbers of alternative folds.
Consistent with this conjecture is the SCOP classification of “small proteins”, whose membership consists of proteins that explicitly require disulfide bonds or metal ions for increased stability [32]. In this scenario, primordial small proteins would have had the tendency for metamorphic behavior, thus negative evolutionary selection would have been a necessary adaptation for dependable metabolic processes mediated by such molecules. Figure 9 suggests that as proteins increase in length they gradually lose the requirement for negative selection. Perhaps this implies the conformational space for a large protein is so vast that preservation of fold is energetically “easier”, as long as aggregation is avoided [60]. Alternatively, larger proteins, which are sometimes composed of several smaller domains, contain functionally important intra-domain interfaces that alter the energetic landscape relative to the individual domains. The role of negative design in larger multi-domain proteins remains to be investigated.
Protein Design Strategy Based on Negative Selection
Aside from effects of negative selection for small proteins, we believe that thermodynamic environments data (Fig. 5) could be practically used as a template for fold design. The log-odds scores (Fig. 1) may be used to generate reasonable amino acid choices for site directed mutagenesis and/or de novo design of full-length proteins. A possible advantage of using thermodynamic environments as a design template, as opposed to structural coordinates, is that environments avoid the difficulty of a “frozen approximation” of the backbone [51]. Instead, thermodynamic environments intrinsically incorporate a range of small conformational adjustments that are approximately isoenergetic within the average stability, enthalpy, and entropy of the environment. One possible design strategy, leveraging both the theory and techniques in this paper, would be to simultaneously maximize the positive score of sequence choices for a desired fold target using the log-odds scores while maximizing the negative score of the same sequence against a large library of alternative folds. This strategy could mimic the “energy gap” between desired and alternative structures that has been demonstrated to be useful in protein design [61]. Such a strategy would be computationally fast to implement using the thermodynamics environment data in Table S2. The pursuit of such avenues is currently under way [62].
Supplementary Material
Acknowledgments
This work was supported by NIH grant R01-GM63747 and NSF grant MCB0446050. Additional support from the Woodrow Wilson Fellowship of Johns Hopkins University to J. H. is gratefully acknowledged. The authors wish to thank Alex Chin for critical reading of the manuscript and for helpful discussions.
References
- 1.Kabsch W, Sander C. On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. Proceedings of the National Academy of Sciences of the United States of America. 1984;81:1075–1078. doi: 10.1073/pnas.81.4.1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sudarsanam S. Structural diversity of sequentially identical subsequences of proteins: identical octapeptides can have different conformations. Proteins: Structure, Function, and Genetics. 1998;30:228–231. doi: 10.1002/(sici)1097-0134(19980215)30:3<228::aid-prot2>3.0.co;2-g. [DOI] [PubMed] [Google Scholar]
- 3.Guo JT, Jaromczyk JW, Xu Y. Analysis of chameleon sequences and their implications in biological processes. Proteins: Structure, Function, and Bioinformatics. 2007;67:548–558. doi: 10.1002/prot.21285. [DOI] [PubMed] [Google Scholar]
- 4.Li W, et al. ChSeq: A database of chameleon sequences. Protein Science. 2015;24(7):1075–1086. doi: 10.1002/pro.2689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Murzin AG. Metamorphic proteins. Science. 2008;320:1725–1726. doi: 10.1126/science.1158868. [DOI] [PubMed] [Google Scholar]
- 6.Chang YG, et al. Circadian rhythms. A protein fold switch joins the circadian oscillator to clock output in cyanobacteria. Science. 2015;349(6245):324–8. doi: 10.1126/science.1260031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Minor DL, Jr, Kim PS. Context-dependent secondary structure formation of a designed protein sequence. Nature. 1996;380(6576):730–4. doi: 10.1038/380730a0. [DOI] [PubMed] [Google Scholar]
- 8.Alexander PA, et al. From the Cover: A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci U S A. 2009;106(50):21149–54. doi: 10.1073/pnas.0906408106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cordes MH, et al. Evolution of a protein fold in vitro. Science. 1999;284(5412):325–8. doi: 10.1126/science.284.5412.325. [DOI] [PubMed] [Google Scholar]
- 10.Chen SH, Meller J, Elber R. Comprehensive analysis of sequences of a protein switch. Protein Science. 2015 doi: 10.1002/pro2723. p. Epub ahead of print. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Porter LL, et al. Subdomain interactions foster the design of two protein pairs with ~80% sequence identity but different folds. Biophysical Journal. 2015;108(1):154–162. doi: 10.1016/j.bpj.2014.10.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bryan PN, Orban J. Implications of protein fold switching. Current Opinion in Structural Biology. 2013;23:314–316. doi: 10.1016/j.sbi.2013.03.001. [DOI] [PubMed] [Google Scholar]
- 13.He Y, et al. Mutational tipping points for switching protein folds and functions. Structure. 2012;20(2):283–291. doi: 10.1016/j.str.2011.11.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Eaton KV, et al. Studying protein evolution with hybrids of differently folded homologs. Protein Engineering, Design, and Selection. 2015;28(8):241–250. doi: 10.1093/protein/gzv027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Meyerguz L, Kleinberg J, Elber R. The network of sequence flow between protein structures. Proceedings of the National Academy of Sciences USA. 2007;104(28):11627–11632. doi: 10.1073/pnas.0701393104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Jeffery CJ. Why study moonlighting proteins? Front Genet. 2015;6:211. doi: 10.3389/fgene.2015.00211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Copley SD. Moonlighting is mainstream: paradigm adjustment required. Bioessays. 2012;34:578–588. doi: 10.1002/bies.201100191. [DOI] [PubMed] [Google Scholar]
- 18.Cao B, Elber R. Computational exploration of the network of sequence flow between protein structures. Proteins: Structure, Function, and Bioinformatics. 2010;78:985–1003. doi: 10.1002/prot.22622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Allison JR, et al. Current computer modeling cannot explain why two highly similar sequences fold into different structures. Biochemistry. 2011;50(50):10965–10973. doi: 10.1021/bi2015663. [DOI] [PubMed] [Google Scholar]
- 20.Shen Y, et al. De novo structure generation using chemical shifts for proteins with high-sequence identity but different folds. Protein Science. 2010;19(2):349–56. doi: 10.1002/pro.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hilser VJ, et al. A statistical thermodynamic model of the protein ensemble. Chem Rev. 2006;106(5):1545–58. doi: 10.1021/cr040423+. [DOI] [PubMed] [Google Scholar]
- 22.Lattman EE, Rose GD. Protein folding - what’s the question? Proceedings of the National Academy of Sciences, USA. 1993;90:439–441. doi: 10.1073/pnas.90.2.439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bolon DN, et al. Specificity vs. stability in computational protein design. Proceedings of the National Academy of Sciences, USA. 2005;102(36):12724–12729. doi: 10.1073/pnas.0506124102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Hilser VJ, Freire E. Structure-based calculation of the equilibrium folding pathway of proteins. Correlation with hydrogen exchange protection factors. J Mol Biol. 1996;262(5):756–72. doi: 10.1006/jmbi.1996.0550. [DOI] [PubMed] [Google Scholar]
- 25.Pan H, Lee JC, Hilser VJ. Binding sites in Escherichia coli dihydrofolate reductase communicate by modulating the conformational ensemble. Proc Natl Acad Sci U S A. 2000;97(22):12020–5. doi: 10.1073/pnas.220240297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Babu CR, V, Hilser J, Wand AJ. Direct access to the cooperative substructure of proteins and the protein ensemble via cold denaturation. Nat Struct Mol Biol. 2004;11(4):352–7. doi: 10.1038/nsmb739. [DOI] [PubMed] [Google Scholar]
- 27.Whitten ST, Garcia-Moreno EB, Hilser VJ. Local conformational fluctuations can modulate the coupling between proton binding and global structural transitions in proteins. Proc Natl Acad Sci U S A. 2005;102(12):4282–7. doi: 10.1073/pnas.0407499102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Liu T, et al. Quantitative assessment of protein structural models by comparison of H/D exchange MS data with exchange behavior accurately predicted by DXCOREX. Journal of the American Society for Mass Spectrometry. 2012;23:43–56. doi: 10.1007/s13361-011-0267-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wrabl JO, Larson SA, Hilser VJ. Thermodynamic environments in proteins: fundamental determinants of fold specificity. Protein Sci. 2002;11(8):1945–57. doi: 10.1110/ps.0203202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Larson SA, V, Hilser J. Analysis of the “thermodynamic information content” of a Homo sapiens structural database reveals hierarchical thermodynamic organization. Protein Sci. 2004;13(7):1787–801. doi: 10.1110/ps.04706204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wang S, et al. Denatured-state energy landscapes of a protein structural database reveal the energetic determinants of a framework model for folding. J Mol Biol. 2008;381(5):1184–201. doi: 10.1016/j.jmb.2008.06.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Andreeva A, et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008;36(Database issue):D419–25. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Vertrees J, et al. COREX/BEST server: a web browser-based program that calculates regional stability variations within protein structures. Bioinformatics. 2005;21(15):3318–9. doi: 10.1093/bioinformatics/bti520. [DOI] [PubMed] [Google Scholar]
- 35.Frishman D, Argos P. Knowledge-based protein secondary structure assignment. Proteins. 1995;23(4):566–79. doi: 10.1002/prot.340230412. [DOI] [PubMed] [Google Scholar]
- 36.Reva BA, Finkelstein AV, Skolnick J. What is the probability of a chance prediction of a protein structure with an rmsd of 6 A? Folding and Design. 1998;3:141–147. doi: 10.1016/s1359-0278(98)00019-4. [DOI] [PubMed] [Google Scholar]
- 37.Press WH, et al. Numerical recipes in C: the art of scientific computing. 2. New York: Cambridge University Press; 1992. [Google Scholar]
- 38.Robertson AD, Murphy KP. Protein structure and the energetics of protein stability. Chemical Reviews. 1997;97:1251–1267. doi: 10.1021/cr960383c. [DOI] [PubMed] [Google Scholar]
- 39.Wrabl JO, V, Hilser J. Investigating homology between proteins using energetic profiles. PLoS Comput Biol. 2010;6(3):e1000722. doi: 10.1371/journal.pcbi.1000722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Bai Y, et al. Thermodynamic parameters from hydrogen exchange measurements. Methods Enzymol. 1995;259:344–56. doi: 10.1016/0076-6879(95)59051-x. [DOI] [PubMed] [Google Scholar]
- 41.Munoz V, Serrano L. Helix design, prediction, and stability. Current Opinion in Biotechnology. 1995;6:382–386. doi: 10.1016/0958-1669(95)80066-2. [DOI] [PubMed] [Google Scholar]
- 42.Wang Y, Shortle D. Residual helical and turn structure in the denatured state of staphylococcal nuclease: analysis of peptide fragments. Folding and Design. 1997;2(2):93–100. doi: 10.1016/S1359-0278(97)00013-8. [DOI] [PubMed] [Google Scholar]
- 43.Wrabl JO, Larson SA, Hilser VJ. Thermodynamic propensities of amino acids in the native state ensemble: implications for fold recognition. Protein Sci. 2001;10(5):1032–45. doi: 10.1110/ps.01601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vertrees J, Wrabl JO, Hilser VJ. An energetic representation of protein architecture that is independent of primary and secondary structure. Biophys J. 2009;97(5):1461–70. doi: 10.1016/j.bpj.2009.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Leaver-Fay A, et al. A generic program for multistate protein design. PLoS One. 2011;6(7):e20937. doi: 10.1371/journal.pone.0020937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Minning J, Porto M, Bastolla U. Detecting selection for negative design in proteins through an improved model of the misfolded state. Proteins: Structure, Function, and Bioinformatics. 2013;81(7):1102–1112. doi: 10.1002/prot.24244. [DOI] [PubMed] [Google Scholar]
- 47.Noivert-Brik O, Horovitz A, Unger R. Trade-off between positive and negative design of protein stability: from lattice models to real proteins. PloS Computational Biology. 2009;5(12):e1000592. doi: 10.1371/journal.pcbi.1000592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Berezovsky I, Zeldovich KB, Shakhnovich EI. Positive and negative design in stability and thermal adaptation of natural proteins. PloS Computational Biology. 2007;3(3):e52. doi: 10.1371/journal.pcbi.0030052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kuhlman B, et al. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302(5649):1364–1368. doi: 10.1126/science.1089427. [DOI] [PubMed] [Google Scholar]
- 50.Dahiyat BI, Mayo SL. De novo protein design: fully automated sequence selection. Science. 1997;278(82–87):82. doi: 10.1126/science.278.5335.82. [DOI] [PubMed] [Google Scholar]
- 51.Murphy GS, et al. Increasing sequence diversity with flexible backbone protein design: the complete redesign of a protein hydrophobic core. Structure. 2012;20(6):1086–1096. doi: 10.1016/j.str.2012.03.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Creighton TL. Proteins: Structures and Molecular Properties. 2. New York: W.H. Freeman and Company; 1993. [Google Scholar]
- 53.Matthews BW, Nicholson H, Becktel WJ. Enhanced protein thermostability from site-directed mutations that descrease the entropy of unfolding. Proceedings of the National Academy of Sciences of the United States of America. 1987;84(19):6663–6667. doi: 10.1073/pnas.84.19.6663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distributions can predict domain boundaries. Bioinformatics. 2000;16(7):613–618. doi: 10.1093/bioinformatics/16.7.613. [DOI] [PubMed] [Google Scholar]
- 55.Islam SA, Luo J, Sternberg MJ. Identification and analysis of domains in proteins. Protein Engineering. 1995;8(6):513–525. doi: 10.1093/protein/8.6.513. [DOI] [PubMed] [Google Scholar]
- 56.Cesareni G, et al., editors. Modular Protein Domains. Wiley-VCH; Weinheim, FRG: 2005. [Google Scholar]
- 57.Ghosh K, Dill KA. Computing protein stabilities from their chain lengths. Proceedings of the National Academy of Sciences, USA. 2009;106(26):10649–10654. doi: 10.1073/pnas.0903995106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Camacho CJ, Thirumalai D. Minimum energy compact structures of random sequences of heteropolymers. Physical Review Letters. 1993;71(15):2505–2508. doi: 10.1103/PhysRevLett.71.2505. [DOI] [PubMed] [Google Scholar]
- 59.Dill KA, et al. Principles of protein folding - a perspective from simple exact models. Protein Science. 1995;4:561–602. doi: 10.1002/pro.5560040401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Doye JP, Louis AA, Vendruscolo M. Inhibition of protein crystallization by negative design. Physical Biology. 2004;1(1–2):9–13. doi: 10.1088/1478-3967/1/1/P02. [DOI] [PubMed] [Google Scholar]
- 61.Shakhnovich E. Protein folding thermodynamics and dynamics: where physics, chemistry, and biology meet. Chemical Reviews. 2006;106(6):1559–1588. doi: 10.1021/cr040425u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hoffmann J, Wrabl JO, Hilser VJ. Towards the design of metamorphic proteins using ensemble-based energetic information. Biophysical Journal: 2013 Biophysical Society Meeting Abstracts. 2013;(Supplement):2897-Pos. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.