Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2015 Jul 1;25(1):135–146. doi: 10.1002/pro.2723

Comprehensive analysis of sequences of a protein switch

Szu‐Hua Chen 1,2, Jaroslaw Meller 3,4,5,6,7, Ron Elber 2,8,
Editors: Carol B Post, Charles L Brooks III
PMCID: PMC4815306  PMID: 26073558

Abstract

Switches form a special class of proteins that dramatically change their three‐dimensional structures upon a small perturbation. One possible perturbation that we explore is that of a single point mutation. Building on the pioneering experimental work of Alexander et al. (Alexander et al. PNAS, 2007; 104,11963–11968) that determines switch sequences between α and α+β folds we conduct a comprehensive sequence sampling by a Markov Chain with multiple fitness criteria to identify new switches given the experimental folds. We screen for switch sequences using a combination of contact potential, secondary structure prediction, and finally molecular dynamics simulations. Statistical properties of switch sequences are discussed and illustrated to be most sensitive to mutation at the N‐ and C‐ termini of the switch protein. Based on this analysis, a particularly stable putative switch pair is identified and proposed for further experimental analysis.

Keywords: protein folds, mutations, structural flips, molecular dynamics, secondary structure prediction, contact maps

Introduction

The mechanism of protein evolution has attracted considerable attention in molecular biology and is a prime challenge in this field. A plausible model of protein evolution exploits point mutations.1 In this model the amino acid sequence of a protein is altered, one residue at a time, and is accepted or rejected based on fitness criteria. The fitness is determined by the condition that the new protein must function properly in the biological environment, and is frequently modeled by thermal stability.2, 3, 4, 5, 6 The series of point mutations create a Markov chain of events that may modify the structure and function of the original protein. The process is Markovian since the acceptance or rejection of the new protein depends only on the current and newly generated sequences. The Markovian modeling of a series of point mutations is the approach we adopt in this study. The focus of this manuscript is on a sub‐problem, which is the evolution of protein structures.

In the last few years we have constructed a contact‐based computational model that explores structural evolution of proteins following point mutations.7, 8 This model is appealing but is clearly incomplete. The evolutionary processes in biology allow for more complex steps in sequence and structure space, while the model is more restrictive. We did not include, for example, the possibility of domain swaps9 that enables the replacement of large protein segments in one mutational event. Moreover, errors and limitations of the fitness function are making the model more permissive. It is therefore not possible to put clear error bounds on the simulations. Despite the limitations, the model suggests a useful framework to consider the relationships between changes in sequence space and their impact on structure evolution.

Our past studies were purely computational and globally examined the space of known protein folds. We considered a representative set of all experimentally determined protein structures as reported in the Protein Data Bank (PDB10) and placed them as nodes in a network. We called it “the network of sequence flow”.7 By structurally aligning all folds in the PDB and removing repeated folds such that no two folds in the set have a TM‐score11 greater than 0.8, We obtained a set of around 2,000 structures that represent known folds in the PDB. We designed a Markov chain to estimate the number of sequences, N A, that fold into a particular structure, A. This number is the sequence capacity of fold A. We also estimated the absolute number of sequences, NAB, that change structures with a point mutation (“flip”) from fold A, to another fold, B. The number of flips divided by the number of sequences in fold A, NAB/NA, was assigned as the weight of the edge connecting A and B. Note that the graph is asymmetric and therefore directional. If a sequence SA, which folds to structure A, mutates to a sequence SB and flips to structure B, then the final sequence can flip back to structure A using the same point mutation in reverse. Hence NAB=NBA. However, in general NANB and therefore, the edge weight is different between A to B and B to A directions. The capacities and the edge weights are average entities and thus we could estimate them accurately with a Markov chain and with statistics significantly smaller than exhaustive (a few million sequences per structure).7, 8 The sampling was not always sufficient to obtain more detailed information. For example, we were not able to quantitatively study the probability of finding each of the 20 amino acids in particular positions at network edges. While we have conducted exhaustive exploration of the network of sequence flow for a model system,6 it is not obvious that the conclusions from the model are applicable to experimental protein structures. In the present study we comprehensively investigate transient sequences for a single edge of the network for which detailed experimental information is available.

After our first study of the network of sequence flow was completed,7 a series of experimental studies was published in which a cleverly designed pair of protein folds (an edge in the network) was analyzed.12, 13, 14, 15 Structures and thermodynamic stabilities were determined for 31 synthetic sequences of the two folds, that is, that of protein GA (fold 3α) and protein GB (fold 4β+α). The analysis includes four sequences that flip between the folds upon a point mutation and other proteins with high sequence identities but different structures. From theoretical perspective the experiments make it possible to verify the network model. In a recent paper16 we tested and improved our fitness function exploiting the experimental data for the α and α+β switch. We illustrated that our previous and adjusted protocols can predict the presence of a switch with high confidence. The present manuscript is a continuation and elaboration of the previous studies. We use the method illustrated previously to investigate comprehensively the properties of the sequences at the switching point, that is, the interface in sequence space that divides the sequences belonging to each of the two folds. A sequence is said to be at the interface between the two folds if a point mutation can be found that changes its stable structure from one fold to another. In practice, we consider a change in the stability ranking of the two folds using our scoring protocols.16 The connection of this edge to the experiment makes this particular flip between folds special and of particular interest for simulations.

Results and Discussion

We follow the binary flip protocol to create switch sequences of the experimentalists. There are 47 non‐identical aligned amino acids of a total of 56 when comparing the wild‐type G A and G B sequences. Therefore, the total number of sequences in the space of binary permutations is 247 ≈ 1014. The number of sequences that we sampled during the generation of the Markov chain is only 107×20 (20 Markov chains and 107 permutation attempts per Markov chain), which is clearly not exhaustive. Nevertheless, our sample provides converged properties for some functions of interest as we illustrate below.

Consider the complete set of sequences that we sampled. This set is divided into two, the set of switch sequence pairs and the set of non‐switch sequence pairs. Each sequence pair differs by a point mutation at one site. We define a transition probability P j, which is the probability that a point mutation at site j will flip the protein fold. To distinguish between mutations that cause switch and those that do not we define P j,switch and P j,non‐switch. We estimate the P j numerically by taking the average of the independent Markov chains k=1,...,L, where L is equal to 10.

Pj,switch=1Lk=1LNswitch,j,kNswitch,kPj,nonswitch=1Lk=1LNnonswitch,j,kNnonswitch,k (1)

N switch,k is the total number of switch pairs found in a Markov chain k, and N switch,j,k is the number of switch pairs where the point mutation is found at the non‐identical position j in the Markov chain k. Similar definitions follow for the non‐switch pairs as N non‐switch,k and N non‐switch,j,k.

In Figure 1 we show the computed transition probabilities. The probabilities computed with different sampling seeds are highly similar. Furthermore, the average probabilities computed with two different initial sequences overlap exceptionally well. We conclude that for the function of interest, P j, the sampling converged.

Figure 1.

Figure 1

The top two plots are probability profiles of accepted sequence pairs as a function of the amino acid site j when the binary permutation occurs at each of the 47 non‐identical positions after the three filters. The nine sites with probability of zero are the positions with identical amino acids. “*” marks the probabilities of a Markov chain starting from the wild‐type G A sequence, G1; “o” marks that from the wild‐type GB sequence, G29. The top plot shows the average probability calculated from the non‐switch pairs generated in the 10 Markov chains using the HL energy function. A statistical error bar for each average is included. The middle plot shows (black curves) the average probability calculated from the switch pairs generated in the 10 Markov chains using the HL energy function (an error bar for each average is also included), (green curves) the probabilities calculated from the switch pairs that pass the combined scoring function screening, and (blue curves) the probabilities calculated from the switch pairs after homology modeling by MODELLER. The bottom plot is the absolute difference of HL contacts (N) between the two wild‐type structures at each position. There are five positions that have the same number of HL contacts between the two folds.

If the sampling is uniform at every non‐identical position, the average probability at these positions should be equal to 1/47 = 0.021. Indeed, in Figure 1 the probability profile of a non‐switch pair (red curves) fluctuates only slightly around 0.021. The zero values are for the positions along the sequences with identical amino acids. While the non‐switch pairs are not all sequence pairs, they are nevertheless overwhelming majority explaining why the distribution is uniform.

At variance with the non‐switch pairs, the probability of the switch pairs deviates significantly from the uniform distribution. As a single Markov chain is already converged, we further analyzed the switch pairs in only the first Markov chain initiated from the two wild‐type sequences. We re‐scored all switch pairs using the second filter, the combination of contact energy and secondary structure divergence score, Eq. (5), and assigned again the structures to these switch pairs based on the new scores. A significant fraction, 27.43 and 27.53% of the switch pair candidates, passed the second filter in the Markov chain initiated from G1 and G29, respectively (Table 1). The probability profiles for switch pairs that passed the second filter are shown in Figure 1. After the second filter the probability of switch pairs has peaks at roughly the same position. However, peak heights increase at previously most probable sites. They are positions 3, 5, 49, 52, and 54 along the sequence, which are located near the N‐ and C‐termini. We screened these switch pairs one more time by the third filter. Homology models for each of the sequences were built for the two folds using MODELLER. The models were re‐scored by HL energy function. 30.33 and 30.48% of the switch pairs that passed the second filter also passed the third filter in the Markov chain starting from G1 and G29, respectively (Table 1). The probability profiles after the third filter are shown in Figure 1. We observe a further increase in probability of the switch pairs at peak positions already observed in filters 1 and 2.

Table 1.

The Filtering Scheme from the Stochastic Sampling to Homology Modeling for the Sequences Sampled from the First Trajectory Starting from G1 and from G29

Filtering scheme Starts from G1 Starts from G29
Number of trial sequences 10,000,000 10,000,000
Number of accepted sequences using the first filter (HL energy) 8,337,761 8,340,093
Number of candidates for switch pairs selected by the first filter 298,065 298,158
Number and percentage of switch pairs that pass the second filter (HL+0.65DKL) 81,768 (27.43%) 82,096 (27.53%)
Number and percentage of switch pairs that pass the third filter (MODELLER 9.13) 24,803 (30.33%) 25,024 (30.48%)

See text for more details.

The residues of the N‐ and C‐termini have different contact patterns than the rest of the protein. In fold A the two ends form flexible coils, while in fold B they fold into β sheets in the core. Therefore, the difference in the number of contacts between the two templates is greater at the ends, making these regions more sensitive to point mutation. The absolute difference of the HL contacts between the two wild‐type structures at each position is also shown in Figure 1. There are ten positions that have the absolute difference of HL contacts greater than five. They are positions 3, 5, 19, 20, 30, 42, 49, 52, 54, and 56. Positions 3, 5, 49, 52, and 54 are also the peaks in the probability profiles for switch pairs.

The number of candidates for switch pairs that we sampled is too large for detailed experimental studies. We therefore wish to select the most promising candidates for an extended analysis. For this selection we relied on the energy gap ΔE, using Eq. (6). We ranked the 24,803 and 25,024 switch pairs by their ΔE values in the first Markov chain initiating from G1 and G29, respectively, after the homology‐modeling filter. The higher the ΔE, the more stable the two switch sequences are in their native folds. We selected the top 25 switch pairs that have the largest ΔE values from the Markov chain starting from each of the two wild‐type sequences, for the last filter, a 100‐ps MD simulation. 18 of the 50 selected pairs passed the filter.

To explore further the most stable switch pair after this series of filtering process, we re‐ranked the 18 switch pairs by ΔE after 100‐ps MD simulations. The sequence pair that acquires the largest ΔE value is one of the nine pairs starting from G29, which has a ΔE value of 13.5693. The critical point mutation causing the structural flip of this switch pair is at position 54. When the valine, the G B amino acid at position 54, is replaced by a proline, the GA amino acid at the same position, the structure flips from fold B to fold A. In Figure 2 we show the pair sequence alignments of the switch sequences with their corresponding wild‐type sequences, and in Figure 3 we illustrate each sequence of the switch pair in the predicted native structure as compared to the corresponding wild‐type structure.

Figure 2.

Figure 2

The pair sequence alignments between the predicted switch pair that has the largest ΔE value (S1 and S2) and the corresponding wild‐type sequences (G29 and G1). The alignments were generated by clustalw‐2.117 using the BLOSUM matrices with gap opening penalty of 10 and gap extension penalty of 0.5. The identical positions are highlighted in red.

Figure 3.

Figure 3

The predicted native structures of the predicted switch pair at the end of the 100‐ps production run, sequences S1 (blue) and S2 (raspberry), aligned with G29 and G1 (gray), respectively. The critical point mutation is at position 54, at which valine is substituted for proline and the structure switches from α+β fold to α fold. The side chain of the amino acid at position 54 is displayed in the stick mode. This figure is generated in PyMOL (The PyMOL Molecular Graphics System, Version 1.3 Schrödinger, LLC.).

To further assess the quality of the predicted switch pair with respect to the experimental proteins, we calculate the average Cα‐RMSD and TM‐scores between the MD‐sampled structures of the predicted sequences (S1 and S2) and the MD‐sampled structures of the experimental wild‐type (G1 and G29) and switch (G14, G15, G30, and G31) proteins in their two native structures (the first NMR model of 2FS1 and the crystal structure of 1PGA). The RMSD and TM‐score comparisons are computed between all pairs of the 100‐ps runs of the predicted switch pair and the experimental proteins. Hence, we calculate 1002=104 comparisons to obtain a single average. The results of the comparison are listed in Table 2. We have two switch sequences and two plausible folds. We run each of the sequences in each of the folds giving us a total of 4 runs for the two switch sequences (these are the rows denoted by S1‐2FS1, S1‐1PGA, S2‐2FS1, and S2‐1PGA). We then assign the result to a particular experimental fold by computing the average RMSD (or TM‐score) between the MD‐sampled structures of our predicted sequences and those of the experimental sequences. The experimental sequences are listed in the first row: G1‐2FS1, G14‐2FS1, G30‐2FS1, G29‐1PGA, G15‐1PGA, and G31‐PGA. The sequence–structure pair with the lowest RMSD or the highest TM‐score is the predicted match. S1 and S2 in the corresponding predicted native folds, S1‐1PGA and S2‐2FS1, have the lowest average RMSD and the largest TM‐score when compared to the experimental sequences in the same correct fold. Note however, that the signal assigning some sequences to 1PGA is rather weak.

Table 2.

Average RMSD and TM Scores Between the Predicted Switch Sequences (S1 and S2) and the Experimental Wild‐Type Sequences (G1 and G29) and Experimental Switch Sequences (G14, G15, G30, and G31) in the Two Native Structures

G1‐2FS1 G14‐2FS1 G30‐2FS1 G29‐1PGA G15‐1PGA G31‐1PGA
RMSD (Å)
S1‐2FS1 2.450 ± 0.222 2.491 ± 0.198 2.188 ± 0.174 3.042 ± 0.219 3.048 ± 0.200 3.031 ± 0.256
S1‐1PGA 2.824 ± 0.168 3.109 ± 0.158 3.247 ± 0.172 1.468 ± 0.151 1.843 ± 0.220 1.378 ± 0.152
S2‐2FS1 2.129 ± 0.188 2.324 ± 0.224 1.869 ± 0.177 2.753 ± 0.155 2.798 ± 0.170 2.684 ± 0.152
S2‐1PGA 2.864 ± 0.171 3.022 ± 0.190 3.234 ± 0.183 1.547 ± 0.232 1.890 ± 0.216 1.571 ± 0.250
TM score
S1‐2FS1 0.606 ± 0.046 0.594 ± 0.026 0.635 ± 0.030 0.349 ± 0.013 0.360 ± 0.015 0.348 ± 0.014
S1‐1PGA 0.348 ± 0.017 0.353 ± 0.019 0.355 ± 0.019 0.799 ± 0.022 0.739 ± 0.033 0.796 ± 0.033
S2‐2FS1 0.680 ± 0.034 0.641 ± 0.033 0.702 ± 0.027 0.399 ± 0.013 0.404 ± 0.014 0.414 ± 0.015
S2‐1PGA 0.384 ± 0.019 0.377 ± 0.022 0.381 ± 0.025 0.773 ± 0.039 0.732 ± 0.038 0.763 ± 0.047

S1‐1PGA and S2‐2FS1 is the predicted highly stable switch pair. The entries corresponding to optimal RMSD or TM scores are highlighted in gray. In each column, the predicted correct combination has the lowest average RMSD value and the largest TM score.

To further assess the structural flexibility of the predicted switch, we ran a 40‐ns MD simulation for each sequence of the switch pair in each of the two native folds and calculated the Cα‐RMSD between the initial structure of the simulation and the structures sampled each picosecond in the simulation. We show the RMSD values over time in Figure 4. We observed a lower RMSD value for the switch sequences in their native fold than the switch sequence that matches the alternative fold while embedded in the same (incorrect) fold.

Figure 4.

Figure 4

RMSD of the predicted switch sequences in the two native folds. The black curve is for the predicted correct sequence‐structure combination, while the red curve is for the incorrect combination.

Methods

This section is divided into five major parts. In Stochastic Sampling of Sequences by a Markov Chain section we describe the sampling algorithm that provides comprehensive sampling of switch or flip sequences. In sections “Secondary Structure Prediction,” “A Linear Combination of the HL Energy and the Secondary Structure Prediction Score Learned and Tested on GA and GB Sequence Database,” “Homology Modeling by MODELLER,” and “Final Filter by MD simulations,” we consider other measures that further probe and verify switch sequences using secondary structure matching and molecular dynamics (MD) simulations.

Stochastic sampling of sequences by a Markov chain

Our earlier study of the flip of the protein G A (fold 3α) to GB (fold 4β+α)16 was focused on sequences that had been analyzed experimentally.12, 13, 14, 15 These experimental observations are critical for assessing the accuracy of the sequence assignment to a fold and for further refinement of the computational model. In the current investigation we build on the enhanced accuracy of the model and sample a large number of sequences at the interface between the two folds. The larger sample size makes it possible to conduct significant statistical analysis of these rare and special sequences.

We denote a sequence by S and a structure by X. S A is a sequence that folds to structure X A. A similar notation is used for the B sequences and structures. We further denote an amino acid at position j (a j ), which belongs to fold A or B, by a Aj or a Bj, respectively.

The sequence space that we consider, following the original experimental procedure,12 is the binary permutations of the aligned sequences of the two wild‐type proteins (GAWT/G1 and GBWT/G29). This means that any site j along a sequence can have one of the two amino acids, a Aj or a Bj. The alignment is trivial in this case since the proteins are of the same length with 56 amino acids and one‐to‐one matching is used (Table 3). The total number of possible sequence variants is 2471014 because only 47 of the 56 amino acids are different. The total number of binary permutations is much smaller than 2047 for all twenty amino acid types but is still large enough to present a significant challenge for exhaustive enumeration. We illustrate below that efficient sampling can be conducted which provides statistically converged results for some observables.

Table 3.

Alignment of the Two Wild‐type Sequences of the Protein Switch

Position index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
GAWT/G1 M E A V D A N S L A Q A K E A A I K E L K Q Y G I G D Y Y I K L I N N A K T V E G V E S L K N E I L K A L P T E
GBWT/G29 M T Y K L I L N G K T L K G E T T T E A V D A A T A E K V F K Q Y A N D N G V D G E W T Y D D A T K T F T V T E

The 47 non‐identical positions are in black, while the rest of the 9 identical positions are in red.

We denote the current sequence by S C, and the current fold by X C. S N denotes the newly sampled sequence, and X N is the associated fold. C and N can be either fold A or B. The length of the Markov chain of sampled sequences is l, where l[0,107]. The sampling begins with one of the wild‐type sequences. A non‐identical position j along the sequence alignment that belongs to fold A (or B) is chosen uniformly and at random and the identity of amino acid aAj is substituted for a Bj (or a Bj for a Aj) in a new trial sequence. The new sequence is then placed in the two folds and scored by the Hinds and Levitt's energy function (HL).18 According to the scheme below, the sequence may or may not be assigned to one of the two structures.

IfScoreSN,XAScoreSN,XB<0  ScoreSN,XA<thresholdXA, then SNXAElse ifScoreSN,XAScoreSN,XB>0  ScoreSN,XB<thresholdXB, then SNXBElseSNXAXBSN=SC  (2)

Equation (2) forms the base for four sequential filters that we use to identify switch sequences. In three of the four filters, including the sampling phase, ScoreS,X is the HL contact energy, EHL. The second filter uses ScoreS,X that includes secondary structure information and is discussed in detail in “A Linear Combination of the HL Energy and the Secondary Structure Prediction Score Learned and Tested on GA and GB Sequence Database” section.

The thresholds, thresholdXAand thresholdXB, add absolute measures besides the score of relative fitness. They are the EHL values of the two wild‐type sequences, −22.55 and −37.05, in fold A and in fold B, respectively. Based on our previous study on the 31 experimental sequences, all 14 experimental G A sequences and 15 experimental GB sequences are lower in energy than GAWT and GBWT, respectively (data not shown). Accordingly, this selection scheme samples sequences such that their energy is lower than the energy of the wild‐type sequences in their native folds. We have used similar cutoffs also in the construction of the “network of sequence flow.”7

If the HL energy of S N in one of the two target folds is lower than the other and if the energy is lower than the threshold of the corresponding lower‐energy fold, then S N is accepted and set to S C. If the designated fold of S N is the same as the fold of S C, it is accepted as a foldable non‐switch sequence. Otherwise, the two sequences, S N and S C, form a switch pair. The series of trial sequences generates a Markov chain of switch and non‐switch sequences that is further filtered as described in sections “Secondary Structure Prediction,” “A Linear Combination of the HL Energy and the Secondary Structure Prediction Score Learned and Tested on GA and GB Sequence Database,” “Homology Modeling by MODELLER,” and “Final Filter by MD simulations.”

To test the statistical convergence of the stochastic sampling, we expect that properties we compute will not depend on the initiation of the Markov chain. We consider the probability that a particular site along the sequence be a position for a switch. That is, the probability that the site is identified in a switch pair regardless of the state of the rest of the positions in the sequence. This profile of a switch sequence is useful in identification of “hot spots” and is therefore of considerable interest. To examine the convergence of the profile we repeat the sampling using two starting points. One initiator of the Markov chain is the wild‐type GA sequence in fold A, G1, and the other initiator is the wild‐type GB sequence in fold B, G29 (see also Table 3). For each starting sequence, we conduct ten independent runs where different sampling seeds were used to generate alternative Markov chains.

As we illustrate in the Results section, the probability distributions of structural flips generated with different initial sequences are very similar, supporting our assessment of statistical convergence.

Secondary structure prediction

In our previous paper we found that the contact potential alone was not sufficient to obtain prediction with high accuracy on the experimental sequences (only 26 of 31 sequences are correctly predicted by the HL energy). However, if an additional filter was added on top of the contact potential the accuracy achieved was of 30 sequences out of 31. The additional filter enriches structures in the neighborhood of the initial models by relatively short (100 ps) MD simulations and re‐scores the structures sampled by MD by the contact potential. The larger sample size reduces statistical errors and provides more accurate structural assignments as recommend in Ref. 19. However, the MD simulations are too expensive to be conducted for all the millions of sequences that we sample here. Therefore, we implement an intermediate rapid filter using a linear combination of the HL energy and the secondary structure prediction score calculated by SABLE.20

Complementary to the contact energy, which relies mainly on the 3D structure information of the target sequence, the secondary structure prediction is based on information derived from evolutionary profiles of protein families. Specifically, SABLE (as well as Psi‐Pred21) predictions use Psi‐BLAST22 generated Position Specific Scoring Matrices (PSSMs) derived from iteratively improved alignments of multiple related sequences of the target protein in the non‐redundant database (here we use NCBI's nr database). The use of evolutionary profiles, coupled with inherently statistical nature of the predictions that assign probability of observing an amino acid in each of the (typically) three different states, that is, helix, strand or coil, allows one to estimate the propensity for each of the secondary structure. As a result, regions prone to undergoing transition between two different states, for example, helix versus strand, can be predicted by comparing propensity for the states. In fact, prediction of conformationally labile regions and local switches based on secondary structure predictions has been proposed in the past.23, 24 Here, we are testing the hypothesis that such obtained propensities for secondary structures can also be used for the more challenging prediction of the overall fold change. While this evolutionary signal can provide insights into the source of fold instability and switching, it should be noted that the disadvantage of the secondary structure function is that there is no simple physical interpretation to its functional form.

To construct standard templates for secondary structure prediction per site we consider the two wild‐type sequences, G1 and G29, as the target sequences. We use “targeted” Position Specific Scoring Matrices (targeted PSSMs) of G1 and G29 to verify the secondary structures of the 31 experimental sequences and to predict the secondary structures for all sampled switch sequence pairs in the first Markov chains from both native sequences. PSSM is a T‐by‐L matrix, where T is the number of amino acid types and gaps (T = 21), and L is the length of the sequence of interest (L = 56). Each element in the matrix corresponds to the log‐odds score of the amino acid substitution at that position in the sequence. The log‐odds scores are calculated from the multiple sequence alignment of the related sequences pulled out by PSI‐BLAST from nr. Instead of re‐constructing PSSM for every sequence (an approach that would be computationally expensive as PSSM would need to be recomputed for each query sequence, and it might introduce instabilities in multiple alignments for intermediate sequences equally distant from both folds), we use the PSSMs of the two wild‐type sequences as the templates. We name this method as “targeted PSSM.” The rationale behind targeted PSSM is similar to threading. In threading, the sequence of interest is accommodated in the 3D structure of the template using a contact map of the native structure. In targeted PSSM the sequence of interest is accommodated in the secondary structure of the native template. In this regard, it should be noted that SABLE was designed to accommodate changes in the amino acid sequence on the background of (near) constant PSSM by incorporating single sequence (in addition to PSSM‐derived) features into the input of the method.20 Such design enabled successful identification of conformational changes induced by single amino acid substitutions before.25, 26

The results of targeted PSSM are expressed as probability distribution, which is the probabilities of the three common partitions of secondary structure components, the probabilities of being in an α helix (H), in a β sheet (S), and in a coil (C). We calculate a secondary structure prediction score of a sequence Sx by targeted PSSM using the Kullback–Leibler divergence, D KL.

DKL(SxG1)=j=156i=H,S,CPSx,i(j)log(PSx,i(j)PG1,i(j))DKL(SxG29)=j=156i=H,S,CPSx,i(j)log(PSx,i(j)PG29,i(j)) (3)

D KL(S x ǁG1) expresses deviation between the probability distribution of a template, G1, and the sequence of interest, S x. D KL(S x ǁG29) is a similar expression where G29 serves as the template. The probabilities PSx,i(j), PG1,i(j), and PG29,i(j) are for a j in S x, G1, and G29, respectively, to adopt a secondary structure type i (either H, S, or C). At each position j, the sum of the probabilities of H, S, and C is normalized to 1. D KL is always greater than or equal to zero. It is zero when the distribution of the target sequence is identical to the distribution of the template. Hence, the smaller the value of the divergence, the better is the prediction.

To predict a fold for each of the 31 experimental sequences we use Eq. (2) except that we replace the HL energy, EHL, by the divergence measure, DKL. In addition, we do not use the absolute threshold.

The results are summarized in Table 4. D KL correctly assigns 29 out of 31 experimental sequences to their native structures, which matches the success rate of HL energy on the refined models. The two sequences that DKL fails upon are G15 and G17, which are both GB sequences. As the sequences that HL energy fails on are all GA sequences, we expect D KL to complement HL energy in our task. This observation motivates a combination of the two scores.

Table 4.

Fold Prediction on the 31 Experimental Sequences Using D KL or E HL

Experimental 31 sequence variants, G i Prediction level
Method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Number of correct predictions
Secondary structure score, DKL a a a a a a a a a a a a a a a b a b b b b b b b b b b b b a b 29
HL energy, E HL (1) Direct sequence replacement a a a a a a a a a a b b b b b b b b b b b b b b b b b b b b b 26
(2) Model refinement by MODELLER a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b 29
(3) MOIL ensemble of structures a a a a a a a a a a a a b a b b b b b b b b b b b b b b b a b 30

The sequence indices that fold into B are highlighted in gray. The wild‐type sequences are colored. The wild‐type GA and GB sequences are 1 and 29, respectively. (1) Models are built by sequence replacement in the first NMR model of G1 and the crystal structure of G29 and scored using the contact maps of the native proteins by E HL. (2) The models are refined by MODELLER27 and new contact maps are constructed and scored by EHL, and (3) Structures are generated by 100‐ps MD simulations using the MOIL program17 and scored by E HL. “a” and “b” denote 3α and 4β+α folds, respectively. The characters in red are incorrect predictions. Overall the accuracy of the predictions using D KL is comparable to that obtained by E HL for the refined models. Moreover, the two entities complement each other. D KL performs better in G A sequences while E HL does not err in G B sequences.

A Linear combination of the HL energy and the secondary structure prediction score learned and tested on G A and G B sequence database

The HL energy and the secondary structure fitness score capture different structural and evolutionary signals. The former uses contacts between amino acids that are far along the sequence, or patterns of tertiary structure. The later predicts local information, such as the values of the backbone dihedral angles, or patterns of short‐range hydrogen bonding. Hence, a combination of the two scores is likely to be complementary.

Appropriate training and test sets are required in order to optimize and validate weights (adaptive parameters) for such a consensus model. The 31 experimental sequences form a sample size that is too small for an effective multiple sequence alignment and for verification by means of cross‐validation. To increase the number of sequences in the sample that could be used to increase the size of the training and test sets to validate a combined score, we use G1 and G29 queries to find homologous sequences in nr using Psi‐BLAST and using reversed BLAST search against domain profiles from the Conserved Domain Database (CDD).27 Such identified homologous sequences are aligned to the corresponding query sequence by the pairwise sequence alignment option of the clustalw‐2.1 program.17 The alignment uses one of the BLOSUM matrices (80, 62, 45, and 30). The particular choice of the matrix is made internally, depending on the evolutionary distance between the aligned sequences. Gap opening penalty of 10 and gap extension penalty of 0.5 are used. We remove redundant sequences and those with uncertain homology based assignment or inconclusive functional annotations. As a result, we obtain 10 G A sequences and 63 G B sequences (listed in Table 1 in Supporting Information), of which 2 G A and 37 G B, respectively, have resolved structures and are thus experimentally validated. The remaining sequences in our expanded database constitute 8 G A and 26 G B putative sequences, classified as such based on their sequence similarity and UniProt and Ensembl functional annotations for the proteins that these sequences are derived from. These sequences are used for training and validation in addition to the original 15 and 16 experimental GA and GB sequences, respectively. The 31 experimental sequences are listed in Table 2 in Supporting Information.

We consider a simple linear combination of the HL energy (E HL) and D KL with one parameter λ to be determined to improve the accuracy of the scoring function. The functional form is

Ecomb=EHL+λDKL (4)
  • We determine λ by optimizing the scoring accuracy as a function of one variable and conducting a halving search. The results are reported in Table 5.

Table 5.

Structural Prediction Using a Linear Combination of EHL and DKL in Training on Half (15) of Experimental Sequences and Half of the Extended Database Chosen at Random, and Testing on the Rest of the Sixteen Experimental Sequences and the Other Half of the Extended Database

Training Testing
λ 0.65 λ 0.65
15 Experimental sequences 1 a 16 Experimental sequences 3 a
2 a 4 a
5 a 6 a
7 a 9 a
8 a 10 a
12 a 11 a
14 a 13 a
15 b 17 b
16 b 19 b
18 b 20 b
22 b 21 b
24 b 23 b
27 b 25 b
28 b 26 b
29 b 30 b
5 GA sequences in the extended database 3 a 31 b
4 a Error 4.46
6 a 5 GA sequences in the extended database 1 a
8 a 2 a
9 a 5 a
31 GB sequences in the extended database 2 b 7 a
3 b 10 a
4 b 32 GB sequences in the extended database 1 b
6 b 5 b
7 b 8 b
9 b 10 b
12 b 11 b
17 b 13 b
22 b 14 b
25 b 15 b
26 b 16 b
27 b 18 b
29 b 19 b
32 b 20 b
34 b 21 b
36 b 23 b
37 b 24 b
38 b 28 b
41 b 30 b
42 b 31 b
43 b 33 b
44 b 35 b
45 b 39 b
46 b 40 b
52 b 47 b
53 b 48 b
54 b 49 b
60 b 50 b
61 b 51 b
62 b 55 b
63 b 56 b
57 b
58 b
59 b

The single incorrect prediction in the test set is shown in red. Error is the sum of the absolute differences of the combined scores of incorrect predicted sequences embedded in the two folds.

The new scoring function with optimal linear coefficient λ for the training set is:

Ecomb=EHL+0.65DKL (5)

We would like to comment that since PSSM is computed only for the two target sequences, while single sequence features can be computed at constant time, this approach is very efficient and can be applied in the context of large scale sampling for the network of sequence flow. Here, Eq. (5) is used to filter predictions made by the initial sampling, that is, the prediction of 298,065 switch pairs starting from G1 and 298,158 switch pairs starting from G29.

The selection process of sequences in the filtering process is the same as in Eq. (2) except that no absolute threshold is used. The scoring function ScoreS,X here is the combination score E comb. We focus on only the switch sequences, and only the sequences that remain as switch pairs after the above filter are forwarded to the next filtering step, in which we refine the atomic models with the program MODELLER28 and adjust the contact maps according to the new structures.

Homology modeling by MODELLER

Structural models are built using MODELLER 9.1328 for switch sequence pairs that are confirmed by the second filter (A Linear Combination of the HL Energy and the Secondary Structure Prediction Score Learned and Tested on GA and GB Sequence Database section). New contact maps are calculated and the structures are re‐scored by the HL energy function. Structural assignment of the switch pairs is based on Eq. (2) where the scoring function ScoreS,X is E HL.

If a pair of sequences is no longer identified as a switch, they are removed from the set. We wish to select top switch sequences for more detailed analysis, which we achieve by examining the stability of each sequence in its corresponding fold. That is, we want both sequences of a switch to be maximally stable. We calculate the sum of the absolute value of the energy differences, ΔE, for the pair in the two folds. These differences present energy gaps between the two folds that we wish to maximize

ΔE=|EHL,S1,XAEHL,S1,XB|+|EHL,S2,XAEHL,S2,XB| (6)

EHL,S1,XAand EHL,S2,XA are the HL energies of the switch sequence pair S 1 and S 2 in fold A after MODELLER, respectively. Similarly EHL,S1,XB and EHL,S2,XB are HL energies of the pair in fold B. The top 25 switch pairs that have the largest ΔE values are further analyzed using MD simulations.

Final filter by MD simulations

We use the same MD simulation protocol as we did in our previous study.16 The protein is embedded in a periodic box of 65 Å3, Ewald sum29 is to compute electrostatic interactions and the real space distance cutoff is 9.5 Å. The water model is TIP3P.30 The score of a sequence in a switch pair in one fold is the average HL energy of the 100 configurations that are sampled every 1 ps at 300 K during a 100‐ps run. We then calculate the energy gap, ΔE, using Eq. (6). The switch pair with the highest energy gap is subjected to a prolong production run, a 40‐ns MD simulation. We collect 40,000 configurations and measure Cα‐RMSD between the structure in the beginning of the production run and the structures during the production run. The length of simulations is clearly insufficient to detect switching events. However, the length is sufficient to probe the local stability and fluctuations of the protein in the neighborhood of their assumed native state.

Conclusions

In this manuscript we present a comprehensive sampling of sequence binary space for a structural switch. We illustrate that an important function (the sequence probability profile of the switch) is statistically converged and we analyze in detail the switch property. The structural portions critical for switching are at the N‐ and C‐ termini. Perhaps the remaining greatest challenge is to increase the accuracy of prediction that is subject to subtle changes in interactions.

Supporting information

Supporting Information

References

  • 1. Maynard SJ (1970) Natural selection and the concept of proteins pace. Nature 225:563–564. [DOI] [PubMed] [Google Scholar]
  • 2. Saven JG, Wolynes PG (1997) Statistical mechanics of the combinatorial synthesis and analysis of folding macromolecules. J Phys Chem B 101:8375–8389. [Google Scholar]
  • 3. Shakhnovich EI (1998) Protein design: a perspective from simple tractable models. Fold Des 3:R45–R58. [DOI] [PubMed] [Google Scholar]
  • 4. Meyerguz L, Grasso C, Kleinberg J, Elber R (2004) Computational analysis of sequence selection mechanisms. Structure 12:547–557. [DOI] [PubMed] [Google Scholar]
  • 5. Chan HS, Dill KA (1991) Sequence space soup of proteins and copolymers. J Chem Phys 95:3775–3787. [Google Scholar]
  • 6. Burke S, Elber R (2012) Super folds, networks, and barriers. Proteins 80:463–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Meyerguz L, Kleinberg J, Elber R (2007) The network of sequence flow between protein structures. Proc Natl Acad Sci USA 104:11627–11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Cao BQ, Elber R (2010) Computational exploration of the network of sequence flow between protein structures. Proteins 78:985–1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Bennett MJ, Eisenberg D (2004) The evolving role of 3D domain swapping in proteins. Structure 12:1339–1341. [DOI] [PubMed] [Google Scholar]
  • 10. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Zhang Y, Skolnick J (2005) TM‐align: a protein structure alignment algorithm based on the TM‐score. Nucleic Acids Res 33:2302–2309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Alexander P, He Y, Chen Y, Orban J, Bryan P (2007) The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci USA 104:11963–11968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Alexander PA, He YA, Chen YH, Orban J, Bryan PN (2009) A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA 106:21149–21154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Bryan PN, Orban J (2010) Proteins that switch folds. Curr Opin Struct Biol 20:482–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. He YA, Chen YH, Alexander PA, Bryan PN, Orban J (2012) Mutational tipping points for switching protein folds and functions. Structure 20:283–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Chen SH, Elber R (2014) The energy landscape of a protein switch. Phys Chem Chem Phys 16:6407–6421. [DOI] [PubMed] [Google Scholar]
  • 17. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007) Clustal W and clustal X version 2.0. Bioinformatics 23:2947–2948. [DOI] [PubMed] [Google Scholar]
  • 18. Hinds DA, Levitt M (1994) Exploring conformational space with a simple lattice model for protein‐structure. J Mol Biol 243:668–682. [DOI] [PubMed] [Google Scholar]
  • 19. Cossio P, Granata D, Laio A, Seno F, Trovato A (2012) A simple and efficient statistical potential for scoring ensembles of protein structures. Scientific Rep 2:351. [Google Scholar]
  • 20. Adamczak R, Porollo A, Meller J (2005) Combining prediction of secondary structure and solvent accessibility in proteins. Proteins 59:467–475. [DOI] [PubMed] [Google Scholar]
  • 21. Jones DT (1999) Protein secondary structure prediction based on position‐specific scoring matrices. J Mol Biol 292:195–202. [DOI] [PubMed] [Google Scholar]
  • 22. Altschul SF, Madden TL, Schaffer AA, Zhang, JH , Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Young M, Kirshenbaum K, Dill KA, Highsmith S (1999) Predicting conformational switches in proteins. Protein Sci 8:1752–1764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Rost B (2001) Review: protein secondary structure prediction continues to rise. J Struct Biol 134:204–218. [DOI] [PubMed] [Google Scholar]
  • 25. Howarth JW, Meller J, Solaro RJ, Trewhella J, Rosevear PR (2007) Phosphorylation‐dependent conformational transition of the cardiac specific N‐extension of troponin I in cardiac troponin. J Mol Biol 373:706–722. [DOI] [PubMed] [Google Scholar]
  • 26. Takatori A, Geh E, Chen L, Zhang L, Meller J, Xia Y (2008) Differential transmission of MEKK1 morphogenetic signals by JNK1 and JNK2. Development 135:23–32. [DOI] [PubMed] [Google Scholar]
  • 27. Marchler‐Bauer A, Lu SN, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese‐Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke ZX, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang DC, Zhang NG, Zheng CJ, Bryant SH (2011) CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Eswar N, Mari‐Renom M, Webb B, Madhusudhan MS, Eramian D, Shen M, Pieper U, Sali A (2006) Comparative protein structure modeling with modeller. Curr Protoc Bioinform 5.6:5.6.1–5.6.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Essmann U, Perera L, Berkowitz ML, Darden, T , Lee H, Pedersen LG (1995) A smooth particle mesh Ewald method. J Chem Phys 103:8577–8593. [Google Scholar]
  • 30. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML (1983) Comparison of simple potential functions for simulating liquid water. J Chem Phys 79:926–935. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES