Modeling proteins using a super-secondary structure library and NMR chemical shift information

Vilas Menon; Brinda Vallat; Joseph M Dybas; Andras Fiser

doi:10.1016/j.str.2013.04.012

. Author manuscript; available in PMC: 2014 Jun 4.

Published in final edited form as: Structure. 2013 May 16;21(6):891–899. doi: 10.1016/j.str.2013.04.012

Modeling proteins using a super-secondary structure library and NMR chemical shift information

Vilas Menon ^1,^#, Brinda Vallat ^1,^#, Joseph M Dybas ¹, Andras Fiser ^1,^*

PMCID: PMC3703203 NIHMSID: NIHMS482228 PMID: 23685209

Summary

A remaining challenge in protein modeling is to predict structures for sequences that do not share recognizable sequence similarity to any experimentally solved structure. This challenge can be addressed by hybrid algorithms that utilize easily obtainable experimental data and carry a limited amount of indirect structural information. Based on earlier observations, the library of protein super-secondary structure motifs (Smotifs) saturated about a decade ago, and new folds discovered since then are novel combinations of existing Smotifs. This observation suggests that it should be possible to build any structure, of either a known or yet to be discovered fold, from a combination of existing Smotifs derived from already known structures. In the absence of any sequence similarity signal, limited experimental data can be used to relate the backbone conformations of Smotifs between target proteins and known experimental structures. Here we present a modeling algorithm that relies on an exhaustive Smotif library and on NMR chemical shift patterns without any input of primary sequence information. In a test of 102 proteins with unique folds, the algorithm delivered 90 homology model quality models, among them 24 high quality ones, and a topologically correct solution for almost all cases. Detailed analysis of the method’s performance suggests that further improvement can be achieved by improving sampling algorithms and developing more precise tools that predict dihedral angle preferences from chemical shift assignments. The current approach opens a venue to address the modeling of larger protein structures for which chemical shifts are available.

Keywords: Ab initio modeling, protein structure modeling, Smotif, NMR chemical shift

Introduction

Knowledge of the three-dimensional model of a protein can provide essential insight into its function. This insight can range from low-resolution descriptions, such as confirming the fold and inferring a general functional role (Zhan et al., 2005), to high-resolution descriptions, such as understanding ligand specificities (Schwede et al., 2009). Protein modeling methods that complement experimental approaches to obtain three dimensional models have been extensively developed over the last decades (Bonneau et al., 2002b; Marti-Renom et al., 2000; Pillardy et al., 2001). The remaining challenges in this area are to address the modeling of very high quality models (refinement) and modeling those proteins where templates cannot be used due to the lack of detectable signal on the sequence level. Despite significant progress in ab initio protein structure prediction, benchmarks indicate that “template-free” techniques still cannot get the overall fold correct for the majority of targets (Kinch et al., 2011a). This is a major limitation as most structurally related proteins share low sequence identity that is indistinguishable from noise (Rost, 1997).

One possible avenue to make significant breakthrough in structure modeling is to take advantage of the rapid advances in experimental techniques. The next generation of modeling approaches should be able to incorporate indirect structural data from high throughput experiments (Fiser, 2004). Following this spirit, a growing number of methods incorporate a variety of easily obtainable NMR data as restraints to guide protein structure modeling or simulation. Many of these methods focus on backbone NMR chemical shift (CS) assignments. Obtaining CS is a necessary first step in the classical NMR structure determination process. Backbone CS data are the easier to obtain in comparison to assigning side chain resonances or determining large numbers of interproton distances (NOEs).

A number of programs exist that use NMR CS data to predict secondary structure conformations (Hung and Samudrala, 2003; Shen et al., 2009a; Wishart and Sykes, 1994). Within the framework of developing the TALOS program, it was shown that CS data can guide the selection of tripeptide segments with similar conformations and provide preferences/restraints for main-chain dihedral angles (Cornilescu et al., 1999; Shen et al., 2009a). Recently, TALOS was extended to specifically address CS-based dihedral angle predictions in loop segments (Shen and Bax, 2012). The highly successful Rosetta ab initio fragment assembly program (Bonneau et al., 2002b) was combined with chemical shift data and sparse NOE restraints (~1 per residue) to steer the selection and filtering of 3 and 9 residue fragments, besides taking into account sequence similarity measures of these fragments (Bowers et al., 2000). In a similar approach by Rose et al. (2007), experimentally determined CS and sequence patterns were used to search the protein database for consecutively overlapping six residue long backbone fragments, which then were “stitched” together using Monte Carlo simulation (Gong et al., 2007). In more recent applications, CS-Rosetta was shown to be successful in delivering high quality models when using CS data in combination with sequence information (Shen et al., 2008; Shen et al., 2009b). Subsequently, the robustness of the approach as a function of CS assignment completeness was also assessed (Shen et al., 2009b). The advantage of using CS in structure modeling proved particularly useful when it was challenged on a pair of proteins with high sequence identity but exhibiting different folds (Shen et al., 2010).

The applicability of CS-Rosetta was recently extended for larger molecules (>12kDa) through a combined approach that uses sequence information of short 3 and 9 residue segments, NMR CS, and residual dipolar coupling data (Raman et al., 2010). Similar ideas are implemented in the CHESHIRE method, which first predicts secondary structures of 3 and 9 residue fragments using CS data and then combines these fragments into larger ones by matching sequence information, secondary structures and CS patterns(Cavalli et al., 2007). In an elegant approach from the same group, NMR CS data were converted into forces in molecular dynamics simulations and were successfully used to fold short polypeptide chains or to refine partially unfolded structures (Robustelli et al., 2009; Robustelli et al., 2010). An important advance for that work was the development of the CamShift method (Kohlhoff et al., 2009) that quickly predicts CS values from structures, approximating CS with a polynomial function of interatomic distances. This results in a readily differentiable function with respect to the coordinates of atomic positions and therefore is suitable to use as restraints in molecular dynamics simulations. Besides CamShift several other approaches are available that calculate theoretical CS values for a given structure, such as SHIFTX2(Han et al., 2011), SPARTA+(Shen and Bax, 2010) and PROSHIFT (Meiler, 2003). GENMR (Berjanskii et al., 2009) is a very fast modeling implementation that combines homology models with CS and/or NOE data. The component of GENMR that relies on structure calculation using CS and sequence information without NOE data is CS23D(Wishart et al., 2008). CS23D incorporates various other methods, such as threading, homology modeling or small fragment assembly using the Rosetta program.

Recently, we have explored the limits of applicability of our previously-developed fragment-based loop modeling approach (Fernandez-Fuentes et al., 2006a; Fernandez-Fuentes et al., 2006b) and observed that the protein structure universe seems to have saturated on the level of super-secondary motifs (Fernandez-Fuentes and Fiser, 2006). We define super-secondary structure motifs (Smotifs) systematically as two secondary structures with a connecting loop. We have built a library containing clusters of Smotifs with similar internal geometry and observed that new folds discovered during the last decade did not require the emergence of new Smotifs, but are simply a consequence of novel combinations of existing Smotifs (Fernandez-Fuentes et al., 2010). This observation presents a hypothesis according to which, it should be possible to build any new structure of a known or yet to be discovered fold by combining existing Smotifs from already known structures. The library of Smotifs is a backbone-only, geometrically-defined fragment library, which means that for practical modeling applications, a relation needs to be made between the target protein and specific fragments in the library. In this work, this connection is made via the use of NMR CS data. We present a computational approach, where the structure of a protein can be modeled from NMR CS assignments alone, without any input about sequence information. When tested on a set of 102 different fold topologies the method returned a homology model quality solution for about a 90% of cases and at least a topologically correct fold for almost all of them. As the current approach employs large chunks of supersecondary structures it is well suited to model larger proteins.

Results

In NMR structural biology, CSs are mainly used to characterize regular secondary structures. However, the underlying hypothesis in our approach is that a CS pattern characterizing the connecting loop region in an Smotif will determine the relative orientation of flanking secondary structures. The notion for this hypothesis emerges from the success of an inverse application in which conformations of loops were successfully modeled by the fit of the corresponding Smotif, specifically the flanking secondary structure residues, in the relevant structural environments in the template and target structures (Fernandez-Fuentes et al., 2006a; Fernandez-Fuentes et al., 2005). Here, we introduce a structure modeling algorithm (SmotifCS) that does not use any sequence similarity information at all but takes advantage of the indirect structural information conveyed by the CSs of loop residues and our exhaustive Smotif library. The method can be divided into three stages: selection of candidate Smotifs, sampling Smotif combinations, and scoring these combinations to generate compact folds (Fig. 1) (for details see Experimental Methods).

Flowchart of the modeling algorithm. Inset: Unit vector presentation of Smotifs. The largest momentum of inertia is shown in red arrow and runs for the length of the corresponding secondary structures, while the normalized unit vector has a blue cap. See also Figure S5.

Benchmarking the algorithm

We implemented our prediction method on a dataset of 102 proteins obtained from the BMRB (Ulrich et al., 2008) database (Table S1). The test set is the currently largest non-redundant dataset of experimentally known structures for which CS data are publically available and where all structures represent a different SCOP Fold category (Andreeva et al., 2008). This selection ensures that the largest possible varieties of proteins are tested with respect to secondary structure composition and topologies. The results are presented as a distribution of GDT_TS scores (Zemla, 2003) of the superposed backbone atoms for the entire lengths of the experimental structure and the top ranked model (Fig. 2). The top ranked models have GDT_TS scores in the range of 20–80%. The number of proteins where the best sampled models have GDT_TS >= 50%, is 47 (Fig. 2). This means that for about half of the cases a high quality homology model is generated and, for almost all cases, at least a topologically correct fold is produced. The 102 proteins can be broken down in different SCOP Classes, with a slight difference in terms of performance. The best performing Classes in terms of median GDT_TS scores are the all-α class (44%), followed by the α/β class (40%) while the all-β class (37%) and α+β Class (36%) lag behind slightly, and the Class of small proteins are in the middle (39%). The only two designed proteins in our set perform the best, albeit the statistics are very limited. We also employed a smaller, separate set of 10 proteins for exploring some of the computationally intensive aspects of the method.

Distribution of GDT_TS scores in a test dataset as a function of secondary structure assignment accuracy from CS data. The entire dataset contains 102 proteins (black columns). This dataset is split into two, in 50 proteins at least one secondary structure is incorrectly assigned (light gray), while in 52 others, all Smotifs are captured correctly (dark gray). See also Figures S1, S2, S3, S4 and Tables S1, S2, S3, S4.

We also compared SmotifCS to CSRosetta (http://www.csrosetta.org) for a randomly selected subset of 15 proteins from our test set (Table S1). This comparison normally would not be completely relevant since our approach does not use any sequence information at all: Smotifs are used with their backbone geometries and we generate backbone only models, while Rosetta relies on fragments collected from sequentially related structures. In order to establish comparable conditions we purged from the Rosetta fragment database all homologous PDB templates that were detected for a target protein using HHblits(Remmert et al., 2012) and Psi-Blast(Altschul and Koonin, 1998). This eliminated on average 0.82% of the 3 residue fragments and 1.43% of the 9 residue fragments that CSRosetta could use in modeling. In a head-to-head comparison on the randomly picked 15 test cases the two methods show competitive performance with an average 52.07 and 55.07 GDT_TS +/− 3.08 and 3.16 (std. error of the mean) for SmotifCS and CSrosetta, respectively. Fig. S1 shows that CSRosetta outperforms SmotifCS in 5 cases and SmotifCS outperforms CSRosetta in 7 cases, with both of them performing comparably in 3 cases. In terms of required computational time, CSRosetta takes about a magnitude longer to perform the calculations for the same proteins and this difference increases rapidly with protein size. The fact that SmotifCS, due to the large chunks of supersecondary structures it uses, does not scale exponentially with the increasing protein size makes it a promising approach to model larger proteins for which CS data can be collected.

Individual modeling cases

We explored the modeling of a designed protein (PDB code 2kl8(Koga et al., 2012)). The advantage of this case is that it presents no bias with respect to other already known experimental structures or topologies for sampling Smotifs. For 2kl8 we obtain a high quality model with a GDT_TS score of 50.33 (Fig. 3a). By definition, since 2kl8 is a unique fold, all the Smotifs used to build this model come from unrelated proteins but, in addition, all five Smotifs come from five unique folds. The pairwise superposition of sampled and the experimental Smotifs range between 0.66–2.55 Å RMSD, but in the assembled model the C-terminal motif flips around the core of the protein and hence results in a GDT_TS score of 50.33. Designed proteins usually are intentionally engineered with short loops and compact structure, which could have also contributed to the overall good result.

Structural superposition of top ranked model (in pink) with the solution structures (in blue) for (a) 2kl8 (b)1khm and (c) 2jya are shown in the center with the overall GDT_TS score indicated in brackets. The templates from which the Smotifs are sampled are shown in gray with the Smotifs themselves colored according to their secondary structures (helix=red; loop=green; strand=yellow). The PDB code, chain and residues contributing to the Smotif template, the SCOP identifier of the template (if available) and the RMSD between the template and the native Smotif are shown.

One of the better performances for a fold with a mixed composition of β-strands and α-helices is observed in the case of 1khm(Baber et al., 1999), with an overall GDT_TS score of 68.57 (Fig. 3b). Folds with a mixed composition of secondary structure types usually pose a more difficult challenge. The five Smotifs (with pairwise RMSD accuracies in the range of 0.55–2.58 Å) that the algorithm identified for this modeling case come from a diverse set of SCOP Superfamilies. The general tendency that Smotifs are typically sampled from a range of unrelated folds underlines the algorithmic concept, where large modular building blocks are identified that are shared between unrelated folds that do not necessarily show any overall structural homology.

It has been observed in similar studies that proteins with long loops or disordered segments pose the most difficult challenge for modeling. We also show that Smotifs with long loops are less well-sampled and, in general, harder to match well when comparing CS loop “fingerprints”. When modeling 2jya (Fig. 3c), which has particularly long loops (longest loop length 22 residues, total loop content is 72%), it is clear that while the core of the protein made of regular secondary structures is well captured, the two long loops are poorly modeled resulting in an overall GDT_TS score of 39.53. If we calculate the RMSD for the whole model we obtain 9.46 Å but if we calculate the RMSD of the structured core only it is 1.50 Å.

There are various reasons for the method delivering a mediocre performance for some cases. It could be due to incorrect Smotif definitions resulting from errors in prediction of dihedral angles from TALOS, insufficient sampling, rotation or rigid body shifts of Smotifs during enumeration or inadequate scoring. More often than not, the method generates a good model, only to not identify it as the top scoring one (Table S2). Sometimes, one incorrect Smotif is sampled and that leads to a wrong model. This is observed more often in proteins with Smotifs with long loops (Fig. 3c). A 180 degree rotation of a single Smotif during enumeration is sufficient to drastically lower the GDT_TS scores as was described in the case of 2kl8 (Fig. 3a).

Accuracy of sampling Smotifs

We assessed how effectively the CS fingerprint of a loop region can be used to identify relatively large protein fragments such as Smotifs. We ran the Smotif identification and CS pattern matching algorithm on known Smotifs from the fragment library, using the corresponding SPARTA-generated theoretical CSs as inputs. In all cases, the query Smotif was eliminated from the pool of potential matches. For Smotifs with loops less than 9 residues (~77% of all Smotifs in the library) the algorithm identifies candidate Smotifs whose overall backbone RMSD is within 2 Å from the query (Fig. 4). For longer loops, the accuracy gradually decreases, partly due to the sparsity of Smotifs with longer loops in the database itself. Even the theoretical best match (based on overall backbone RMSD) to the query Smotifs has increasingly larger RMSD values as the loops get longer. However, it was estimated in the past that about 85% of loop segments are shorter than 12 residues (Fiser et al., 2000) and this estimate is confirmed by our current test set (Fig. S2). We explored the effect of loop length on the accuracy of the final models and observed that proteins with Smotifs containing long loops resulted in lower overall model quality and was worse for proteins containing several long loops (Fig. S3a and S3b), in accordance with observations from other similar studies (Raman et al., 2010). Although most Smotifs have short loops (Fig. S2), we find that there are 57 proteins in our dataset of 102 proteins with at least one Smotif with loop length equal or longer than 9 residues. This indicates that although long loops are indeed rare within the whole Smotif space, they can affect many different proteins since often one such long loop may exist in a given structure. We also analyzed how the contact order of proteins affects our prediction quality. It has been shown that lower contact order proteins fold faster and are predicted more accurately by ab initio structure prediction methods (Bonneau et al., 2002a). This phenomenon can be observed in the current application as well (Fig. S3d).

(a) Accuracy of identifying Smotifs from CS data using structural fingerprints. Accuracy of selection (in RMSD) is shown as a function of loop length for the helix-helix sub-type. The best available Smotifs present in the library (theoretical limit), the best Smotif selected by CS matching and the average of the top 8 Smotif selected are shown in green, blue and red, respectively. Standard deviations are shown. (b) Illustration of pre-calculated structural weights for each type of CS. For each residue type, preceding residue type, atom type the secondary structural preferences are obtained (helical, strand and coil in blue, red and green, respectively). The largest relative frequency is reduced by the second largest value for each normalized chemical shift value to obtain a relative weight (in black), which correlates with the information content carried by the normalized chemical shift value. The example shown here corresponds to the C atom of the Ala-Met dipeptide.

Performance of scoring functions

We explored other scoring functions to rank the sampled models, but we did not find a clear advantage over the chosen one. The number of proteins (out of the 102 in the dataset) where the top scoring model has GDT_TS >=50 when ranked by our scoring function SmotifCS, Rosetta (Bonneau et al., 2002b), Prosa (Sippl, 1993) and Dfire (Yang and Zhou, 2008a, b) are very similar: 21, 19, 21 and 21, respectively (Fig. S4). Consensus approaches became very popular and powerful in protein structure prediction methods over the last decade, utilizing the simple idea of signal to noise improvement through averaging (Kurowski and Bujnicki, 2003). Indeed, if we average out the ranking of models by different scoring functions we get an improved performance, ranking 24 proteins with top models above GDT_TS 50%, and 90 above GDT_TS 30% (Fig. S4). The GDT_TS (Zemla, 2003) score based accuracies of the best models generated for 10 proteins in our smaller testing set with four alternative energy functions: SmotifCS, Prosa, Dfire and Rosetta, are given in the Table S2. It emerged in earlier studies that a practical discriminator between ab initio and homology model quality models is around GDT_TS 30% while above 50% it signals high quality homology models (Kinch et al., 2011b). Although all 10 proteins have high homology quality models sampled with GDT_TS scores >= 50 (column 3 in Table S2), only 7 out of 10 cases were identified as such by the SmotifCS energy function (column 4 in Table S2). Similar success rates for Prosa, Dfire and Rosetta are observed (columns 5, 6 & 7 in Table S2) with a slight edge when using a consensus scoring. This suggests that scoring functions deliver comparable performance despite the large differences in their complexity and style (Deng et al., 2012; Rykunov and Fiser, 2010).

Sampling full models

While it appears that we are operating at the edge of performance of current leading scoring functions, the prospects of improving performance on the sampling side turns out to be more promising. Sampling is designed with practicality in mind, to generate about a million full models for ranking that can be built and scored within a reasonable time. We explored the accuracy of the method if we generate ~0.5 and ~1.5 million conformations and do not change anything else in the modeling process. The test was run on only 10 select proteins due to the intensity of computation and it shows a tendency that one can indeed deliver more accurate models with enhanced sampling with an average GDT_TS score of 50.02, 54.19 and 57.11 for 0.5M, 1.0M and 1.5M sampled conformations, respectively (Table S3). This is especially important in the case of larger proteins with more Smotifs, where there is a more limited sampling per Smotif to generate the same total number of full model decoys. This trend is clear if one plots the accuracy of models as a function of the number of Smotifs they are composed of (Fig. S3c).

Another issue is how the availability of Smotifs limits the quality of the resulting models. Out of 455 Smotifs that the algorithm had to identify for our test set of 102 proteins, we located an Smotif within 1.0 Å of the best available Smotif in our library for 338 cases. Meanwhile we confirmed again that the Smotif library itself is robust. Only 32 of the 455 Smotifs do not have a template in the library within 4.0 Å and these 32 difficult Smotifs come from 25 different proteins in the dataset. These discrepancies may be addressed in the future with a successful refinement method, although this remains a challenging task.

Accuracy of CS-based secondary structure prediction

While the idea of using Smotif elements to assemble full structures seems powerful, it is also vulnerable to the accuracy of definition of secondary structures from CS values. An error in this first step of the algorithm is hard to correct in subsequent steps and can lead to low quality models. We currently use TALOS+ to determine secondary structure locations within the query protein sequences and in 50 out of 102 proteins we get at least one major Smotif definition incorrect in terms of the number of Smotifs (where we predict more or less Smotifs than what exists in the native structure) or the type of Smotif (where we predict an incorrect secondary structure and Smotif type, such as a helix-helix Smotif instead of a strand-helix). If we include minor discrepancies like Smotif starting position and the length of secondary structures or loops (within a four residue margin), then 64 out of the 102 proteins have at least one incorrect Smotif definition. Fig. 2 shows the GDT_TS distributions of cases where we get the major Smotif definitions correct (52 of the 102 proteins) and incorrect (50 of the 102 proteins). Incorrect definitions clearly affect the resulting model quality with a median GDT_TS score for the correctly and incorrectly assigned proteins of 43% and 33%, respectively. When exploring alternative approaches, we used a non-CS based secondary structure prediction method (PSIPRED(Jones, 1999)) for assignment, and we obtained a slightly worse but statistically indistinguishable overall performance.

Effect of incomplete CS assignment

Since the algorithm depends entirely on CS information, we explored how incorrect CS assignments or missing data affect our results. It has been estimated that more than 20% of the proteins in the BMRB are improperly referenced and that about 1% of all chemical shifts are mis-assigned (Wang et al., 2010). During the selection of our test set we required that the BMRB and PDB entries match, therefore no CS data was missing in the loop regions that we focus on. We randomly picked 5%–30% of loop residues and three flanking residues from the bracing secondary structures and replaced their CS values with a random coil one. In our algorithm these positions are disregarded as they will not show a difference to the reference random coil distribution. The simulation on 10 test proteins shows that accuracy of results do not change until at least 15% of residues have missing CS data but beyond 15% the results become proportionately incorrect (Table S4). This amount of error is unusual in chemical shift assignments and therefore should not be a bottleneck in our approach.

Role of refinement to identify near native models

We carried out a short refinement with the Rosetta structure prediction program, using the top 200 best scored models obtained from the set of ten proteins to see if initial models can be improved (column 8 in Table S2). We also carried out a Monte Carlo refinement of the top five models obtained from SmotifCS with the goal to further sample the structural space found to exist within an individual cluster of Smotifs (column 9 in Table S2). We obtained a similar conclusions in both attempts as no systematic pattern emerged and model accuracy improved or declined randomly.

Discussion

Chemical shift data is often the most easily obtainable type of data from NMR experiments and is a necessary first step in the classical NMR pipeline. This has prompted several groups to research ways to incorporate this information in modeling protein structures. Unlike other forms of NMR data such as RDC and NOE constraints, chemical shifts only provide information about local structure, and existing methods have so far always incorporated sequence information of very short segments along with the chemical shifts to model protein structures. In our fragment-based approach, the selection of building blocks solely depends on the CS patterns of secondary structure spanning loop regions, without relying on any sequence similarity information. Here, we have shown that by taking into account the predicted secondary structure of an unknown protein, we can sample large fragments of super secondary structures (Smotifs). This results in reduced combinatorial complexity and thus complete models can be obtained by full enumeration, leading to the possibility of tractably modeling larger proteins. We find that the method works best in proteins with relatively short loops (less than 9 residues). As we venture into proteins with longer loops, the method becomes less reliable and requires further improvement so that it can be applied to a wide variety of protein structures. The success of this algorithm strengthens the hypothesis that the space of ordered proteins consists of a limited set of already observed Smotifs, and suggests that the Smotif library is a useful tool for protein structure prediction as well as other applications such as protein design.

Experimental Methods

The structure modeling algorithm introduced here (SmotifCS) can be divided into three stages: selection of candidate Smotifs through pattern matching, sampling Smotif combinations, and scoring these combinations to generate compact folds (Fig. 1). The method requires two databases, one that organizes Smotif fragments and another pre-calculated database that contains the relative weights of structural information conveyed by a given normalized chemical shift value.

Building the Smotif database

The Smotif database currently consists of 466,939 Smotifs obtained from 28,012 sequentially non-redundant protein structures (culled using PISCES(Wang and Dunbrack, 2003); dataset is non-redundant at 99% identity; only X-ray; R-factor < 0.3; Resolution<3.0, length 40–10000 residues) obtained from the Protein Data Bank(Berman et al., 2007). The Smotifs are classified into subtypes according to their bracing secondary structures; helix-helix, helix-strand, strand-helix and strand-strand. Within each subtype, structurally similar Smotifs are clustered based on RMSD measurements. In order to make structural comparison of Smotifs with different lengths of bracing secondary structures uniform, a unit vector is used to represent each secondary structure at its largest moment of inertia, and RMSD is calculated between corresponding unit vectors (Fig. 1). This representation also ensures that otherwise well superposing but out-of-register helix-loop junction points will not dominate the quality of RMSD calculation. All backbone atom (N, H^N, Hα, Cα, Cβ, C’) chemical shifts are pre-calculated for all Smotifs in our library using SPARTA+ (Shen and Bax, 2010).

Database of relative weights for all chemical shift types

The structure prediction algorithm relies on another pre-calculated database that contains the relative weights of structural information conveyed by a given normalized chemical shift. Predicted CS values aggregated from all library Smotifs were divided into groups based on atom type, residue type, and preceding residue type, resulting in 6 × 20 × 20 = 2400 categories. For each category, CS values were normalized by subtracting the random coil value. Since each CS corresponds to a residue in one of three possible structural subtypes (helical, strand, and ‘other’), the statistical propensity of a given CS value for each of these structural subtypes (within each of the 2400 shift-type categories) is calculated. The relative weight of structural information conveyed by a given CS (categorized by atom type, residue type, and preceding residue type) is calculated as the difference between the statistical propensities of the 'most favored' and 'second-most favored' secondary structural conformations (Fig. 4).

Test dataset of experimentally known proteins

Entries were extracted from the BMRB database (out of a current total of 7881) that had either been deposited simultaneously with a corresponding PDB entry or which had a corresponding solution NMR PDB entry with a BMRB “comparison score” less than or equal to 9. Entries with identical sequence to the corresponding PDB file and with complete CS data were retained. In order to select the widest possible range of protein architectures, all entries were cross-referenced with SCOP (Andreeva et al., 2008). From the remaining set we selected 102 proteins, which did not generate errors when running TALOS+ (Table S1). In terms of SCOP Class definition, the test set contained 42 all-α, 8 all-β, 3 α/β, 33 α+β, 14 Small proteins and 2 designed proteins, all belonging to a unique Fold category. The length of the proteins ranged between 56 and 130 with a median length of 88 residues, and these proteins are comprised of 2–8 Smotifs. Due to the intensity of the computation, the algorithm itself was parameterized, trained and developed on a smaller set of 10 proteins solved by NMR and disjoint from the above described 102 member test set: 2kl8, 2kcl, 2kd1, 2kpo, 2kys, 2jmo, 2jua, 2jve, 2jvf & 2l2n.

Selection of Smotifs via pattern matching

CS “fingerprints” of loop regions

In order to identify encoded information about the relative orientation of regular secondary structures within each predicted Smotif of the query sequence, we compared CS patterns of the loop segments and the three flanking secondary structure residues on each side, between the experimental CS of the query Smotif and the theoretical CSs of available Smotifs in our library. The goal is to transfer the structure information from the identified library Smotif to the local region of the query sequence. The complete CS information of the regular secondary structures does not carry much information for this purpose. Further, the TALOS+ predicted Φ/ψ angles (also obtained from the experimental CSs) were used to assign each loop residue of the query Smotif in one of the 11 possible locations within the Ramachandran map (Fernandez-Fuentes et al., 2006a). The string of Ramachandran Map sublocations constituted the “fingerprint” of loop segments that was compared to similar fingerprints derived from the Smotifs in our library. The best matching Smotif fingerprints were then ranked by their CS match “score” obtained as the sum of weighted squared differences between the chemical shifts of the query and library Smotifs, considering only the loop and three flanking residues in each case.

Algorithmic steps

Given the pre-determined databases of theoretical CSs for each Smotif in our library and a statistical set of weights for the CS values, potential building block candidates for modeling a query protein with known experimental CSs are identified from the Smotif library using the following steps:

Secondary structure of the residues and putative Smotifs in the query protein are identified with TALOS+ using the experimental CS values (Shen et al., 2009a). Each loop residue is assigned to one of the 11 possible structural categories on the Ramachandran map (Fernandez-Fuentes et al., 2006a; Fiser et al., 2000), based on their TALOS+ predicted Φ/ψangles, and the loop segment is represented by this string of conformational categories as structural “fingerprint”.
For each Smotif in the query protein, a list of candidates from the Smotif library is obtained by filtering according to its type (helix-helix, helix-strand, strand-helix, or strand-strand), loop length (which can tolerate one residue difference) and secondary structure length (which tolerates a two residue difference). Within the set of Smotif candidates redundant Smotifs are removed at 100% sequence identity.
The best matching (largest number of matching residue-level Φ/ψ structural categories) Smotif is found from the library using the Smotif fingerpints. Then the required number of residue matches with the query fingerprint (N_s) is relaxed until at least 20 Smotifs are selected.
The chemical shift difference (ΔCS) between the two Smotifs, considering the loop and 3 flanking residues, is calculated as a sum of the weighted squared difference between the experimental CS values of the query Smotif and the SPARTA-calculated theoretical CS values of the library Smotif. The Smotif candidates are then ranked by their overall chemical shift difference.
$Δ C S = \frac{1}{N} \sum_{p} \sum_{a} a b s (C S_{p, a}^{q u e r y} - C S_{p, a}^{l i b}) \times w_{p, a}^{d}$
where CS is the normalized chemical shift value, p is the residue position, d is the preceding residue type to position p, a is the atom type compared, N is the total number of residues compared (loop + six flanking secondary structure residues), w is one of the 2400 pre-calculated CS weights, specific to the residue type, preceding residue type and atom type.
Once a list of ranked Smotifs is obtained, we employ two different clustering methods to further refine the Smotif selection procedure:
1. Sampling pre-calculated structural clusters from the Smotif library. Each of the Smotif candidates obtained from steps (a)–(d) belongs to a pre-defined cluster of structurally similar Smotifs in the library. We sample as many diverse clusters (between 4 and 10) among these as possible irrespective of the frequency of occurrence of Smotifs among the top hits. The number of picked motifs is limited by our computational capacity and therefore will depend on the size of the protein. For larger proteins with more Smotifs we pick smaller number of candidates and vice versa, so that we end up sampling about a million different combinations overall. The number of Smotifs chosen from the library varies between 10 and 4 for query proteins with predicted number of Smotifs varying from 2 to 8, respectively. From each cluster, the Smotif with the lowest ΔCS is chosen.
2. Sampling dynamically calculated clusters. The clusters in the library were generated using the RMSDs between the unit vector representations of the corresponding secondary structures (Fig. 1). Since we ignore the details of the loop regions within the Smotifs in the previous clustering method, here we use an alternate approach, which takes into consideration the loop residues of the Smotifs. We start with the top 200 Smotifs selected from steps (a)–(d) described above, but we skip the highly restrictive step (c) since it often provides less than 200 Smotifs. We then calculate all-vs.-all backbone atom RMSD, including loop and secondary structure residues, for all the 200 Smotifs and use Phylip (Retief, 2000) to carry out hierarchical UPGMA clustering at 2.0 Å cutoff. We then calculate a score for each Smotif, as a function of it’s cluster size, as follows: Score =(200-current rank)+Cluster Size; where current rank is based on ΔCS as described in step (d) above and cluster size is the size of the cluster to which the Smotif belongs. Smotifs are re-ranked based on this score and then the top 4–10 Smotifs are picked depending the size of proteins, as before. This approach is expected to facilitate identifying Smotifs that are structurally similar (including in the loop region) and hence cluster together, without compromising on the CS ranking. It also increases the preference of structurally similar Smotifs that occur more frequently in the top 200 than the others.
We pool all non-redundant Smotifs from both clustering methods in step (e) to obtain a set of 8–20 Smotifs per query.

Sampling and scoring of Smotif combinations

After a suitable set of candidates has been selected for each putative Smotif in the query structure, a full enumeration of the structures is carried out by joining every possible combination of these Smotifs. Successive motifs are joined by optimally superposing their overlapping secondary structures. Length of secondary structures of the sampled Smotifs are extended or shortened to fit the query sequence. In the process of joining Smotifs, a limited number of steric clashes (equal to the number of total Smotifs in the structure) are allowed. The candidate structures resulting from the full enumeration are evaluated using a two-pass linear scoring function with the following components:

Radius of gyration using Cα carbons
A distance-dependent statistical potential function (Rykunov and Fiser, 2007, 2010; Rykunov et al., 2009)
An implicit solvation potential (Lazaridis and Karplus, 1999)
A knowledge-based long-range backbone hydrogen-bonding potential (Morozov and Kortemme, 2005)

All components were converted into statistical z-scores before combining them. The weights for the linear scoring function were optimized on a set of decoy structures obtained from 5 proteins of varying sizes and secondary structure composition (1ptf, 1m7t, 1zlm, 2lis, 2dc3), all of which were disjoint from the proteins used to test this algorithm. Importantly, the decoy sets were organized into two subsets, the first of which included only ‘distant’ structures (with >3 Å RMSD from the native structures) and generated the weights for the ‘coarse’ grained scoring function and the second consisted of near-native structures (<3Å RMSD of native structures) and yielded weights for the ‘refined’ scoring function (Fig. S5). All enumerated models were scored with the coarse scoring function, then the top 5000 structures were re-ranked using the refined scoring function. The best 200 structures from this re-ranking were relaxed using MODELLER (Fiser and Sali, 2003) to resolve steric clashes and maintain stereochemistry. The dominant components in the coarse grained scoring function turned out to be the radius of gyration and implicit solvation potential. At this stage it is expected that these terms are efficiently selecting compact sampled conformations, with a native-like proportion of buried/exposed hydrophobic/hydrophilic residues. Meanwhile, in the fine-grained scoring function, the distance dependent statistical pair potential and explicit backbone hydrogen bond potential terms dominate. This makes sense as at this latter stage the selected conformations are all reasonably compact and have a good solvation profile therefore the quality of internal contacts and the correct register of the hydrogen bonds is expected to play a more prominent role. The accuracy of final models were evaluated using root mean square deviations (RMSD) and GDT_TS scores(Zemla, 2003) with respect to the experimental solution structure, considering the entire protein. GDT_TS score calculates the average percentage of structurally equivalent pairs of residues at 1,2,4,8 Å cutoff values upon optimal superposition of the experimental solution structure and the computational model.

Refining near native models

The internal angles of the loop residues of each Smotif in the structure were allowed to vary randomly within +/−30 degrees from the original Smotif configuration with each Monte Carlo step with the condition that no additional steric clashes can be created. Perturbed structures were accepted using the Metropolis criterion and the algorithm was run for 3000 iterations. Alternately, a short Rosetta refinement was carried out using the top 100 models obtained.

Supplementary Material

NIHMS482228-supplement-01.docx^{(604.1KB, docx)}

Acknowledgements

We thank Jerry Karp for contributing to the test set selection and to Dr. David Cowburn for critical reading of the manuscript. This work was supported by grants NIH R01 GM096041 and NIH 5U54GM094662.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem.Sci. 1998;23:444. doi: 10.1016/s0968-0004(98)01298-5. [DOI] [PubMed] [Google Scholar]
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008;36:D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baber JL, Libutti D, Levens D, Tjandra N. High precision solution structure of the C-terminal KH domain of heterogeneous nuclear ribonucleoprotein K, a c-myc transcription factor. Journal of Molecular Biology. 1999;289:949–962. doi: 10.1006/jmbi.1999.2818. [DOI] [PubMed] [Google Scholar]
Berjanskii M, Tang P, Liang J, Cruz JA, Zhou J, Zhou Y, Bassett E, MacDonell C, Lu P, Lin G, Wishart DS. GeNMR: a web server for rapid NMR-based protein structure determination. Nucleic Acids Res. 2009;37:W670–W677. doi: 10.1093/nar/gkp280. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35:D301–D303. doi: 10.1093/nar/gkl971. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonneau R, Ruczinski I, Tsai J, Baker D. Contact order and ab initio protein structure prediction. Protein Sci. 2002a;11:1937–1944. doi: 10.1110/ps.3790102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D. De novo prediction of three-dimensional structures for major protein families. J Mol Biol. 2002b;322:65–78. doi: 10.1016/s0022-2836(02)00698-8. [DOI] [PubMed] [Google Scholar]
Bowers PM, Strauss CE, Baker D. De novo protein structure determination using sparse NMR data. J.Biomol.NMR. 2000;18:311. doi: 10.1023/a:1026744431105. [DOI] [PubMed] [Google Scholar]
Cavalli A, Salvatella X, Dobson CM, Vendruscolo M. Protein structure determination from NMR chemical shifts. Proc Natl Acad Sci U S A. 2007;104:9615–9620. doi: 10.1073/pnas.0610313104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cornilescu G, Delaglio F, Bax A. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J Biomol NMR. 1999;13:289–302. doi: 10.1023/a:1008392405740. [DOI] [PubMed] [Google Scholar]
Deng H, Jia Y, Wei Y, Zhang Y. What is the best reference state for designing statistical atomic potentials in protein structure prediction? Proteins. 2012;80:2311–2322. doi: 10.1002/prot.24121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fernandez-Fuentes N, Dybas JM, Fiser A. Structural Characteristics of Novel Protein Folds. Plos Computational Biology. 2010;6 doi: 10.1371/journal.pcbi.1000750. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fernandez-Fuentes N, Fiser A. Saturating representation of loop conformational fragments in structure databanks. BMC Struct Biol. 2006;6:15. doi: 10.1186/1472-6807-6-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fernandez-Fuentes N, Oliva B, Fiser A. A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Res. 2006a;34:2085–2097. doi: 10.1093/nar/gkl156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fernandez-Fuentes N, Querol E, Aviles FX, Sternberg MJ, Oliva B. Prediction of the conformation and geometry of loops in globular proteins: testing ArchDB, a structural classification of loops. Proteins. 2005;60:746–757. doi: 10.1002/prot.20516. [DOI] [PubMed] [Google Scholar]
Fernandez-Fuentes N, Zhai J, Fiser A. ArchPRED: a template based loop structure prediction server. Nucleic Acids Res. 2006b;34:W173–W176. doi: 10.1093/nar/gkl113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fiser A. Protein structure modeling in the proteomics era. Expert Rev Proteomics. 2004;1:97–110. doi: 10.1586/14789450.1.1.97. [DOI] [PubMed] [Google Scholar]
Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000;9:1753. doi: 10.1110/ps.9.9.1753. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fiser A, Sali A. Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 2003;374:461. doi: 10.1016/S0076-6879(03)74020-8. [DOI] [PubMed] [Google Scholar]
Gong H, Shen Y, Rose GD. Building native protein conformation from NMR backbone chemical shifts using Monte Carlo fragment assembly. Protein Sci. 2007;16:1515–1521. doi: 10.1110/ps.072988407. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han B, Liu YF, Ginzinger SW, Wishart DS. SHIFTX2: significantly improved protein chemical shift prediction. Journal of Biomolecular Nmr. 2011;50:43–57. doi: 10.1007/s10858-011-9478-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hung LH, Samudrala R. Accurate and automated classification of protein secondary structure with PsiCSI. Protein Sci. 2003;12:288–295. doi: 10.1110/ps.0222303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
Kinch L, Shi SY, Cong Q, Cheng H, Liao YX, Grishin NV. CASP9 assessment of free modeling target predictions. Proteins-Structure Function and Bioinformatics. 2011a;79:59–73. doi: 10.1002/prot.23181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kinch LN, Shi S, Cheng H, Cong Q, Pei J, Mariani V, Schwede T, Grishin NV. CASP9 target classification. Proteins. 2011b;79(Suppl 10):21–36. doi: 10.1002/prot.23190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koga N, Tatsumi-Koga R, Liu GH, Xiao R, Acton TB, Montelione GT, Baker D. Principles for designing ideal protein structures. Nature. 2012;491:222. doi: 10.1038/nature11600. -+. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kohlhoff KJ, Robustelli P, Cavalli A, Salvatella X, Vendruscolo M. Fast and accurate predictions of protein NMR chemical shifts from interatomic distances. J Am Chem Soc. 2009;131:13894–13895. doi: 10.1021/ja903772t. [DOI] [PubMed] [Google Scholar]
Kurowski MA, Bujnicki JM. GeneSilico protein structure prediction meta-server. Nucleic Acids Res. 2003;31:3305. doi: 10.1093/nar/gkg557. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lazaridis T, Karplus M. Effective energy function for proteins in solution. Proteins. 1999;35:133. doi: 10.1002/(sici)1097-0134(19990501)35:2<133::aid-prot1>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]
Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu.Rev.Biophys.Biomol.Struct. 2000;29:291. doi: 10.1146/annurev.biophys.29.1.291. [DOI] [PubMed] [Google Scholar]
Meiler J. PROSHIFT: protein chemical shift prediction using artificial neural networks. J Biomol NMR. 2003;26:25–37. doi: 10.1023/a:1023060720156. [DOI] [PubMed] [Google Scholar]
Morozov AV, Kortemme T. Potential functions for hydrogen bonds in protein structure prediction and design. Adv Protein Chem. 2005;72:1–38. doi: 10.1016/S0065-3233(05)72001-5. [DOI] [PubMed] [Google Scholar]
Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kazmierkiewicz R, Oldziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, et al. Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proc.Natl.Acad.Sci.U.S.A. 2001;98:2329. doi: 10.1073/pnas.041609598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raman S, Lange OF, Rossi P, Tyka M, Wang X, Aramini J, Liu GH, Ramelot TA, Eletsky A, Szyperski T, et al. NMR Structure Determination for Larger Proteins Using Backbone-Only Data. Science. 2010;327:1014–1018. doi: 10.1126/science.1183649. [DOI] [PMC free article] [PubMed] [Google Scholar]
Remmert M, Biegert A, Hauser A, Soding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods. 2012;9:173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
Retief JD. Phylogenetic analysis using PHYLIP. Methods Mol.Biol. 2000;132:243. doi: 10.1385/1-59259-192-2:243. [DOI] [PubMed] [Google Scholar]
Robustelli P, Cavalli A, Dobson CM, Vendruscolo M, Salvatella X. Folding of small proteins by Monte Carlo simulations with chemical shift restraints without the use of molecular fragment replacement or structural homology. J Phys Chem B. 2009;113:7890–7896. doi: 10.1021/jp900780b. [DOI] [PubMed] [Google Scholar]
Robustelli P, Kohlhoff K, Cavalli A, Vendruscolo M. Using NMR chemical shifts as structural restraints in molecular dynamics simulations of proteins. Structure. 2010;18:923–933. doi: 10.1016/j.str.2010.04.016. [DOI] [PubMed] [Google Scholar]
Rost B. Protein structures sustain evolutionary drift. Fold.Des. 1997;2:S19. doi: 10.1016/s1359-0278(97)00059-x. [DOI] [PubMed] [Google Scholar]
Rykunov D, Fiser A. Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials. Proteins. 2007;67:559–568. doi: 10.1002/prot.21279. [DOI] [PubMed] [Google Scholar]
Rykunov D, Fiser A. New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics. 2010;11:128. doi: 10.1186/1471-2105-11-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rykunov D, Steinberger E, Madrid-Aliste CJ, Fiser A. Improved scoring function for comparative modeling using the M4T method. J Struct Funct Genomics. 2009;10:95–99. doi: 10.1007/s10969-008-9044-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwede T, Sali A, Honig B, Levitt M, Berman HM, Jones D, Brenner SE, Burley SK, Das R, Dokholyan NV, et al. Outcome of a workshop on applications of protein models in biomedical research. Structure. 2009;17:151–159. doi: 10.1016/j.str.2008.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Bax A. SPARTA plus : a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. Journal of Biomolecular Nmr. 2010;48:13–22. doi: 10.1007/s10858-010-9433-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Bax A. Identification of helix capping and beta-turn motifs from NMR chemical shifts. Journal of Biomolecular Nmr. 2012;52:211–232. doi: 10.1007/s10858-012-9602-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Bryan PN, He YN, Orban J, Baker D, Bax A. De novo structure generation using chemical shifts for proteins with high-sequence identity but different folds. Protein Science. 2010;19:349–356. doi: 10.1002/pro.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Delaglio F, Cornilescu G, Bax A. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J Biomol NMR. 2009a;44:213–223. doi: 10.1007/s10858-009-9333-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu G, Eletsky A, Wu Y, Singarapu KK, Lemak A, et al. Consistent blind protein structure generation from NMR chemical shift data. Proc Natl Acad Sci U S A. 2008;105:4685–4690. doi: 10.1073/pnas.0800256105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Vernon R, Baker D, Bax A. De novo protein structure generation from incomplete chemical shift assignments. J Biomol NMR. 2009b;43:63–78. doi: 10.1007/s10858-008-9288-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sippl MJ. Recognition of errors in three-dimensional structures of proteins. Proteins. 1993;17:355. doi: 10.1002/prot.340170404. [DOI] [PubMed] [Google Scholar]
Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, et al. BioMagResBank. Nucleic Acids Res. 2008;36:D402–D408. doi: 10.1093/nar/gkm957. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang B, Wang Y, Wishart DS. A probabilistic approach for validating protein NMR chemical shift assignments. J Biomol NMR. 2010;47:85–99. doi: 10.1007/s10858-010-9407-y. [DOI] [PubMed] [Google Scholar]
Wang G, Dunbrack RL., Jr PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589. doi: 10.1093/bioinformatics/btg224. [DOI] [PubMed] [Google Scholar]
Wishart DS, Arndt D, Berjanskii M, Tang P, Zhou J, Jr, Lin G. CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data. Nucleic Acids Res. 2008;36:W496–W502. doi: 10.1093/nar/gkn305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wishart DS, Sykes BD. The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data. J Biomol NMR. 1994;4:171–180. doi: 10.1007/BF00175245. [DOI] [PubMed] [Google Scholar]
Yang Y, Zhou Y. Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all-atom statistical energy functions. Protein Sci. 2008a;17:1212–1219. doi: 10.1110/ps.033480.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y, Zhou Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins. 2008b;72:793–803. doi: 10.1002/prot.21968. [DOI] [PubMed] [Google Scholar]
Zemla A. LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–3374. doi: 10.1093/nar/gkg571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhan C, Fedorov EV, Shi W, Ramagopal UA, Thirumuruhan R, Manjasetty BA, Almo SC, Fiser A, Chance MR, Fedorov AA. The ybeY protein from Escherichia coli is a metalloprotein. Acta Crystallogr Sect F Struct Biol Cryst Commun. 2005;61:959–963. doi: 10.1107/S1744309105031131. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS482228-supplement-01.docx^{(604.1KB, docx)}

[R1] Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem.Sci. 1998;23:444. doi: 10.1016/s0968-0004(98)01298-5. [DOI] [PubMed] [Google Scholar]

[R2] Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008;36:D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Baber JL, Libutti D, Levens D, Tjandra N. High precision solution structure of the C-terminal KH domain of heterogeneous nuclear ribonucleoprotein K, a c-myc transcription factor. Journal of Molecular Biology. 1999;289:949–962. doi: 10.1006/jmbi.1999.2818. [DOI] [PubMed] [Google Scholar]

[R4] Berjanskii M, Tang P, Liang J, Cruz JA, Zhou J, Zhou Y, Bassett E, MacDonell C, Lu P, Lin G, Wishart DS. GeNMR: a web server for rapid NMR-based protein structure determination. Nucleic Acids Res. 2009;37:W670–W677. doi: 10.1093/nar/gkp280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35:D301–D303. doi: 10.1093/nar/gkl971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bonneau R, Ruczinski I, Tsai J, Baker D. Contact order and ab initio protein structure prediction. Protein Sci. 2002a;11:1937–1944. doi: 10.1110/ps.3790102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D. De novo prediction of three-dimensional structures for major protein families. J Mol Biol. 2002b;322:65–78. doi: 10.1016/s0022-2836(02)00698-8. [DOI] [PubMed] [Google Scholar]

[R8] Bowers PM, Strauss CE, Baker D. De novo protein structure determination using sparse NMR data. J.Biomol.NMR. 2000;18:311. doi: 10.1023/a:1026744431105. [DOI] [PubMed] [Google Scholar]

[R9] Cavalli A, Salvatella X, Dobson CM, Vendruscolo M. Protein structure determination from NMR chemical shifts. Proc Natl Acad Sci U S A. 2007;104:9615–9620. doi: 10.1073/pnas.0610313104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Cornilescu G, Delaglio F, Bax A. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J Biomol NMR. 1999;13:289–302. doi: 10.1023/a:1008392405740. [DOI] [PubMed] [Google Scholar]

[R11] Deng H, Jia Y, Wei Y, Zhang Y. What is the best reference state for designing statistical atomic potentials in protein structure prediction? Proteins. 2012;80:2311–2322. doi: 10.1002/prot.24121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Fernandez-Fuentes N, Dybas JM, Fiser A. Structural Characteristics of Novel Protein Folds. Plos Computational Biology. 2010;6 doi: 10.1371/journal.pcbi.1000750. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fernandez-Fuentes N, Fiser A. Saturating representation of loop conformational fragments in structure databanks. BMC Struct Biol. 2006;6:15. doi: 10.1186/1472-6807-6-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Fernandez-Fuentes N, Oliva B, Fiser A. A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Res. 2006a;34:2085–2097. doi: 10.1093/nar/gkl156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fernandez-Fuentes N, Querol E, Aviles FX, Sternberg MJ, Oliva B. Prediction of the conformation and geometry of loops in globular proteins: testing ArchDB, a structural classification of loops. Proteins. 2005;60:746–757. doi: 10.1002/prot.20516. [DOI] [PubMed] [Google Scholar]

[R16] Fernandez-Fuentes N, Zhai J, Fiser A. ArchPRED: a template based loop structure prediction server. Nucleic Acids Res. 2006b;34:W173–W176. doi: 10.1093/nar/gkl113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Fiser A. Protein structure modeling in the proteomics era. Expert Rev Proteomics. 2004;1:97–110. doi: 10.1586/14789450.1.1.97. [DOI] [PubMed] [Google Scholar]

[R18] Fiser A, Do RK, Sali A. Modeling of loops in protein structures. Protein Sci. 2000;9:1753. doi: 10.1110/ps.9.9.1753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Fiser A, Sali A. Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol. 2003;374:461. doi: 10.1016/S0076-6879(03)74020-8. [DOI] [PubMed] [Google Scholar]

[R20] Gong H, Shen Y, Rose GD. Building native protein conformation from NMR backbone chemical shifts using Monte Carlo fragment assembly. Protein Sci. 2007;16:1515–1521. doi: 10.1110/ps.072988407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Han B, Liu YF, Ginzinger SW, Wishart DS. SHIFTX2: significantly improved protein chemical shift prediction. Journal of Biomolecular Nmr. 2011;50:43–57. doi: 10.1007/s10858-011-9478-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Hung LH, Samudrala R. Accurate and automated classification of protein secondary structure with PsiCSI. Protein Sci. 2003;12:288–295. doi: 10.1110/ps.0222303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999;292:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]

[R24] Kinch L, Shi SY, Cong Q, Cheng H, Liao YX, Grishin NV. CASP9 assessment of free modeling target predictions. Proteins-Structure Function and Bioinformatics. 2011a;79:59–73. doi: 10.1002/prot.23181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Kinch LN, Shi S, Cheng H, Cong Q, Pei J, Mariani V, Schwede T, Grishin NV. CASP9 target classification. Proteins. 2011b;79(Suppl 10):21–36. doi: 10.1002/prot.23190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Koga N, Tatsumi-Koga R, Liu GH, Xiao R, Acton TB, Montelione GT, Baker D. Principles for designing ideal protein structures. Nature. 2012;491:222. doi: 10.1038/nature11600. -+. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Kohlhoff KJ, Robustelli P, Cavalli A, Salvatella X, Vendruscolo M. Fast and accurate predictions of protein NMR chemical shifts from interatomic distances. J Am Chem Soc. 2009;131:13894–13895. doi: 10.1021/ja903772t. [DOI] [PubMed] [Google Scholar]

[R28] Kurowski MA, Bujnicki JM. GeneSilico protein structure prediction meta-server. Nucleic Acids Res. 2003;31:3305. doi: 10.1093/nar/gkg557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Lazaridis T, Karplus M. Effective energy function for proteins in solution. Proteins. 1999;35:133. doi: 10.1002/(sici)1097-0134(19990501)35:2<133::aid-prot1>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]

[R30] Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu.Rev.Biophys.Biomol.Struct. 2000;29:291. doi: 10.1146/annurev.biophys.29.1.291. [DOI] [PubMed] [Google Scholar]

[R31] Meiler J. PROSHIFT: protein chemical shift prediction using artificial neural networks. J Biomol NMR. 2003;26:25–37. doi: 10.1023/a:1023060720156. [DOI] [PubMed] [Google Scholar]

[R32] Morozov AV, Kortemme T. Potential functions for hydrogen bonds in protein structure prediction and design. Adv Protein Chem. 2005;72:1–38. doi: 10.1016/S0065-3233(05)72001-5. [DOI] [PubMed] [Google Scholar]

[R33] Pillardy J, Czaplewski C, Liwo A, Lee J, Ripoll DR, Kazmierkiewicz R, Oldziej S, Wedemeyer WJ, Gibson KD, Arnautova YA, et al. Recent improvements in prediction of protein structure by global optimization of a potential energy function. Proc.Natl.Acad.Sci.U.S.A. 2001;98:2329. doi: 10.1073/pnas.041609598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Raman S, Lange OF, Rossi P, Tyka M, Wang X, Aramini J, Liu GH, Ramelot TA, Eletsky A, Szyperski T, et al. NMR Structure Determination for Larger Proteins Using Backbone-Only Data. Science. 2010;327:1014–1018. doi: 10.1126/science.1183649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Remmert M, Biegert A, Hauser A, Soding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods. 2012;9:173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]

[R36] Retief JD. Phylogenetic analysis using PHYLIP. Methods Mol.Biol. 2000;132:243. doi: 10.1385/1-59259-192-2:243. [DOI] [PubMed] [Google Scholar]

[R37] Robustelli P, Cavalli A, Dobson CM, Vendruscolo M, Salvatella X. Folding of small proteins by Monte Carlo simulations with chemical shift restraints without the use of molecular fragment replacement or structural homology. J Phys Chem B. 2009;113:7890–7896. doi: 10.1021/jp900780b. [DOI] [PubMed] [Google Scholar]

[R38] Robustelli P, Kohlhoff K, Cavalli A, Vendruscolo M. Using NMR chemical shifts as structural restraints in molecular dynamics simulations of proteins. Structure. 2010;18:923–933. doi: 10.1016/j.str.2010.04.016. [DOI] [PubMed] [Google Scholar]

[R39] Rost B. Protein structures sustain evolutionary drift. Fold.Des. 1997;2:S19. doi: 10.1016/s1359-0278(97)00059-x. [DOI] [PubMed] [Google Scholar]

[R40] Rykunov D, Fiser A. Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials. Proteins. 2007;67:559–568. doi: 10.1002/prot.21279. [DOI] [PubMed] [Google Scholar]

[R41] Rykunov D, Fiser A. New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics. 2010;11:128. doi: 10.1186/1471-2105-11-128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Rykunov D, Steinberger E, Madrid-Aliste CJ, Fiser A. Improved scoring function for comparative modeling using the M4T method. J Struct Funct Genomics. 2009;10:95–99. doi: 10.1007/s10969-008-9044-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Schwede T, Sali A, Honig B, Levitt M, Berman HM, Jones D, Brenner SE, Burley SK, Das R, Dokholyan NV, et al. Outcome of a workshop on applications of protein models in biomedical research. Structure. 2009;17:151–159. doi: 10.1016/j.str.2008.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Shen Y, Bax A. SPARTA plus : a modest improvement in empirical NMR chemical shift prediction by means of an artificial neural network. Journal of Biomolecular Nmr. 2010;48:13–22. doi: 10.1007/s10858-010-9433-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Shen Y, Bax A. Identification of helix capping and beta-turn motifs from NMR chemical shifts. Journal of Biomolecular Nmr. 2012;52:211–232. doi: 10.1007/s10858-012-9602-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Shen Y, Bryan PN, He YN, Orban J, Baker D, Bax A. De novo structure generation using chemical shifts for proteins with high-sequence identity but different folds. Protein Science. 2010;19:349–356. doi: 10.1002/pro.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Shen Y, Delaglio F, Cornilescu G, Bax A. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J Biomol NMR. 2009a;44:213–223. doi: 10.1007/s10858-009-9333-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu G, Eletsky A, Wu Y, Singarapu KK, Lemak A, et al. Consistent blind protein structure generation from NMR chemical shift data. Proc Natl Acad Sci U S A. 2008;105:4685–4690. doi: 10.1073/pnas.0800256105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Shen Y, Vernon R, Baker D, Bax A. De novo protein structure generation from incomplete chemical shift assignments. J Biomol NMR. 2009b;43:63–78. doi: 10.1007/s10858-008-9288-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Sippl MJ. Recognition of errors in three-dimensional structures of proteins. Proteins. 1993;17:355. doi: 10.1002/prot.340170404. [DOI] [PubMed] [Google Scholar]

[R51] Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading S, Maziuk D, Miller Z, et al. BioMagResBank. Nucleic Acids Res. 2008;36:D402–D408. doi: 10.1093/nar/gkm957. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Wang B, Wang Y, Wishart DS. A probabilistic approach for validating protein NMR chemical shift assignments. J Biomol NMR. 2010;47:85–99. doi: 10.1007/s10858-010-9407-y. [DOI] [PubMed] [Google Scholar]

[R53] Wang G, Dunbrack RL., Jr PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589. doi: 10.1093/bioinformatics/btg224. [DOI] [PubMed] [Google Scholar]

[R54] Wishart DS, Arndt D, Berjanskii M, Tang P, Zhou J, Jr, Lin G. CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data. Nucleic Acids Res. 2008;36:W496–W502. doi: 10.1093/nar/gkn305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Wishart DS, Sykes BD. The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data. J Biomol NMR. 1994;4:171–180. doi: 10.1007/BF00175245. [DOI] [PubMed] [Google Scholar]

[R56] Yang Y, Zhou Y. Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all-atom statistical energy functions. Protein Sci. 2008a;17:1212–1219. doi: 10.1110/ps.033480.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] Yang Y, Zhou Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins. 2008b;72:793–803. doi: 10.1002/prot.21968. [DOI] [PubMed] [Google Scholar]

[R58] Zemla A. LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–3374. doi: 10.1093/nar/gkg571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] Zhan C, Fedorov EV, Shi W, Ramagopal UA, Thirumuruhan R, Manjasetty BA, Almo SC, Fiser A, Chance MR, Fedorov AA. The ybeY protein from Escherichia coli is a metalloprotein. Acta Crystallogr Sect F Struct Biol Cryst Commun. 2005;61:959–963. doi: 10.1107/S1744309105031131. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Modeling proteins using a super-secondary structure library and NMR chemical shift information

Vilas Menon

Brinda Vallat

Joseph M Dybas

Andras Fiser

Summary

Introduction

Results

Figure 1.

Benchmarking the algorithm

Figure 2.

Individual modeling cases

Figure 3.

Accuracy of sampling Smotifs

Figure 4.

Performance of scoring functions

Sampling full models

Accuracy of CS-based secondary structure prediction

Effect of incomplete CS assignment

Role of refinement to identify near native models

Discussion

Experimental Methods

Building the Smotif database

Database of relative weights for all chemical shift types

Test dataset of experimentally known proteins

Selection of Smotifs via pattern matching

CS “fingerprints” of loop regions

Algorithmic steps

Sampling and scoring of Smotif combinations

Refining near native models

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases