Abstract
The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as “Protein Blocks” (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa.
Keywords: protein structures, structural alphabet, fold recognition, protein domains, threading, sequence–structure relationship, structural annotation, protein blocks
Introduction
Reliable structural annotations of proteins are of crucial importance in the post-genomic era. In that respect, structural genomics initiatives to determine 3D structures of carefully chosen protein targets are expanding the known fold space.1–6 Computational structural genomics approaches provide alternate strategies to fill the gap between sequence and structure space.2,7 Structural inferences gathered from such methods are very helpful toward comprehensive understanding of protein sequence–structure–function relationships and evolution.8 One common feature of many of these computational approaches is to search in the Protein Data Bank (PDB)9 for related proteins of known 3D structure. Simple sequence search methods such as PSI-BLAST,10 HMMER,11 and IMPALA12 have been successful to recognize closely related protein structures in PDB leading to generation of 3D models, using comparative modeling, on the basis of one or more related structures.13 When these classical sequence search methods fail to identify templates in PDB, fold recognition methods have been used.7
Most of the fold recognition methods rely on combination of sequence search methods and sequence–structure threading approaches.14–16 Such methods incorporate contact potentials and other knowledge-based potentials in conjunction with sequence–structure relationships in the threading algorithms.17 Fold recognition using fragment-assembly based method, such as ROBETTA,18 has been shown to be one of the best methods in CASP.19,20 Some fold recognition algorithms use consensus from multiple fold recognition approaches to identify the correct fold of the query sequence.14,21,22 Such meta-predictors have been shown to work better than methods using single fold recognition approach in CASP. The availability of diverse fold recognition algorithms is hence a key aspect in structure prediction and it provides added value to structure annotation methods.22
It has been shown that structural alphabets can be used to establish sequence–structure relationships.23–25 In particular, one such structural alphabet is known as “Protein Blocks” (PBs) which is a library of 16 local structural prototypes named a to p based on sliding window of pentapeptides to encode existing folds into PB sequences.24 PBs are the most widely used structural alphabet in terms of applications.25,26 The description of protein structures in terms of PBs may be applied to protein structure analysis, comparison, and mining.27–31
To-date there is no method that can directly score the compatibility of an amino acid sequence with known structures using sequence–structure threading approaches where the structures are described in terms of a structural alphabet. Here, we propose a method that investigates the potential of PBs to address this question. The general scheme of the proposed approach is illustrated in Figure 1. Sequence–structure relationships were extracted in the form of amino acid occurrence matrices for each of the 16 PBs. These 16 matrices were further used for calculating the conditional probabilities of an amino acid query sequence to fit to PB encoded structures from a fold library. No other information like residue contact potentials, solvent accessibility, amino acid physico-chemical properties, or evolutionary data was introduced in our model. We evaluated the efficiency of these occurrence matrices using both global and local threading schemes to identify compatible folds on large sets of query protein sequences. The method was further benchmarked on a blind protein dataset from CASP10 that had no known close homologs in the PDB. Application of PBs for fold recognition, FoRSA (FOld Recognition using Structural Alphabet), is further discussed. A web server that implements the method has been developed and is freely available at http://www.bo-protscience.fr/forsa/.
Figure 1.

General scheme of method for finding compatible folds. Our algorithm uses sequence–structure relationships for threading of the input amino acid sequence over all PB sequences in the fold library. Threading scores are based on calculation of conditional probabilities for the input sequence to fit to the different folds in fold library using sequence–structure relationships between non-redundant amino acid sequences and their corresponding PB sequences.
Results
The general objective of this work is to investigate how a structural alphabet can be used to identify the compatible folds for a set of protein sequences with unknown structure. It relies on the availability of a fold library. Here, we used a library composed of 4142 known folds (hereafter called FAMREP), one from each protein family in SCOP (see Materials and Methods section). To test our approach, we used two datasets: TEST40 and TEST70 are composed of 816 and 1837 query protein sequences, respectively. These datasets were threaded against FAMREP fold library. They share, respectively, at most 40% and 70% sequence identity with any protein in the FAMREP fold library. We implemented a method that allows both global and local threading using dynamic programming algorithm which integrates the possibility of introducing gaps. The performance of the algorithm was assessed using different gap penalties for global and local threading approaches. The calculated percent accuracies for top-5 hits for TEST40 dataset are given in Table I. Gap penalty of 4 and 25 provided the maximum accuracy of 67.77% and 61.03% for top-5 hits using global and local threading approaches, respectively (Table I).
Table I.
Effect of Gap Penalty on Top-5 Accuracy of Global and Local Threading Approach Applied on TEST40 Dataset
| Gap penalty | Accuracy of global threading approach | Accuracy of local threading approach |
|---|---|---|
| 3 | 67.40% | — |
| 4 | 67.77% | — |
| 5 | 67.40% | 43.26% |
| 7 | 64.34% | 46.94% |
| 10 | 57.35% | 52.70% |
| 15 | 51.23% | 57.11% |
| 20 | — | 59.56% |
| 25 | — | 61.03% |
| 30 | — | 60.17% |
Maximum accuracies are shown in bold.
With these optimized gaps, we tested the performance of the method to find native folds of the protein sequences with known structures. Overall, the global threading algorithm was able to pick up the correct fold at top-1 hit in 52.57% and 61.68% of the cases for TEST40 and TEST70 datasets, respectively (Table II). The multi-domain proteins class showed the highest top-1 accuracy of 78.95% and 85.71% for TEST40 and TEST70 datasets, respectively. The domains in α/β class showed the second highest top-1 accuracy (60.68% and 70.31%, respectively). Regarding local threading, the method could pick the correct fold as top-1 hits in 49.14% and 55.96% of the cases for TEST40 and TEST70 datasets, respectively. Multi-domain proteins again showed highest top-1 accuracy but with higher percentages than for global threading (Table II): 94.74% versus 78.95% and 95.24% versus 85.71% for TEST40 and TES70, respectively. Confusion matrices for both TEST40 and TEST70 datasets using global and local threading approaches are provided as Supporting Information Tables SI and SII, respectively. Small protein class is confused with other protein classes and in particular with all β proteins.
Table II.
Assessment of Local and Global Threading Approaches When TEST40 and TEST70 Datasets Were Queried Against FAMREP Fold Library
| Local threading | Global threading | |||||
|---|---|---|---|---|---|---|
| SCOP class | Top-5 | Top-1 | Top-5 | Top-1 | Total number of domains | |
| TEST40 dataset | All α proteins | 56.30% | 49.58% | 43.70% | 34.45% | 119 |
| All β proteins | 51.59% | 42.04% | 65.61% | 47.77% | 157 | |
| α/β proteins | 68.14% | 51.19% | 78.31% | 60.68% | 295 | |
| α+β proteins | 57.01% | 47.51% | 66.52% | 52.94% | 221 | |
| Multi-domain proteins | 100.00% | 94.74% | 84.21% | 78.95% | 19 | |
| Small proteins | 80.00% | 40.00% | 80.00% | 40.00% | 5 | |
| Mean | 61.03% | 49.14% | 67.77% | 52.57% | 816 | |
| TEST70 dataset | All α proteins | 59.11% | 48.80% | 59.11% | 45.36% | 291 |
| All β proteins | 61.79% | 51.49% | 76.42% | 60.98% | 369 | |
| α/β proteins | 76.22% | 63.37% | 85.24% | 70.31% | 576 | |
| α+β proteins | 65.63% | 54.10% | 73.24% | 61.52% | 512 | |
| Multi-domain proteins | 100.00% | 95.24% | 90.48% | 85.71% | 42 | |
| Small proteins | 44.68% | 29.79% | 65.96% | 42.55% | 47 | |
| Mean | 67.39% | 55.96% | 75.61% | 61.68% | 1837 | |
Shown are the percentage of query domains for which the method picked up the correct fold within Top-1 and Top-5 hits.
We also calculated sensitivity and specificity of global and local threading approaches for both TEST40 and TEST70 datasets at different Z-score cutoffs. Sensitivity and specificity plots are provided in Supporting Information Figure S1. The corresponding ROC curves are provided as Figure 2(a,b). Area under curve (AUC) for ROC of global threading in TEST40 is 0.86 and local threading is 0.75. As for TEST70 dataset, AUC of global threading is 0.89 and local threading is 0.78 which are very similar to those obtained for TEST40. The Z-score cutoffs that were found to yield 95% specificity at top-1 hit for global and local threading are shown in Table III. At these Z-score cutoffs we obtained sensitivities of 64.10% and 34.16% for global and local threading approaches, respectively, with TEST40. Regarding TEST70 dataset, the sensitivities were, respectively, 66.28% and 36.48%. With such thresholds, a maximum of 5% false positives are expected but the chance to identify true positives is approximately twofold higher in global threading when compared to local threading. In our subsequent analysis, we used a Z-score of 7.4 and 4.5 as thresholds for local and global threading, respectively.
Figure 2.

ROC curves for sensitivity and specificity plots (Supporting Information Fig. S1) for global and local threading approaches. (a) ROC on TEST40 dataset and (b) ROC on TEST70 dataset. Area under curve (AUC) for global threading in TEST40 is 0.86 and local threading is 0.75. For TEST70 dataset, AUC of global threading is 0.89 and local threading is 0.78.
Table III.
Analysis of Specificities and Sensitivities (see Supporting Information Fig. 2(a,b))
| TEST40 | TEST70 | |||
|---|---|---|---|---|
| Z-Score | Sensitivity | Z-Score | Sensitivity | |
| Local | 7.6 | 34.16% | 7.4 | 36.48% |
| Global | 4.5 | 64.10% | 4.5 | 66.28% |
Shown are the Z-score cut-off values that yield on average less than 5% of false positives (95% specificity) and their corresponding sensitivities.
We further checked the ability of our method to pick up appropriate structural templates for predicting the structures of difficult queries with no known homologs in PDB. We benchmarked our global threading approach on 57 targets from CASP10. We queried these targets against 45,964 structures encoded in PB sequences from I-TASSER14 fold library using global threading. We obtained significant hits for 38 targets using a Z-score cutoff of 4.5. Homology models were constructed for these 38 targets by an automated script that uses MODELLER32 and were evaluated based on different criteria which were also used in CASP10. Out of these 38 models, 25 (66%) models were comparable to models generated by top performing methods such as I-TASSER,14 ROBETTA,18 and Phyre2,33 from CASP10 (Supporting Information Table SIII). Structural alignments of the recently released structures of these 25 CASP10 targets with their predicted structures using the templates from our method are shown in Supporting Information Figure S2. For example, the model of the target T0662 (PDB ID: 2LTE) was generated using the structure of an acyl-carrier protein (PDB ID: 3GZL) as a template which is also the top template in CASP10. Template and target sequences share identity of 25.6%. Structural alignment of the target T0662 and our model using DALI had an RMSD of 2 Å and Z-score of 8.8 where 74 out of 76 residues were aligned.
We further implemented the algorithm in the form of a web server. Both global and local threading are available to users. Users can query their targets against fold libraries derived from SCOP filtered at different sequence identity cutoff values. I-TASSER fold library14 and full PDB filtered at 100% sequence identity cutoff are also provided. When a query sequence that contains two domains is threaded against a domain fold library (like SCOP), the server will report potential domains as two separate hits (see Fig. 3 for an example). The server is freely available at http://www.bo-protscience.fr/forsa.
Figure 3.

Local threading of a protein sequence that contains two domains against SCOP dataset filtered at 70% sequence identity. This query is a 700 residues long glycosyl hydrolase that contains a catalytic domain and a carbohydrate binding module. Our method reports the catalytic domain (with Z-score = 13.9) and the concanavalin A-like lectin domain (with Z-score = 8.3) correctly as first two hits. Both Z-scores were above the threshold of 7.4 which was defined for 95% specificity.
Discussion
Our study exploits available protein structure resources and a structural alphabet for recognizing compatible folds for protein sequences. It shows that the protein blocks can be used to capture sequence to local structure relationship in the form of occurrence matrices. We show that these occurrence matrices can be applied to define a conditional probability based scoring function which addresses the question of protein sequence to global fold compatibility. As a consequence, these occurrence matrices and overall approach that we have developed form the basis of a PB-based fold recognition algorithm. It distinguishes itself from the approach developed by Suresh et al.34 which relies on knowledge-based PB prediction23 followed by PB alignment.
Surprisingly, this algorithm is able to pick up correct folds for the queries without using any information from residue–residue contacts, homology, or explicit incorporation of hydrophobic effect which are generally used in the fold recognition methods. This shows that the information in sequential amino acid residues and their corresponding local structures is crucial in determining protein folding. Importance of sequence–structure relationships in protein folding over residue contact potentials also has been shown by Crooks et al.35 The use of the simple scoring and threading scheme also makes our algorithm computationally competitive and it can scale-up well.
The efficiency of our approach and more generally of fold recognition approaches depends on the definition and scope of the fold libraries. Fold libraries based on protein domains are appropriate when the queries are single domain proteins and such libraries may provide useful structural and functional data for annotating such query sequences. Even though, such libraries can still pick up the good folds in top ranking hits when two domains (or more) proteins are queried (Fig. 3) using our local threading algorithm. We have used in this study a highly restricted domain-based fold library (FAMREP) which contains only one domain representative per SCOP family. Despite this limitation, our approach gives overall good performance on blind test datasets that share at most 40% and 70% sequence identity with the fold library (Table II). Length and sequence variations in SCOP families and super-families47 also create more challenges toward fold recognition using only one domain per family. Because of these reasons, our method is sensitive to length thresholds and gap penalties. It is hence expected that the sensitivity of the method will increase when more comprehensive fold libraries are used but this would be at the cost of computing time. Keeping this in mind, we used a more comprehensive I-TASSER fold library for querying CASP10 targets. Despite the unavailability of close homologs for these targets in I-TASSER fold library, our method was able to pick up appropriate templates for comparative modeling (Supporting Information Table SIII).
The PB-based occurrence matrices can be easily integrated into scoring schemes of the other fold recognition methods. We believe that sequence–structure relationships established using protein blocks can provide added value to the other fold recognition methods and meta-predictors. FoRSA algorithm can be improved by combining the threading approach with sequence-search methods and by incorporating residue–residue contact potentials. We also intend to investigate the use of multiple occurrence matrices per PB, based on different structural protein classes for fold recognition as it has been shown to increase the accuracy of PB prediction.24 We further suggest that local threading approach in FoRSA algorithm can also be integrated into fragment-assembly based fold recognition methods.
Materials and Methods
The proposed method has three important aspects; fold library, scoring function, and threading approach. These aspects and datasets used for testing the efficiency of the approach are described below.
Protein blocks
PB is a structural alphabet which approximates protein local structures. PBs were designed based on clustering of dihedral angles of sliding window of pentapeptides from non-redundant set of proteins into 16 unique clusters named a to p.24 Assignment of PBs to 3D protein structures is based on lowest angular root mean square deviation (rmsda) of φ and ψ values of sliding window of pentapeptides to the standard set of φ and ψ values for each of the 16 PBs. The abstraction of 3D structures to 1D strings of PBs has been used in several applications23,26 from ligand binding site detection36,37 to protein structural database mining for similar structures.27,28,30
Fold library
A fold library, FAMREP, was constructed using the Astral datasets38 of 4142 representative domains from 4142 SCOP 1.75A39 families except membrane and trans-membrane protein class. These 4142 domain structures were encoded into PB sequences using in-house scripts. This small fold library of PB sequences was used for the primary testing of the algorithm. For further benchmarking of the method we used I-TASSER40 structural library encoded into PB sequences as fold library. I-TASSER fold library is composed of 45,964 structural entries filtered at 70% sequence identity cutoff. These 45,964 structural entries were converted into PB sequences and used as a fold library.
Test datasets
For primary testing of the algorithm, we used two datasets that we named TEST40 and TEST70. TEST40 and TEST70 are blind datasets which consist of 816 and 1837 SCOP domains with less than 40% and 70% sequence identity, respectively, to any domain in FAMREP fold library.
To further test the ability of the algorithm to pick up correct structural templates, we used 57 server-only targets from CASP10. These 57 targets were queried against I-TASSER40 fold library. Using the templates that were identified by our method, comparative models for these targets were built using the standalone version of MODELLER.32 Quality of models was determined using different methods such as structural alignment of model to the actual target structure using DALI,41 model structure validation using ProSA42 and MolProbity.43
Scoring function
This method relies on sequence–structure relationships which were captured in the form of 16 occurrence matrices, one for each PB. These matrices were used to calculate conditional probability of a window of 15 residues to have a local structure corresponding to a particular PB. To generate these occurrence matrices we used a non-redundant Astral-40 dataset38 which is dataset of SCOP 1.75A domains39 filtered at 40% sequence identity cutoff. All the domain structures from astral-40 dataset were converted to PB sequences using PBE-T.31 Occurrence of each amino acid to be located in the windows of 15 residues corresponding to a PB at the middle position (i.e., position 8) of the windows was counted for astral-40 dataset. Sixteen such 20 × 15 (20 amino acids × 15 position in a window) occurrence matrices were generated for 16 PBs. These matrices were normalized using following formula adapted from Johnson and Overington.44
![]() |
(1) |
where
is count of occurrence of an amino acid y at position z in windows of 15 residues corresponding to a PB x;
is a log odds score for occurrence of an amino acid y at position z in windows of 15 residues corresponding to a PB x.
Conditional probability of an input protein sequence to fit to a particular fold, represented as a PB sequence, from the fold library is calculated using these normalized occurrence matrices. The formula used to calculate the score (S) of a sliding window of 15 amino acid residues given a sliding window of PB sequence from the fold library is as follows:
| (2) |
where AAy is an amino acid at position y in a sliding window from input sequence; PBx is a PB at position x in the given sliding window of PB sequence from the fold library and
is a log odds score for occurrence of an amino acid AAy at position z in the occurrence matrix for a PBx. The first part of this equation uses sliding windows of 5 amino acid residues (Fig. 4) and the second part uses the whole window of 15 amino acid residues (Fig. 5) to calculate the local compatibility of amino acid sequence to PB sequence.
Figure 4.
Threading of an amino acid sequence on a protein fold encoded in terms of protein blocks (PBs). The first component (S1) of the scoring function used in this work calculates the conditional probability of a 15 residues window to match the middle PB in the equivalent window in the PB sequence.
Figure 5.
Threading of an amino acid sequence on a protein fold encoded in terms of protein blocks (PBs). The second component (S2) of the scoring function used in this work calculates the conditional probabilities of all possible overlapping five residues windows to match the middle PB in the corresponding windows in the PB sequence. Henceforth, 11 probability values are obtained and are multiplied with each to calculate the S2 score.
Threading approach
We have used dynamic programming for threading an input amino acid sequence over PB sequences from fold library. Instead of amino acid substitution matrix used in sequence alignment approaches,45,46 we use the scores derived from Eq. (2) to build a scoring matrix in dynamic programming algorithm. We have implemented both global and local threading approaches for fold recognition. A range of gap penalties were used for threading of TEST70 dataset sequences over FAMREP fold library to determine the optimum gap penalties. The gap penalties that gave the highest accuracies for global and local threading approaches were chosen as default for benchmarking of our method using CASP10 dataset.
Each input sequence from the test datasets was threaded over a PB sequence from fold library if the percent difference in their lengths was not less than a pre-defined length cutoff. Length cutoff was fixed at 50% for global threading and 20% for the local threading approach.
Significance of threading scores
Bootstrap strategy was used to assess the significance of threading scores and to provide normalized Z-scores that are independent of query length. Each of the 4142 amino acid sequences in the FAMREP dataset was randomly shuffled 100 times and threaded over its corresponding PB sequence. The threading scores of these randomly shuffled sequences were grouped according to their sequence lengths. Mean (μ) and standard deviation (σ) were calculated for each of these groups of threading scores. Two equations were obtained for the distribution of μ and σ against the sequence length. Z-Scores for the input sequences were obtained using the following formula:
| (3) |
where Z is the Z-score for the input sequence; T is the threading score for the input sequence; μrand is the mean of the threading scores for randomly shuffled sequences of the same length as input sequence and σrand is the standard deviation of the threading scores for randomly shuffled sequences of the same length as input sequence.
Threading scores of each input sequence from all the test datasets were sorted according Z-scores. Sensitivity and specificity were calculated for global and local threading approaches based on results of TEST40 and TEST70 datasets to pick-up the correct fold (as per SCOP definition) at different Z-score cutoffs. The equations for sensitivity and specificity are as follows:
| (4) |
| (5) |
Z-Scores cutoffs that provided 95% specificity for both datasets using global and local threading approaches were identified as default Z-score thresholds to define significant hits.
Acknowledgments
Technical assistance for setting up of the webserver was kindly provided by Peaccel, Inc.
Supporting Information
Additional Supporting Information may be found in the online version of this article.
Supporting Information
References
- 1.Montelione GT. The protein structure initiative: achievements and visions for the future. F1000 Biol Rep. 2012;4:7. doi: 10.3410/B4-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Watson JD, Todd AE, Bray J, Laskowski RA, Edwards A, Joachimiak A, Orengo CA, Thornton JM. Target selection and determination of function in structural genomics. IUBMB Life. 2003;55:249–255. doi: 10.1080/1521654031000123385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bray JE. Target selection for structural genomics based on combining fold recognition and crystallisation prediction methods: application to the human proteome. J Struct Funct Genomics. 2012;13:37–46. doi: 10.1007/s10969-012-9130-x. [DOI] [PubMed] [Google Scholar]
- 4.Goldsmith-Fischman S, Honig B. Structural genomics: computational methods for structure analysis. Protein Sci. 2003;12:1813–1821. doi: 10.1110/ps.0242903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Berman HM, Westbrook JD, Gabanyi MJ, Tao W, Shah R, Kouranov A, Schwede T, Arnold K, Kiefer F, Bordoli L, Kopp J, Podvinec M, Adams PD, Carter LG, Minor W, Nair R, La Baer J. The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res. 2009;37:D365–D368. doi: 10.1093/nar/gkn790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Montelione GT, Arrowsmith C, Girvin ME, Kennedy MA, Markley JL, Powers R, Prestegard JH, Szyperski T. Unique opportunities for NMR methods in structural genomics. J Struct Funct Genomics. 2009;10:101–106. doi: 10.1007/s10969-009-9064-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Drew K, Winters P, Butterfoss GL, Berstis V, Uplinger K, Armstrong J, Riffle M, Schweighofer E, Bovermann B, Goodlett DR, Davis TN, Shasha D, Malmström L, Bonneau R. The proteome folding project: proteome-scale prediction of structure and function. Genome Res. 2011;21:1981–1994. doi: 10.1101/gr.121475.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Furnham N, Sillitoe I, Holliday GL, Cuff AL, Laskowski RA, Orengo CA, Thornton JM. Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol. 2012;8:e1002403. doi: 10.1371/journal.pcbi.1002403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schäffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics. 1999;15:1000–1011. doi: 10.1093/bioinformatics/15.12.1000. [DOI] [PubMed] [Google Scholar]
- 13.Pieper U, Webb BM, Dong GQ, Schneidman-Duhovny D, Fan H, Kim SJ, Khuri N, Spill YG, Weinkam P, Hammel M, Tainer JA, Nilges M, Sali A. ModBase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 2014;42:D336–D346. doi: 10.1093/nar/gkt1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc. 2010;5:725–738. doi: 10.1038/nprot.2010.5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.McGuffin LJ, Jones DT. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics. 2003;19:874–881. doi: 10.1093/bioinformatics/btg097. [DOI] [PubMed] [Google Scholar]
- 16.Söding J, Biegert A, Lupas AN. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 2005;33:W244–W248. doi: 10.1093/nar/gki408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ma J, Peng J, Wang S, Xu J. A conditional neural fields model for protein threading. Bioinformatics. 2012;28:i59–66. doi: 10.1093/bioinformatics/bts213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chivian D, Kim DE, Malmström L, Schonbrun J, Rohl CA, Baker D. Prediction of CASP6 structures using automated Robetta protocols. Proteins. 2005;61:157–166. doi: 10.1002/prot.20733. [DOI] [PubMed] [Google Scholar]
- 19.Kryshtafovych A, Barbato A, Fidelis K, Monastyrskyy B, Schwede T, Tramontano A. Assessment of the assessment: evaluation of the model quality estimates in CASP10. Proteins. 2014;82:112–126. doi: 10.1002/prot.24347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol. 2005;15:285–289. doi: 10.1016/j.sbi.2005.05.011. [DOI] [PubMed] [Google Scholar]
- 21.Lundström J, Rychlewski L, Bujnicki J, Elofsson A. Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci. 2001;10:2354–2362. doi: 10.1110/ps.08501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics. 2003;19:1015–1018. doi: 10.1093/bioinformatics/btg124. [DOI] [PubMed] [Google Scholar]
- 23.Offmann B, Tyagi M, de Brevern AG. Local protein structures. ingentaconnect.com. 2007;2:165–202. [Google Scholar]
- 24.de Brevern AG, Etchebest C, Hazout S. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins. 2000;41:271–287. doi: 10.1002/1097-0134(20001115)41:3<271::aid-prot10>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- 25.Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins. 2003;51:504–514. doi: 10.1002/prot.10369. [DOI] [PubMed] [Google Scholar]
- 26.Joseph AP, Agarwal G, Mahajan S, Gelly J-C, Swapna LS, Offmann B, Cadet F, Bornot A, Tyagi M, Valadié H, Schneider B, Etchebest C, Srinivasan N, De Brevern AG. A short survey on protein blocks. Biophys Rev. 2010;2:137–147. doi: 10.1007/s12551-010-0036-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tyagi M, de Brevern AG, Srinivasan N, Offmann B. Protein structure mining using a structural alphabet. Proteins. 2008;71:920–937. doi: 10.1002/prot.21776. [DOI] [PubMed] [Google Scholar]
- 28.Joseph AP, Srinivasan N, de Brevern AG. Improvement of protein structure comparison using a structural alphabet. Biochimie. 2011;93:1434–1445. doi: 10.1016/j.biochi.2011.04.010. [DOI] [PubMed] [Google Scholar]
- 29.Agarwal G, Mahajan S, Srinivasan N, de Brevern AG. Identification of local conformational similarity in structurally variable regions of homologous proteins using protein blocks. PLoS One. 2011;6:e17826. doi: 10.1371/journal.pone.0017826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tyagi M, Gowri VS, Srinivasan N, de Brevern AG, Offmann B. A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications. Proteins. 2006;65:32–39. doi: 10.1002/prot.21087. [DOI] [PubMed] [Google Scholar]
- 31.Tyagi M, Sharma P, Swamy CS, Cadet F, Srinivasan N, de Brevern AG, Offmann B. Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet. Nucleic Acids Res. 2006;34:W119–W123. doi: 10.1093/nar/gkl199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen M-Y, Pieper U, Sali A. Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics Chapter 5. 2006 doi: 10.1002/0471250953.bi0506s15. Unit 5.6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Kelley LA, Sternberg MJE. Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc. 2009;4:363–371. doi: 10.1038/nprot.2009.2. [DOI] [PubMed] [Google Scholar]
- 34.Suresh V, Ganesan K, Parthasarathy S. A protein block based fold recognition method for the annotation of twilight zone sequences. Protein Pept Lett. 2013;20:249–254. doi: 10.2174/0929866511320030003. [DOI] [PubMed] [Google Scholar]
- 35.Crooks GE, Wolfe J, Brenner SE. Measurements of protein sequence-structure correlations. Proteins. 2004;57:804–810. doi: 10.1002/prot.20262. [DOI] [PubMed] [Google Scholar]
- 36.Dudev M, Lim C. Discovering structural motifs using a structural alphabet: application to magnesium-binding sites. BMC Bioinformatics. 2007;8:106. doi: 10.1186/1471-2105-8-106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wu CY, Chen YC, Lim C. A structural-alphabet-based strategy for finding structural motifs across protein families. Nucleic Acids Res. 2010;38:e150. doi: 10.1093/nar/gkq478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Chandonia J-M, Hon G, Walker NS, Conte Lo L, Koehl P, Levitt M, Brenner SE. The ASTRAL Compendium in 2004. Nucleic Acids Res. 2004;32:D189–D192. doi: 10.1093/nar/gkh034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
- 40.Zhang Y. I-TASSER server for protein 3D structure prediction. BMC Bioinformatics. 2008;9:40. doi: 10.1186/1471-2105-9-40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics. 2000;16:566–567. doi: 10.1093/bioinformatics/16.6.566. [DOI] [PubMed] [Google Scholar]
- 42.Wiederstein M, Sippl MJ. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res. 2007;35:W407–W410. doi: 10.1093/nar/gkm290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D. 2010;66:12–21. doi: 10.1107/S0907444909042073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Johnson MS, Overington JP. A structural basis for sequence comparisons. An evaluation of scoring methodologies. J Mol Biol. 1993;233:716–738. doi: 10.1006/jmbi.1993.1548. [DOI] [PubMed] [Google Scholar]
- 45.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 46.Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- 47.Sandhya S, Rani SS, Pankaj B, Govind MK, Offmann B, Srinivasan N, Sowdhamini R. Length variations amongst protein domain superfamilies and consequences on structure and function. PLoS One. 2009;4:e4981. doi: 10.1371/journal.pone.0004981. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information



