Abstract
In this study, we address the problem of local quality assessment in homology models. As a prerequisite for the evaluation of methods for predicting local model quality, we first examine the problem of measuring local structural similarities between a model and the corresponding native structure. Several local geometric similarity measures are evaluated. Two methods based on structural superposition are found to best reproduce local model quality assessments by human experts. We then examine the performance of state-of-the-art statistical potentials in predicting local model quality on three qualitatively distinct data sets. The best statistical potential, DFIRE, is shown to perform on par with the best current structure-based method in the literature, ProQres. A combination of different statistical potentials and structural features using support vector machines is shown to provide somewhat improved performance over published methods.
Keywords: homology modeling, protein model, model evaluation, similarity measure, statistical potential, support vector machine, protein structure prediction
Over recent years, template-based modeling has become an important tool in modern molecular biology as well as in pharmaceutical applications (Jacobson and Sali 2004; Petrey and Honig 2005). While the overall structures of models are often correct and may provide important biological insights, the models are generally not accurate enough to be used with confidence in applications that depend on atomic detail. In addition, it is very difficult to know to what extent models are reliable, further limiting their usefulness in practice. The assessment of model quality is therefore a very important step in the practical application of template-based modeling. In addition, quality assessment is a critical step in the construction of models both in providing a means of selecting among different possible models and in defining regions of the model that need to be refined.
A large number of methods have been developed for selecting a native-like structure from a set of models. Scoring functions have been based on both molecular mechanics energy functions (Lazaridis and Karplus 1999; Petrey and Honig 2000; Feig and Brooks 2002; Felts et al. 2002; Lee and Duan 2004), statistical potentials (Sippl 1995; Melo and Feytmans 1998; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Lu and Skolnick 2001; Wallqvist et al. 2002; Zhou and Zhou 2002), residue environments (Luthy et al. 1992; Eisenberg et al. 1997; Park et al. 1997; Summa et al. 2005), local side-chain and backbone interactions (Fang and Shortle 2005), orientation-dependent properties (Buchete et al. 2004a,b; Hamelryck 2005), packing estimates (Berglund et al. 2004), solvation energy (Petrey and Honig 2000; McConkey et al. 2003; Wallner and Elofsson 2003; Berglund et al. 2004), hydrogen bonding (Kortemme et al. 2003), and geometric properties (Colovos and Yeates 1993; Kleywegt 2000; Lovell et al. 2003; Mihalek et al. 2003). A number of methods combine different potentials into a global score, usually using a linear combination of terms (Kortemme et al. 2003; Tosatto 2005), or with the help of machine learning techniques, such as neural networks (Wallner and Elofsson 2003) and support vector machines (SVM) (Eramian et al. 2006). Comparisons of different global model quality assessment programs can be found in recent papers by Pettitt et al. (2005), Tosatto (2005), and Eramian et al. (2006).
Less work has been reported on the local quality assessment of models. Local scores are important in the context of modeling because they can give an estimate of the reliability of different regions of a predicted structure. This information can be used in turn to determine which regions should be refined, which should be considered for modeling by multiple templates, and which should be predicted ab initio. Information on local model quality could also be used to reduce the combinatorial problem when considering alternative alignments; for example, by scoring different local models separately, fewer models would have to be built (assuming that the interactions between the separate regions are negligible or can be estimated separately).
One of the most widely used local scoring methods is Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), which combines secondary structure, solvent accessibility, and polarity of residue environments. ProsaII (Sippl 1993), which is based on a combination of a pairwise statistical potential and a solvation term, is also applied extensively in model evaluation. Other methods include the Errat program (Colovos and Yeates 1993), which considers distributions of nonbonded atoms according to atom type and distance, and the energy strain method (Maiorov and Abagyan 1998), which uses differences from average residue energies in different environments to indicate which parts of a protein structure might be problematic. Melo and Feytmans (1998) use an atomic pairwise potential and a surface-based solvation potential (both knowledge-based) to evaluate protein structures. Apart from the energy strain method, which is a semiempirical approach based on the ECEPP3 force field (Nemethy et al. 1992), all of the local methods listed above are based on statistical potentials. A conceptually distinct approach is the ProQres method, which was very recently introduced by Wallner and Elofsson (2006). ProQres is based on a neural network that combines structural features to distinguish correct from incorrect regions. ProQres was shown to outperform earlier methodologies based on statistical approaches (Verify3D, ProsaII, and Errat). The data presented in Wallner and Elofsson's study suggests that their machine-learning approach based on structural features is indeed superior to statistics-based methods. However, the knowledge-based methods examined in their work, Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), Prosa (Sippl 1993), and Errat (Colovos and Yeates 1993), are not based on newer statistical potentials.
In this study, we apply a combination of state-of-the-art statistical potentials and some structural features with machine-learning methods to predict local errors in homology models. We rely heavily on the DFIRE potential (Zhou and Zhou 2002), which has performed with some success in applications to fold recognition (Zhou and Zhou 2005), global model quality assessment (Eramian et al. 2006), and structure refinement (Zhu et al. 2006).
Defining an appropriate measure of local structural similarity is clearly a prerequisite for an evaluation of different methods for the assessment of local model quality. Previous studies on local model evaluation have used manual inspection of models (Luthy et al. 1992; Maiorov and Abagyan 1998), distances between corresponding atoms in model and native structure (Colovos and Yeates 1993; Melo and Feytmans 1998), contact area distance (CAD) (Abagyan and Totrov 1997), and backbone angle differences (Melo and Feytmans 1998). The recent study by Wallner and Elofsson (2006) evaluated local model quality by comparing the sequence-based alignments used to construct the models with structural alignments. Wallner and Elofsson also consider averages of a scaled RMSD measure to evaluate the overall performance of predictors of local model quality. In addition, they use a variant of the S-score (see Materials and Methods) for the training of their neural networks (Wallner and Elofsson 2006).
To the best of our knowledge, different local similarity measures have not yet been compared in a systematic way. Since visual inspection is not practical for a large-scale study, we evaluate a number of local similarity measures by validating them against expert assessment of local model quality. We find that two local geometric similarity measures, the S- and LS-scores (see Materials and Methods), best approximate manual local model quality scores by experts. Using these scores, we evaluate the performance of individual statistical potentials and their combination with structural measures of model quality using support vector machines (SVM). The SVM we develop offers improved performance over the best methods reported in the literature, at least on the data sets studied here.
Results
Evaluation of local structural similarity measures
A number of our lab members who participated in CASP were asked to visually compare models and native structures taken from the CASP5 (Moult et al. 2003) and CASP6 (Moult et al. 2005) experiments. Each residue was assigned to one of three discrete categories: “good,” “fair,” and “incorrect” (see Materials and Methods). Ten structures were assessed by a single human assessor; for two structures there were two expert evaluations, and for one structure there were three manual assessments. Different geometric similarity measures were plotted alongside the manual scores (Supplemental Fig. S1) and visually inspected to explore systematic differences. As a numerical measure of how well the scores coincide, the mutual information between the manual and the automatic scores was calculated (Table 1). In summary, S-scores and LS-scores give the best agreement with the human scores.
Table 1.
Mutual information (MI) between manual model quality scores and local geometric similarity scores
S-scores (see Materials and Methods) based on different global structural superposition programs were tested. The methods include TMscore (Zhang and Skolnick 2004), ska (Petrey et al. 2003), CE (Shindyalov and Bourne 1998), and lga (Zemla 2003). The S-scores are different because the alignments produced by different programs can differ from one another. Among these methods, the S-score based on the TMscore program showed the highest correlation with the human scores, although the results obtained from ska were comparable (Table 1). The TMscore program finds a global structural superposition that optimizes the overlap of similar regions of two protein structures based on a given and known equivalency of residues. This is achieved by finding the superposition that maximizes the sum of the si-values, as defined in Equation 1, over all residues i in a structure. (The value of d0 for the TMscore is dependent on protein size.) Thus, it is not surprising that the TMscore program shows a better correspondence to manual assessment than other structural alignment methods.
Of the various methods listed in Table 1, the LS-score has the highest mutual information value with manual scores, although the S-scores derived from global alignments with TMscore and ska also correlate well with the manual assessors. Among the contact-based scores, the Holm-Sander score best corresponds to the expert scores. The regions defined as incorrect by this score are very similar to those identified by the LS-score. However, an inspection of plots of the Holm-Sander score along with manual local scores reveal that it is less sensitive in distinguishing “fair” and “good” regions of a model. The power distance is extremely sensitive to large differences in the position of neighboring residues. For cases where even a small number of residues in the local neighborhood of a residue are displaced by a large distance, the score drops to low values. This is not useful in the present context, since intuitively the quality of a local environment should not depend on how far away missing residues are. The CAD score is very sensitive to local side chain positioning. This makes it hard to use the CAD as a stand-alone measure, since it is difficult to distinguish cases where there are serious errors in the backbone from cases where the error is restricted to the side chain positions.
The LS-score and contact-based scores are generally consistent with the S-scores. However, closer inspection of regions of models where the two types of functions disagree reveals systematic differences: The LS-score requires not only that a residue is in the correct position with respect to its neighborhood, but also that the neighborhood of the residue in both structures is similar. The advantage of the S-score is that global superposition uses a larger set of reference atoms for the comparison. The S-score becomes less useful when there are larger structural differences. In this case, finding an optimal structural superposition becomes a difficult and sometimes ill-defined task, especially as differences between structures become significant. For example, if there is a hinge-based structural change, a comparison based on global structural superposition will label one part of the model as correct, and the part that is shifted (usually the smaller part) as incorrect, even if there are local similarities in the latter substructure. The S-score is closely related to commonly used global quality measures, such as the TMscore or MaxSub (Siew et al. 2000); regions with a high S-score are the regions that contribute most to the global measures.
The LS-score and contact-based measures have the advantage of not relying on a global superposition, which allows them to identify local similarities that might be missed by the S-score. However, a measure based on local superposition will be more sensitive to the number of atoms in the neighborhood. If a part of the structure is relatively isolated—e.g., an exposed loop—such a measure will essentially describe the differences in the local shape along the backbone but will be insensitive to the relative orientation of the loop with respect to the rest of the protein, which might or might not be a desirable property depending on the context. The two approaches therefore offer complementary definitions of local errors: The S-score measures how well a region fits within the context of the overall structure, whereas the LS and contact-based scores measure purely local similarities.
Analysis of errors in data sets
The three data sets used here to evaluate model quality, SCOP, Moulder, and CASP5&6, have different distributions in terms of overall model quality (Fig. 1). The models in the Moulder data set are typically of lower quality, the SCOP data set contains more models in the very high-quality range (TMscores 0.7–1.0), whereas the models in the CASP5&6 data sets are more uniformly spread over the whole allowed range from 0.5–1.0. The three data sets also differ in terms of local model quality (Supplemental Fig. S2): Models in the Moulder data set on average have more residues with low S- and LS-scores. In addition, the distribution of local errors in the models varies with secondary structure type, with the largest number of problematic regions in models concentrated in coil regions, as expected (Supplemental Fig. S2).
Figure 1.
Distribution of TMscores for different data sets. (Blue) SCOP, (green) CASP5&6, (red) Moulder. Only models with TMscores >0.5 were considered. For the histograms, bins of size 0.1 were used.
Evaluation of local scoring functions
As an illustrative example, Figure 2 shows plots of different local scoring functions for a model for CASP target T0188 (pdb code 1o13, classified in the MTH1175-like SCOP family). Figure 2A displays the model along with the native structure, with areas of major differences highlighted in different color codes. The panels in Figure 2B display the values of the different scoring functions for the native and model structure along with the S-score. The score for the native structure is displayed to illustrate the propensity of different local scoring functions to label correct regions in a model as incorrect. Since all residues in the native structure are correct by definition, a perfect local scoring function should label all residues of the native structure as correct.
Figure 2.

Illustration of local scoring for CASP5 target 188. (A) Structural superposition of model and native structure. Regions where model and native superimpose <2.5 Å are colored in green (model) and dark blue (native), respectively. Regions where the differences are >2.5 Å are colored in yellow (model) and light blue (native), respectively. (B) Per residues scores for different local scoring function for a model and the native structure. Each panel shows the score for the model (red) and native (green) structures, and the S-score (blue) (1: perfect, 0: very different). In some plots, the S-score is rescaled to (−1,1) to adjust for the range of the local scoring function. From left to right and top to bottom, the subpanels represent data for DFIRE, the torsion potential, the SVM method, ProQres, Prosa, and Verify3D. Note: For the energy-based functions, DFIRE, torsion potential, and Prosa, higher scores indicate lower model quality (i.e., the correlation with the local similarity function is negative).
Most of the local scoring functions correctly predict a problem for the model in the regions around residues 7–16, 46–55, 71–74, and 91–96, which correspond to the incorrectly modeled loop regions highlighted in Figure 2A. However, the height of the peaks in the values of the scoring functions does not correlate very well with the degree of structural deviation. Most of the scores also incorrectly assign a poor score to the native structure in the region from residues 46–55. This is most likely due to the fact that residues 46–55 form an exposed loop region. Exposed loops inherently have fewer close-range interactions with the protein body and thus tend to have higher energy values for the potentials considered in this study than do core residues. For the example in Figure 2, the SVM method and ProQres show the best overall visual correlation to the S-score. It is also noteworthy that the SVM score most consistently yields very good local scores for the native structure, which is not the case for most other methods. Similar trends can be observed in plots for other targets, such as T0142 and T0234, which are shown in Supplemental Figures S5 and S6.
In order to examine how well the observations in the examples above generalize on a large set of data, we performed correlation and ROC analyses on the three model data sets (see Materials and Methods). Table 2 reports linear correlation coefficients between different scoring functions and the S-score. The correlations are calculated over the complete set of all residues in all structures. Note that this is different and more stringent than calculating averages of individual correlation coefficients for each model (i.e., between the red [local scoring function] and blue curves [local similarity score] on plots shown in Fig. 2), since it requires consistency across different models. As can be seen in the table, the SVM yields correlation coefficients that are significantly better than the other methods, with the results for the SCOP data set being significantly better than for the Moulder and CASP data sets. Note that the proportion of good SCOP and CASP models is about the same, but that there are many more bad CASP models. The advantage in performance of the SVM over the other methods was found to be statistically significant at a 95% confidence level (see Materials and Methods) on all data sets, with the exception of ProQres on the CASP5&6 data set for which the performance advantage of the SVM was found to be statistically significant at a 90% level. Among the remaining methods, the correlation coefficients for ProQres and DFIRE were higher than those of the other four methods at the 95% confidence level.
Table 2.
Linear correlation coefficients between S-score and local evaluation scores
Linear correlation coefficients give a good indication to what degree different scoring functions reproduce the local similarity measures. However, while corresponding well to assessments by human experts, the functional forms of the S- and LS-scores necessarily reflect a set of assumptions and preferences, as discussed in the first part of the results section. Depending on the application, other local similarity measures might be preferable. In general, we would expect a monotonic relationship between alternate local similarity measures as well as local scoring functions, but there is no reason to expect a linear relationship. This is especially true for functions representing energies, such as DFIRE. Since general functional relationships beyond linear are not captured very well by linear correlation coefficients, we also compare the performance of different scoring functions for local model assessment using a ROC analysis, which is a more general approach.
In practice, we are often mainly interested in how well a scoring function performs at predicting if a region in model is “correct” or “incorrect,” which can be evaluated using a ROC analysis. Since the ROC analysis is based on thresholds, it does not depend on the exact nature of the functional forms. The ROC analyses in this study were carried out on the various local scoring functions studied here, using both the S-score (based on the TM global alignment) and the LS-score to evaluate local geometric similarity. Residues in models were classified as correct or incorrect according to whether the S- or LS-score is above or below a threshold. In order to smooth local fluctuations, the S- and LS-scores were averaged over a window of three residues.
Table 3 and Figure 3 report a ROC analysis in which an S-score threshold of 0.66 was chosen (e.g., see Fig. 2), which from inverting Equation 1 corresponds to a distance of 2.5 Å. When the score—comparing the model and the native structure—for a particular residue is within this value, the residue is considered as correctly predicted. Thresholds of 1.5 and 3.5 Å were tested as well, but the performance characteristics reported here were fairly insensitive to the value that was used.
Table 3.
Area under ROC curves for S- and LS-scores
Figure 3.

ROC curves based on S-score for different local scoring functions for different data sets. The threshold used to classify residues into correct and incorrect categories corresponds to 2.5 Å. (A) SCOP data set, (B) Moulder data set, (C) CASP5&6 data set. In all panels, the scoring functions are represented as follows: DFIRE (dark blue with “x” markers), SVM (green with “x” markers), contact potential (purple with “o” markers), torsion potential (cyan with “o” markers), Verify3D (red with diamond markers), Prosa (orange with triangular markers), and ProQres (black with “o” markers).
ROC curves are reported in Figure 3, and the corresponding areas under the curves in Table 3 for all three data sets. As is evident from the table and figure, the SVM consistently yields the best performance, while overall, DFIRE and ProQres also perform quite well and yield comparable results. The other scoring functions do not do as well. As was the case for the correlation coefficients, the increased performance of the SVM over other methods in terms of the areas under ROC curves was found to be statistically significant at a 95% confidence level. Similarly, DFIRE and ProQres outperformed the remaining four methods at a 95% level. The only exception was the prediction of the LS-score on the Moulder data set, for which the confidence level for the difference in performance was only 90%. For all scoring functions, the results are somewhat better for the SCOP models than they are for the other two data sets, suggesting that error detection is easier in high-quality than in low-quality models.
Since the S-score was used in the training of the SVM, it is important to have an alternative independent reference function. Such an independent verification of the performance of the SVM method is provided by the LS-score, which was not used in SVM training. All correlation and ROC analyses of the different methods were carried out based on both the S- and LS-scores as local similarity functions. However, the relative performance of the top local scoring methods is not significantly different between the analyses based on the S- or LS-scores. In fact, the SVM approach performs slightly better relative to the other local scoring functions in the LS-score-based analysis than in the analysis based on S-scores (Table 3; Supplemental materials). For the remaining tables and figures, we only include results for the S-score. The corresponding data for the LS-score can be found in the Supplemental materials.
Error rates
While Figure 3 reveals overall performance characteristics, it does not reveal the extent to which a region in a particular model can be viewed as reliable given the values of a local scoring function in that region. Table 4 relates the true-positive and false-positive rates for the top three methods, SVM, ProQres, and DFIRE, to specific threshold values for each method. The values in the table can be used as a guideline to select an appropriate cutoff for the scoring function when interpreting plots such as the ones shown in Figure 2B to find incorrectly modeled regions. The appropriate tradeoff between true-positive and false-positive rates will depend on the application.
Table 4.
False-positive (FP) and true-positive (TP) rates for prediction of incorrect residues on CASP5&6 data set at 2.5 Å threshold, grouped by model quality
To determine if performances vary with model quality, we separated the models in the CASP5&6 data set according to model quality, using a TMscore cutoff value of 0.75. A separate ROC analysis on the two subsets, low-quality (TMscore 0.5–0.75) and high-quality (TMscore 0.75–1.0), shows that the relative performances of DFIRE and SVM compared with ProQres are better for high-quality models, whereas the relative performance of ProQres is better on the lower quality subset (Table 4).
False-positive and false-negative rates for a given threshold also vary systematically for different secondary structure types (Table 5). False-positive rates for a given threshold value used to predict incorrect residues are much lower for residues in helical or strand regions than for residues in coil regions. This effect is especially pronounced for strand regions. This effect is most likely due to the uneven distribution of errors among the secondary structure types, with coil regions having a higher likelihood to be incorrect than helix or strand regions. For machine-learning methods, the uneven distribution of errors on the training set among different types of secondary structure can create problems when learning a target function.
Table 5.
False-positive (FP) and true-positive (TP) rates for prediction of incorrect residues at 2.5 Å threshold, grouped by secondary structure
Discussion
In this paper we have tested a number of measures of local structural similarity through comparisons with manual assessment and have used these measures to evaluate scoring functions for local model quality. We find that the S- and LS-score measures are the most effective in determining regions of models that deviate from the native structure. The S-score was used in a recent study by Wallner and Elofsson (2006), and we confirm its utility in this work, as well. In addition to the application described in this work, the S- and LS-scores can also be used in the analysis of a set of templates with the goal of evaluating which regions are constant in structure and which regions are variable. This in turn can provide useful information in determining which regions of a particular model based on these templates deserve extra attention.
The evaluation of different scoring functions based on comparison to the structural similarity measures used here is compromised in part by the fact that geometric similarity is based on comparisons of Cα atoms alone, while model quality is evaluated with all atom functions, such as the one used in DFIRE. Thus, for example, an unfavorable van der Waals contact between side chains detected by DFIRE may not be reflected in a structural similarity measure based on Cα atoms. This will limit the extent of the correlation we obtain between geometric similarity and scoring function. In principle, geometric similarity measures can be generalized to include side chain atoms, which may be of value in identifying problematic regions in high-quality homology models.
The ROC analyses presented above as well as data on correlation coefficients show that the SVM method is consistently the best or among the best methods for all data sets. Among the individual statistical potentials, DFIRE consistently outperforms the other statistical potentials examined, especially for higher quality models. The most likely explanation is that DFIRE is an all-atom function, which allows it to capture local errors such as steric clashes. In contrast, contact and torsion potentials evaluate structure at a much coarser level. We would expect that the more detailed representation used by DFIRE is less of an advantage for lower quality models, which is consistent with our results. ProQres also performs quite well, although despite the fact that it uses a neural network, it is essentially equivalent to DFIRE. However, ProQres does not include an all-atom pairwise potential, which may in part limit its performance. In agreement with the results of Wallner and Elofsson (2006), we find that Verify3D and Prosa, which were seminal developments in the field, do not perform as well as other more recently developed methods.
The SVM method outperforms the other methods on the SCOP data set, especially in terms of the linear correlation coefficients. This advantage is to be expected, as the data set used for SVM training and optimization was constructed with the same procedure as was used in setting up the SCOP testing data set. As opposed to the significant difference on the SCOP data set, the performances of two machine-learning methods, the SVM method and ProQres, are comparable on the CASP5&6 data set and only marginally better than that of the DFIRE potential. We speculate that the two machine-learning methods share the same problem that stems from the data set used for the training. The training data sets in both studies were built based on the structural alignment of SCOP domains, which can be qualitatively different from models based on sequence alignment, as for example is done in the construction of CASP models. This might explain the significantly reduced performance obtained for the CASP5&6 data set.
Among different secondary structure types, the SVM and to some extent DFIRE show some advantage over ProQres in loop regions. Especially for strand regions, the SVM detects much fewer errors for a given threshold than for other types of secondary structure. This is probably an effect of a low incidence of errors in strand regions in the training set and will be examined in more detail in future work. Since loop regions tend to be more variable than helical or strand regions, and since loop regions are typically difficult to model, a large fraction of the incorrect regions in all data sets falls into these regions. Performing well on loop regions is therefore an important factor in overall performance. Since the SVM approach performs better than DFIRE at identifying incorrect residues in loop regions, it might be valuable to use the method as a scoring function for loop prediction.
It is important to emphasize that even the most successful methods reported here have significant limitations. False-positive rates are fairly high if thresholds of a scoring function are chosen such that a large fraction of incorrect residues are to be detected. It is therefore important to select a threshold appropriate for the application at hand. For the example of local structural refinement, we could in principle measure the probability that a refinement method improves incorrect regions as well as the probability that it disturbs correct regions. These probabilities could then be used together with the information of true- and false-positive rates to determine an optimal threshold for the local scoring function to decide which regions to refine. Another approach to identify such regions can be based on sequence information, as was done in the study of Wallner and Elofsson (2006). In their work, the sequence-based assessment method, ProQprof, was further combined with the structure-based assessment method, ProQres. The resulting method, ProQlocal, was more effective than each individual method, although the performance difference tends to be not so significant. Nevertheless, combining information about local alignment quality together with information about local model quality may provide a direction that is particularly useful for homology modeling.
Materials and Methods
Local structural similarity measures
We evaluate two main categories of local structural similarity functions. The first category is based on superposition, the second on comparing distances between atoms.
Superposition-based measures
Superposition-based measures calculate positional differences of corresponding atoms after optimal structural superposition of two sets atoms. If an RMSD measure is used, there is no upper limit to the contribution of an individual atom to the overall score. The RMSD is therefore often dominated by outliers. For this reason, the functional form
![]() |
is widely used in global structure similarity measures such as TMscore (Zhang and Skolnick 2004) or MaxSub (Siew et al. 2000). Here, di is the distance between corresponding atom i in the model and in the native structure and d0 is a constant (set in this work to 3.5 Å).
The S-score is a sum over a set of atoms of si-values defined in Equation 1:
![]() |
Here, the distances di are measured using a global structural superposition. SET includes atoms of the center residues and its sequence neighbors within a window. For this study, we used Cα atoms within a three-residue window. A variation of the S-score was used as a target function by Wallner and Elofsson (2006) for the training of the neural network in ProQres, although not for evaluation purposes.
The LS-score is a new measure developed for this study. Its value for a residue i is based on local superposition. Local neighborhoods, defined as corresponding sets of atoms within a certain distance of the center residue either in the model and the native structure, are superposed. If the sets of neighboring atoms are different for the two structures, the union of the two sets is used. The algorithm used for the superposition is similar to the one used in the MaxSub global score (Siew et al. 2000): In a first step, fragments centered at residue i are optimally superimposed. After the superposition, all residues in the neighborhood of residue i, for which the distance between the model and native structure is below the threshold value d0, are collected. A new optimal superposition is calculated using this updated set of residues. This process is iterated until the set of superimposed residues converges. The LS-score is a product of two terms:
![]() |
The first term in Equation 3, slocal,i, measures the similarity of the position of the center residue with respect to the local neighborhood. In the formula, slocal,i is defined as in Equation 1, using the local structural superposition to calculate the distances di. Comparing positions with respect to a local neighborhood makes sense only if the neighborhoods are similar. For example, imagine a case in which the model and native structure have very different conformations for a local region, such as a long loop. After superposition, a residue in the model and native structures may by accident occupy a closely corresponding position in space, resulting in a high S-score. However, this does not mean the model is correct for the residue in question. This is the rationale for including the average of the slocal,j in the second term of Equation 3: It measures the degree of overlap of the local neighborhoods, and will reduce the LS-score if there is a poor local match in the region.
A related score based on local superposition was described recently by Lema and Echave (2005). This score is based on calculating the average RMSD of residues in spherical regions around a center residue. However, since unlike the si-values given by Equation 1, the average RMSD is dominated by the residues with the largest distance, this score is less appropriate in the context of local model evaluation.
Contact-based measures
These measures are based on differences in distances between atoms. Here, we evaluate three different measures: the power distance, in which contributions of residues weaken polynomially with distance (Wallin et al. 2003); the Holm-Sander score, which uses an exponential term to reduce the influence of distant residues (Wallin et al. 2003); and the contact area distance score (CAD), introduced in (Abagyan and Totrov 1997). The CAD measures differences in contact surfaces between residues to detect local differences between two structures.
Manual assessment of structure
In order to evaluate the usefulness of the different local geometric similarity measures discussed above, we manually scored a number of models from CASP5 (Moult et al. 2003) and CASP6 (Moult et al. 2005). This was done by human experts visually comparing model structures to the correct native structure and then assigning scores for each residue. Residues in the models were assigned to one of three categories: “good,” “fair,” and “incorrect.” In order to assign a rough numerical scale to these classifications, “good” was defined to correspond to an accuracy of ∼1.5 Å or better, “fair” better than ∼3.5 Å, and “incorrect” anything worse than 3.5 Å.
Statistical potentials
We used three potentials: DFIRE (Zhou and Zhou 2002), a solvation potential based on the number of atoms within 10 Å of a center Cα atom (Melo and Feytmans 1998), and a simple backbone torsion potential. The torsion potential was derived by dividing the Ramachandran plot into bins of 10 by 10 degrees for each of the 20 residue types (no distinction was made between different secondary structure types) (Zhou and Zhou 2004).
Statistical potentials were converted to a per residue score by using a similar normalization procedure to that described in (Maiorov and Abagyan 1998): All energy terms Ei,j for a given residue i were summed. Means μT(i) and standard deviations σT(i) for each of the 20 amino acid types T(i) were calculated over a database of reference structures as described in Zhu et al. (2006). These values were used to calculate a per residue energy z-score using the standard formula: E z,i = (∑j∈RESi E i,j − μ T(i))/σ T(i).
In order to reduce fluctuations in the local potential profiles, values of the potentials were averaged by using triangular smoothing:
![]() |
The numbers (a−w, … , aw) are weight factors between 0 and 1 that vary in a triangular pattern from a−w to aw in a window of size 2w + 1, with the maximum value a0 = 1 at the center. The value of individual ai is given by ai = 1 − abs(i)/(w + 1), with i ∈ [−w, w]). We evaluated the performance of the different potentials for different values of w. For all potentials tested, performance either peaked or started to flatten out around a total window size of nine residues (w = 4). Accordingly, this window size was chosen for smoothing for all potentials. (Note: The triangular smoothing was not applied to scoring functions that have their own internal smoothing mechanism [SVM, ProQres, Verify3D].)
Support vector machines
Different statistical potentials were combined using the support vector machine libsvm package (Fan et al. 2005). The SVMs were trained in regression mode to predict the S-score using radial basis functions.
The input features used for the SVM consist of the statistical potentials described above: DFIRE, contact, and torsion potential, as well as information about the local environment of a residue. The residue environment was described by the following attributes: secondary structure (calculated using DSSP [Kabsch and Sander 1983]), predicted secondary structure (psipred [Jones 1999]), solvent-accessible surface, and fraction polar surface (determined by using the environment program from Verify3D), as well as contact number (for a 10 Å radius around the Cα atom). Additionally, the average overall DFIRE energy, as an estimator of overall model quality, was included as an input feature. For all input features except overall DFIRE energy and residue type, the SVM used the values for all residues falling within a nine-residue sequence window around the center residue, resulting in a total of 129 features per residue.
SVM parameters were optimized with respect to the linear correlation with the S-score as defined above. During the optimization of the SVMs, different window sizes for the input features were evaluated. For each window size, the SVM kernel parameters were optimized using a grid search. The optimal window size was chosen by using fivefold cross-validation on the training set. For the cross-validation, the training set was separated into five parts with no overlap at the SCOP family level. The performance of the SVMs increased sharply up to a window size of about seven residues. The correlation to the S-score started to plateau around nine residues (see Supplemental Fig. S4). Based on these data, we selected a nine-residue window. Larger window sizes showed no significant improvement in the cross-validation analysis, and machine-learning methods typically show better generalization properties on other data sets if the number of input parameters is kept as small as possible.
Data sets for evaluation of scoring functions
SCOP
Models in this database are built based on target–template pairs from the same SCOP family. At most, five target–template pairs were selected at random from a given SCOP family. All target structures and templates have a resolution >3 Å and R-factor >0.3. These threshold values were chosen as a compromise between structure quality and the number of models in the database. Alignments from a structural superposition with the ska program (Petrey et al. 2003) were used to build models using two programs: nest (Petrey et al. 2003) and Modeller (Sali and Blundell 1993). Unaligned termini longer than five residues were removed from the final models. The SCOP data set was divided in two parts, a training set used for SVM-training and a test set. Models were distributed between the two sets such that there was no overlap at the SCOP family level between the two sets. In order to guarantee high-quality models, only models with a TMscore (Zhang and Skolnick 2004) >0.5 were considered. Also, to assure protein-like structures, only models with >70 residues were included in the data set. Errors in models in the SCOP data set are mostly due to differences between target and template structure, since the alignments used are structural alignments. The test data set contained 1736 models (310,107 residues), whereas the training set contained 2565 models (460,296 residues).
Moulder
The Moulder data set has been used in a recent study of global model evaluation methods (Eramian et al. 2006). The data set is publicly available from the Web site of the Sali Lab at http://salilab.org/decoys/. It contains models for 20 different target–template pairs based on 300 different alignments created using the genetic algorithm protocol MOULDER (John and Sali 2003). In the original data set, 0%–100% of the residues are aligned correctly with respect to a CE structure-based alignment (c.f., http://salilab.org/john_decoys.html). For the purpose of this study, only models with a TMscore >0.5 were retained, since for models with lower quality, the distinction between locally correct and incorrect regions starts to become ill-defined. Local errors in this data set are due mostly to alignment errors, since only one alignment for each target—template pair is the optimal alignment corresponding to the structural alignment. The Moulder data set contains 2413 structures (416,033 residues).
CASP5&6
This data set consists of models from the CASP5 (Moult et al. 2003) and CASP6 (Moult et al. 2005) structure prediction experiments. For each target, the top 50 models in the CASP GDT_TS (Zemla 2003) ranking were considered. Of these, we retained complete models with TMscores >0.5 and >70 residues, leaving 2318 structures for 73 targets with a total of 443,712 residues.
The CASP5&6 data set is the most interesting in terms of practical application: It consists of the best models constructed by top research groups in the homology modeling field, often with manual intervention of experts to improve the models. The CASP5&6 data set also differs from the SCOP and Moulder data sets in that it is much more eclectic in terms of the methods used to obtain the models. Thus, models in the CASP5&6 data set should cover a much broader range of types of local errors.
Assessment of statistical significance
In order to assess the statistical significance of the difference in performance of two scoring functions, we used a parametric Student's t-test at a 95% confidence level (Casella and Berger 1990; Marti-Renom et al. 2002). Means and standard deviations were estimated by partitioning the data sets into subsets. Since models for the same target are not independent, the models were partitioned so that there was no overlap at the target level between the partitions. In addition, we also performed a bootstrap analysis to obtain a 95% bootstrap confidence interval for the mean of the difference in performance of two methods (Efron and Tibshirani 1986). The difference between two methods was considered significant if zero was not part of the bootstrap confidence interval. As in the approach using partitioning of the data, the bootstrap resampling was done at the target level.
Electronic supplemental material
The Supplemental material includes: example of plots of different local geometric similarity functions plotted alongside human expert scores; correlation coefficients between different local scoring functions and the LS-score; ROC curves based on the LS-score; plot of SVM performance as a function of window size for input features; two additional examples of local scoring functions on CASP targets. (Data included in MS Word document LocalQualityAssessment_ESM.doc.)
Acknowledgments
We thank Christina Leslie for helpful discussions regarding support vector machines, Lucy Forrest and Mickey Kosloff for help with the manual scoring of the models, Lucy Forrest for careful reading of the manuscript, and Björn Wallner for providing us with a stand-alone version of the ProQres program. This work was supported in part by NIH grant GM-30518.
Footnotes
Supplemental material: see www.proteinscience.org
Reprint requests to: Barry Honig, Howard Hughes Medical Institute at Columbia University, Department of Biochemistry and Molecular Biophysics, Center for Computational Biology and Bioinformatics, Columbia University, 1130 St. Nicholas Avenue, Room 815, New York, NY 10032, USA; e-mail: bh6@columbia.edu; fax: (212) 851-4650.
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.072856307.
References
- Abagyan R.A. and Totrov, M.M. 1997. Contact area difference (CAD): A robust measure to evaluate accuracy of protein models. J. Mol. Biol. 268: 678–685. [DOI] [PubMed] [Google Scholar]
- Berglund A., Head, R.D., Welsh, E.A., and Marshall, G.R. 2004. ProVal: A protein-scoring function for the selection of native and near-native folds. Proteins 54: 289–302. [DOI] [PubMed] [Google Scholar]
- Buchete N.V., Straub, J.E., and Thirumalai, D. 2004a. Orientation-dependent coarse-grained potentials derived by statistical analysis of molecular structural databases. Polymer 45: 597–608. [Google Scholar]
- Buchete N.V., Straub, J.E., and Thirumalai, D. 2004b. Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci. 13: 862–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casella G. and Berger, R.L. 1990. Statistical inference. Duxbury Press, Belmont, California.
- Colovos C. and Yeates, T.O. 1993. Verification of protein structures—Patterns of nonbonded atomic interactions. Protein Sci. 2: 1511–1519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. and Tibshirani, R. 1986. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1: 54–75. [Google Scholar]
- Eisenberg D., Luthy, R., and Bowie, J.U. 1997. VERIFY3D: Assessment of protein models with three-dimensional profiles. Methods Enzymol. 277: 396–404. [DOI] [PubMed] [Google Scholar]
- Eramian D., Shen, M.Y., Devos, D., Melo, F., Sali, A., and Marti-Renom, M.A. 2006. A composite score for predicting errors in protein structure models. Protein Sci. 15: 1653–1666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan R.E., Chen, P.H., and Lin, C.J. 2005. Working set selection using the second order information for training SVM. J. Mach. Learn. Res. 6: 1889–1918. [Google Scholar]
- Fang Q.J. and Shortle, D. 2005. A consistent set of statistical potentials for quantifying local side-chain and backbone interactions. Proteins 60: 90–96. [DOI] [PubMed] [Google Scholar]
- Feig M. and Brooks, C.L. 2002. Evaluating CASP4 predictions with physical energy functions. Proteins 49: 232–245. [DOI] [PubMed] [Google Scholar]
- Felts A.K., Gallicchio, E., Wallqvist, A., and Levy, R.M. 2002. Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the surface generalized born solvent model. Proteins 48: 404–422. [DOI] [PubMed] [Google Scholar]
- Hamelryck T. 2005. An amino acid has two sides: A new 2D measure provides a different view of solvent exposure. Proteins 59: 38–48. [DOI] [PubMed] [Google Scholar]
- Jacobson M. and Sali, A. 2004. Comparative protein structure modeling and its applications to drug discovery. In Annual reports in medicinal chemistry, Vol. Vol. 39, pp. 259–276. Elsevier Academic Press Inc, San Diego, CA. [Google Scholar]
- John B. and Sali, A. 2003. Comparative protein structure modeling by iterative alignment, model building, and model assessment. Nucleic Acids Res. 31: 3982–3992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones D.T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292: 195–202. [DOI] [PubMed] [Google Scholar]
- Kabsch W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22: 2577–2637. [DOI] [PubMed] [Google Scholar]
- Kleywegt G.J. 2000. Validation of protein crystal structures. Acta Crystallogr. Sect. D Biol. Crystallogr. 56: 249–265. [DOI] [PubMed] [Google Scholar]
- Kortemme T., Morozov, A.V., and Baker, D. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes. J. Mol. Biol. 326: 1239–1259. [DOI] [PubMed] [Google Scholar]
- Lazaridis T. and Karplus, M. 1999. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288: 477–487. [DOI] [PubMed] [Google Scholar]
- Lee M.C. and Duan, Y. 2004. Distinguish protein decoys by using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model. Proteins 55: 620–634. [DOI] [PubMed] [Google Scholar]
- Lema M.A. and Echave, J. 2005. Assessing local structural perturbations in proteins. BMC Bioinformatics doi: 10.1186/1471-2105-6-226. [DOI] [PMC free article] [PubMed]
- Lovell S.C., Davis, I.W., Adrendall, W.B., de Bakker, P.I.W., Word, J.M., Prisant, M.G., Richardson, J.S., and Richardson, D.C. 2003. Structure validation by Cα geometry: ϕ,ψ, and Cβ deviation. Proteins 50: 437–450. [DOI] [PubMed] [Google Scholar]
- Lu H. and Skolnick, J. 2001. A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins 44: 223–232. [DOI] [PubMed] [Google Scholar]
- Luthy R., Bowie, J.U., and Eisenberg, D. 1992. Assessment of protein models with 3-dimensional profiles. Nature 356: 83–85. [DOI] [PubMed] [Google Scholar]
- Maiorov V. and Abagyan, R. 1998. Energy strain in three-dimensional protein structures. Folding Des. 3: 259–269. [DOI] [PubMed] [Google Scholar]
- Marti-Renom M.A., Madhusudhan, M.S., Fiser, A., Rost, B., and Sali, A. 2002. Reliability of assessment of protein structure prediction methods. Structure 10: 435–440. [DOI] [PubMed] [Google Scholar]
- McConkey B.J., Sobolev, V., and Edelman, M. 2003. Discrimination of native protein structures using atom–atom contact scoring. Proc. Natl. Acad. Sci. 100: 3215–3220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Melo F. and Feytmans, E. 1998. Assessing protein structures with a non-local atomic interaction energy. J. Mol. Biol. 277: 1141–1152. [DOI] [PubMed] [Google Scholar]
- Mihalek I., Res, I., Yao, H., and Lichtarge, O. 2003. Combining inference from evolution and geometric probability in protein structure evaluation. J. Mol. Biol. 331: 263–279. [DOI] [PubMed] [Google Scholar]
- Moult J., Fidelis, K., Zemla, A., and Hubbard, T. 2003. Critical assessment of methods of protein structure prediction (CASP)—Round V. Proteins 53: 334–339. [DOI] [PubMed] [Google Scholar]
- Moult J., Fidelis, K., Rost, B., Hubbard, T., and Tramontano, A. 2005. Critical assessment of methods of protein structure prediction (CASP)—Round 6. Proteins 61: 3–7. [DOI] [PubMed] [Google Scholar]
- Nemethy G., Gibson, K.D., Palmer, K.A., Yoon, C.N., Paterlini, G., Zagari, A., Rumsey, S., and Scheraga, H.A. 1992. Energy parameters in polypeptides. 10. Improved geometrical parameters and nonbonded interactions for use in the ECEPP/3 algorithm, with application to proline-containing peptides. J. Phys. Chem. 96: 6472–6484. [Google Scholar]
- Park B.H., Huang, E.S., and Levitt, M. 1997. Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J. Mol. Biol. 266: 831–846. [DOI] [PubMed] [Google Scholar]
- Petrey D. and Honig, B. 2000. Free energy determinants of tertiary structure and the evaluation of protein models. Protein Sci. 9: 2181–2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petrey D. and Honig, B. 2005. Protein structure prediction: Inroads to biology. Mol. Cell 20: 811–819. [DOI] [PubMed] [Google Scholar]
- Petrey D., Xiang, Z.X., Tang, C.L., Xie, L., Gimpelev, M., Mitros, T., Soto, C.S., Goldsmith-Fischman, S., Kernytsky, A., Schlessinger, A., et al. 2003. Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins 53: 430–435. [DOI] [PubMed] [Google Scholar]
- Pettitt C.S., McGuffin, L.J., and Jones, D.T. 2005. Improving sequence-based fold recognition by using 3D model quality assessment. Bioinformatics 21: 3509–3515. [DOI] [PubMed] [Google Scholar]
- Rojnuckarin A. and Subramaniam, S. 1999. Knowledge-based interaction potentials for proteins. Proteins 36: 54–67. [DOI] [PubMed] [Google Scholar]
- Sali A. and Blundell, T.L. 1993. Comparative modelling by satisfaction of spatial constraints. J. Mol. Biol. 234: 779–815. [DOI] [PubMed] [Google Scholar]
- Samudrala R. and Moult, J. 1998. An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275: 895–916. [DOI] [PubMed] [Google Scholar]
- Shindyalov I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11: 739–747. [DOI] [PubMed] [Google Scholar]
- Siew N., Elofsson, A., Rychlewski, L., and Fischer, D. 2000. MaxSub: An automated measure for the assessment of protein structure prediction quality. Bioinformatics 16: 776–785. [DOI] [PubMed] [Google Scholar]
- Sippl M.J. 1993. Recognition of errors in 3-dimensional structures of proteins. Proteins 17: 355–362. [DOI] [PubMed] [Google Scholar]
- Sippl M.J. 1995. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 5: 229–235. [DOI] [PubMed] [Google Scholar]
- Summa C.M., Levitt, M., and DeGrado, W.F. 2005. An atomic environment potential for use in protein structure prediction. J. Mol. Biol. 352: 986–1001. [DOI] [PubMed] [Google Scholar]
- Tosatto S.C.E. 2005. The Victor/FRST function for model quality estimation. J. Comput. Biol. 12: 1316–1327. [DOI] [PubMed] [Google Scholar]
- Wallin S., Farwer, J., and Bastolla, U. 2003. Testing similarity measures with continuous and discrete protein models. Proteins 50: 144–157. [DOI] [PubMed] [Google Scholar]
- Wallner B. and Elofsson, A. 2003. Can correct protein models be identified? Protein Sci. 12: 1073–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wallner B. and Elofsson, A. 2006. Identification of correct regions in protein models using structural, alignment and consensus information. Protein Sci. 15: 900–913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wallqvist A., Gallicchio, E., Felts, A.K., and Levy, R.M. 2002. Detecting native protein folds among large decoy sets with the OPLS all-atom potential and the surface generalized born solvent model. In Computational methods for protein folding, pp. 459–486. John Wiley & Sons, New York.
- Zemla A. 2003. LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 31: 3370–3374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y. and Skolnick, J. 2004. Scoring function for automated assessment of protein structure template quality. Proteins 57: 702–710. [DOI] [PubMed] [Google Scholar]
- Zhou H.Y. and Zhou, Y.Q. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11: 2714–2726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H.Y. and Zhou, Y.Q. 2004. Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins 55: 1005–1013. [DOI] [PubMed] [Google Scholar]
- Zhou H.Y. and Zhou, Y.Q. 2005. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins 58: 321–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu J., Xie, L., and Honig, B. 2006. Structural refinement of protein segments containing secondary structure elements: Local sampling, knowledge-based potentials, and clustering. Proteins 65: 463–479. [DOI] [PubMed] [Google Scholar]










