Improving a Consensus Approach for Protein Structure Selection by Removing Redundancy

Qingguo Wang; Yi Shang; Dong Xu

doi:10.1109/TCBB.2011.75

. Author manuscript; available in PMC: 2026 Apr 22.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2011 Nov-Dec;8(6):1708–1715. doi: 10.1109/TCBB.2011.75

Improving a Consensus Approach for Protein Structure Selection by Removing Redundancy

Qingguo Wang ¹, Yi Shang ¹, Dong Xu ¹

PMCID: PMC13098764 NIHMSID: NIHMS2165873 PMID: 21519117

Abstract

In protein tertiary structure prediction, a crucial step is to select near-native structures from a large number of predicted structural models. Over the years, extensive research has been conducted for the protein structure selection problem with most approaches focusing on developing more accurate energy or scoring functions. Despite significant advances in this area, the discerning power of current approaches is still unsatisfactory. In this paper, we propose a novel consensus-based algorithm for the selection of predicted protein structures. Given a set of predicted models, our method first removes redundant structures to derive a subset of reference models. Then, a structure is ranked based on its average pairwise similarity to the reference models. Using the CASP8 data set containing a large collection of predicted models for 122 targets, we compared our method with the best CASP8 quality assessment (QA) servers, which are all consensus based, and showed that our QA scores correlate better with the GDT-TSs than those of the CASP8 QA servers. We also compared our method with the state-of-the-art scoring functions and showed its improved performance for near-native model selection. The GDT-TSs of the top models picked by our method are on average more than 8 percent better than the ones selected by the best performing scoring function.

Index Terms—: Protein tertiary structure, protein structure selection, quality assessment, consensus approach, metapredictor, critical assessment of protein structure prediction

1. Introduction

Large-scale genome projects have produced massive amount of protein sequences. To fully understand the functions of these proteins, the knowledge of tertiary structures of these proteins is indispensable. The availability of high-quality protein structures would not only enable us to understand the biological roles of proteins, but also enhance our ability of structure-based drug design [2]. Traditionally, protein structures are determined by using the X-ray crystallography or NMR methods, which are time consuming and could not keep up with the protein sequencing endeavors. To complement current experimental efforts to determine protein tertiary structures, the computational approach to infer tertiary structures directly from 1D amino acid sequences has been an important research area for two decades and provided essential tools for many protein-related research problems.

The computational approach for protein structure prediction generally consists of two phases: generating good candidate structures and selecting the best candidates. Many structural models are typically generated in the first phase. In ab initio protein structure prediction, candidate structures are constructed either by assembling different protein tertiary structure pieces aligned to the query sequence or by sampling and searching in different regions of the conformational space [30]. The result is a huge number of candidate structures so that good ones close to the native structures could be included.

In the second phase, a variety of approaches have been proposed to distinguish high-quality candidates from poor ones. The most popular approach is to use scoring functions [10], [24], [47], [32], which calculate a score as a quality estimate for a model. Although many scoring functions have been developed over the years, they could benefit greatly from improving their ability to identify near-native structures. Another approach is clustering based [33], [42], [36]. The clustering approach groups similar structures in the generated pool together and picks the most representative one, or centroid, from each group. Clustering-based methods are more robust than scoring functions, though it is not a trivial task to extract a good representative from a cluster. Finally, the third selection approach uses consensus information [30], [15], [35], [38]. The average similarity of each candidate structure to other structures is computed and the one with the highest average similarity is selected. Unlike scoring functions, the consensus approach as well as the clustering approach is not applicable to the evaluation of a single protein model. However, when a large set of diverse protein structures is provided, it has been shown that the consensus approach performs significantly better than all other approaches for near-native structure selection [12], [14]. Although the consensus approach has been successful, we found that there was significant room for improvement.

In this paper, we investigate the redundancy existed in the predicted proteins structures and derive a novel consensus-based algorithm that ranks each structure based on its average pairwise similarity to a set of reference models, which are extracted from the predicted structures by discarding redundant models. The evaluation of our algorithm on a test set of 122 proteins as the targets in CASP8 (the 8th Critical Assessment of protein Structure Prediction [13]) shows that the removal of structural redundancy is effective in increasing the correlation of the quality assessment (QA) score of a predicted model with the actual structural similarity between the model and native structure. When applied to selecting the near-native models from candidate structures, our algorithm picked candidates significantly better than those by the state-of-the-art scoring functions.

The rest of the paper is organized as follows: In the next section, we review the related work. In Section 3, we present the basics of a consensus-based selection approach and formally define the problem of protein structure selection. In Section 4, we present our new consensus-based algorithm and implementation details. In Section 5, we show experimental results. Finally, we summarize the paper in Section 6.

2. Related Work

A major approach for evaluating protein structure uses energy or scoring functions, which fall into two main categories: physics-based energy functions [10] and knowledge-based statistical potentials [18], [34], [29], [28], [24], [47], [32]. Physics-based energy functions are derived from physical laws by evaluating directly the true energy of a protein tertiary structure. Knowledge-based statistical potentials are generally based on the inverse Boltzmann equation. These potentials assume protein structure in a thermodynamic equilibrium and exploit data of known protein structures to derive energy functions that reflect the distribution of observed structures. In comparison with physics-based energy functions, knowledge-based potentials are fast, simple to construct, and have been widely used for structure quality assessment. However, though a variety of scoring functions have been proposed over the years, with some recent ones that are very complicated and incorporated information of dozens of sources, none of them can consistently and reliably identify near-native structures due to the errors of protein folding energy [21] and noisy statistical nature of the scoring functions.

Another approach for the protein structure selection is clustering based [33], [6], [20], [17], [36]. The idea of using clustering to discriminate protein structures suggests that near-native structures have more structural neighbors than poor structures [33]. Some clustering approach, e.g., SCAR [6], applied a k-means clustering algorithm and used cluster centroids as representative models for clusters. A centroid is the average of all the structures in a cluster and is constructed by minimizing distance constraint between each pair of residues. Some clustering approaches, e.g., SPICKER [42] and its variant [36], found clusters by looking for one or more structures with the largest number of neighbors within a certain clustering radius. Surrounding a structure with most neighbors, a cluster is formed and the cluster center is constructed. SPICKER also used cluster centroids as the candidates of near-native conformations. Though centroid is more robust than an existing structure in a cluster, it often contains significant atomic clashes.

The third approach for the selection of protein structures utilizes consensus information [30], [15], [35], [38]. Like clustering methods, the consensus-based approach exploits structure-dependent features to do selection. But instead of building cluster centroids, the consensus approach employs the pairwise similarity between structures to rank predicted models. The idea behind the consensus approach is that a correctly folded structure is more likely to be similar to the other predicted structures for the same protein target than an incorrectly folded structure and that the one with the highest averaged similarity score to other structures is likely to be the closest to the native structure. The application of the consensus approach has been very successful in the recent Critical Assessment of protein Structure Prediction (CASP) competition[12], [25], [11], [22], [5], [1]. In the model quality assessment category in CASP8, the accuracy of structure evaluation achieved by the consensus-based approach was consistently better than other methods [14].

Finally, it is worth mentioning that existing consensus methods usually combine multiple selection techniques. For example, the CASP8 QA server SAM-T08-MQAC[1] blended scoring functions with several consensus terms including a median global distance test total score (GDT-TS) and median TM-score, while servers QMEANclust [5] and MULTICOM [11] utilized QMEAN scoring function [4] and a consensus score [11], respectively, to select reference models, with which to evaluate the predicted protein structures. All these three servers used a small portion of predicted structures as reference models. Interestingly, the performance of these three servers are slightly inferior to the CASP8 servers Pcons [22] and ModFOLDclust [25], [26], [27], which simply took all or most predicted structures as references. The new consensus algorithm to be presented in this paper does not rely on any additional selection techniques. It obtains improved discerning power over existing methods by employing a novel redundancy removal approach for the extraction of reference models.

3. Problem Formulation

In this section, we introduce the main idea behind the consensus-based approach and define the problem of protein structure selection.

3.1. Protein Structure Similarity

A major step of the consensus approach is to calculate pairwise similarity among protein structures. A variety of metrics have been proposed to measure the similarity between two 3D structures of a protein. The most widely used ones include pairwise root-meansquared distance (RMSD), GDT-TS [46], and Q score [3], [27]. As GDT-TS is the most widely used for assessing protein structure prediction and is adopted as the main measurement in CASP evaluations, we use a pairwise GDT-TS between two structures to indicate how significant these two structures are similar to each other in this work. The pairwise GDT-TS represents the percentage of aligned regions between two protein structures. Specifically, it is defined as follows:

GDT - TS (s_{i}, s_{j}) = (P_{1} + P_{2} + P_{4} + P_{8}) / 4,

(1)

where $s_{i}$ and $s_{j}$ are two 3D structures of a protein and $P_{d}$ is the percentage of residues from $s_{i}$ that can be superimposed with corresponding residues from $s_{j}$ under selected distance cutoffs $d, d \in {1, 2, 4, 8}$ [46].

The (1) above implies that the GDT-TS takes values in the range of 0 to 1. The larger the GDT-TS between two structures, the more similar these two structures considered to be. In this paper, the GDT-TS is computed using a software TM-score [43].

3.2. A Basic Consensus-Based Algorithm

The analysis of server predictions of CASP targets suggested that in a set of predicted structures of a protein, which are generated by multiple independent servers or for which the different regions of the conformational space of the protein are randomly and evenly sampled, a correctly folded structure is more likely to be similar to the other predicted structures for the same protein target than an incorrectly folded structure [41], [12]. Therefore, the similarity score between a predicted structure and other candidate structures can be used as an indicator of quality of this structure. This is the main idea behind the consensus-based approach.

As an illustrative example, Fig. 1 shows the average pairwise similarity (measured by the pairwise GDT-TS) of each predicted structure to all other structures against its structural GDT-TS to native for a CASP8 target, T0388. In Fig. 1, each point corresponds to one structure and there are totally 299 predicted structures generated independently by around 70 servers. It indicates that the average pairwise similarities of structures have strong correlation with structural GDT-TSs and that the similarity among structures can be used to rank the predicted structures of proteins.

Based on this observation, we derive a basic consensus-based algorithm, RefAll, which calculates the consensus of predicted models by comparing a predicted structure to all other structures of the same protein. In the pseudocode below, let $S$ be a set of predicted structures of a protein and $X$ be a set of quality assessment scores for $S$ . The structure $s_{i}, s_{i} \in S$ , which has the highest QA score $x_{i}, x_{i} \in X$ , will be selected as the best candidate in $S$ .

Algorithm RefAll(S)

Compute the pairwise GDT-TS score between each pair of structures $s_{i}$ and $s_{j}$ in $S$ . Then calculate the average pairwise GDT-TS score $x_{i}$ for each structure $s_{i} \in S$ :
$x_{i} = \frac{1}{| S |} \sum_{s_{j} \in S} GDT - TS (s_{i}, s_{j})$
Output the quality assessment (QA) score $X$ , where
$X = \{x_{i}, 1 \leq i \leq n\} .$

Though simple, RefAll performs very well on the recent CASP data set, as demonstrated in the section of Experimental Results below. However, one problem with RefAll is that it biases toward structures with multiple duplicate copies. As a result, the chance of a near-native structure to be ranked top by RefAll becomes slim if it does not have enough duplicates as other models. This is especially true for CASP data, in that CASP allows a predictor to register multiple servers and for a target each server is allowed to submit up to five models, which may be identical or highly similar for some targets. Another problem with RefAll is that it uses all the predicted structures of a protein as references, with which a structure to be evaluated is compared, though some poorly predicted structures can bear no resemblance with near-native ones. By using poor predictions as reference models, the QA scores of high-quality structures are brought down, hence the discerning power of RefAll.

The ineffectiveness of using all predicted models as references can be illustrated by Fig. 2. In Fig. 2, CASP8 targets form four bins based on the GDT-TS of the best predicted structure of each target. $X$ -axis denotes the percentage of a set of predicted structures used as references, which are models having the best GDT-TSs to native. When x = 100 percent, the results are exactly same as those of RefAll. $Y$ -axis is the average correlation of all targets within a bin. The red curve presents the correlation averaged over all targets. As shown in the right half of Fig. 2, with more and more poorly predicted models removed from the reference set, i.e., with the $x$ -axis value changed from 100 to 40 percent, the correlation between the QA scores and GDT-TSs increases in all categories of targets. It indicates that with a refined reference set a consensus algorithm can outperform RefAll.

3.3. Problem Statement

The problem of protein structure selection to be addressed in this paper is defined as follows: Let $N$ be the native structure of a protein and $S, S = \{s_{i}, 1 \leq i \leq n\}$ , be a set of predicted structures for this protein. For each structure $s_{i} \in S$ , we calculate a quality assessment score $x_{i}, 0 \leq x_{i} \leq 1$ , without using any information of $N$ . The structure with the highest QA score will be selected.

Let $y_{i}$ be the GDT-TS of $s_{i}$ to the native structure $N$ . To evaluate the QA scores $X, X = \{x_{i}\}$ , we compute a correlation coefficient $ρ$ between $X$ and the GDT-TSs $Y, Y = \{y_{i}\}$

ρ = \frac{\sum_{i = 1}^{n} (x_{i} - \overline{x}) (y_{i} - \overline{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \overline{x})}^{2}} \sqrt{\sum_{i = 1}^{n} {(y_{i} - \overline{y})}^{2}}} .

(2)

Our objective is to generate QA scores $X$ such that the correlation between $X$ and $Y$ is maximized

\underset{X}{argmax} ρ .

(3)

This problem formulation is based on an evaluation criterion used in CASP where the QA scores predicted for a set of structures are submitted and the Pearson correlation between the QA scores and the structural GDT-TSs is the score of the team.

4. A New Consensus-Based Algorithm

Fig. 2 in the previous section indicates the necessity to find a good reference set, which preferably consists mainly of near-native structures. Unfortunately, the selection of high-quality models is not a trivial task without the knowledge of native structures. Various approaches have been introduced to determine reference structures. For an example, MULTICOM [11], one of the top QA servers in CASP8, utilized a consensus score to pick reference models. However, due to the noisy nature of the score, the correlation of the QA scores by MULTICOM with the GDT-TSs to native is even inferior to RefAll, as shown by the experimental results in Section 5.

The inability of existing selection methods to pick good reference models motivated our investigation into a new approach: extracting reference models from a set of predicted structures by discarding redundant structures. In the following text, we present our new consensus-based algorithm for the computation of QA scores and our novel redundancy removal approach for the extraction of reference models.

4.1. A New Consensus Algorithm RefSelect

Let $R$ be a set of reference models extracted from a set of predicted structures $S$ . The algorithm RefSelect below calculates QA scores $X$ for $S$ by comparing each structure in $S$ with reference models in $R$ .

Algorithm RefSelect(S, R)

Calculate the average pairwise GDT-TS score $x_{i}$ for each predicted structure $s_{i} \in S$ :
$x_{i} = \frac{1}{| R |} \sum_{s_{j} \in R} GDT - TS (s_{i}, s_{j})$
Output $X$ , where $X = \{x_{i}, 1 \leq i \leq n\}$ .

The only difference between the algorithms RefSelect and RefAll is that each structure $s_{i} \in S$ is compared to all the structures in $R$ instead of to all the structures in $S$ . Here, $R$ is a set of reference structures that can be extracted from $S$ using the algorithm RmRedundant to be presented next.

4.2. New Technique for Finding Reference Structures

In order to present our new approach for the extraction of reference models, we introduce the concept of redundant model first. For a pair of 3D structures $s_{i}$ and $s_{j}$ of a protein, we say either $s_{i}$ or $s_{j}$ is redundant if their pairwise GDT-TS, i.e., $GDT-TS (s_{i}, s_{j})$ , is greater than a threshold $Z$ . From this definition, it can be inferred that given a set of predicted structures $S$ higher value of $Z$ corresponds to less redundant structures in $S$ . If $Z = 1$ , no structure in $S$ is redundant, because the pairwise GDT-TS between any pair of structures is no greater than 1. However, when $Z = 0$ , all the structures in $S$ can be redundant, since the pairwise GDT-TS is always greater than 0. Hence, by changing the value of $Z$ , we can adjust the number of redundant structures in $S$ .

A trivial way to remove the redundancy from a set of predicted structures is to discard randomly either of a pair of structures $s_{i}$ and $s_{j}$ if $GDT-TS (s_{i}, s_{j}) > Z$ . The following program RmRedundant implements this idea and returns a refined reference set $R$ .

Algorithm RmRedundant(S, Z)

Let $R = S$ .
For $i$ from 1 to $n - 1$ , iteratively remove from $R$ those structures $s_{j}, j > i$ , satisfying $GDT - TS (s_{i}, s_{j}) > Z$ .
Return $R$ .

Here, the threshold $Z$ is an important parameter of RmRedundant. It determines the quality of the resulting reference set $R$ , hence the correlation between the QA scores $X$ and the GDT-TSs $Y$ . Fig. 3 takes the server predictions of the CASP8 target T0496 as an example to illustrate the influence of $Z$ on the QA scores $X$ . In Fig. 3, $x$ -axis represents the threshold $Z$ and $y$ -axis is the correlation between the QA scores $X$ and the structural GDT-TSs $Y$ . For each $Z$ , the algorithm RmRedundant is run to find from the server predictions of T0496 a reference set $R$ , with which the algorithm RefSelect is called to compute QA scores $X$ . Then, the correlation $ρ$ between $X$ and $Y$ is calculated and a corresponding point $(Z, ρ)$ is plotted.

As shown in Fig. 3, with the decrease of the threshold $Z$ from 1 to 0.46, i.e., with more and more predicted structures of T0496 defined as redundant and then removed from $R$ , the correlation between the QA scores $X$ and the GDT-TSs $Y$ increases significantly from 0.72 to impressive 0.86. However, if we decrease $Z$ further, the correlation starts to drop from the peak. It indicates that a proper value of $Z$ is critical for the quality of the QA scores $X$ .

The protein T0496 is a hard target, for which it is very difficult to predict the 3D structures from its amino acid sequence. To make sure RmRedundant is a general approach capable of the extraction of a reference set better than that of RefAll for most proteins instead of for few specific targets, we further evaluated RmRedundant on all CASP8 targets. Similarly as in Fig. 2, we categorize CASP8 targets into four bins based on the GDT-TS of the best predicted model of each target. For each target, we adjust $Z$ to generate a reference set $R$ that maximizes the correlation between the QA scores (calculated using the algorithm RefSelect) and the GDT-TSs. Then, we output this maximal correlation, which represents the highest correlation that RefSelect can achieve for this target when RmRedundant is used to remove redundant models. Due to space limitation, here we only present the mean of the maximal correlations averaged over all targets within each bin. As shown in Table 1, the average correlations of RefSelect are consistently higher than those of RefAll in all categories of targets. It, thus, demonstrates that RmRedundant is able to generate reference sets better than the one using all as references for most proteins.

TABLE 1.

The highest Correlations RefSelect Can Reach on 122 CASP8 Targets when the Algorithm RmRedundant Is Used to Remove Redundant Models

	RefAll	RefSelect (using optimal Z)
GDT-TS < 0.3	0.7990	0.8574
0.3 ≤ GDT-TS < 0.5	0.8454	0.8946
0.5 ≤ GDT-TS < 0.7	0.9482	0.9705
0.7 ≤ GDT-TS	0.9577	0.9804
All	0.9290	0.9579

Open in a new tab

4.3. Parameter Learning

We need a parameter $Z$ to run the program RmRedundant. For CASP data, each target consists of hundreds of 3D structures predicted by different servers. The mean of the pairwise GDT-TSs of easy targets is close to 1 while those of hard targets can be as low as 0.1. So, it is impossible to fix the value of the threshold $Z$ for all targets; otherwise, we will be at risk of discarding most structures for easy targets while preserving too many redundant models for hard ones. Thus, $Z$ has to be dynamically adjusted for different targets. The procedure to find a proper value for $Z$ is provided in Fig. 4.

As shown in Fig. 4, for a set of predicted structures $S$ of a protein, we first set an initial value to the threshold $Z$ . Next, the algorithm RmRedundant is run to find a set of reference structures $R$ by removing redundant structures under the threshold $Z$ . If the size of $R$ is too small or too large, then we go through an iterative process to adjust the value of $Z$ so as to increase or decrease the size of $R$ . After a proper size of $R$ is found, the final value of $Z$ is returned.

Now our objective is to determine $| R |$ , the size of the reference set $R$ . We use the server predictions of 93 CASP7 targets as training data to learn $| R |$ . Each CASP7 target consists of around 250 structures predicted by different servers. On the training data set, the problem of finding an optimal $| R |$ can be formulated as the following maximization problem:

\underset{| R |}{argmax} ρ s.t. R = RmRedundant (S, Z), C_{1} < | R | < C_{2},

(4)

where $ρ$ is the correlation coefficient defined in (2) and $C_{1}$ and $C_{2}$ are lower and upper bound of $| R |$ , respectively.

The (4) above can be solved by searching through the entire solution space, in which solutions for $| R |$ are not unique. Since $| R |$ approximates to $| S |$ (the whole protein set) for easy targets, for simplicity we let $| R | = | S |$ for easy targets. Here, the differentiation of easy targets from hard is based on the mean of the pairwise GDT-TSs between predicted structures. The threshold to classify a target as easy or hard is learned from CASP7 data. Let $u$ be a ratio of $| R |$ to $| S |$ , i.e., $| R | = u \times | S |$ . By averaging the results of (4) for $u$ , we have $u = 0.27$ for hard targets.

5. Experimental Results

5.1. Data Set

Our method was evaluated on the set of CASP8 targets. CASP8 provided 128 targets/proteins, from T0387 to T0514. For each target, its primary sequence with length ranging from 55 to 829 amino acids was provided to computer servers for 3D structure predictions. There are about 300 predicted structures submitted for each target by around 70 servers. After the expiration of the tertiary structure prediction, these predicted structures were made available from the CASP8 website and groups submitting quality assessment predictions were then asked to evaluate server predictions and assign a number that predicts the quality of a predicted structure.

The CASP8 quality assessment was evaluated based on 122 targets. This is the test set we used in our experiments. The difficulty of these targets for predictions varies. Based on the average of the top 10 GDT-TSs for the first server models, these targets can be partitioned in the order from hard to easy into five categories:

free modeling (FM),
fold recognition (FR),
comparative modeling—hard (CM_H),
comparative modeling—medium (CM_M), and
comparative modeling—easy (CM_E) [48].

The detailed description of domain category is provided in [48]. In the following text, the performance of our method was evaluated on each of these categories.

5.2. Comparing with Existing QA Servers

First, our algorithm RefSelect was compared to the state-of-the-art QA servers, e.g., MULTICOM [11], Pcons [22], SAM-T08-MQAC [1], ModFOLDclust [25], [26], [27], and QMEANclust [5], which are all consensus based and are the best performing QA servers in the CASP8 competition. The quality assessment results of these servers were taken directly from the official website of CASP8. We assigned QA scores $X$ to the predicted structures of each target and calculated the Pearson correlation between $X$ and the structural GDT-TSs $Y$ . Then, the average correlation over all the targets within each category was computed and the result was provided in Table 2a. As indicated in Table 2a, the correlation between the QA scores generated by RefSelect and the structural GDT-TSs is 2 percent higher than the best CASP8 QA servers.

TABLE 2.

Comparison of Our Algorithm RefSelect to the Best Performing CASP8 QA Servers on the Test Data Set of 122 CASP8 Targets

(a)
	MULTICOM	Pcons	SAM-T08-MQAC	ModFOLDclust	QMEANclust	RefAll	RefSelect
FM	0.7387	0.7806	0.8150	0.7280	0.8304	0.7981	0.8386
FR	0.7980	0.8473	0.8433	0.8549	0.7890	0.8657	0.8730
CM_H	0.8653	0.8623	0.8541	0.8699	0.8580	0.8806	0.8880
CM_M	0.9558	0.9599	0.9579	0.9629	0.9521	0.9726	0.9732
CM_E	0.9730	0.9845	0.9749	0.9763	0.9738	0.9856	0.9855
Overall	0.9029	0.9168	0.9144	0.9156	0.9021	0.9290	0.9343

(b)
FM	0.2394	0.2048	0.2337	0.2307	0.2248	0.1951	0.2319
FR	0.3561	0.3600	0.3581	0.3737	0.3384	0.3551	0.3743
CM_H	0.5926	0.5784	0.5909	0.5745	0.5778	0.5711	0.5861
CM_M	0.7158	0.7139	0.7136	0.7126	0.7097	0.7172	0.7169
CM_E	0.8464	0.8560	0.8500	0.8526	0.8481	0.8558	0.8543
Overall	0.6338	0.6285	0.6337	0.6306	0.6203	0.6267	0.6347

Open in a new tab

(a) The average correlation between the QA scores and structural GDT-TSs. (b) The average GDT-TSs of the top one structures selected by QA servers.

To verify whether the correlation improvement of our method is significant statistically, we grouped our method with MULTICOM, Pcons, SAM-T08-MQAC, ModFOLDclust, and QMEANclust, respectively, for paired T tests [39], [49] over 122 CASP8 targets. We got the corresponding two-tailed P values for the significance between our method and each of the other five methods as follows: 0.0001, 0.0001, 0.0001, 0.0001, and 0.0006, respectively, indicating that the improvement of RefSelect over the top CASP8 QA servers is highly significant.

Table 2b evaluated the ability of our method for the selection of near-native structures. Models with the best QA scores were picked from each target. The GDT-TSs of these top picks were then averaged for each category and the results were presented in Table 2b, which shows that the top one models selected by RefSelect are comparable to the top QA servers in CASP8.

Table 2 also provides the results of RefAll on CASP8. The overall correlation of the QA scores by RefAll is 0.9290, worse than 0.9343, the correlation of RefSelect. Besides, for the targets in the category of free modeling, which are the most difficult ones to perform 3D structure prediction and evaluation, the correlation was improved the greatest, from mere 0.7981 to prominent 0.8386. Overall, Table 2 indicates that the discerning power of our method is improved by removing redundant structures as references.

5.3. Comparing with Scoring Functions

Our method was compared with the state-of-the-art scoring functions. The scoring functions we compared with include

OPUS-Ca, a knowledge-based function that is formed based on seven major representative molecular interactions in proteins [40];
ModelEvaluator, a scoring function that evaluates protein model quality using support vector machines and 1D and 2D structural features [37];
DFIRE, a knowledge-based potential that is based on a novel reference state called the distance-scaled, finite idealgas reference (DFIRE) state [47];
RAPDF, a residue-specific all-atom probability discriminatory function [31];
OPUS-PSP, an all-atom potential derived from side-chain packing [23];
DFIRE2.0, energy functions that performs ab initio refolding of fully unfolded terminal segments [44]; and
DOPE, a knowledge-based scoring function based on an improved reference [32].

First, Table 3a shows the correlation of the scoring functions on CASP8. It indicates that the QA scores of RefSelect correlate better with the structural GDT-TSs than the scoring functions.

TABLE 3.

Comparison of RefSelect to the Existing Scoring Functions on the Test Data Set of 122 CASP8 Targets

(a)
	Top1	OPUS-Ca	ModelEvaluator	DFIRE	RAPDF	OPUS-PSP	DFIRE2.0	DOPE	RefAll	RefSelect
FM	N/A	0.4701	0.2396	0.3267	0.1873	−0.0458	0.3118	0.3926	0.7981	0.8386
FR	N/A	0.3770	0.4926	0.3109	0.2452	0.1868	0.4145	0.4405	0.8657	0.8730
CM_H	N/A	0.5159	0.6420	0.4808	0.4523	0.4319	0.6039	0.6206	0.8806	0.8880
CM_M	N/A	0.6643	0.7042	0.5831	0.5853	0.4611	0.6933	0.7299	0.9726	0.9732
CM_E	N/A	0.7275	0.7988	0.6592	0.6652	0.4664	0.7134	0.7830	0.9856	0.9855
Overall	N/A	0.5878	0.6497	0.5173	0.4940	0.3763	0.6077	0.6493	0.9290	0.9343

(b)
FM	0.2830	0.1536	0.1218	0.1724	0.1996	0.1772	0.1841	0.1841	0.1951	0.2319
FR	0.4329	0.2567	0.2712	0.3048	0.2988	0.3064	0.2765	0.2912	0.3551	0.3743
CM_H	0.6524	0.4713	0.5285	0.5423	0.5265	0.3945	0.4633	0.4282	0.5711	0.5861
CM_M	0.7502	0.5807	0.6480	0.6240	0.6649	0.5762	0.5722	0.5477	0.7172	0.7169
CM_E	0.8884	0.6748	0.7946	0.7846	0.8330	0.7968	0.8102	0.8025	0.8558	0.8543
Overall	0.6799	0.4985	0.5613	0.5627	0.5856	0.5239	0.5331	0.5196	0.6267	0.6347

(c)
FM	0.2830	0.1905	0.2161	0.2083	0.2342	0.2104	0.2184	0.2111	0.2130	0.2488
FR	0.4329	0.3081	0.3480	0.3521	0.3516	0.3370	0.3473	0.3318	0.3671	0.3945
CM_H	0.6524	0.5646	0.5802	0.6046	0.5877	0.4909	0.5587	0.5339	0.5871	0.5938
CM_M	0.7502	0.6950	0.7031	0.7162	0.7154	0.6879	0.6901	0.6856	0.7266	0.7278
CM_E	0.8884	0.8144	0.8460	0.8597	0.8603	0.8511	0.8527	0.8505	0.8647	0.8641
Overall	0.6799	0.5989	0.6212	0.6336	0.6318	0.5990	0.6147	0.6049	0.6382	0.6468

Open in a new tab

(a) The average correlation between the scores generated by scoring functions and the structural GDT-TSs. (b) The average GDT-TSs of the top one structures selected by scoring functions. (c) The average GDT-TSs of the top of the five structures picked by scoring functions.

We also applied each scoring function to pick the top one structure for each target. Table 3b shows that the average GDT-TS of the structures picked by our method is 0.6347, significantly better than 0.5856, the average GDT-TS of the best performing scoring function RAPDF. Table 3b also provides the selection results of RefAll. By removing redundant structures, RefSelect improved over RefAll the GDT-TS by 19 and 5 percent, respectively, for targets in the categories of FM and FR. Moreover, in Table 3b, the column Top1 is the average GDT-TS of the best structures over each category; i.e., it corresponds to results of perfect selection. The comparison with Top1 suggests how much more our algorithm can be possibly improved.

Next, we selected five structures instead of one from each target and computed the average GDT-TS of the best of five. The results in Table 3c again show that the removal of redundant structures improves the discerning power of our selection method and that our algorithm outperforms other selection methods.

5.4. Performance on Rosetta Models

Finally, we tested our algorithm on Rosetta models provided by the CASP8 server MUFOLD-MD [45] for CASP8 sequences. MUFOLD-MD applied Rosetta[19] (version 2.2.0) for ab initio generation of models. To avoid evaluating poor predictions, we excluded from our test set the targets that Rosetta failed to provide high-quality structures, i.e., those targets with the GDT-TS of the top predicted model being less than 0.5. After this filtering step, our test data consist of 22 targets, as shown in Table 4.

TABLE 4.

The GDT-TS of the Top Selection and the Correlation of the QA Scores of Our Method and Existing Scoring Functions on a Test Set of Rosetta Models Generated for 22 CASP8 Targets

		GDTTS of the top1 selection							Correlation
target	top1	opus	dfire	rapdf	DOPE	Rosetta¹	RefAll	RefSelect	opus	dfire	rapdf	DOPE	Rosetta¹	RefAll	RefSelect
T0392	0.4716	0.3247	0.3041	0.3041	0.2552	0.2526	0.2809	0.2809	0.1782	0.1053	0.0619	0.2210	0.1772	0.4624	0.4666
T0408	0.6011	0.4920	0.4176	0.4176	0.3910	0.4761	0.4707	0.4707	0.4198	0.1209	0.2366	0.1988	0.6086	0.7589	0.7184
T0411	0.4271	0.3375	0.3271	0.3271	0.3375	0.2958	0.3625	0.3625	0.1167	0.0912	−0.0245	0.0916	0.2745	0.8028	0.7776
T0415	0.5275	0.3005	0.4518	0.4518	0.4037	0.4037	0.4495	0.4495	0.3964	0.2901	0.4167	0.1803	0.7190	0.8433	0.8433
T0433	0.4083	0.2977	0.2751	0.2751	0.3405	0.2977	0.3869	0.3329	0.4302	0.2012	0.0984	0.4212	0.6068	0.8938	0.8780
T0437	0.4899	0.3561	0.3409	0.3409	0.2929	0.3561	0.3232	0.3232	0.3392	0.1482	0.1145	0.1515	0.1787	0.1980	0.2387
T0459	0.5505	0.4977	0.3739	0.3739	0.2959	0.3739	0.4977	0.4977	0.2240	0.1194	0.2484	0.1890	0.1553	0.8542	0.8200
T0469	0.6308	0.4000	0.4692	0.4692	0.4846	0.5038	0.5000	0.5000	0.0746	0.1287	−0.0051	0.2268	0.0506	0.6445	0.6445
T0473	0.6103	0.5662	0.5037	0.5037	0.4522	0.4926	0.5037	0.5037	0.2418	0.0877	0.2274	0.2450	0.1011	0.5851	0.4893
T0492	0.6438	0.5651	0.3425	0.3425	0.4658	0.3014	0.5822	0.5822	0.2548	0.1042	0.1094	0.2235	0.1617	0.5306	0.5944
T0502	0.4821	0.2168	0.3061	0.3061	0.2372	0.4821	0.4082	0.4082	0.0880	0.0615	0.0947	0.0255	0.2045	0.6259	0.6259
T0396	0.7696	0.2647	0.6397	0.6397	0.7696	0.2525	0.6618	0.6912	0.1977	0.3780	0.3529	0.4717	0.4312	0.9154	0.8537
T0400	0.5301	0.4668	0.3892	0.3892	0.5063	0.3892	0.3782	0.4652	0.4766	0.0788	0.0928	0.3110	0.5113	0.8659	0.8549
T0404	0.5549	0.3323	0.3720	0.3720	0.4695	0.3811	0.3506	0.3506	0.1920	0.1055	0.0420	0.1536	0.2956	0.5576	0.5232
T0432	0.4673	0.3327	0.3865	0.3865	0.2615	0.2346	0.3327	0.3327	0.2589	0.1538	0.1404	0.2014	0.1487	0.5771	0.5498
T0453	0.5431	0.3534	0.2701	0.2701	0.3793	0.3362	0.3276	0.4713	0.1688	−0.0411	0.1682	0.0622	0.1695	0.2497	0.2802
T0458	0.5253	0.3354	0.3956	0.3956	0.3418	0.3924	0.4272	0.4272	0.1427	0.0500	0.1577	0.0319	0.2858	0.3748	0.4008
T0476	0.4425	0.2701	0.2701	0.2701	0.3822	0.3305	0.3017	0.3017	0.1090	−0.0932	−0.0783	0.1075	0.3690	0.4361	0.5040
T0479	0.6700	0.6320	0.4880	0.4880	0.5400	0.4360	0.6700	0.4780	0.4229	0.2676	0.2917	0.4851	0.5886	0.8236	0.4985
T0488	0.7684	0.7684	0.7684	0.7684	0.4789	0.6447	0.6447	0.6447	0.6049	0.3139	0.3026	0.5495	0.6385	0.9474	0.9347
T0491	0.6020	0.3929	0.5153	0.5153	0.4133	0.5332	0.5510	0.5510	0.3517	0.2465	0.3424	0.4215	0.6216	0.8665	0.7533
T0499	0.7679	0.4777	0.4732	0.4732	0.5000	0.4955	0.5446	0.5446	0.2416	0.0925	0.0591	−0.1162	0.5064	0.3346	0.3346
Mean	0.5675	0.4082	0.4127	0.4127	0.4090	0.3937	0.4525	0.4532	0.2696	0.1369	0.1568	0.2206	0.3547	0.6431	0.6175

Open in a new tab

Please note: Rosetta¹ denotes the results of Rosetta scoring function

For each CASP8 sequence, MUFOLD-MD generated 10,000 ab initio models. The current version of our algorithm is not able to deal with such large-scale data due to long computing time. So, we performed random sampling, selecting randomly 300 models out of 10,000 for each target. Then, we applied our algorithm as well as other scoring functions to these smaller sets of 300 models. Part of our results were provided in Table 4, which indicates that our algorithm RefSelect shows its superior performance even on Rosetta models, outperforming all other scoring functions including Rosetta scoring function for near-native model selection.

6. Summary

In this paper, we have presented a new consensus-based method by exploiting the redundancy existed in the predicted protein structures. In extensive experiments, we demonstrated that the deprival of redundant structures from a set of reference models is effective in increasing the discerning power of our method. On a test set of server predictions of 122 CASP8 targets, the correlation between our QA scores and the GDT-TSs is significantly higher than those of the top QA servers, which are all consensus based, in CASP8. When applied to the selection of near-native structures, our method achieved results much better than the state-of-the-art scoring functions. Moreover, our algorithm also shows its superior performance on Rosetta models.

AcKnowledgments

This work has been supported by NIH Grant R33GM078601. We are also thankful to Dr. Ioan Kosztin and Dr. Bogdan Barz for providing the Rosetta predictions of CASP8 targets to us.

Footnotes

For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

References

[1].Archie JG, Paluszewski M, and Karplus K, “Applying Undertaker to Quality Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 191–195, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Baker D and Sali A, “Protein Structure Prediction and Structural Genomics,” Science, vol. 294, pp. 93–96, 2001. [DOI] [PubMed] [Google Scholar]
[3].Ben-David M, Noivirt-Brik O, Paz A, Prilusky J, Sussman JL, and Levy Y, “Assessment of CASP8 Structure Predictions for Template Free Targets,” Proteins: Structure, Function, and Bioinformatics, vol. 77, no. suppl 9, pp. 50–65, 2009. [DOI] [PubMed] [Google Scholar]
[4].Benkert P, Tosatto SCE, and Schomburg D, “QMEAN: A Comprehensive Scoring Function for Model Quality Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 71, pp. 261–277, 2008. [DOI] [PubMed] [Google Scholar]
[5].Benkert P, Tosatto SCE, and Schwede T, “Global and Local Model Quality Estimation at CASP8 Using the Scoring Functions QMEAN and QMEANclust,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 173–180, 2009. [DOI] [PubMed] [Google Scholar]
[6].Betancourt MR and Skolnick J, “Finding the Needle in a Haystack: Educing Protein Native Folds from Ambiguous ab initio Folding Predictions,” J. Computational Chemistry, vol. 22, pp. 339–353, 2001. [Google Scholar]
[7].Bondugula R and Xu D, “MUPRED: A Tool for Bridging the Gap between Template Based Methods and Sequence Profile Based Methods for Protein Secondary Structure Prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 66, pp. 664–670, 2007. [DOI] [PubMed] [Google Scholar]
[8].Bondugula R, Xu D, and Shang Y, “A Fast Algorithm for Low-Resolution Protein Structure Prediction,” Proc. Ann. Int’l Conf. IEEE Eng. in Medicine and Biology Soc, pp. 5826–5829, July 2006. [DOI] [PubMed] [Google Scholar]
[9].Borg I and Groenen P, Modern Multidimensional Scaling, Theory and Applications. Springer-Verlag, 1997. [Google Scholar]
[10].Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, and Karplus M, “CHARMM: A Program for Macromolecular Energy Minimization and Dynamic Calculations,” J. Computational Chemistry, vol. 4, pp. 187–217, 1983. [Google Scholar]
[11].Cheng J, Wang Z, Tegge AN, and Eickholt J, “Prediction of Global and Local Quality of CASP8 Models by MULTICOM Series,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 181–184, 2009. [DOI] [PubMed] [Google Scholar]
[12].Cozzetto D, Kryshtafovych A, Ceriani M, and Tramontano A, “Assessment of Predictions in the Model Quality Assessment Category,” Proteins: Structure, Function, and Bioinformatics, vol. 69, pp. 175–183, 2007. [DOI] [PubMed] [Google Scholar]
[13].Cozzetto D, Kryshtafovych A, and Tramontano A, “Critical Assessment of Methods of Protein Structure Prediction-Round VIII,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 1–4, 2009. [DOI] [PubMed] [Google Scholar]
[14].Cozzetto D, Kryshtafovych A, and Tramontano A, “Evaluation of CASP8 Model Quality Predictions,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 157–66, 2009. [DOI] [PubMed] [Google Scholar]
[15].Ginalski K, Elofsson A, Fischer D, and Rychlewski L, “3D-Jury: A Simple Approach to Improve Protein Structure Predictions,” Bioinformatics, vol. 19, pp. 1015–1018, 2003. [DOI] [PubMed] [Google Scholar]
[16].Goulden CH, Methods of Statistical Analysis, second ed., pp. 50–55. Wiley, 1956. [Google Scholar]
[17].Gront D, Hansmann UHE, and Kolinski A, “Exploring Protein Energy Landscapes with Hierarchical Clustering,” Int’l J. Quantum Chemistry, vol. 105, pp. 826–830, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Kalman M and Ben-Tal N, “Quality Assessment of Protein Model-Structures Using Evolutionary Conservation,” Bioinformatics, vol. 26, no. 10, pp. 1299–1307, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Kim DE, Chivian D, and Baker D, “Protein Structure Prediction and Analysis Using the Robetta Server,” Nucleic Acids Research, vol. 32, pp. 526–531, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Kozakov D, Clodfelter KH, Vajda S, and Camacho CJ, “Optimal Clustering for Detecting near-Native Conformations in Protein Docking,” Biophysical J., vol. 89, pp. 867–875, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Lazaridis T and Karplus M, “New View of Protein Folding Reconciled with the Old through Multiple Unfolding Simulations,” Science, vol. 278, pp. 1928–1931, 1997. [DOI] [PubMed] [Google Scholar]
[22].Larsson P, Skwark MJ, Wallner B, and Elofsson A, “Assessment of Global and Local Model Quality in CASP8 Using Pcons and ProQ,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 167–172, 2009. [DOI] [PubMed] [Google Scholar]
[23].Lu M, Dousis AD, and Ma J, “OPUS-PSP: An Orientation-Dependent Statistical All-Atom Potential Derived from Side-Chain Packing,” J. Molecular Biology, vol. 376, pp. 288–301, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Lu H and Skolnick J, “A Distance-Dependent Atomic Knowledge-Based Potential for Improved Protein Structure Selection,” Proteins: Structure, Function, and Genetics, vol. 44, pp. 223–232, 2001. [DOI] [PubMed] [Google Scholar]
[25].Mcguffin LJ, “Prediction of Global and Local Model Quality in CASP8 Using the ModFOLD Server,” Proteins: Structure, Function, and Bioinformatics, vol. 77, no. suppl 9, pp. 185–190, 2009. [DOI] [PubMed] [Google Scholar]
[26].McGuffin LJ, “The ModFOLD Server for the Quality Assessment of Protein Structural Models,” Bioinformatics, vol. 24, no. 4, pp. 586–587, 2008. [DOI] [PubMed] [Google Scholar]
[27].McGuffin LJ and Roche DB, “Rapid Model Quality Assessment for Protein Structure Predictions Using the Comparison of Multiple Models without Structural Alignments,” Bioinformatics, vol. 26, no. 2, pp. 182–188, 2010. [DOI] [PubMed] [Google Scholar]
[28].Moult J, “Comparison of Database Potentials and Molecular Mechanics Force Fields,” Current Opinion in Structural Biology, vol. 7, pp. 194–199, 1997. [DOI] [PubMed] [Google Scholar]
[29].Nobeli I, Mitchell JBO, Alex A, and Thornton J, “Evaluation of a Knowledge-Based Potential of Mean Force for Scoring Docked Protein-Ligand Complexes,” J. Computational Chemistry, vol. 22, pp. 673–688, 2001. [Google Scholar]
[30].Qui J, Sheffler W, Baker D, and Noble WS, “Ranking Predicted Protein Structures with Support Vector Regression,” Proteins: Structure, Function, and Bioinformatics, vol. 71, pp. 1175–1182, May 2008. [DOI] [PubMed] [Google Scholar]
[31].Samudrala R and Moult J, “An All-Atom Distance-Dependent Conditional Probability Discriminatory Function for Protein Structure Prediction,” J. Molecular Biology, vol. 275, pp. 895–916, 1998. [DOI] [PubMed] [Google Scholar]
[32].Shen M and Sali A, “Statistical Potential for Assessment and Prediction of Protein Structures,” Protein Science, vol. 15, pp. 2507–2524, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Shortle D, Simons KT, and Baker D, “Clustering of Low-Energy Conformations near the Native Structures of Small Proteins,” Biophysics, vol. 95, pp. 11158–11162, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Sippl M, “Knowledge-Based Potentials for Proteins,” Current Opinion in Structural Biology, vol. 5, pp. 229–235, 1995. [DOI] [PubMed] [Google Scholar]
[35].Wang K, Fain B, Levitt M, and Samudrala R, “Improved Protein Structure Selection Using Decoy-Dependent Discriminatory Functions,” BMC Structural Biology, vol. 4, p. 8, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Wang Q, Shang Y, and Xu D, “A New Clustering-Based Method for Protein Structure Selection,” Proc. Int’l Joint Conf. Neural Networks (IJCNN ‘08), pp. 2891–2898, 2008. [Google Scholar]
[37].Wang Z, Tegge AN, and Cheng J, “Evaluating the Absolute Quality of a Single Protein Model Using Support Vector Machines and Structural Features,” Proteins, vol. 75, no. 3, pp. 638–647, 2009. [DOI] [PubMed] [Google Scholar]
[38].Wallner B and Elofsson A, “Pcons5: Combining Consensus, Structural Evaluation and Fold Recognition Scores,” Bioinformatics, vol. 21, pp. 4248–4254, 2005. [DOI] [PubMed] [Google Scholar]
[39].Weisstein EW, “Paired t-Test,” From MathWorld-A Wolfram Web Resource, http://mathworld.wolfram.com/Pairedt-Test.html, 2011. [Google Scholar]
[40].Wu Y, Lu M, Chen M, Li J, and Ma J, “OPUS-Ca: A Knowledge-Based Potential Function Requiring Only Ca Positions,” Protein Science, vol. 16, pp. 1449–1463, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Venclovas C and Margelevičius M, “Comparative Modeling in CASP6 Using Consensus Approach to Template Selection, Sequence-Structure Alignment, and Structure Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 7, pp. 99–105, 2005. [DOI] [PubMed] [Google Scholar]
[42].Zhang Y and Skolnick J, “SPICKER: A Clustering Approach to Identify near-Native Protein Folds,” J. Computational Chemistry, vol. 25, pp. 865–871, 2004. [DOI] [PubMed] [Google Scholar]
[43].Zhang Y and Skolnick J, “Scoring Function for Automated Assessment of Protein Structure Template Quality,” Proteins, vol. 57, pp. 702–710, June 2004. [DOI] [PubMed] [Google Scholar]
[44].Yang Y and Zhou Y, “Ab initio Folding of Terminal Segments with Secondary Structures Reveals the Fine Difference between Two Closely Related All-Atom Statistical Energy Functions,” Protein Science, vol. 17, pp. 1212–1219, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Zhang J, Wang Q, Barz B, He Z, Kosztin I, Shang Y, and Xu D, “MUFOLD: A New Solution for Protein 3D Structure Prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 78, pp. 1137–1152, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Zemla A, “LGA: A Method for Finding 3D Similarities in Protein Structures,” Nucleic Acids Research, vol. 31, no. 13, pp. 3370–3374, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Zhou H and Zhou Y, “Distance-Scaled, Finite Ideal-Gas Reference State Improves Structure-Derived Potentials of Mean Force for Structure Selection and Stability Prediction,” Protein Science, vol. 11, pp. 2714–2726, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].http://prodata.swmed.edu/CASP8/evaluation/Categories.htm, 2011.
[49].http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C, 2011.

[R1] [1].Archie JG, Paluszewski M, and Karplus K, “Applying Undertaker to Quality Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 191–195, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Baker D and Sali A, “Protein Structure Prediction and Structural Genomics,” Science, vol. 294, pp. 93–96, 2001. [DOI] [PubMed] [Google Scholar]

[R3] [3].Ben-David M, Noivirt-Brik O, Paz A, Prilusky J, Sussman JL, and Levy Y, “Assessment of CASP8 Structure Predictions for Template Free Targets,” Proteins: Structure, Function, and Bioinformatics, vol. 77, no. suppl 9, pp. 50–65, 2009. [DOI] [PubMed] [Google Scholar]

[R4] [4].Benkert P, Tosatto SCE, and Schomburg D, “QMEAN: A Comprehensive Scoring Function for Model Quality Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 71, pp. 261–277, 2008. [DOI] [PubMed] [Google Scholar]

[R5] [5].Benkert P, Tosatto SCE, and Schwede T, “Global and Local Model Quality Estimation at CASP8 Using the Scoring Functions QMEAN and QMEANclust,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 173–180, 2009. [DOI] [PubMed] [Google Scholar]

[R6] [6].Betancourt MR and Skolnick J, “Finding the Needle in a Haystack: Educing Protein Native Folds from Ambiguous ab initio Folding Predictions,” J. Computational Chemistry, vol. 22, pp. 339–353, 2001. [Google Scholar]

[R7] [7].Bondugula R and Xu D, “MUPRED: A Tool for Bridging the Gap between Template Based Methods and Sequence Profile Based Methods for Protein Secondary Structure Prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 66, pp. 664–670, 2007. [DOI] [PubMed] [Google Scholar]

[R8] [8].Bondugula R, Xu D, and Shang Y, “A Fast Algorithm for Low-Resolution Protein Structure Prediction,” Proc. Ann. Int’l Conf. IEEE Eng. in Medicine and Biology Soc, pp. 5826–5829, July 2006. [DOI] [PubMed] [Google Scholar]

[R9] [9].Borg I and Groenen P, Modern Multidimensional Scaling, Theory and Applications. Springer-Verlag, 1997. [Google Scholar]

[R10] [10].Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, and Karplus M, “CHARMM: A Program for Macromolecular Energy Minimization and Dynamic Calculations,” J. Computational Chemistry, vol. 4, pp. 187–217, 1983. [Google Scholar]

[R11] [11].Cheng J, Wang Z, Tegge AN, and Eickholt J, “Prediction of Global and Local Quality of CASP8 Models by MULTICOM Series,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 181–184, 2009. [DOI] [PubMed] [Google Scholar]

[R12] [12].Cozzetto D, Kryshtafovych A, Ceriani M, and Tramontano A, “Assessment of Predictions in the Model Quality Assessment Category,” Proteins: Structure, Function, and Bioinformatics, vol. 69, pp. 175–183, 2007. [DOI] [PubMed] [Google Scholar]

[R13] [13].Cozzetto D, Kryshtafovych A, and Tramontano A, “Critical Assessment of Methods of Protein Structure Prediction-Round VIII,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 1–4, 2009. [DOI] [PubMed] [Google Scholar]

[R14] [14].Cozzetto D, Kryshtafovych A, and Tramontano A, “Evaluation of CASP8 Model Quality Predictions,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 157–66, 2009. [DOI] [PubMed] [Google Scholar]

[R15] [15].Ginalski K, Elofsson A, Fischer D, and Rychlewski L, “3D-Jury: A Simple Approach to Improve Protein Structure Predictions,” Bioinformatics, vol. 19, pp. 1015–1018, 2003. [DOI] [PubMed] [Google Scholar]

[R16] [16].Goulden CH, Methods of Statistical Analysis, second ed., pp. 50–55. Wiley, 1956. [Google Scholar]

[R17] [17].Gront D, Hansmann UHE, and Kolinski A, “Exploring Protein Energy Landscapes with Hierarchical Clustering,” Int’l J. Quantum Chemistry, vol. 105, pp. 826–830, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Kalman M and Ben-Tal N, “Quality Assessment of Protein Model-Structures Using Evolutionary Conservation,” Bioinformatics, vol. 26, no. 10, pp. 1299–1307, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Kim DE, Chivian D, and Baker D, “Protein Structure Prediction and Analysis Using the Robetta Server,” Nucleic Acids Research, vol. 32, pp. 526–531, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Kozakov D, Clodfelter KH, Vajda S, and Camacho CJ, “Optimal Clustering for Detecting near-Native Conformations in Protein Docking,” Biophysical J., vol. 89, pp. 867–875, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Lazaridis T and Karplus M, “New View of Protein Folding Reconciled with the Old through Multiple Unfolding Simulations,” Science, vol. 278, pp. 1928–1931, 1997. [DOI] [PubMed] [Google Scholar]

[R22] [22].Larsson P, Skwark MJ, Wallner B, and Elofsson A, “Assessment of Global and Local Model Quality in CASP8 Using Pcons and ProQ,” Proteins: Structure, Function, and Bioinformatics, vol. 77, pp. 167–172, 2009. [DOI] [PubMed] [Google Scholar]

[R23] [23].Lu M, Dousis AD, and Ma J, “OPUS-PSP: An Orientation-Dependent Statistical All-Atom Potential Derived from Side-Chain Packing,” J. Molecular Biology, vol. 376, pp. 288–301, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Lu H and Skolnick J, “A Distance-Dependent Atomic Knowledge-Based Potential for Improved Protein Structure Selection,” Proteins: Structure, Function, and Genetics, vol. 44, pp. 223–232, 2001. [DOI] [PubMed] [Google Scholar]

[R25] [25].Mcguffin LJ, “Prediction of Global and Local Model Quality in CASP8 Using the ModFOLD Server,” Proteins: Structure, Function, and Bioinformatics, vol. 77, no. suppl 9, pp. 185–190, 2009. [DOI] [PubMed] [Google Scholar]

[R26] [26].McGuffin LJ, “The ModFOLD Server for the Quality Assessment of Protein Structural Models,” Bioinformatics, vol. 24, no. 4, pp. 586–587, 2008. [DOI] [PubMed] [Google Scholar]

[R27] [27].McGuffin LJ and Roche DB, “Rapid Model Quality Assessment for Protein Structure Predictions Using the Comparison of Multiple Models without Structural Alignments,” Bioinformatics, vol. 26, no. 2, pp. 182–188, 2010. [DOI] [PubMed] [Google Scholar]

[R28] [28].Moult J, “Comparison of Database Potentials and Molecular Mechanics Force Fields,” Current Opinion in Structural Biology, vol. 7, pp. 194–199, 1997. [DOI] [PubMed] [Google Scholar]

[R29] [29].Nobeli I, Mitchell JBO, Alex A, and Thornton J, “Evaluation of a Knowledge-Based Potential of Mean Force for Scoring Docked Protein-Ligand Complexes,” J. Computational Chemistry, vol. 22, pp. 673–688, 2001. [Google Scholar]

[R30] [30].Qui J, Sheffler W, Baker D, and Noble WS, “Ranking Predicted Protein Structures with Support Vector Regression,” Proteins: Structure, Function, and Bioinformatics, vol. 71, pp. 1175–1182, May 2008. [DOI] [PubMed] [Google Scholar]

[R31] [31].Samudrala R and Moult J, “An All-Atom Distance-Dependent Conditional Probability Discriminatory Function for Protein Structure Prediction,” J. Molecular Biology, vol. 275, pp. 895–916, 1998. [DOI] [PubMed] [Google Scholar]

[R32] [32].Shen M and Sali A, “Statistical Potential for Assessment and Prediction of Protein Structures,” Protein Science, vol. 15, pp. 2507–2524, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Shortle D, Simons KT, and Baker D, “Clustering of Low-Energy Conformations near the Native Structures of Small Proteins,” Biophysics, vol. 95, pp. 11158–11162, 1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Sippl M, “Knowledge-Based Potentials for Proteins,” Current Opinion in Structural Biology, vol. 5, pp. 229–235, 1995. [DOI] [PubMed] [Google Scholar]

[R35] [35].Wang K, Fain B, Levitt M, and Samudrala R, “Improved Protein Structure Selection Using Decoy-Dependent Discriminatory Functions,” BMC Structural Biology, vol. 4, p. 8, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Wang Q, Shang Y, and Xu D, “A New Clustering-Based Method for Protein Structure Selection,” Proc. Int’l Joint Conf. Neural Networks (IJCNN ‘08), pp. 2891–2898, 2008. [Google Scholar]

[R37] [37].Wang Z, Tegge AN, and Cheng J, “Evaluating the Absolute Quality of a Single Protein Model Using Support Vector Machines and Structural Features,” Proteins, vol. 75, no. 3, pp. 638–647, 2009. [DOI] [PubMed] [Google Scholar]

[R38] [38].Wallner B and Elofsson A, “Pcons5: Combining Consensus, Structural Evaluation and Fold Recognition Scores,” Bioinformatics, vol. 21, pp. 4248–4254, 2005. [DOI] [PubMed] [Google Scholar]

[R39] [39].Weisstein EW, “Paired t-Test,” From MathWorld-A Wolfram Web Resource, http://mathworld.wolfram.com/Pairedt-Test.html, 2011. [Google Scholar]

[R40] [40].Wu Y, Lu M, Chen M, Li J, and Ma J, “OPUS-Ca: A Knowledge-Based Potential Function Requiring Only Ca Positions,” Protein Science, vol. 16, pp. 1449–1463, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Venclovas C and Margelevičius M, “Comparative Modeling in CASP6 Using Consensus Approach to Template Selection, Sequence-Structure Alignment, and Structure Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 7, pp. 99–105, 2005. [DOI] [PubMed] [Google Scholar]

[R42] [42].Zhang Y and Skolnick J, “SPICKER: A Clustering Approach to Identify near-Native Protein Folds,” J. Computational Chemistry, vol. 25, pp. 865–871, 2004. [DOI] [PubMed] [Google Scholar]

[R43] [43].Zhang Y and Skolnick J, “Scoring Function for Automated Assessment of Protein Structure Template Quality,” Proteins, vol. 57, pp. 702–710, June 2004. [DOI] [PubMed] [Google Scholar]

[R44] [44].Yang Y and Zhou Y, “Ab initio Folding of Terminal Segments with Secondary Structures Reveals the Fine Difference between Two Closely Related All-Atom Statistical Energy Functions,” Protein Science, vol. 17, pp. 1212–1219, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Zhang J, Wang Q, Barz B, He Z, Kosztin I, Shang Y, and Xu D, “MUFOLD: A New Solution for Protein 3D Structure Prediction,” Proteins: Structure, Function, and Bioinformatics, vol. 78, pp. 1137–1152, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Zemla A, “LGA: A Method for Finding 3D Similarities in Protein Structures,” Nucleic Acids Research, vol. 31, no. 13, pp. 3370–3374, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Zhou H and Zhou Y, “Distance-Scaled, Finite Ideal-Gas Reference State Improves Structure-Derived Potentials of Mean Force for Structure Selection and Stability Prediction,” Protein Science, vol. 11, pp. 2714–2726, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].http://prodata.swmed.edu/CASP8/evaluation/Categories.htm, 2011.

[R49] [49].http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C, 2011.

PERMALINK

Improving a Consensus Approach for Protein Structure Selection by Removing Redundancy

Qingguo Wang

Yi Shang

Dong Xu

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Protein Structure Similarity

3.2. A Basic Consensus-Based Algorithm

Fig. 1.

Algorithm RefAll(S)

Fig. 2.

3.3. Problem Statement

4. A New Consensus-Based Algorithm

4.1. A New Consensus Algorithm RefSelect

Algorithm RefSelect(S, R)

4.2. New Technique for Finding Reference Structures

Algorithm RmRedundant(S, Z)

Fig. 3.

TABLE 1.

4.3. Parameter Learning

Fig. 4.

5. Experimental Results

5.1. Data Set

5.2. Comparing with Existing QA Servers

TABLE 2.

5.3. Comparing with Scoring Functions

TABLE 3.

5.4. Performance on Rosetta Models

TABLE 4.

6. Summary

AcKnowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases