A New Generation of Statistical Potentials for Proteins

Y Dehouck; D Gilis; M Rooman

doi:10.1529/biophysj.105.079434

. 2006 Mar 13;90(11):4010–4017. doi: 10.1529/biophysj.105.079434

A New Generation of Statistical Potentials for Proteins

Y Dehouck ¹, D Gilis ¹, M Rooman ¹

PMCID: PMC1459517 PMID: 16533849

Abstract

We propose a novel and flexible derivation scheme of statistical, database-derived, potentials, which allows one to take simultaneously into account specific correlations between several sequence and structure descriptors. This scheme leads to the decomposition of the total folding free energy of a protein into a sum of lower order terms, thereby giving the possibility to analyze independently each contribution and clarify its significance and importance, to avoid overcounting certain contributions, and to deal more efficiently with the limited size of the database. In addition, this derivation scheme appears as quite general, for many previously developed potentials can be expressed as particular cases of our formalism. We use this formalism as a framework to generate different residue-based energy functions, whose performances are assessed on the basis of their ability to discriminate genuine proteins from decoy models. The optimal potential is generated as a combination of several coupling terms, measuring correlations between residue types, backbone torsion angles, solvent accessibilities, relative positions along the sequence, and interresidue distances. This potential outperforms all tested residue-based potentials, and even several atom-based potentials. Its incorporation in algorithms aiming at predicting protein structure and stability should therefore substantially improve their performances.

INTRODUCTION

Somewhere between the time-consuming semiempirical force fields (1–3) and the oversimplified Gō-like potentials (4–7), statistical energy functions, extracted from databases of known protein structures, are prime tools for the in silico study of proteins (8–12). They present the advantage of being easily adaptable to any level of simplification of protein representation, and have been successfully used in many applications, ranging from structure prediction to sequence design. Though there has been a considerable increase in the number of resolved protein structures since the first approaches of this type were described, no major improvement in predictive power could be drawn from the larger size of the databases (13–15). Indeed, increasing the database size beyond a few hundred proteins appears to yield no significant advantage in the case of the simple potentials that are still very commonly used nowadays, which are based on a limited number of sequence and structure descriptors.

In the last few years, a number of more complex potentials have been designed with the aim of exploiting more efficiently the large amount of available structural data and dealing with couplings between different structural features. Among those, let us cite distance or contact potentials that depend on the solvent accessibility of the residues (16,17), on the conformation of their main chain (18), or on the relative orientation of their side chains (19–21). On the other hand, potentials describing the propensities of the different amino acid types to adopt certain backbone conformations, which simultaneously take into account the nature and/or conformation of several neighboring residues, have also been developed (16,22,23). A major difficulty that frequently arises in such studies is related to the fact that the number of proteins in the database becomes rapidly too small when increasing the complexity of a potential. One faces a delicate choice: the use of a more complex potential can be quite advantageous for common values of the sequence and structure descriptors (e.g., Ala-Ala pair associated with α-helical conformations), and pretty disastrous in other cases (e.g., Trp-Trp pair associated with some rare turn conformations). The usual answer to this dilemma consists in drastic limitations of the description of the conformational space, for example by restricting the backbone to three possible conformations, the solvent accessibility to two different bins, or by deriving contact potentials rather than distance-dependent ones.

We present here a general derivation scheme that allows one to bypass this issue, and to build statistical energy functions based simultaneously on several sequence and structure descriptors without altering the efficiency of the elementary contributions when the values taken by these descriptors are not frequent enough in the database of known protein structures. We apply our procedure to generate statistical potentials based on the correlations among amino acid types, backbone conformations, and solvent accessibilities of residues close to each other in the sequence and/or in space. The resulting energy function displays a strongly improved ability to discriminate genuine proteins from decoy models. All potentials presented in this article are freely available at http://babylone.ulb.ac.be/StatPots.

METHODS

Sequence and structure descriptors

The backbone conformation of the residue at position i, t_i, is defined by the values of the torsion angles (φ,ϕ,ω). These values are grouped in seven domains corresponding to distinct regions on the Ramachandran map (22,24). The solvent accessibility of the residue at position i, a_i, is defined as the ratio of its solvent-accessible surface in the considered structure (as computed by DSSP (25)) and in an extended tripeptide Gly-X-Gly (26). These values are grouped in five discrete domains: a_i ≤ 5%, 5% < a_i ≤ 15%, 15% < a_i ≤ 30%, 30% < a_i ≤ 50%, and 50% < a_i . The interresidue distance d_ij is computed between the average side-chain centroids, noted C^μ, of the residues at positions i and j. The C^μ corresponds to the geometric center of heavy side-chain atoms of a given amino acid type, averaged over all side-chain conformations in a data set of known structures (16). The distances d_ij between 3 Å and 8 Å are grouped into 25 bins of 0.2 Å width; two additional bins describe distances smaller than 3 Å and larger than 8 Å, respectively. Finally, the sequence descriptor s_i corresponds to the nature (1 of 20 amino acids) of the residue at position i.

Protein structure data set

An initial set of 1522 high-resolution (≤2 Å) x-ray structures of protein chains with <20% pairwise sequence identity was extracted in October 2003 from the website “Culling the PDB by Resolution and Sequence Identity” (27) (http://dunbrack.fccc.edu/Guoli/pisces_download.php). All structures containing more than 5% heteroatoms or nonnatural residues were excluded. This led to a final set of 1403 protein chains. Furthermore, to ensure that the data set used to derive the potentials includes the proper, active, quaternary conformations of the selected proteins, the coordinates were taken from the “Protein Quaternary Structure” server (28) (http://pqs.ebi.ac.uk).

Correction for sparse data

All database-derived potentials and coupling terms presented here can be generically written as ΔW = −kT ln (n^obs/n^exp), where n^obs is the number of observations of a given association of sequence and structure descriptors in the data set of known protein structures, and n^exp is the corresponding number expected in a reference state. To deal with the limited size of the data set, a correction for sparse data (29) is applied: (n^obs/n^exp) → ((σ + n^obs)/(σ + n^exp)), where σ is an adjustable parameter, taken equal to 20 for local potentials, and 10 for distance potentials (see Results for the definition of local and distance potentials). This correction ensures that the potentials tend to 0 when the number of observations in the data set is too small.

Decoy sets

To assess the performances of the potentials, we evaluate their ability of singling out correct sequence-structure matches out of sets of decoy models. Three groups of decoys sets are considered. The first, noted Inline graphic , includes 25 proteins (30,31), each associated with hundreds of alternative structures generated by different modeling methods (4state_reduced (32): 1ctf, 1r69, 1sn3, 2cro, 4pti and 4rxn ; fisa (33): 1fc2-c, 1hdd-c, 2cro ; fisa_casp3 (33): 1bg8-a, 1bl0, 1jwe ; lattice-ssfit (31): 1ctf, 1dkt-a, 1fca, 1nlk, 1pgb, 1trl-a ; lmds (34): 1ctf, 1dtk, 1fc2-c, 1igd, 1shf-a, 2cro, 2ovo). The second group, noted Inline graphic includes 25 proteins (35), each associated with ∼2000 alternative structures generated by the Rosetta structure prediction method (1a32, 1ail, 1am3, 1cc5, 1cei, 1hyp, 1flb, 1mzm, 1r69, 1utg, 1ctf, 1dol, 1orc, 1pgx, 1ptq, 1tif, 1vcc, 2fxb, 5icb, 1bq9, 1csp, 1msi, 1tuc, 1vif, 5pti). The third group, noted D_seq, includes 50 proteins (1ptq, 1d0d, 2igd, 1g2b, 1orc, 1hz6, 1i27, 1hoe, 1luz, 1ugi, 1aba, 1cy5, 1lpl, 1mk0, 1h7m, 1bm8, 1l8r, 1lyq, 1o13, 1gmx, 1cew, 1hxi, 1nyc, 1by2, 1lsl, 1o7i, 1gnu, 1fc3, 1mai, 1dzo, 1lwb, 1huf, 1nwz, 3nul, 1cuo, 1jf8, 1p0z, 1mdc, 1vsr, 1gmi, 1eca, 1j9b, 1kmt, 1mzg, 1oz9, 1h6h, 1l2h, 1srv, 2hbg, 1amx), each associated with 1000 decoys obtained by maintaining the structure and randomizing the amino acid sequence with fixed amino acid composition. To render the test more challenging, only a fraction of the sequence was modified. This fraction was chosen randomly between 25% and 100%, independently for each decoy.

To avoid any bias toward the native structure or wild-type sequence that might result from the presence of similar proteins in the data set, an extended jackknife procedure is applied: we remove the target protein, as well as all proteins sharing more than 20% sequence identity with the target, from the database before deriving the potentials.

Performance measures

We use five different measures to evaluate the ability of the potentials to discriminate the native structure from the decoys:

The success rate S₁ is the percentage of proteins, in each group of decoys, for which the free energy of the correct sequence-structure association is smaller than the free energies computed for all decoys.
〈Z〉 is the average Z-score, over all proteins in a group of decoys. The Z-score is defined as Z = (ΔW_c − 〈ΔW〉)/σ_ΔW, where ΔW_c is the free energy of the correct sequence-structure association, 〈ΔW〉 is the average free energy of all sequence-structure associations, and σ_ΔW is the associated standard deviation. Energy functions discriminating well the genuine protein from the decoys are characterized by a very negative Z-score.
S₋₁ is the percentage of proteins with a Z-score lower than −1 (19). This measure may be more useful than S₁ when the test is challenging, for instance when the decoys and the native structures or sequences are very similar.
〈Z^x〉 evaluates the ability of the potentials to select the decoys that are closest from the native among the complete decoy set. Z^x is defined as (〈ΔW〉_5% − 〈ΔW〉)/σ_ΔW, where 〈ΔW〉_5% is the average free energy computed on a subset including 5% of the decoys (19). This subset contains the decoys with the lowest root mean-square deviation from the native structure, or the decoys with the largest sequence identity with the wild-type in the case of decoys generated by sequence randomization.
is equal to the percentage of proteins for which Z^x is lower than −1 (19).

RESULTS

General derivation scheme

A form commonly used for statistical potentials derived from a set of protein structures is

(1)

where c₁ is an amino acid type and c₂ a structure descriptor (e.g., a torsion angle or solvent accessibility domain) of the same or a neighboring residue, and P are their relative frequencies of occurrence in the structure data set. Similarly, considering two sequence descriptors c₁ and c₂ and one structure descriptor c₃, we have

(2)

where, for example, c₁ and c₂ are amino acid types at positions i and j along the sequence and c₃ is the spatial distance between them.

This form can easily be generalized. First, c₁, c₂, and c₃ can be any sequence or structure descriptor. For example, all three can correspond to torsion angle domains, or c₁ can correspond to an amino acid type, c₂ to a solvent accessibility domain, and c₃ to a torsion angle domain. A second way to generalize this form is to consider higher order potentials involving n sequence and structure descriptors. We then get

(3)

Increasing n reduces the number of observations of each combination of the c_i's in the data set and the statistical significance of the frequencies P(c₁,c₂,…,c_n). When the number of observations is too small, the correction for sparse data (see Methods) becomes important and the potential tends to zero, leading to a complete loss of information. A straightforward solution to this problem involves decomposing the potential into different coupling terms Δ Inline graphic , and applying the correction for sparse data to each of them separately. In particular, for n = 3:

(4)

where the n = 2 coupling terms coincide with the ordinary potentials Δ Inline graphic (c₁,c₂) = ΔW(c₁,c₂), and the n = 3 coupling term is defined as

(5)

This n = 3 coupling term measures the correlation between the three sequence and structure descriptors c₁, c₂, and c₃, independently of the correlations between c₁ and c₂, c₂ and c₃, and c₃ and c₁. More generally, we can define n-potentials ΔW in terms of all k ≤ n coupling terms Δ Inline graphic :

(6)

where the n-coupling terms describing correlations between n descriptors are defined as

(7)

To ensure that each contribution is counted only once, the total free energy of a protein of sequence S and structure C, ΔW(C,S), is defined as the sum of the total contributions of all coupling terms of order k ≤ n:

(8)

where the third sum goes over all combinations of the (c_i₁,c_i₂,…,c_{i_k}) descriptors present in the protein. The value chosen for n depends on the structural descriptors and the level of detail that one wishes to take into account, and also on the limitations arising from the finite size of the database.

Note that it is not always necessary or advantageous to fully decompose the potential functions like in Eqs. 4 and 6. In particular, the coupling terms of the type Δ Inline graphic (s₁,s₂), with s₁ and s₂ being single residues, may reasonably be overlooked. For example, a relevant and commonly used distance potential ΔW′ (s₁,s₂,d₁₂) may be defined as

(9)

More generally, we denote by ΔW′ potentials comprising only some of the couplings included in ΔW.

Local potentials and couplings

A first application of our general derivation scheme consists in defining local potentials reflecting the correlations among characteristics of residues that are close to each other along the sequence. We focus here on three different residue characteristics: its type s, its backbone conformation t, and its solvent accessibility a (see Methods).

Among the local n = 2 coupling terms of the type Δ Inline graphic (c₁,c₂) defined in Eqs. 1 and 7, let us consider first Δ_ts(t_i,s_j), where c₁ is taken to be the backbone conformation of the residue at position i (t_i) and c₂ the type of the residue at position j (s_j). We assume that this effective energy depends only on the relative positions of the residues along the sequence (i–j), and not on the precise positions i and j. The total free energy of a given sequence S in a structure C, according to this potential, is computed by summing Δ Inline graphic _ts(t_i,s_j) over all pairs of positions i and j in S that satisfy the condition |i–j| ≤ F_LOC, where F_LOC is an adjustable parameter taken here equal to 2. This energy function is similar to previously described backbone torsion potentials (16,22,23,36). We also compute all other n = 2 coupling terms (except Δ Inline graphic _ss(s_i,s_j), which depends only on the sequence), i.e., Δ_as(a_i,s_j), Δ_at(a_i,t_j), Δ_aa(a_i,a_j) and ΔW_tt(t_i,t_j). Note that when c₁ and c₂ correspond to the same structure or sequence descriptor, the condition |i–j| ≤ F_LOC becomes 1 ≤ i–j ≤ F_LOC.

We would like to stress that summing the energy contributions of all pairs (c₁,c₂) yields only an approximation of the total free energy of a protein. Indeed, the contributions Δ Inline graphic _ts(t_i,s_j) and Δ_ts(t_i,s_k) are in general not independent. Moreover, using simultaneously Δ_ts(t_i,s_j) and Δ_as(a_i,s_j) can be advantageous but introduces some redundancy since the solvent accessibility of a residue is related to its backbone conformation. To overcome these dependencies, we must add the n = 3 coupling terms Δ Inline graphic _tts(t_i,t_j,s_k), Δ_tss(t_i,s_j,s_k), Δ_ttt(t_i,t_j,t_k), Δ_aas(a_i,a_j,s_k), Δ_ass(a_i,s_j,s_k), Δ_aaa(a_i,a_j,a_k), Δ_aat(a_i,a_j,t_k), Δ_att(a_i,t_j,t_k) and Δ_ats(a_i,t_j,s_k). They are defined on the basis of Eq. 5 so as to be additive to, and exclusive of, the lower order coupling terms (Eq. 4). The interdependence of the different n = 3 coupling terms can, in turn, be corrected by the use of n = 4 coupling terms.

We assessed the predictive power of the different n = (2,3,4) coupling terms, independently and in combination, on the three groups of decoy sets described in Methods. The performance measures obtained are given in Table 1 for the basic potentials Δ Inline graphic _ts and Δ_as and for the most efficient linear combination of the local coupling terms, named ΔW′_LOC:

(10)

Overall, the predictive power of ΔW′_LOC is quite impressive: each performance measure indicates a markedly better discrimination of the correct sequence-structure association than with the basic potentials. The only exception is Inline graphic which slightly decreases in the D_seq set.

TABLE 1.

Performances of local and distance potentials and couplings

	Potential	〈Z〉	S₁	S₋₁	〈Z^x〉
	Δ_ts	−2.69	40%	80%	−0.34	4%
	Δ_as	−2.40	44%	80%	−0.45	16%
	Δ_ts + Δ_as	−3.44	64%	88%	−0.53	24%
	ΔW′_LOC	−4.16	76%	92%	−0.57	28%
	Δ_sd + Δ_sds	−3.27	72%	84%	−0.66	28%
	ΔW′_DIST	−4.65	80%	88%	−0.73	28%
	ΔW′_LOC + ΔW′_DIST	−5.25	84%	88%	−0.79	36%
	Δ_ts	−1.45	8%	68%	−0.27	0%
	Δ_as	−0.60	0%	44%	−0.26	0%
	Δ_ts + Δ_as	−1.84	20%	72%	−0.41	0%
	ΔW′_LOC	−2.06	20%	88%	−0.49	12%
	Δ_sd + Δ_sds	−1.80	16%	76%	−0.33	0%
	ΔW′_DIST	−2.32	28%	88%	−0.50	12%
	ΔW′_LOC + ΔW′_DIST	−2.65	36%	92%	−0.59	24%
D_seq	Δ_ts	−2.21	22%	100%	−1.54	100%
	Δ_as	−2.29	50%	100%	−1.58	96%
	Δ_ts + Δ_as	−2.22	26%	100%	−1.54	100%
	ΔW′_LOC	−2.57	80%	100%	−1.71	98%
	Δ_sd + Δ_sds	−2.75	64%	100%	−1.90	100%
	ΔW′_DIST	−2.64	48%	100%	−1.81	100%
	ΔW′_LOC + ΔW′_DIST	−2.74	84%	100%	−1.87	100%

Open in a new tab

The predictive power of the basic potentials and of the different combinations of coupling terms is evaluated on three groups of decoy sets, with five different measures (see Methods). The sequence-independent terms are not taken into account when D_seq is considered.

Strikingly, ΔW′_LOC includes almost all n = 2 and n = 3 coupling terms. The only exception is Δ Inline graphic _aa, which systematically drags down the predictive power when included in a combination of coupling terms. This follows from the fact that Δ_aa strongly favors situations in which residues close to each other in the sequence have similar solvent accessibilities, and therefore awards very negative energies to (partially) unfolded proteins. The best combination incorporates also several n = 4 coupling terms: Δ Inline graphic _ttts, Δ_aaas, Δ_attt, Δ_aatt, and Δ_aaat. The other n = 4 coupling terms have a negative impact on the predictive power. This is most probably due to the limited size of the data set, which does not allow one to compute precisely enough the probabilities of observing simultaneously four sequence and/or structure descriptors. Also note that there are 20 types of sequence elements (s), whereas only 7 torsion (t) and 5 accessibility (a) domains. Coupling terms involving several sequence elements, such as Δ Inline graphic _tsss or Δ_asss, do not appear in ΔW′_LOC as they require larger data sets to extract reliable statistics.

In principle, our derivation scheme does not give any reason to under- or overweight some coupling terms with respect to others. However, some contributions may be less/not relevant and should therefore not be included, for example because of the limited size of the data set (e.g., Δ Inline graphic _tsss, Δ_asss,…), the overstabilization of the unfolded state (e.g., Δ_aa), or the uselessness of purely sequence terms (e.g., Δ_ss). Furthermore, sequence-independent terms can be expected to yield interesting results when discriminating among nonprotein-like structures, and to be quite useless in applications such as threading experiments. Testing the potentials on decoy sets can reasonably well be considered as an intermediate case, which probably explains why we observed that underweighting these contributions by a ½ factor, in Eq. 10, is advantageous in terms of predictive power.

Distance potentials and couplings

A very popular category of statistical potentials is derived from the spatial distance distribution between residue types (e.g., 16,17,29,37). They are complementary to the local potentials presented above. It has been previously noted that such potentials do not represent the “true” energy of interaction between two residues (or two atoms) as if they where in a vacuum, but rather an effective energy including the influence of a mean protein and solvent environment (38,39). As a consequence, these potentials may depend on some characteristics of the proteins from which they are derived, such as their size (40–42) or their content in secondary structures (14,42–44). The idea of being more precise on the definition of the environment that is actually “felt” by the two interacting residues is not new (16–18), and can have a positive impact on the performances of the potentials. We show that the formalism presented in this article can be applied to define residue pair distance potentials that take appropriately into account the influence of the specific environment in which the two residues are located. This environment is here represented by backbone conformations and solvent accessibilities.

The n = 2 coupling term Δ Inline graphic _sd(s_i,d_ij) is a “one-body” distance potential that reflects the preferences of each type of residue to be located more or less close to other residues, whatever their type, and is therefore dominated by the hydrophobic effect. For residues close to each other along the sequence, i.e., |i–j| ≤ F_DIS (taken here equal to 8), the frequencies and potentials are computed separately, whereas they are merged in a single class when |i–j| > F_DIS. The total contribution to the free energy of a given sequence S in a structure C is computed by summing Δ Inline graphic _sd(s_i,d_ij) over all pairs of positions i and j in S that satisfy the condition |i–j| > 1.

On its own, Δ Inline graphic _sds(s_i,d_ij,s_j) is a two-body distance potential that excludes the one-body contributions reflecting the individual preferences of the two amino acids s_i and s_j. Such a potential has been presented previously and shown to describe more accurately the electrostatic interactions (42). In this case, by reason of symmetry, the condition |i–j| > 1 becomes i–j > 1 when computing the total free energy of a protein. Coupling Δ Inline graphic _sd(s_i,d_ij) with Δ_sds(s_i,d_ij,s_j) yields the common distance potential given in Eq. 9.

In a similar way, it is possible to define sequence-independent distance potentials involving the backbone torsion angles, Δ Inline graphic _td and Δ_tdt, or the solvent accessibilities, Δ_ad and Δ_ada. The concomitant use of these three types of potentials is hazardous since the backbone conformation and solvent accessibility of a residue are clearly dependent on its amino acid type, and some contributions are therefore overcounted. To deal with this problem, we have to define higher order coupling terms. The highest order coupling term is in this case the n = 7 term Δ Inline graphic _atsdats(a_i,t_i,s_i,d_ij,a_j,t_j,s_j). Considering all the lower level coupling terms would lead to a very large number of energetic functions and hamper any intuitive understanding of their significance. Among these, we choose to disregard all distance-independent terms, as they are redundant with the local potentials defined in the previous section for |i–j| ≤ F_LOC, and the contributions for other i and j may reasonably be assumed to be negligible. Moreover, to avoid overloading the notations, two-body asymmetrical terms, such as Δ Inline graphic _ads(a_i,d_ij,s_j) or Δ_asds(a_i,s_i,d_ij,s_j), are not considered independently but grouped with the closest symmetrical coupling term, here Δ_asdas(a_i,s_i,d_ij,a_j,s_j). We thus define ΔŴ_asdas(a_i,s_i,d_ij,a_j,s_j) as the sum of Δ_asdas(a_i,s_i,d_ij,a_j,s_j) and all the lower order asymmetrical two-body terms. Note finally that, given the limited size of the database, Δ Inline graphic _atsd(a_i,t_i,s_i,d_ij) and ΔŴ_atsdats(a_i,t_i,s_i,d_ij,a_j,t_j,s_j) are computed as contact potentials, where d_ij takes only two possible values: lower or larger than 8 Å.

Overall, according to our performance test on the three groups of decoy sets, the best combination of distance potentials and coupling terms is ΔW′_DIST, defined as

(11)

where the terms Δ Inline graphic _ad and Δ_asd are only included for short-range interactions (SR), that is, when the considered residues are separated by no more than F_DIST positions along the sequence. As shown in Table 1, the improvement of the predictive power with respect to the basic distance potential Δ_sd + Δ Inline graphic _sds is substantial in the two decoy sets based on structural modifications ( and ). However, it appears that Δ_sd + Δ_sds performs slightly better than ΔW′_DIST in the third decoy set. Since these decoys are obtained by modifications of the sequence, the sequence-independent terms (Δ Inline graphic _td, Δ_tdt, Δ_ad,…) are not taken into account in the evaluation of the energies, which may limit the necessity of using coupling terms such as Δ_tsd, Δ_asd, or Δ_tsdts.

Interestingly, as with ΔW′_LOC, almost all coupling terms are included in the best performing combination, ΔW′_DIST. This provides a strong support to the legitimacy of our derivation procedure. The only exceptions are Δ Inline graphic _ada and ΔŴ_atsdats. The former strongly favors situations where residues close in space have similar solvent accessibilities, which is a characteristic of both folded and unfolded states. The relevance of the latter is obviously compromised by the limited size of the data set. On the other hand, the terms Δ Inline graphic _ad and Δ_asd are only included for short-range interactions. Indeed, for long-range interactions, the separation in sequence is not explicitly taken into account, and Δ_ad merely reflects a trivial correlation: residues with a higher solvent accessibility have fewer contacts with other residues. For those residue pairs that do not benefit from the Δ Inline graphic _ad term, it also appears that Δ_asd is unnecessary, as its aim is to uncouple Δ_ad and Δ_sd.

Combination of local and distance potentials

The combination of the best performing local and distance potentials, ΔW′_LOC and ΔW′_DIST, improves their individual scores, as seen in Table 1. We did not address explicitly the issue of possible redundancies between these two types of potentials. However, in itself, the use of distance coupling terms significantly limits this problem. For example, a relatively strong correlation is observed between Δ Inline graphic _as and Δ_sd, but Δ_as and (Δ_sd + Δ_asd) are only weakly correlated. Overall, the performances of the combination ΔW′_LOC + ΔW′_DIST are very impressive, as exemplified by average Z-scores of −5.25, −2.65, and −2.74, on the three groups of decoy sets.

Comparison with other statistical potentials

A large number of knowledge-based potentials reflecting the preferences of the different amino acids (or of short stretches of amino acids) to adopt particular local conformations (16,22,23,36), to be more or less accessible to the solvent (16,17,45,46), or to be separated by a given spatial distance (16,17,29,30,37) have been described in the literature. However, to our knowledge, our approach is the first to integrate all these different types of contributions in a single energetic function while taking special care of their couplings. Moreover, on the local level, the nonadditivity of contributions related to pairs of residues, such as Δ Inline graphic _ts(t_i,s_j) and Δ_ts(t_i,s_k), is taken care of by the use of higher order coupling terms (Δ_tss(t_i,s_j,s_k), Δ_tts(t_i,t_j,s_k),…).

Among the local potentials based on backbone torsion angles that have been described earlier, let us cite the residue-to-torsion (22) and the torsion-to-residue (16) potentials, developed by one of us. As seen in Table 2, (a) and (b), both potentials can be expressed as simple combinations of the coupling terms Δ Inline graphic _ts, Δ_tss, and Δ_tts. Miyazawa and Jernigan designed a more complex torsion potential (23), based on a reference state that is quite different from ours and on different values of the structural descriptors. A rigorous comparison of the two approaches is therefore difficult. However, a common feature is the expression of the energetic function as a sum of basic potentials and of higher order coupling terms defined so as to exclude the more basic contributions. In this sense, their potential can be compared to the combination of coupling terms Δ Inline graphic given in Table 2 (c).

TABLE 2.

Correspondence with other statistical potentials

	Potential	Corresponding combination of our coupling terms
(a)	Residue-to-torsion (22)	Δ_ts + Δ_tss/(2 F_LOC + 1)
(b)	Torsion-to-residue (16)	Δ_ts + Δ_tts/(2 F_LOC + 1)
(c)	E^sec (23)	Δ_ts + Δ_tt +Δ_tts + Δ_ttt + Δ_ttts
(d)	C^μ-C^μ core/surface (16)	Δ_sd + Δ_asd + Δ_sds + ΔŴ_asdas
(e)	−log (P(sequence\|structure)/P(sequence)) (17)	Δ_as + Δ_ass + Δ_asas + Δ_sds + Δ_asds + Δ_asdas
(f)	ERCE (18)	Δ_sd + Δ_td + Δ_tsd + Δ_sds + Δ_tdt + ΔŴ_tsdts
(g)	Distance potentials (only α- or only β-subsets) (14,42–44)	Δ_sd + Δ_tsd + Δ_sds + ΔŴ_tsdts

Open in a new tab

The generality of our approach is demonstrated by the fact that several previously described potentials can be expressed as a linear combination of some of our coupling terms. Note that, in some cases, the values taken by the structural descriptors and the formalism used to define the reference state are quite different from ours. As a consequence, these potentials are generally not identical to the corresponding combination of our coupling terms, but rather describe the same contributions in a slightly different way.

Most commonly used distance and contact potentials (16,29,37) can be written as a simple sum of Δ Inline graphic coupling terms as described in Eq. 9, sometimes with a different reference state. In addition, more sophisticated distance potentials that take into account the solvent accessibilities or the conformations of the residues also appear as particular cases of our formalism. A first example is the “C^μ-C^μ core/surface” potential of Kocher et al. (16), which is derived separately for residue pairs that are buried or on the surface of the protein (Table 2 (d)). In the same line of thought, the energy function presented by Simons et al. (17) is composed of an environment term, comparable to Δ Inline graphic _as (with F_LOC = 0), and a pair term based on the spatial distance separating two residues in specific environments and designed to avoid redundancy with the environment term. This energy function is equivalent to the combination given in Table 2 (e), where Δ_ass and Δ_asas are distance-independent contributions included in the distance potential, which do not correspond to local potentials since the sequence separation i–j is not taken into account. Furthermore, Zhang and Kim estimated contact energies between residue pairs, depending on the conformations of their main chain (ERCE: Environment-Independent Residue Contact Energies) (18). To do this, they combined the 20 amino acid types with 3 structural states (α-helix, β-sheet, and turn) to define an extended 60-residue alphabet. This approach can easily be translated into a combination of Δ Inline graphic coupling terms, as described in Table 2 (f). Finally, several authors derived distance potentials from data sets containing only α- or only β-proteins (14,42–44). The basic potential defined in Eq. 9, when derived separately on a subset of the database (α- or β-proteins), becomes –kT ln(P(s_i,s_j,d_ij|t_i,t_j)/P(s_i,s_j|t_i,t_j)P(d_ij|t_i,t_j)), where (t_i,t_j) refers to the global secondary structure content of the protein. With such a definition, this distance potential is equivalent to the combination given in Table 2 (g).

Regarding the increase in performances provided by our new derivation scheme, the results summarized in Table 1 are unambiguous: ΔW′_LOC, ΔW′_DIST, and especially ΔW′_LOC + ΔW′_DIST are superior to common distance and local potentials such as Δ Inline graphic _sd + Δ_sds, Δ_as, and Δ_ts. This comparison can be considered as fair, given that all these potentials are derived from the same data set, using the same type of reference state, structural descriptors, and adjustable parameters. Another way to assess the performances of the potentials is to look at previously published tests on the same groups of decoy sets. This comparison has nevertheless the drawback that the effects of derivation scheme, reference state, and other parameters are mixed.

Several potentials have been tested on the group of decoy sets Inline graphic (30,47); the results are summarized in Table 3. According to this test, our distance potential ΔW′_DIST is clearly superior to every other residue-based distance or contact potential given in Table 3, as indicated by all available measures except S₋₁ in the case of TE-13 and DFIRE-B. This difference is even more manifest when we consider the combination ΔW′_LOC + ΔW′_DIST. Table 3 also suggests that atom-based potentials perform on the average better than potentials considering only one interaction center per residue. Even so, the residue-based combination ΔW′_DIST appears markedly more efficient than the RAPDF and KBP potentials. The good performances of the potentials DFIRE-A and DFIRE-B seem to result from the use of a particular reference state, defined in such a way that the effective energy associated to a pair of atoms (or residues) tends to zero when the distance separating them approaches 15 Å (47). Let us also note that another statistical potential, based on a detailed (atomic) representation of protein structures and designed to describe H-bonds as precisely as possible, has been recently tested on the Inline graphic group of decoy sets (19). The results were slightly better than with our potentials (〈Z〉 = −3.34 and S₋₁ = 92%, whereas 〈Z〉 = −2.65 and S₋₁ = 92% are obtained with ΔW′_LOC + ΔW′_DIST). It is not surprising that better predictive capabilities can be obtained with potentials based on a more detailed structural representation, but it should be stressed that a higher level of detail inevitably induces drastic limitations of the application possibilities.

TABLE 3.

Comparison with the performances of other statistical potentials

		〈Z〉	S₁	S₋₁
Our potentials (residue-based)	ΔW′_LOC	−4.16	76%	92%
	ΔW′_DIST	−4.65	80%	88%
	ΔW′_LOC + ΔW′_DIST	−5.25	84%	88%
Other distance or contact potentials (residue-based)	TE-13 (30)	−3.53	56%	100%
Other distance or contact potentials (residue-based)	MJ (13,30)	−2.82	44%	88%
	GKS (30,43)	−2.36	36%	80%
	BT (30,48)	−2.65	36%	84%
	HL (30,49)	−2.67	32%	88%
	BJ (30,37)	−2.75	60%	76%
	DFIRE-B (47)	−4.21	76%	96%
Other distance potentials (atom-based)	RAPDF (47,50)	−3.18	72%	84%
Other distance potentials (atom-based)	KBP (47,51)	−2.91	60%	84%
	DFIRE-A (47)	−4.84	92%	92%

Open in a new tab

Results were obtained on the Inline graphic group of decoy sets; data concerning the potentials derived by other groups were taken from the literature. TE-13, MJ, GKS, BT, HL, BJ (initials correspond to the authors' names), and DFIRE-B (distance-scaled, finite ideal-gas reference state) are contact or distance potentials between pairs of residues. RAPDF (residue-specific all-atom conditional probability discriminatory function), KBP (knowledge-based mean force interaction potential), and DFIRE-A are distance potentials between pairs of atoms (167 atom types are considered, according to the type of the residue to which the atom belongs).

DISCUSSION

The most exciting result of this study is the definition of a general derivation scheme that allows one to define statistical potentials taking into account the interdependence of correlations among several different sequence or structure descriptors. To demonstrate its interest, we applied this formalism and generated combinations of local and distance potentials that perform strikingly well in discriminating genuine proteins from decoy models.

Our derivation scheme is mainly based on the decomposition of a complex potential into a sum of lower order terms, through the expression of products of probabilities. This decomposition gives the possibility to analyze independently each contribution and clarify its significance and importance. It also offers several valuable advantages in terms of predictive power. First of all, according to the choice of the sequence/structure descriptors, the decomposition may be absolutely necessary to avoid overcounting certain contributions. To clarify this point, let us focus on the correlations between one residue type, s, and two backbone conformations, t. The correct contribution to the total free energy of a protein is given by Eq. 8, in this particular case: Δ Inline graphic _tts(C,S) = Σ_i,j Δ_ts(t_i,s_j) + Σ_i,j Δ_tt(t_i,t_j) + Σ_i,j,kΔ_tts(t_i,t_j,s_k). In contrast, if the potential function Δ_tts(t_i,t_j,s_k) was not decomposed and was summed over all triplets of positions (i,j,k), each Δ_ts and Δ_tt contribution would be counted several times.

Secondly, the decomposition we propose allows one to deal much more efficiently with the limited size of the database since the correction for sparse data (see Methods) is applied to each coupling term rather than on the whole energy function. For example, the distance potential ΔW_atsdats(a_i,t_i,s_i,d_ij,a_j,t_j,s_j) can be expressed as a sum of many n-coupling terms, ranging from n = 2 to n = 7, or computed directly from Eq. 3. If the database is large enough, these two possibilities are equivalent. But if the number of observations of a given combination of values of (a_i,t_i,s_i,d_ij,a_j,t_j,s_j) is too small, the correction for sparse data will make Δ Inline graphic _atsdats(a_i,t_i,s_i,d_ij,a_j,t_j,s_j) tend to zero, but not ΔW_atsdats(a_i,t_i,s_i,d_ij,a_j,t_j,s_j) unless it is computed directly through Eq. 3. In the latter case, the fact that the database is too small to reliably extract the higher order couplings actually leads to a consequent loss of valuable information about the lower order contributions. Finally, the decomposition makes it possible to modulate the reference state, by excluding some contributions (such as Δ Inline graphic _aa, Δ_ada,…) that do not appear to be relevant and decrease the overall predictive power.

The comparison with other potentials described in the literature underlines the generality of our approach, for previous potentials based on several sequence or structure descriptors can be expressed as particular cases of our formalism. This comparison also shows that we significantly raised the expectations regarding the predictive power of residue-based potentials. Indeed, our energetic functions even outperform some potentials that are based on a more detailed representation of protein structures at the atomic level.

Several improvements may still be envisaged. Indeed, our derivation scheme can easily be adapted to develop energy functions dealing with a more detailed representation of protein structures, or based on another, possibly more relevant, reference state. It is also straightforward to include additional structural descriptors, reflecting, for example, the relative orientations of interacting side chains or the relative positions of triplets of residues.

Acknowledgments

We thank J. M. Kwasigroch for his help with computers and web servers, and acknowledge support from the Communauté Française de Belgique through the Action de Recherche Concertée No. 02/07-289, and from the European Community through the Concerted Action Quality of Life 2001-3-8.4.

M.R. is research director at the Belgian National Fund for Scientific Research.

References

1.Brooks, B. R., R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. 1983. CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4:187–217. [Google Scholar]
2.Halgren, T. A. 1995. Potential energy functions. Curr. Opin. Struct. Biol. 5:205–210. [DOI] [PubMed] [Google Scholar]
3.Mackerell, A. D., Jr. 2004. Empirical force fields for biological macromolecules: overview and issues. J. Comput. Chem. 25:1584–1604. [DOI] [PubMed] [Google Scholar]
4.Gō, N. 1983. Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12:183–210. [DOI] [PubMed] [Google Scholar]
5.Galzitskaya, O. V., and A. V. Finkelstein. 1999. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl. Acad. Sci. USA. 96:11299–11304. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Alm, E., and D. Baker. 1999. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl. Acad. Sci. USA. 96:11305–11310. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Munoz, V., and W. A. Eaton. 1999. A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl. Acad. Sci. USA. 96:11311–11316. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wodak, S., and M. Rooman. 1993. Generating and testing protein folds. Curr. Opin. Struct. Biol. 3:249–259. [Google Scholar]
9.Sippl, M. J. 1995. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 5:229–235. [DOI] [PubMed] [Google Scholar]
10.Jernigan, R. L., and I. Bahar. 1996. Structure-derived potentials and protein simulations. Curr. Opin. Struct. Biol. 6:195–209. [DOI] [PubMed] [Google Scholar]
11.Moult, J. 1997. Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 7:194–199. [DOI] [PubMed] [Google Scholar]
12.Russ, W. P., and R. Ranganathan. 2002. Knowledge-based potential functions in protein design. Curr. Opin. Struct. Biol. 12:447–452. [DOI] [PubMed] [Google Scholar]
13.Miyazawa, S., and R. L. Jernigan. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256:623–644. [DOI] [PubMed] [Google Scholar]
14.Furuichi, E., and P. Koehl. 1998. Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins. 31:139–149. [DOI] [PubMed] [Google Scholar]
15.Melo, F., R. Sanchez, and D. Sali. 2002. Statistical potentials for fold assessment. Protein Sci. 11:430–448. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Kocher, J.-P., M. J. Rooman, and S. J. Wodak. 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235:1598–1613. [DOI] [PubMed] [Google Scholar]
17.Simons, K. T., C. Kooperberg, E. Huang, and D. Baker. 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268:209–225. [DOI] [PubMed] [Google Scholar]
18.Zhang, C., and S.-H. Kim. 2000. Environment-dependent residue contact energies for proteins. Proc. Natl. Acad. Sci. USA. 97:2550–2555. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Kortemme, T., A. V. Morozov, and D. Baker. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol. 326:1239–1259. [DOI] [PubMed] [Google Scholar]
20.Buchete, N. V., J. E. Straub, and D. Thirumalai. 2004. Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci. 13:862–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Miyazawa, S., and R. L. Jernigan. 2005. How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins. J. Chem. Phys. 122:24901–24918. [DOI] [PubMed] [Google Scholar]
22.Rooman, M. J., J.-P. A. Kocher, and S. J. Wodak. 1991. Prediction of backbone conformation based on seven structure assignments. Influence of local interactions. J. Mol. Biol. 221:961–979. [DOI] [PubMed] [Google Scholar]
23.Miyazawa, S., and R. L. Jernigan. 1999. Evaluation of short-range interactions as secondary structure energies for protein fold and sequence recognition. Proteins. 36:347–356. [PubMed] [Google Scholar]
24.Ramachandran, G., and V. Sasilekharan. 1968. Conformation of peptides and proteins. Adv. Protein Chem. 23:283–438. [DOI] [PubMed] [Google Scholar]
25.Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22:2577–2637. [DOI] [PubMed] [Google Scholar]
26.Rose, G. D., A. R. Geselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus. 1985. Hydrophobicity of amino acid residues in globular proteins. Science. 229:834–838. [DOI] [PubMed] [Google Scholar]
27.Wang, G., and R. Dunbrack. 2003. PISCES: a protein sequence culling server. Bioinformatics. 19:1589–1591. [DOI] [PubMed] [Google Scholar]
28.Hendrick, K., and J. M. Thornton. 1998. PQS: a protein quaternary structure file server. Trends Biochem. Sci. 23:358–361. [DOI] [PubMed] [Google Scholar]
29.Sippl, M. J. 1990. Calculation of conformational ensemble from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213:859–883. [DOI] [PubMed] [Google Scholar]
30.Tobi, D., and R. Elber. 2000. Distance-dependent, pair potential for protein folding: results from linear optimization. Proteins. 41:40–46. [PubMed] [Google Scholar]
31.Samudrala, R., and M. Levitt. 2000. Decoys‘R’Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9:1399–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Park, B., and M. Levitt. 1996. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258:367–392. [DOI] [PubMed] [Google Scholar]
33.Simons, K. T., I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker. 1999. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins. 34:82–95. [DOI] [PubMed] [Google Scholar]
34.Keasar, C., and M. Levitt. 2003. A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329:159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Tsai, J., R. Bonneau, A. V. Morozov, B. Kuhlman, C. A. Rohl, and D. Baker. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins. 53:76–87. [DOI] [PubMed] [Google Scholar]
36.Kang, H. S., A. Kurochkina, and B. Lee. 1993. Estimation and use of protein backbone angle probabilities. J. Mol. Biol. 229:448–460. [DOI] [PubMed] [Google Scholar]
37.Bahar, I., and R. L. Jernigan. 1997. Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266:195–214. [DOI] [PubMed] [Google Scholar]
38.Zhang, L., and J. Skolnick. 1996. How do potentials derived from structural databases relate to “true” potentials. Protein Sci. 7:1201–1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Shan, Y., and H.-X. Zhou. 2000. Correspondence of potentials of mean force in proteins and in liquids. J. Chem. Phys. 113:457–469. [Google Scholar]
40.Thomas, P. D., and K. A. Dill. 1996. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 257:457–469. [DOI] [PubMed] [Google Scholar]
41.Dehouck, Y., D. Gilis, and M. Rooman. 2004. Database-derived potentials dependent on protein size for in silico folding and design. Biophys. J. 87:171–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Rooman, M., and D. Gilis. 1998. Different derivations of knowledge-based potentials and analysis of their robustness and context-dependent predictive power. Eur. J. Biochem. 254:135–143. [DOI] [PubMed] [Google Scholar]
43.Godzik, A., A. Kolinski, and J. Skolnick. 1995. Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets. Protein Sci. 4:2107–2117. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zhang, C., S. Liu, H. Zhou, and Y. Zhou. 2004. The dependence of all-atom statistical potentials on structural training database. Biophys. J. 86:3349–3358. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253:164–170. [DOI] [PubMed] [Google Scholar]
46.Summa, C. M., M. Levitt, and W. F. DeGrado. 2005. An atomic environment potential for use in protein structure prediction. J. Mol. Biol. 352:986–1001. [DOI] [PubMed] [Google Scholar]
47.Zhou, H., and Y. Zhou. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11:2714–2726. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Betancourt, M. R., and D. Thirumalai. 1999. Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 8:361–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Hinds, D. A., and M. Levitt. 1992. A lattice model for protein structure prediction at low resolution. Proc. Natl. Acad. Sci. USA. 89:2536–2540. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Samudrala, R., and J. Moult. 1998. An all-atom distance-dependent conditional discriminatory function for protein structure prediction. J. Mol. Biol. 275:895–916. [DOI] [PubMed] [Google Scholar]
51.Lu, H., and J. Skolnick. 2001. A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins. 44:223–232. [DOI] [PubMed] [Google Scholar]

[bib1] 1.Brooks, B. R., R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. 1983. CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4:187–217. [Google Scholar]

[bib2] 2.Halgren, T. A. 1995. Potential energy functions. Curr. Opin. Struct. Biol. 5:205–210. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Mackerell, A. D., Jr. 2004. Empirical force fields for biological macromolecules: overview and issues. J. Comput. Chem. 25:1584–1604. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Gō, N. 1983. Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12:183–210. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Galzitskaya, O. V., and A. V. Finkelstein. 1999. A theoretical search for folding/unfolding nuclei in three-dimensional protein structures. Proc. Natl. Acad. Sci. USA. 96:11299–11304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Alm, E., and D. Baker. 1999. Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl. Acad. Sci. USA. 96:11305–11310. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Munoz, V., and W. A. Eaton. 1999. A simple model for calculating the kinetics of protein folding from three-dimensional structures. Proc. Natl. Acad. Sci. USA. 96:11311–11316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Wodak, S., and M. Rooman. 1993. Generating and testing protein folds. Curr. Opin. Struct. Biol. 3:249–259. [Google Scholar]

[bib9] 9.Sippl, M. J. 1995. Knowledge-based potentials for proteins. Curr. Opin. Struct. Biol. 5:229–235. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Jernigan, R. L., and I. Bahar. 1996. Structure-derived potentials and protein simulations. Curr. Opin. Struct. Biol. 6:195–209. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Moult, J. 1997. Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 7:194–199. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Russ, W. P., and R. Ranganathan. 2002. Knowledge-based potential functions in protein design. Curr. Opin. Struct. Biol. 12:447–452. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Miyazawa, S., and R. L. Jernigan. 1996. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256:623–644. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Furuichi, E., and P. Koehl. 1998. Influence of protein structure databases on the predictive power of statistical pair potentials. Proteins. 31:139–149. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Melo, F., R. Sanchez, and D. Sali. 2002. Statistical potentials for fold assessment. Protein Sci. 11:430–448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Kocher, J.-P., M. J. Rooman, and S. J. Wodak. 1994. Factors influencing the ability of knowledge-based potentials to identify native sequence-structure matches. J. Mol. Biol. 235:1598–1613. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Simons, K. T., C. Kooperberg, E. Huang, and D. Baker. 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268:209–225. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Zhang, C., and S.-H. Kim. 2000. Environment-dependent residue contact energies for proteins. Proc. Natl. Acad. Sci. USA. 97:2550–2555. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Kortemme, T., A. V. Morozov, and D. Baker. 2003. An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J. Mol. Biol. 326:1239–1259. [DOI] [PubMed] [Google Scholar]

[bib20] 20.Buchete, N. V., J. E. Straub, and D. Thirumalai. 2004. Orientational potentials extracted from protein structures improve native fold recognition. Protein Sci. 13:862–874. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Miyazawa, S., and R. L. Jernigan. 2005. How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins. J. Chem. Phys. 122:24901–24918. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Rooman, M. J., J.-P. A. Kocher, and S. J. Wodak. 1991. Prediction of backbone conformation based on seven structure assignments. Influence of local interactions. J. Mol. Biol. 221:961–979. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Miyazawa, S., and R. L. Jernigan. 1999. Evaluation of short-range interactions as secondary structure energies for protein fold and sequence recognition. Proteins. 36:347–356. [PubMed] [Google Scholar]

[bib24] 24.Ramachandran, G., and V. Sasilekharan. 1968. Conformation of peptides and proteins. Adv. Protein Chem. 23:283–438. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Kabsch, W., and C. Sander. 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22:2577–2637. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Rose, G. D., A. R. Geselowitz, G. J. Lesser, R. H. Lee, and M. H. Zehfus. 1985. Hydrophobicity of amino acid residues in globular proteins. Science. 229:834–838. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Wang, G., and R. Dunbrack. 2003. PISCES: a protein sequence culling server. Bioinformatics. 19:1589–1591. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Hendrick, K., and J. M. Thornton. 1998. PQS: a protein quaternary structure file server. Trends Biochem. Sci. 23:358–361. [DOI] [PubMed] [Google Scholar]

[bib29] 29.Sippl, M. J. 1990. Calculation of conformational ensemble from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213:859–883. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Tobi, D., and R. Elber. 2000. Distance-dependent, pair potential for protein folding: results from linear optimization. Proteins. 41:40–46. [PubMed] [Google Scholar]

[bib31] 31.Samudrala, R., and M. Levitt. 2000. Decoys‘R’Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9:1399–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Park, B., and M. Levitt. 1996. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258:367–392. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Simons, K. T., I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, and D. Baker. 1999. Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins. 34:82–95. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Keasar, C., and M. Levitt. 2003. A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329:159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Tsai, J., R. Bonneau, A. V. Morozov, B. Kuhlman, C. A. Rohl, and D. Baker. 2003. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins. 53:76–87. [DOI] [PubMed] [Google Scholar]

[bib36] 36.Kang, H. S., A. Kurochkina, and B. Lee. 1993. Estimation and use of protein backbone angle probabilities. J. Mol. Biol. 229:448–460. [DOI] [PubMed] [Google Scholar]

[bib37] 37.Bahar, I., and R. L. Jernigan. 1997. Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266:195–214. [DOI] [PubMed] [Google Scholar]

[bib38] 38.Zhang, L., and J. Skolnick. 1996. How do potentials derived from structural databases relate to “true” potentials. Protein Sci. 7:1201–1207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Shan, Y., and H.-X. Zhou. 2000. Correspondence of potentials of mean force in proteins and in liquids. J. Chem. Phys. 113:457–469. [Google Scholar]

[bib40] 40.Thomas, P. D., and K. A. Dill. 1996. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 257:457–469. [DOI] [PubMed] [Google Scholar]

[bib41] 41.Dehouck, Y., D. Gilis, and M. Rooman. 2004. Database-derived potentials dependent on protein size for in silico folding and design. Biophys. J. 87:171–181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Rooman, M., and D. Gilis. 1998. Different derivations of knowledge-based potentials and analysis of their robustness and context-dependent predictive power. Eur. J. Biochem. 254:135–143. [DOI] [PubMed] [Google Scholar]

[bib43] 43.Godzik, A., A. Kolinski, and J. Skolnick. 1995. Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets. Protein Sci. 4:2107–2117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Zhang, C., S. Liu, H. Zhou, and Y. Zhou. 2004. The dependence of all-atom statistical potentials on structural training database. Biophys. J. 86:3349–3358. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253:164–170. [DOI] [PubMed] [Google Scholar]

[bib46] 46.Summa, C. M., M. Levitt, and W. F. DeGrado. 2005. An atomic environment potential for use in protein structure prediction. J. Mol. Biol. 352:986–1001. [DOI] [PubMed] [Google Scholar]

[bib47] 47.Zhou, H., and Y. Zhou. 2002. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11:2714–2726. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Betancourt, M. R., and D. Thirumalai. 1999. Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 8:361–369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] 49.Hinds, D. A., and M. Levitt. 1992. A lattice model for protein structure prediction at low resolution. Proc. Natl. Acad. Sci. USA. 89:2536–2540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] 50.Samudrala, R., and J. Moult. 1998. An all-atom distance-dependent conditional discriminatory function for protein structure prediction. J. Mol. Biol. 275:895–916. [DOI] [PubMed] [Google Scholar]

[bib51] 51.Lu, H., and J. Skolnick. 2001. A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins. 44:223–232. [DOI] [PubMed] [Google Scholar]

PERMALINK

A New Generation of Statistical Potentials for Proteins

Y Dehouck

D Gilis

M Rooman

Abstract

INTRODUCTION

METHODS

Sequence and structure descriptors

Protein structure data set

Correction for sparse data

Decoy sets

Performance measures

RESULTS

General derivation scheme

Local potentials and couplings

TABLE 1.

Distance potentials and couplings

Combination of local and distance potentials

Comparison with other statistical potentials

TABLE 2.

TABLE 3.

DISCUSSION

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A New Generation of Statistical Potentials for Proteins

Y Dehouck

D Gilis

M Rooman

Abstract

INTRODUCTION

METHODS

Sequence and structure descriptors

Protein structure data set

Correction for sparse data

Decoy sets

Performance measures

RESULTS

General derivation scheme

Local potentials and couplings

TABLE 1.

Distance potentials and couplings

Combination of local and distance potentials

Comparison with other statistical potentials

TABLE 2.

TABLE 3.

DISCUSSION

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases