Abstract
The energy function is the key component of protein modeling methodology. This work presents a semianalytical approach to the development of contact potentials for protein structure modeling. Residue-residue and atom-atom contact energies were derived by maximizing the probability of observing native sequences in a nonredundant set of protein structures. The optimization task was formulated as an inverse statistical mechanics problem applied to the Potts model. Its solution by pseudolikelihood maximization provides consistent estimates of coupling constants at atomic and residue levels. The best performance was achieved when interacting atoms were grouped according to their physicochemical properties. For individual protein structures, the performance of the contact potentials in distinguishing near-native structures from the decoys is similar to the top-performing scoring functions. The potentials also yielded significant improvement in the protein docking success rates. The potentials recapitulated experimentally determined protein stability changes upon point mutations and protein-protein binding affinities. The approach offers a different perspective on knowledge-based potentials and may serve as the basis for their further development.
Introduction
Computer simulations are essential for studying biological macromolecules, including proteins. Along with the search space sampling, the energy function is the key component of modeling. The energy function can be either derived from the general physical principles (like a number of popular force fields (1, 2, 3, 4)) or based on diverse sets of known protein structures (various knowledge-based or statistical potentials (5, 6, 7, 8)). Statistical potentials provide the balance between accuracy and computational efficiency. Thus, they are successfully applied to many problems, such as discrimination of the native structure from decoys (9, 10), fold recognition (11), structure prediction (12), protein docking (13, 14, 15) and design (16, 17), and prediction of protein stability and affinity (18, 19, 20). Simplified energy models provide insight into general principles of protein folding and binding (21, 22, 23, 24).
One of the common approaches to developing statistical potentials is to calculate the probability of various structural features observed in a set of experimental protein structures relative to a reference state (7, 25). The probability is subsequently converted into energy using the inverse Boltzmann relation (7). However, the choice of the reference state, which serves as an imaginary protein model without interactions, is not a well-defined problem, and a number of approximations have been proposed. Among them are averaging (26), finite ideal-gas (10), spherical noninteracting (27), atom-shuffled (28), random-walk chain (29), and quasichemical (6) approximations. A different strategy for deriving statistical potentials is based on optimization of the energy parameters to maximize recognition of the native structure from a set of decoys (30, 31, 32). Despite the success of statistical potentials in various applications, their physical interpretation is not quite clear (33, 34, 35). Thus, derivation of the potential that provides fundamental and transparent insight is highly desirable.
Many problems that require describing direct (microscopic) interactions of objects (atoms, particles, etc.) from observation of microscopic configurations of the system of these objects can be successfully tackled by inverse statistical mechanics approaches (see (36, 37, 38) and references therein). In particular, the Ising (39) and Potts (40) models were used to study the collective behavior of neurons (41, 42), infer gene-interaction networks from experimentally observed transcription profiles (43), predict residue-residue contacts from multiple sequence alignments (37, 44, 45), study protein fitness landscapes (46, 47), and infer epistatic effects from fitness (48). Interaction parameters in these models are often recovered using the maximal entropy principle (49), resulting in the least structured (i.e., most generic) model that is still consistent with the experimental data. In this study, we show that inverse statistical mechanics formalism applied to the Potts model can be used to construct both residue-residue and atom-atom contact potentials, with the latter outperforming most existing energy functions in a number of tests. A closely related approach has already been utilized to derive residue-residue statistical contact potentials (50, 51, 52). However, these studies have not gained much attention, most likely because they were discussed from a different perspective, i.e., protein evolution and design, and no detailed analysis of the performance of the constructed potentials in protein structural modeling has been reported. In this article, we bridge this gap and show that the inverse Potts inference can be applied to construct simple but effective residue-residue and atom-atom contact potentials, with the latter performing on par with the best existing statistical potentials. The effectiveness of the potentials is attributed to 1) the consistent estimation of the energy parameters by the pseudolikelihood maximization approach and 2) explicit treatment of one-body energies at the learning stage.
Materials and Methods
Contact energies
Graph representation of protein structure
Noncovalent interactions in a protein can be modeled by a simple contact potential, suggesting that if two structural elements (called here interaction centers) are closer in space than a cutoff distance dmax, then these elements contribute some distance-independent value to the total energy. The interaction center can be the center of mass of a residue (for residue-residue potentials) or a single heavy atom (for atom-atom potentials). The number of distinct types of interaction centers (hereafter denoted as q) can vary depending on the level of generalization. In this study, we consider one residue-residue (RRCE20) and three atom-atom (AACE18, AACE20, and AACE167) contact potentials (RRCE and AACE are residue-residue and atom-atom contact energies, respectively; the summary description of the potentials is in Table 1). In the AACE18 potential, the atoms are grouped according to their physicochemical properties (19), yielding q = 18 distinct atom types. For the AACE20 potential, all heavy atoms in a residue are grouped together, resulting in q = 20 atom types. In the most detailed AACE167 potential, each heavy atom in the 20 residue types is considered separately, yielding q = 167 atom types. Hydrogen atoms or different protonation states of titratable amino acids were not considered.
Table 1.
Four Types Of Contact Potentials
Potential | Interaction centers | Number of Interaction Center Types, q | Description of Types | Number of Parameters |
---|---|---|---|---|
RRCE20 | residue centroids | 20 | 20 standard amino acids | 230 |
AACE18 | heavy atoms | 18 | 18 atom types from (19) | 189 |
AACE20 | heavy atoms | 20 | 20 standard amino acids | 230 |
AACE167 | heavy atoms | 167 | all heavy atoms in 20 standard amino acids | 14,195 |
For the applications discussed in this work, it is sufficient to represent a single protein structure by an undirected graph Gp(Vp,Ep) (Fig. 1). In such a graph, the set of nodes Vp = {vi} includes all interaction centers for a protein p. The set of edges Ep = {eij} comprises connections between interaction centers i and j, which are 1) closer in space than a cutoff distance dmax and 2) separated by at least kmin residues in the protein sequence (Fig. 1 A). For a given protein, the number of nodes L is fixed, but the number of edges may vary with the protein conformation. The only free parameters are dmax and kmin. Their optimal values are to be determined by the benchmarking. Besides kmin, there are no other assumptions on the protein topology (e.g., information on the intraresidue connectivity is not used).
Figure 1.
Graphical model of protein 3D structure. (A) A cartoon representation of the 58-residue bovine pancreatic trypsin inhibitor mutant 1G6X is shown. Residue centroids are shown by small spheres. Blue parts are an example of residue neighborhood, showing all residues with centroids within dmax = 7 Å from Phe22 (in red). Residues in yellow are within kmin = 3 positions in sequence from Phe22 and are not included in its neighborhood when calculating the parameters of the potential. (B) The graph Gp(Vp,Ep) is a simplified representation of the BPTI mutant structure at dmax = 7 Å and kmin = 3. Phe22 and its neighbors have red and blue borders, respectively. Nodes of the graph are color-coded according to the amino acid type to indicate their state.
Energy of protein
Each node in the graph Gp(Vp,Ep) can adopt one of q possible states (q is determined solely by the type of the potential; in this study, q = 18, 20, or 167; see Table 1). For clarity, a state of graph Gp(Vp,Ep) is denoted by the same letters as the graph vertices {vi}, giving a vector
(1) |
which is composed of integer numbers vi ∈ (1,...,k,...,q), associated with atom (residue) types of the graph nodes. To derive the contact potentials, we introduce one- and two-body energy terms to account for self-energies and energies of contacting atom (residue) pairs within the protein. Thus, each graph node i of type vi can be associated with one of q possible numbers from vector
(2) |
In turn, each edge, eij, can be attributed to one of q × q values from a symmetric matrix
(3) |
depending on types vi and vj of nodes i and j, respectively. For every type of potential, there are unique sets of and J parameters shared by all nodes and edges of the graph. However, each protein has a unique graph Gp(Vp,Ep) and atom (residue) type assignment vector , which are solely determined by the conformation and amino acid composition of that protein.
Summation over the nodes and edges of the graph Gp(Vp,Ep) yields an expression for the energy of the protein
(4) |
It is similar to the expression for the energy (Hamiltonian) of the q-state generalized Potts model in statistical physics (40), in which pairwise interactions depend on the states of the interacting sites and the local fields (or self-energies) act on the single sites of the system.
For a fixed graph Gp(Vp,Ep) (i.e., fixed protein conformation), the probability of state is given by the Boltzmann (Gibbs) distribution
(5) |
where β is a scaling factor (which, in statistical physics, means the inverse energy of thermal fluctuations at temperature T, β = 1/RT), and Z is the statistical sum over the set of all possible system states :
(6) |
Parameters and J (Eqs. 2 and 3) are not known a priori but can be inferred from a large set of known protein structures (see below). Probability distribution in Eq. 5 is also known as Markov random field (53) on graph Gp(Vp,Ep).
Pseudolikelihood approximation
Because native protein sequences are close to optimal for their three-dimensional structures (54), for a given structure of a protein, an accurate energy model should assign highest probability to the native sequence compared to any non-native one. This concept, for example, helps in protein design when one tries to find a sequence that best fits a given protein fold (55). In terms of the energy function (Eq. 4), the task can be formulated as an optimization problem of finding values of (Eq. 2) and J (Eq. 3) that maximize the probability of observing the native state
(7) |
However, the optimization problem (Eq. 7) cannot be solved directly because of the combinatorial complexity of the partition function (Eq. 6). To make the problem tractable, the probability function (Eqs. 5 and 6) for the native state (sequence) is approximated by a product of local conditional probabilities (pseudolikelihoods)
(8) |
where multiplication is performed over all atoms (residues) and summation in the denominator is over all possible q states of a single interaction center. U(vi,k) is the “energy” of a single interaction center, or Gp(Vp,Ep) node, in state k
(9) |
where summation is performed over all other nodes in Gp(Vp,Ep) connected to node i by an edge. The temperature factor β = 1 is used throughout the work. In the pseudolikelihood approximation, all nodes are in the native states, and only the state of a current node varies to calculate the “pseudo”-statistical sum (denominator in Eq. 8). The pseudolikelihoods are known to provide asymptotically consistent estimates of parameters and J (56) and are successfully applied to large sample size problems in physics and biology (37, 51, 57).
In the above formalism, the optimization problem (Eq. 7) is reduced to solving the system of differential equations
(10) |
For convenience, in the analytical deduction of the derivatives in Eq. 10, we used negative pseudo-log-likelihoods. More details of the pseudolikelihood optimization are in the Supporting Materials and Methods.
The solution to system of equations (Eq. 10) within the graph Gp(Vp,Ep) would provide and J specific only for one protein. To obtain generic potentials, we solved the system of equations (Eq. 10) for a composite graph G(V,E) constructed by joining graphs Gp(Vp,Ep) for all individual proteins in a large set of protein structures (details on this “training” set are in Materials and Methods). The product in Eq. 8 runs over all nodes in that composite graph, whereas all other considerations remain the same.
Once and J values are obtained for the training set, only the J matrix, which constitutes the contact potential, is applied to several problems in protein modeling (see Results). The role of self-energies is to provide an accurate estimation of J (also discussed in the Results).
It is worth noting that in terms of the graph representation, the development of “classical” statistical potentials is usually limited to collecting information on the number of nodes in the composite graph in state k and the number of edges connecting nodes in states k and k′ without solving Eq. 10.
Training set of protein structures
To calculate the Potts parameters and J, a nonredundant training set of 6338 protein chains was collected by the Protein Sequence Culling Server (58). Only x-ray structures with resolution ≤2.0 Å, R-factor ≤0.25, and ≥40 residues per chain were selected. Redundancy was removed at 25% sequence identity cutoff. Individual chains were extracted from Protein Data Bank (PDB) asymmetric units, and missing heavy atoms were restored by the PDB2PQR software (59) using CHARMM topology parameters (1). Alternative residue conformations, if present in the original PDB structure, were removed by the same PDB2PQR program. Nineteen chains from the initial pool of 6338 structures could not be processed either because of multiple models in the original PDB file or a large number of missing heavy atoms (>10%). These structures were left out from the consideration, yielding the final set of 6319 chains from 6092 different PDB entries.
Parameters of the contact potential
At each value of the distance cutoff dmax (from 4 to 15 Å, with 0.1 Å step) and sequence separation kmin (from 1 to 10, with step 1), we built the graph G(V,E) for 6319 single protein structures from the training set. Minimization of the objective function (Eq. 8) with derivatives (Eq. 10) was performed by the in-house C program specifically designed for this purpose. The GNU Scientific Library (http://www.gnu.org/software/gsl/) implementation of the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno method (60) (bgfs2 module of the GNU Scientific Library) was used. Minimization started with all the target parameters set to zero and proceeded iteratively until the norm of the gradient achieved the absolute tolerance of 10−3.
CASP decoys
Decoys for near-native structure detection were compiled from all tertiary structure predictions submitted to Critical Assessment of Structure Prediction (CASP) rounds X and XI (61, 62). Following the CASP practice, the models were analyzed at the level of evaluation units, or domains, assigned by the assessors. To make the energy estimates consistent, partial models with incomplete chains were removed from the pool of decoys. Overall, 224 domains from 172 protein chains were selected for testing. PDB files of models and the tables with models’ parameters and ranking according to global distance test, total score (GDT_TS) (63) were downloaded from the CASP repository (http://predictioncenter.org/download_area/).
Scoring of CASP decoys
Performance of different energy functions on the CASP decoys were assessed in terms of the Z-score, defined as the distance (measured in standard deviations) between the energy of the best (highest GDT_TS score) model Ub (or the native structure) calculated by the tested function and the mean energy of all decoy structures
(11) |
where σ is the standard deviation of energy U (given by the tested function) for all decoys in the set. In addition to the Z-score (Eq. 11), we also used the Pearson correlation coefficient between energies and GDT_TS scores of the models, as well as the normalized rank of the best-energy model
(12) |
where RGDT_TS is the rank by the GDT_TS score of the best-energy model, and Ntot is the total number of the models in the set. The form 1 − R was used to transform the normalized rank to the increasing function of the scoring method effectiveness.
For the assessment of the potentials, the energy of a protein model was calculated by simple summation of the inferred couplings J over all pairs (i,j) of interaction centers that are consistent with distance cutoff dmax used to derive the potential
(13) |
The sum over the self-energies (local fields) was omitted in Eq. 13 because it does not depend on the protein conformation and thus does not affect ranking of the model structures. Subscripts vi, vj enumerate types of atoms (residues) i and j, respectively (vi ∈ 1,2,...,q, where q = 18, 20, or 167 depending on the contact potential type; see Table 1).
Data set of point mutations
To evaluate applicability of the contact potentials to prediction of the change in the folding free energy ΔΔG upon mutations, we used a data set of 2648 point mutations for 131 globular proteins with experimentally resolved x-ray or NMR structure, for which mutation-induced change in protein stability was determined experimentally (64). The set is derived from the ProTherm database (65). 235 mutations in the set originate from NMR structures (12 distinct PDB IDs) with the number of models from 5 to 46. Because the data set contains experimentally resolved structures for the wild-type proteins (those deposited in the PDB) only, the structure of a mutant was obtained by manual replacement of the corresponding side chain. The replacement was followed by the SCWRL4 (66) repacking of the residues within 6 Å distance to the mutated residue (the residue-residue distance defined as any atom to any atom of the two residues). To compensate for possible biases introduced by SCWRL4, the same relaxation procedure was applied to the same residues of the wild-type structure. For each mutated residue X, relative solvent exposure was calculated as the ratio between the absolute solvent-accessible surface area (SASA) of this residue in the wild-type structure and the reference SASA for this type of residue in the Gly-X-Gly tripeptide (67). A residue was considered to be at the surface if >20% of its SASA was exposed.
Assuming that the folding free energy of a protein is proportional to its internal energy in the folded state, ΔΔGcalc was approximated by the difference in internal energies (Eq. 13) of the mutant and the wild-type:
(14) |
Docking decoys
The initial set of 1020 binary protein-protein complexes, for which the structures of the complex (in PDB biological assembly) and the structure of both unbound components are available, was generated by the ProPairs tool (68) run locally on a PDB snapshot with the default parameters. The set was postprocessed to retain only pairs with a high similarity of bound and unbound partners (sequence identity >96% and coverage >80% (69)), which reduced the set size to 427. Additional purging of structurally similar (template modeling-score (70) >0.8) and large (>2500 residues per interactor) complexes yielded the final set of 396 complexes (Dockground Benchmark 4.0 (71) http://dockground.compbio.ku.edu). For comparison, we also used 230 protein-protein complexes in the docking Benchmark 5.0 from Weng’s group (69).
The unbound proteins from both benchmarks were docked by the fast Fourier transform rigid-body docking program GRAMM (72, 73) at low resolution, with 3.5 Å grid step and 10° angular interval. The top 100,000 matches per complex, ranked solely by the shape complementarity, were compared to the reference complex obtained by structural superposition of the unbound monomers onto corresponding proteins in the co-crystallized complex. The quality of the docking models was assessed by the Critical Assessment of Predicted Interactions (CAPRI) criteria (74) (Table S1). Docking success rate was defined as the fraction of complexes for which at least one successful prediction (defined at different accuracy categories) was in the top n predictions. The docking predictions were further reranked by the energy of the proteins A and B interface UAB, calculated (similarly as for individual proteins; see Eq. 13) by summing up the couplings J over all pairs (i ∈ A, j ∈ B) of the interchain contacts closer in space than dmax:
(15) |
The predicted matches were clustered to identify the most probable hit within each putative docking funnel. Only one lowest energy prediction from each cluster was selected.
Affinity benchmark
To access how UAB (Eq. 15) correlates with the protein-protein binding affinities, the set of 92 protein-protein complexes with known co-crystallized structures and experimentally determined binding affinities ΔGexp was selected from the Affinity Benchmark version 2.0 (69). We considered only the rigid-body cases—those without significant conformational changes upon binding. Such cases were defined as bound/unbound interface root mean-square distance < 1.5 Å and fraction of non-native contacts (the number of non-native residue-residue contacts in the predicted complex divided by the total number of contacts in that complex (74)) < 0.4.
Energy functions
For comparison, we tested the following knowledge-based energy functions: discrete optimized protein energy (DOPE) (27), distance-scaled, finite ideal-gas reference (DFIRE) (10), dipolar DFIRE (dDFIRE) (75, 76), random walk (RW) and RWplus (29), generalized orientation-dependent all-atom potential (GOAP) (77), OPUS-PSP (78), RF-HA-SRS (28), and RF-CB-SRS-OD (79). DOPE, DFIRE, RW, and RF-HA-SRS are all-heavy-atom distance-dependent potentials, whereas RWplus, dDFIRE, and GOAP have an additional orientation-dependent term. OPUS-PSP is an orientation-dependent contact potential defined for blocks of side-chain atoms. RF-CB-SRS-OD is a residue-level distance- and orientation-dependent energy function. In addition, a simple residue-residue contact potential by Miyazawa and Jernigan MJ3h (80) was also tested because of its best performance in scoring of protein docking decoys (81, 82).
Results and Discussion
Parameters of the contact potentials
Different distance cutoffs dmax and sequence separations kmin may result in a different graph model for the protein structure (Fig. 1) and thus in different sets of local fields and couplings J that maximize the likelihood function (Eq. 8). To find the optimal dmax and kmin values, we derived our four potentials RRCE20, AACE20, AACE167, and AACE18 (Table 1) using 1110 various dmax and kmin combinations (see Materials and Methods). The performance of the potentials was evaluated by discriminating best models from the CASP decoys (Figs. 2 and S1).
Figure 2.
Performance of residue-residue and atom-atom contact potentials in best-model structure recognition from CASP decoys. Statistical potentials derived at different values of sequence separation kmin and distance cutoff dmax were used to score models of 224 protein domains submitted to CASP rounds X and XI. Performance is measured as Z-score of the highest GDT_TS score model averaged over all 224 evaluation units. The solid line shows distance dependence of the Z-scores (bottom horizontal axis) at the optimal sequence separation kmin = 3. Performance of the potentials derived without local fields is shown by the dashed lines. Thin gray lines with circles show the Z-score dependence on the sequence separation kmin (top horizontal axis) at dmax = 8.0 Å (RRCE20, AACE20) and dmax = 6.9 Å (AACE167, AACE18). The plots are cross-sections of the heat maps in Fig. S1 at specific values of sequence separation kmin and distance cutoff dmax.
All four potentials performed poorly when contacts between residues adjacent in the sequence were considered in the derivation of the contact energies. Residues that are close in sequence are close in space primarily because of the covalent bonds. Thus, taking such contacts into account obscures the treatment of nonbonded interactions, especially at smaller dmax (83). As dmax increases, more interacting pairs contribute to the potentials, and the relative contribution of sequence-adjacent residues declines rapidly. However, with complete exclusion of the sequence-adjacent residues, some portion of the nonbonded interaction energy remains unaccounted for, causing a drop, albeit slight, in the performance. The optimal performance was observed at kmin = 3. Thus, the potentials derived at this kmin value were used in the further analysis unless stated otherwise. Despite high correlation of the RRCE20 contact energies and the well-known Miyazawa-Jernigan matrix MJ3h (80) (R = 0.90), the former still shows better decoy discrimination by all three measures (Fig. 4).
Figure 4.
Performance of various energy functions in the best-model structure recognition from CASP decoys. The best model’s Z-score, its normalized rank 1 − R, and Pearson’s correlation coefficient r of the energy score and GDT_TS score of models, all averaged over 224 CASP decoy sets, are shown for different scoring functions on scatter plots (A)–(C) (one plot per each combination of the above three assessment scores). For comparison, average Z-scores and normalized ranks for the native structure are shown on plot (D).
The RRCE20 and AACE20 potentials showed very similar trends with varying dmax. This suggested that if all atoms within one residue are assigned to one type, the way of calculating contacts (either between residue centroids or between residue heavy atoms) has a negligible effect on the energy function. The optimal performance for these potentials was achieved within a broad dmax interval (6–11 Å). If the protein heavy atoms were split into 167 different types, the trend remained similar. However, the best performance is observed at lower dmax = 5 ÷ 8 Å; Z-score increases significantly from 0.89 to 0.88 for RRCE20 and AACE20, respectively, to 1.09 for AACE167.
The distinct feature of the AACE18 potential is the two sharp peaks of enhanced performance (in terms of Z-score, quantitatively similar to the performance of the much more complex AACE167) at dmax ∼ 6.9 and ∼12.6 Å (Fig. 2 D). The exact reason for the two-peak distribution is not clear because the statistical potential parameters are derived from a self-consistent solution of the system of equations (10), in which elucidating the effect of a particular factor is nontrivial if possible at all. However, the peaks do not appear when the potential is derived using a reduced model (leaving out terms in Eqs. 4 and 9; dashed lines in Fig. 2 D). This points to an important interconnectivity between one- and two-body energies, which substantially elevates the efficiency of the potential at certain dmax. Generally, the reduced model yields significantly less effective contact potentials (the dashed lines in Fig. 2 are all below the solid lines). AACE18 correlates poorly (R = 0.43) with the original atomic contact energies from (19).
The convergence analysis of the inferred energy parameters showed that they are already close to optimal. Thus, further increase in the number of structures deposited to PDB can only marginally improve the potentials (Figs. S2 and S3).
Local fields
After the local fields and couplings J are learned by solving the system of equations (Eq. 10), one-body terms (or self-energies) can be omitted in protein structure energy estimates and scoring applications (Eqs. 13, 14, and 15). Nevertheless, the local fields are essential internal parameters of the model at the learning stage (Eqs. 7, 8, and 9), boosting the effectiveness of the resulting two-body energies J (dashed and solid lines in Fig. 2). Below, we provide a detailed analysis of how the trained hk parameters are related to the basic features of individual interaction center types.
Empirically, we found that the one-body energies can be described by a linear combination of two interaction centers features (Fig. 3 B, rightmost panel)
(16) |
where pk and nk are the propensity (frequency of occurrence) and the average coordination number (in graph terminology, the average number of node neighbors in the graph G(V,E) connected by an edge) for the interaction centers of type k in the training set, respectively. The coordination number is inversely related to the exposure of the interaction center to the solvent because the interface atoms or residues have less contact with other interaction centers than those in the protein core.
Figure 3.
Properties of the local fields . (A) The contribution (relative importance) of atom propensities pk and average coordination numbers nk to local fields for the least-squares linear model (Eq. 16) with varying cutoff distance dmax is shown, calculated for RRCE20, AACE20, AACE167, and AACE18 potentials, respectively. The data was obtained by the averaging-over-orderings method by Lindeman et al. (94) as implemented in the relaimpo package (95) for R statistical computing language. (B) As an example, hk correlations with atom propensities pk and coordination numbers nk and their linear combination (Eq. 16) are shown for the AACE167 potential derived at dmax = 6.9 Å and kmin = 3.
The two parameters pk and nk contribute ≥90% to the fields for all potentials and dmax (Fig. 3, A–D). For the RRCE20 and AACE20 potentials (Fig. 3, A and B), the propensities contribute significantly more at smaller distances, whereas at larger dmax, the coordination numbers become more important. The AACE167 potential showed qualitatively similar behavior. The main difference was that contribution of the propensities became larger at significantly smaller dmax (a steep dark-gray peak on the left-hand side of Fig. 3 A). The AACE18 potential showed distinctly different patterns (Fig. 3 A, rightmost panel). For this potential, the relative importance of pk and nk weakly depends on dmax, and the propensities generally have a higher contribution to local fields than the coordination numbers (e.g., at optimal dmax = 6.9 Å, nk-values contribute ∼65 and 36% to the local fields for the AACE167 and AACE18 potentials, respectively).
Discrimination of protein near-native structures
The quality of knowledge-based energy functions is often assessed by their ability to recognize the native structure or the best model in a set of decoys (9, 84, 85). Models submitted to the CASP competition (86) are believed to be the most challenging (87) and have been recently used by others to benchmark their statistical potentials (79, 88). Thus, we tested our four potentials on the CASP decoys from rounds X and XI of the competition (61, 62). Identifying the best model from the decoys (that also corresponds to the real-case modeling scenario, when the native structure is not known) is generally more challenging than identifying the native structure (79, 87). This is also the case for our potentials: Z-scores for the best model are on average in the 0.9–1.1 range, whereas corresponding numbers for the native structure are 0.6–0.8 Z-score units higher (Fig. S4). Although discrimination of the best model is significantly harder than that of the native structure (which also substantially complicates comparison of different scoring functions), it is more relevant to the real-case scenario when only the models, but not the native structure, are available. Thus, unlike a number of other studies on scoring functions, we focused our analysis on the ability of the potentials to discriminate the best model and compared their performance to 10 state-of-the-art knowledge-based energy functions (Fig. 4).
In terms of the Z-score (Eq. 11), the AACE18 and AACE167 potentials are among the top ones, only behind GOAP (77). Assessment by the normalized rank (Eq. 12) also puts AACE18 and AACE167 on top of the list (1 − R = 0.809 and 0.808, respectively), only slightly behind GOAP (1 − R = 0.821). High correlation of energy and structural accuracy scores of the decoys indicates good scoring (77). In this respect, AACE18 shows the best performance (the average correlation coefficient 0.606), followed by GOAP (0.587) and AACE167 (0.585) (Fig. 4). These values are similar to correlations reported elsewhere (e.g., (77)). Statistical analysis reveals, however, that the differences between AACE18, AACE167, and GOAP are marginal and all three potentials have comparable performance, significantly better than the other tested energy functions (Table S2). The other two assessment scores (Z-score and normalized rank) are not as discriminative (see corresponding p-values in Table S2). However, they still place AACE18 and AACE167 among the top ones, only slightly behind GOAP. The way to establish exact ranking for the 14 potentials is not obvious. However, consistency between the three assessment scores should indicate that GOAP, AACE18, and AACE167 are the three best-performing energy functions in the best-model discrimination test. In discrimination of the native structure test (Fig. 4 D), the performance order is slightly different, with OPUS-PSP and RF-HA-SRS on top of the list by the Z-score, followed by GOAP and AACE167. This suggests that some energy functions are more tuned for high-resolution decoys but are less successful in discriminating models of moderate accuracy. In this respect, AACE167 is more sensitive in selecting the native structure compared to AACE18 (Fig. 4 D).
The residue-level RRCE20 and AACE20 potentials perform poorly by all measures, suggesting the need for atomic details for effective contact potentials. On the other hand, our contact potentials AACE18 and AACE167, which are quite simple (e.g., no distance or orientation dependency), are sufficient for capturing most structural details of the protein models, which usually is achieved by much more complex energy functions.
Protein stability changes upon point mutations
The top-performing AACE18 and AACE167 potentials were further tested for their ability to predict the change in protein stability ΔΔG upon single mutation (see Materials and Methods). The testing was done on the benchmark set of 2648 point mutations (64) in terms of Pearson’s r for correlation of the calculated ΔΔGcalc (Eq. 14) and experimental ΔΔGexp, separately for the buried and the exposed residues at various distance cutoffs dmax (Fig. 5). Correlations were calculated after removal of outliers, which include all points with deviations from the least-squares linear fit outside a 2.5–97.5% range. Surface residues are generally more susceptible to structural variations because their side chains are less constrained by the neighbors. In addition, there is a significant solvent contribution to their energetics. These effects are especially hard to account for by any energy function. Indeed, both potentials had a significant drop in performance (by ∼50% in terms of r) for the surface residues compared to the performance for the buried residues (Fig. 5, A and B). Overall, the AACE167 performance almost saturates at r ∼0.45 for dmax > 5 Å. However, the predictions for the surface residues are much less accurate compared to the predictions for the buried ones (r ∼0.2 and 0.5, respectively). For dmax < 5 Å, the performance drops almost to zero regardless of the residue exposure to solvent.
Figure 5.
Prediction of protein stability changes upon point mutations by the AACE167 and AACE18 potentials. Experimentally determined ΔΔG values for 2648 point mutations from 131 proteins are correlated with the ones calculated by AACE167 (A) and AACE18 (B) potentials at different cutoff distances dmax. As an example, correlations for the AACE18 potential at dmax = 6.9 Å and kmin = 3 are shown separately for (C) buried (relative SASA ≤ 0.2, 1429 residues) and (D) exposed residues (relative SASA > 0.2, 1219 residues), respectively. Light gray circles correspond to the x-ray structures. Dark gray squares are based on multimodel NMR structures and show calculated ΔΔG values averaged over all states, with the error bars showing standard deviations. All points with deviations from the least-squares linear fit (solid black lines) not falling into (0.025, 0.975) percentile range were treated as outliers, shown by open circles/squares.
The AACE18 energy function also has generally better predictions for the buried residues than for the exposed ones. However, the performance is not constant at dmax > 5 Å (Fig. 5 B). Similar to the best-model recognition from the CASP decoys (Fig. 2 D), the elevated r values are observed for the buried residues at two dmax values, 7 and 12 Å. For the surface residues, however, such peaks are not observed, and the best recapitulation of the experimental energies is achieved at dmax ∼ 8 ÷ 11 Å (Fig. 5 B). Interestingly, this dmax region coincides with the region of lower performance on the buried residues. This distance range roughly corresponds to the water-mediated interactions (89), which indicates that solvent effects are better treated by the simpler AACE18 rather than by the more complicated AACE167 potential.
An example of correlation between ΔΔGexp and ΔΔGcalc calculated by the AACE18 potential for buried and exposed residues (Fig. 5, C and D) indicates that NMR structures yield slightly more accurate ΔΔGcalc estimates than the x-ray structures (r = 0.65 vs. 0.58 for buried and 0.38 vs. 0.34 for exposed residues, correspondingly). This might be related to a more adequate environment of the NMR models and to averaging over the ensemble of all models in the PDB entry. However, a direct comparison is problematic because the sets of NMR and x-ray structures consist of different proteins.
In comparison with other energy functions, AACE18 potential is ranked sixth, with rAACE18 = 0.554 compared to rdDFIRE = 0.591 of the top performing dDFIRE (Fig. S6 A, one-sided p-value = 0.023 at 95% confidence). However, the difference in correlations between the first five energy functions is not statistically significant (p-value = 0.189 between the first—dDFIRE—and the fifth—DOPE). The performance of the other three potentials, AACE167, AACE20, and RRCE20, is significantly worse (Fig. S6 A).
Contact potentials in protein docking
Predicting protein-protein complexes from the structures of the individual monomers (protein docking) remains a challenging problem in computational structural biology because of a fine balance between different factors (shape complementarity, solvent and electrostatic effects, conformational changes, etc.) that enable specific binding but are hard to accurately account for. Thus, the protein docking problem is often addressed by coarse-grained approaches, at least at the initial modeling stages (90).
We tested how well our contact potentials score the low-resolution docking predictions by GRAMM (72, 73) (Figs. 6 and S5). Models of complexes were assessed according to the CAPRI criteria (Table S1). Predictions with acceptable and better quality were considered successful. Similar to the CASP decoys (Fig. S1), the best discrimination of the near-native models is achieved at the distance cutoffs dmax = 6.9 Å for AACE167 and AACE18 potentials and dmax = 8.0 Å for RRCE20 and AACE20. However, the best performance is achieved at larger sequence separation kmin = 5. All four energy functions, in most cases, outperform the Miyazawa-Jernigan MJ3h statistical potential (80), which has been recently shown to be one of the top-performing scoring functions in protein docking (81). The largest improvement over MJ3h is achieved by the atom contact potentials AACE167 and AACE18. The AACE18 also proved its efficiency in discriminating near-native docking matches in a recent joint CASP/CAPRI round of the CASP12 competition (91).
Figure 6.
Scoring of low-resolution docking decoys. The top 100,000 matches per complex with highest shape complementarity score from GRAMM were evaluated by RRCE20, AACE20, AACE167, and AACE18 contact potentials, followed by L-root-mean-square-deviation-based clustering with 10 Å radius, for (A) Dockground Benchmark 4 and (B) Weng’s Benchmark 5. The lowest energy model from each cluster was further assessed by the CAPRI criteria (see Table S1). The docking success rate for 1, 10, and 100 best scored clusters was calculated (bars). Dashed lines are the baselines of the success rates when models are ranked according to the raw shape complementarity. Solid lines are docking success rates attained by the Miyazawa-Jernigan MJ3h statistical potential.
Interestingly, the reference structure often has a higher (worse) energy score than the top near-native docking clusters (Fig. S7). This is likely caused by atom clashes in the reference structure obtained by simple structural superimposition of the unbound monomers onto corresponding bound conformations. Our docking protocol, albeit low resolution, is able to find better-scoring near-native matches. The true native conformation is generally scored higher in the case of AACE18 and in particular AACE167 potentials (Fig. S7), suggesting that taking into account protein flexibility might further improve the ranking. However, as Fig. S7 shows, selection of the native conformation from the docking decoys is still difficult. Similar success rates were reported previously for popular protein-protein docking scoring functions ZRANK (92) and integration of residue- and atom-based potentials for docking (93).
Correlation with protein binding affinity
Finally, we analyzed correlation of the interchain energy UAB (Eq. 15) or ΔGcalc calculated by the atom AACE18 and AACE167 potentials at various distance cutoffs dmax for the protein complexes in the affinity benchmark (69), with the experimentally determined binding affinities ΔGexp (Figs. 7 and S6 B). The experimental binding free energies were recapitulated significantly worse by the more complex AACE167 potential than by the simpler AACE18 (Fig. 7 A). Even a naïve ΔG predictor, which approximates binding free energy by the change in the SASA (ΔSASA) upon complex formation, outperformed AACE167 at almost all dmax (except small dmax ∼ 4 Å). At the same time, AACE18 again performed significantly better at dmax = 8 ÷ 11 Å than at other distances, which correlates with the data on the point mutations for the exposed residues (Fig. 5 B). In this dmax range, the AACE18 energies tend to be highly correlated with ΔSASA (Fig. 7 B), which is not the case for the more complex AACE167. This indicates that desolvation effects are largely captured by the AACE18 potential, albeit in a simple form ΔGdesolvation ∼ ΔSASA. However, AACE18 performance is still superior to the naïve ΔSASA predictor (Fig. 7 A) as well as all other energy functions tested on the affinity benchmark (rAACE18 = 0.508, followed by the DFIRE potential with rDFIRE = 0.445; Fig. S6 B). In comparison, specialized affinity prediction algorithms still have a better performance, with correlations up to r = 0.53 for the full set and r = 0.75 for rigid-body cases (69).
Figure 7.
Prediction of proteins binding affinities by AACE167 and AACE18 potentials. (A) Experimentally determined binding free energies (ΔGexp) for 92 rigid-body complexes from affinity benchmark 2 are correlated with the ones calculated (ΔGcalc) by the AACE167 and AACE18 potentials with varying cutoff distances dmax. Dashed line at 0.375 show performance of a naïve predictor, which approximates binding free energy by the change in solvent-accessible surface area (ΔSASA) upon complex formation. (B) The correlation of calculated binding free energies ΔGcalc and ΔSASA values is shown.
Conclusions
In summary, we presented a framework for generating semiempirical general-purpose contact potentials for proteins structure modeling. The potentials are derived from the Potts model by solving the inverse statistical physics problem. The model contains only two adjustable parameters, interaction distance cutoff dmax and separation in the sequence for the interacting units (residues or atoms) kmin. No other assumptions on the protein topology or information on intraresidue connectivity were used. Unlike many other statistical potentials, our derivation scheme explicitly includes one-body energy terms, which are shown to be a significant component of the model, boosting the effectiveness of the derived potentials.
The potentials were derived purely from the structural data in the PDB and are completely independent of any reference state. The results showed that they are successful not only in recognizing near-native models of individual proteins but also in scoring of protein docking decoys, recapitulating the experimental binding energies, and predicting stability changes upon point mutations. Such transferability of atomic potentials is strongly dependent on the assignment of the atom types. Among three considered assignment schemes, the grouping of atoms according to their physicochemical properties yielded consistently top-performing AACE18 potential. Interestingly, despite the effectiveness of the most detailed AACE167 potential in scoring of CASP decoys, it was much less effective at recapitulating experimental free energies. Large number of atom types enables fine-tuning of the AACE167 potential to achieve high decoy discrimination rate. However, at the same time, it affects its transferability to other applications.
It should be also noted that even for the most effective AACE18 potential, it is hard, if possible at all, to use one optimal contact distance dmax for all applications. For example, for discrimination of the structural decoys (for both the individual proteins and the protein complexes), the optimal dmax value was 6.9 Å. However, dmax = 8.0 Å yielded better correlation with the experimentally determined binding free energies. This discrepancy was shown to be at least partially related to the solvent effects, which are not explicitly taken into account by our model.
Overall, it is quite remarkable that in a wide range of protein structure modeling applications, simple contact potentials with no distance or orientation dependencies are sufficient for the same or better performance than much more complex knowledge-based energy functions used in the field. However, such simplicity may also pose limitations to the potentials applicability, e.g., in structure refinement, because of the lack of sensitivity to small changes in atom-atom distances inherent to the contact potentials. Complementing contact potentials with other scoring terms (e.g., the extent of clashes, surface area, etc.) is one way to overcome this problem, which has been explored by us in CASP-CAPRI competition (91).
In the future, we plan further development of the statistical potentials by incorporating distance dependence and solvent effects as well as exploring higher-order interactions (e.g., including three-body terms in Eq. 4). We will also explore different atom types, including hydrogen atoms and different protonation states of the titratable residues. On the learning side, more thorough selection of the training set, as well as inclusion of the interchain contacts from biological assemblies in PDB, could also lead to better contact energy estimates. All these questions can be addressed within the approach presented in this work.
The potentials are available at http://vakser.compbio.ku.edu/main/resources.php
Author Contributions
I.A. was responsible for concept, design, acquisition, analysis and interpretation of data, and writing of the manuscript. P.J.K. was responsible for supervision, analysis and interpretation of data, and writing of the article. I.A.V. was responsible for supervision, analysis and interpretation of data, and writing of the article. All authors read and approved the final manuscript.
Acknowledgments
This study was supported by National Institutes of Health grant R01GM074255 and National Science Foundation grants DBI1262621 and DBI1565107.
Editor: Amedeo Caflisch.
Footnotes
Ivan Anishchenko’s present address is Department of Biochemistry, University of Washington, Seattle, Washington.
Supporting Materials and Methods, seven figures, and two tables are available at http://www.biophysj.org/biophysj/supplemental/S0006-3495(18)30923-8.
Contributor Information
Petras J. Kundrotas, Email: pkundro@ku.edu.
Ilya A. Vakser, Email: vakser@ku.edu.
Supporting Material
References
- 1.MacKerell A.D., Bashford D., Karplus M. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B. 1998;102:3586–3616. doi: 10.1021/jp973084f. [DOI] [PubMed] [Google Scholar]
- 2.Cornell W.D., Cieplak P., Kollman P.A. A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 1996;118 2309–2309. [Google Scholar]
- 3.Jorgensen W.L., Maxwell D.S., Tirado-Rives J. Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 1996;118:11225–11236. [Google Scholar]
- 4.Oostenbrink C., Villa A., van Gunsteren W.F. A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6. J. Comput. Chem. 2004;25:1656–1676. doi: 10.1002/jcc.20090. [DOI] [PubMed] [Google Scholar]
- 5.Tanaka S., Scheraga H.A. Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules. 1976;9:945–950. doi: 10.1021/ma60054a013. [DOI] [PubMed] [Google Scholar]
- 6.Miyazawa S., Jernigan R.L. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules. 1985;18:534–552. [Google Scholar]
- 7.Sippl M.J. Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 1990;213:859–883. doi: 10.1016/s0022-2836(05)80269-4. [DOI] [PubMed] [Google Scholar]
- 8.Lazaridis T., Karplus M. Effective energy functions for protein structure prediction. Curr. Opin. Struct. Biol. 2000;10:139–145. doi: 10.1016/s0959-440x(00)00063-4. [DOI] [PubMed] [Google Scholar]
- 9.Park B., Levitt M. Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 1996;258:367–392. doi: 10.1006/jmbi.1996.0256. [DOI] [PubMed] [Google Scholar]
- 10.Zhou H., Zhou Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002;11:2714–2726. doi: 10.1110/ps.0217002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Buchete N.V., Straub J.E., Thirumalai D. Development of novel statistical potentials for protein fold recognition. Curr. Opin. Struct. Biol. 2004;14:225–232. doi: 10.1016/j.sbi.2004.03.002. [DOI] [PubMed] [Google Scholar]
- 12.Skolnick J. In quest of an empirical potential for protein structure prediction. Curr. Opin. Struct. Biol. 2006;16:166–171. doi: 10.1016/j.sbi.2006.02.004. [DOI] [PubMed] [Google Scholar]
- 13.Mintseris J., Weng Z. Atomic contact vectors in protein-protein recognition. Proteins. 2003;53:629–639. doi: 10.1002/prot.10432. [DOI] [PubMed] [Google Scholar]
- 14.Chuang G.Y., Kozakov D., Vajda S. DARS (decoys as the reference state) potentials for protein-protein docking. Biophys. J. 2008;95:4217–4227. doi: 10.1529/biophysj.108.135814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu S., Vakser I.A. DECK: distance and environment-dependent, coarse-grained, knowledge-based potentials for protein-protein docking. BMC Bioinformatics. 2011;12:280. doi: 10.1186/1471-2105-12-280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hu C., Li X., Liang J. Developing optimal non-linear scoring function for protein design. Bioinformatics. 2004;20:3080–3098. doi: 10.1093/bioinformatics/bth369. [DOI] [PubMed] [Google Scholar]
- 17.Boas F.E., Harbury P.B. Potential energy functions for protein design. Curr. Opin. Struct. Biol. 2007;17:199–204. doi: 10.1016/j.sbi.2007.03.006. [DOI] [PubMed] [Google Scholar]
- 18.Guerois R., Nielsen J.E., Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J. Mol. Biol. 2002;320:369–387. doi: 10.1016/S0022-2836(02)00442-4. [DOI] [PubMed] [Google Scholar]
- 19.Zhang C., Vasmatzis G., DeLisi C. Determination of atomic desolvation energies from the structures of crystallized proteins. J. Mol. Biol. 1997;267:707–726. doi: 10.1006/jmbi.1996.0859. [DOI] [PubMed] [Google Scholar]
- 20.Bordner A.J., Abagyan R.A. Large-scale prediction of protein geometry and stability changes for arbitrary single point mutations. Proteins. 2004;57:400–413. doi: 10.1002/prot.20185. [DOI] [PubMed] [Google Scholar]
- 21.Bryngelson J.D., Wolynes P.G. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci. USA. 1987;84:7524–7528. doi: 10.1073/pnas.84.21.7524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li H., Tang C., Wingreen N.S. Nature of driving force for protein folding: a result from analyzing the statistical potential. Phys. Rev. Lett. 1997;79:765–768. [Google Scholar]
- 23.Mirny L., Shakhnovich E. Protein folding theory: from lattice to all-atom models. Annu. Rev. Biophys. Biomol. Struct. 2001;30:361–396. doi: 10.1146/annurev.biophys.30.1.361. [DOI] [PubMed] [Google Scholar]
- 24.Pokarowski P., Kloczkowski A., Kolinski A. Inferring ideal amino acid interaction forms from statistical protein contact potentials. Proteins. 2005;59:49–57. doi: 10.1002/prot.20380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Betancourt M.R., Thirumalai D. Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 1999;8:361–369. doi: 10.1110/ps.8.2.361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Samudrala R., Moult J. An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 1998;275:895–916. doi: 10.1006/jmbi.1997.1479. [DOI] [PubMed] [Google Scholar]
- 27.Shen M.Y., Sali A. Statistical potential for assessment and prediction of protein structures. Protein Sci. 2006;15:2507–2524. doi: 10.1110/ps.062416606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rykunov D., Fiser A. Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials. Proteins. 2007;67:559–568. doi: 10.1002/prot.21279. [DOI] [PubMed] [Google Scholar]
- 29.Zhang J., Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS One. 2010;5:e15386. doi: 10.1371/journal.pone.0015386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Goldstein R.A., Luthey-Schulten Z.A., Wolynes P.G. Protein tertiary structure recognition using optimized Hamiltonians with local interactions. Proc. Natl. Acad. Sci. USA. 1992;89:9029–9033. doi: 10.1073/pnas.89.19.9029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Maiorov V.N., Crippen G.M. Contact potential that recognizes the correct folding of globular proteins. J. Mol. Biol. 1992;227:876–888. doi: 10.1016/0022-2836(92)90228-c. [DOI] [PubMed] [Google Scholar]
- 32.Thomas P.D., Dill K.A. An iterative method for extracting energy-like quantities from protein structures. Proc. Natl. Acad. Sci. USA. 1996;93:11628–11633. doi: 10.1073/pnas.93.21.11628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Thomas P.D., Dill K.A. Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 1996;257:457–469. doi: 10.1006/jmbi.1996.0175. [DOI] [PubMed] [Google Scholar]
- 34.BenNaim A. Statistical potentials extracted from protein structures: are these meaningful potentials? J. Chem. Phys. 1997;107:3698–3706. [Google Scholar]
- 35.Hamelryck T., Borg M., Ferkinghoff-Borg J. Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PLoS One. 2010;5:e13714. doi: 10.1371/journal.pone.0013714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Habeck M. Bayesian approach to inverse statistical mechanics. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2014;89:052113. doi: 10.1103/PhysRevE.89.052113. [DOI] [PubMed] [Google Scholar]
- 37.Ekeberg M., Lövkvist C., Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2013;87:012707. doi: 10.1103/PhysRevE.87.012707. [DOI] [PubMed] [Google Scholar]
- 38.Levy R.M., Haldane A., Flynn W.F. Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol. 2017;43:55–62. doi: 10.1016/j.sbi.2016.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Brush S.G. History of the Lenz-Ising model. Rev. Mod. Phys. 1967;39:883–893. [Google Scholar]
- 40.Wu F.Y. The Potts model. Rev. Mod. Phys. 1982;54:235–268. [Google Scholar]
- 41.Schneidman E., Berry M.J., II, Bialek W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature. 2006;440:1007–1012. doi: 10.1038/nature04701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cocco S., Leibler S., Monasson R. Neuronal couplings between retinal ganglion cells inferred by efficient inverse statistical physics methods. Proc. Natl. Acad. Sci. USA. 2009;106:14058–14062. doi: 10.1073/pnas.0906705106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lezon T.R., Banavar J.R., Fedoroff N.V. Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proc. Natl. Acad. Sci. USA. 2006;103:19033–19038. doi: 10.1073/pnas.0609152103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Marks D.S., Colwell L.J., Sander C. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011;6:e28766. doi: 10.1371/journal.pone.0028766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Balakrishnan S., Kamisetty H., Langmead C.J. Learning generative models for protein fold families. Proteins. 2011;79:1061–1078. doi: 10.1002/prot.22934. [DOI] [PubMed] [Google Scholar]
- 46.Figliuzzi M., Jacquier H., Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 2016;33:268–280. doi: 10.1093/molbev/msv211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Shekhar K., Ruberman C.F., Chakraborty A.K. Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2013;88:062705. doi: 10.1103/PhysRevE.88.062705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Flynn W.F., Haldane A., Levy R.M. Inference of epistatic effects leading to entrenchment and drug resistance in HIV-1 protease. Mol. Biol. Evol. 2017;34:1291–1306. doi: 10.1093/molbev/msx095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Jaynes E.T. Information theory and statistical mechanics. Phys. Rev. 1957;106:620–630. [Google Scholar]
- 50.Bonnard C., Kleinman C.L., Lartillot N. Fast optimization of statistical potentials for structurally constrained phylogenetic models. BMC Evol. Biol. 2009;9:227. doi: 10.1186/1471-2148-9-227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kleinman C.L., Rodrigue N., Lartillot N. A maximum likelihood framework for protein design. BMC Bioinformatics. 2006;7:326. doi: 10.1186/1471-2105-7-326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Zhou X., Schmidler S.C. Department of Statistical Science, Duke University; Durham, NC: 2009. Bayesian Parameter Estimation in Ising and Potts Models: A Comparative Study with Applications to Protein Modeling. [Google Scholar]
- 53.Kindermann R., Snell J.L., American Mathematical Society . American Mathematical Society; Providence, RI: 1980. Markov Random Fields and Their Applications. [Google Scholar]
- 54.Kuhlman B., Baker D. Native protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. USA. 2000;97:10383–10388. doi: 10.1073/pnas.97.19.10383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Kuhlman B., Dantas G., Baker D. Design of a novel globular protein fold with atomic-level accuracy. Science. 2003;302:1364–1368. doi: 10.1126/science.1089427. [DOI] [PubMed] [Google Scholar]
- 56.Gidas B. Consistency of maximum likelihood and pseudo-likelihood estimators for Gibbs distributions. In: Fleming W., Lions P.-L., editors. Stochastic Differential Systems, Stochastic Control Theory and Applications. Springer; 1988. pp. 129–145. [Google Scholar]
- 57.Aurell E., Ekeberg M. Inverse ising inference using all the data. Phys. Rev. Lett. 2012;108:090201. doi: 10.1103/PhysRevLett.108.090201. [DOI] [PubMed] [Google Scholar]
- 58.Wang G., Dunbrack R.L., Jr. PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005;33:W94–W98. doi: 10.1093/nar/gki402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Dolinsky T.J., Nielsen J.E., Baker N.A. PDB2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Nucleic Acids Res. 2004;32:W665–W667. doi: 10.1093/nar/gkh381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Fletcher R., Fletcher R. Practical Methods of Optimization. John Wiley & Sons, Ltd.; 2000. Structure of methods; pp. 12–43. [Google Scholar]
- 61.Moult J., Fidelis K., Tramontano A. Critical assessment of methods of protein structure prediction: progress and new directions in round XI. Proteins. 2016;84(Suppl 1):4–14. doi: 10.1002/prot.25064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Moult J., Fidelis K., Tramontano A. Critical assessment of methods of protein structure prediction (CASP)--round x. Proteins. 2014;82(Suppl 2):1–6. doi: 10.1002/prot.24452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zemla A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003;31:3370–3374. doi: 10.1093/nar/gkg571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Dehouck Y., Grosfils A., Rooman M. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics. 2009;25:2537–2543. doi: 10.1093/bioinformatics/btp445. [DOI] [PubMed] [Google Scholar]
- 65.Bava K.A., Gromiha M.M., Sarai A. ProTherm, version 4.0: thermodynamic database for proteins and mutants. Nucleic Acids Res. 2004;32:D120–D121. doi: 10.1093/nar/gkh082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Krivov G.G., Shapovalov M.V., Dunbrack R.L., Jr. Improved prediction of protein side-chain conformations with SCWRL4. Proteins. 2009;77:778–795. doi: 10.1002/prot.22488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Rose G.D., Geselowitz A.R., Zehfus M.H. Hydrophobicity of amino acid residues in globular proteins. Science. 1985;229:834–838. doi: 10.1126/science.4023714. [DOI] [PubMed] [Google Scholar]
- 68.Krull F., Korff G., Knapp E.W. ProPairs: a data set for protein-protein docking. J. Chem. Inf. Model. 2015;55:1495–1507. doi: 10.1021/acs.jcim.5b00082. [DOI] [PubMed] [Google Scholar]
- 69.Vreven T., Moal I.H., Weng Z. Updates to the integrated protein-protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J. Mol. Biol. 2015;427:3031–3041. doi: 10.1016/j.jmb.2015.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Zhang Y., Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004;57:702–710. doi: 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
- 71.Kundrotas P.J., Anishchenko I., Vakser I.A. Dockground: a comprehensive data resource for modeling of protein complexes. Protein Sci. 2018;27:172–181. doi: 10.1002/pro.3295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Katchalski-Katzir E., Shariv I., Vakser I.A. Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Natl. Acad. Sci. USA. 1992;89:2195–2199. doi: 10.1073/pnas.89.6.2195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Vakser I.A. Protein docking for low-resolution structures. Protein Eng. 1995;8:371–377. doi: 10.1093/protein/8.4.371. [DOI] [PubMed] [Google Scholar]
- 74.Méndez R., Leplae R., Wodak S.J. Assessment of blind predictions of protein-protein interactions: current status of docking methods. Proteins. 2003;52:51–67. doi: 10.1002/prot.10393. [DOI] [PubMed] [Google Scholar]
- 75.Yang Y., Zhou Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins. 2008;72:793–803. doi: 10.1002/prot.21968. [DOI] [PubMed] [Google Scholar]
- 76.Yang Y., Zhou Y. Ab initio folding of terminal segments with secondary structures reveals the fine difference between two closely related all-atom statistical energy functions. Protein Sci. 2008;17:1212–1219. doi: 10.1110/ps.033480.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Zhou H., Skolnick J. GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction. Biophys. J. 2011;101:2043–2052. doi: 10.1016/j.bpj.2011.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Lu M., Dousis A.D., Ma J. OPUS-PSP: an orientation-dependent statistical all-atom potential derived from side-chain packing. J. Mol. Biol. 2008;376:288–301. doi: 10.1016/j.jmb.2007.11.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Rykunov D., Fiser A. New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics. 2010;11:128. doi: 10.1186/1471-2105-11-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Miyazawa S., Jernigan R.L. Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins. 1999;34:49–68. doi: 10.1002/(sici)1097-0134(19990101)34:1<49::aid-prot5>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
- 81.Moal I.H., Torchala M., Fernández-Recio J. The scoring of poses in protein-protein docking: current capabilities and future directions. BMC Bioinformatics. 2013;14:286. doi: 10.1186/1471-2105-14-286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Anishchenko I., Kundrotas P.J., Vakser I.A. Modeling complexes of modeled proteins. Proteins. 2017;85:470–478. doi: 10.1002/prot.25183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Melo F., Sánchez R., Sali A. Statistical potentials for fold assessment. Protein Sci. 2002;11:430–448. doi: 10.1002/pro.110430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Samudrala R., Levitt M. Decoys ‘R’ Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 2000;9:1399–1401. doi: 10.1110/ps.9.7.1399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Simons K.T., Kooperberg C., Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 1997;268:209–225. doi: 10.1006/jmbi.1997.0959. [DOI] [PubMed] [Google Scholar]
- 86.Moult J., Pedersen J.T., Fidelis K. A large-scale experiment to assess protein structure prediction methods. Proteins. 1995;23 doi: 10.1002/prot.340230303. ii–v. [DOI] [PubMed] [Google Scholar]
- 87.Handl J., Knowles J., Lovell S.C. Artefacts and biases affecting the evaluation of scoring functions on decoy sets for protein structure prediction. Bioinformatics. 2009;25:1271–1279. doi: 10.1093/bioinformatics/btp150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Cossio P., Granata D., Trovato A. A simple and efficient statistical potential for scoring ensembles of protein structures. Sci. Rep. 2012;2:351. [Google Scholar]
- 89.Levy Y., Onuchic J.N. Water mediation in protein folding and molecular recognition. Annu. Rev. Biophys. Biomol. Struct. 2006;35:389–415. doi: 10.1146/annurev.biophys.35.040405.102134. [DOI] [PubMed] [Google Scholar]
- 90.Vakser I.A. Low-resolution structural modeling of protein interactome. Curr. Opin. Struct. Biol. 2013;23:198–205. doi: 10.1016/j.sbi.2012.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Kundrotas P.J., Anishchenko I., Vakser I.A. Modeling CAPRI targets 110-120 by template-based and free docking using contact potential and combined scoring function. Proteins. 2018;86(Suppl 1):302–310. doi: 10.1002/prot.25380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Pierce B., Weng Z. ZRANK: reranking protein docking predictions with an optimized energy function. Proteins. 2007;67:1078–1086. doi: 10.1002/prot.21373. [DOI] [PubMed] [Google Scholar]
- 93.Vreven T., Hwang H., Weng Z. Integrating atom-based and residue-based scoring functions for protein-protein docking. Protein Sci. 2011;20:1576–1586. doi: 10.1002/pro.687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Lindeman R.H., Merenda P.F., Gold R.Z. Scott, Foresman and Comp; Glenview, IL: 1980. Introduction to bivariate and multivariate analysis. [Google Scholar]
- 95.Grömping U. Relative importance for linear regression in R: the package relaimpo. J. Stat. Softw. 2006;17:1–27. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.