Skip to main content
The Journal of Chemical Physics logoLink to The Journal of Chemical Physics
. 2011 Aug 12;135(6):065102. doi: 10.1063/1.3615722

Energy design for protein-protein interactions

D V S Ravikant 1, Ron Elber 2,a)
PMCID: PMC3170394  PMID: 21842951

Abstract

Proteins bind to other proteins efficiently and specifically to carry on many cell functions such as signaling, activation, transport, enzymatic reactions, and more. To determine the geometry and strength of binding of a protein pair, an energy function is required. An algorithm to design an optimal energy function, based on empirical data of protein complexes, is proposed and applied. Emphasis is made on negative design in which incorrect geometries are presented to the algorithm that learns to avoid them. For the docking problem the search for plausible geometries can be performed exhaustively. The possible geometries of the complex are generated on a grid with the help of a fast Fourier transform algorithm. A novel formulation of negative design makes it possible to investigate iteratively hundreds of millions of negative examples while monotonically improving the quality of the potential. Experimental structures for 640 protein complexes are used to generate positive and negative examples for learning parameters. The algorithm designed in this work finds the correct binding structure as the lowest energy minimum in 318 cases of the 640 examples. Further benchmarks on independent sets confirm the significant capacity of the scoring function to recognize correct modes of interactions.

INTRODUCTION

Protein-protein interactions and their associated structures are of fundamental importance in cellular biology. These interactions are used for signaling, creating biochemically active complexes, inhibit enzymes, and more.1, 2 In many cases they are also constantly dynamic. They form, dissociate and rebind as required by cell functions. Modeling complex formation is therefore particularly challenging, and requires in many cases accurate prediction of weak physical forces and marginal binding.

In the present paper we focus on the prediction of the correct geometry of a protein pair that is known to bind. For the task at hand, the prediction of the absolute binding energy is less critical and we focus instead on ranking. We determine the complex geometry that will have the lowest (free) energy compared to all other docking alternatives. This geometry should be in agreement with experiment. The mesoscopic size of proteins and their complexes, and their rough energy landscape make the prediction of the geometry of the complexes challenging.

We differentiate between two cases: (i) bound and (ii) unbound docking. In bound docking we consider a complex of two protein chains with a known structure. We separate the complex to two chains and attempt to re-assemble them. Since the two chains are taken directly from the complex there exists at least a single docked complex in which the fitted geometry is excellent. The second case of (ii) unbound docking is more complicated. We are given the structures of the two isolated chains and are told that these proteins form a complex. However, the structures at hand are approximate. The atomic positions, taken from the experimental structures of the separated chains (or homologous structures), are not necessarily the same as in the complex. Side chain geometries and tertiary conformations adjust during complex formation and can cause significant deviation from the initial structure. Therefore bound docking (case (i)) for which rigid modeling of the individual chains is exact is considered easier than unbound docking (case (ii)).

In actual applications we do not have the structure of the complex (if we had it, we did not have to predict it) and only unbound docking is relevant. Bound docking is used to assess new algorithms and learn energy parameters by presenting to a program cases that carry unusually strong signals. For an algorithm to be successful it must (as a minimum) solve these easy cases. Despite the significant differences in difficulty, docking of type (ii) is handled in a similar way to case (i). We dock rigid models of the proteins, allowing for larger errors during the process for unbound docking, with the hope that the differences between the bound and the unbound structures are not so large as to diminish the signal completely. Adjustments of complexes of type (ii) to more relaxed and chemically sound structures are done for a small number of candidates identified earlier.

Both bound and unbound docking require two separate computational tasks: (a) search for plausible docked conformations and (b) assessment of alternative complexes and ranking. It is useful to compare these two tasks to another problem in structural biology, the problem of protein structure prediction. Docking (determination of protein complexes) is simpler since the number of the degrees of freedom is much smaller. It can be as small as six for three rotations and three translations if the structures of the individual protein chains are assumed rigid. In our searches, which are exhaustive, we examine a million translations and 54 000 rotations. The total number of complexes we examine is therefore on the order of 1010. This not-so-small set is a uniform sampling of docking space. Exhaustive sampling is unlikely in the general protein-folding problem. In protein folding, the number of conformations is exponentially large in the protein length L –zL where z is a number of order 10 and L ∼ 100. As a result exhaustive search is not feasible for folding and conformational sampling is made stochastically and heuristically which reflects on the design of appropriate energy functions. In contrast, the option of a comprehensive search of docked conformations makes it possible to solve optimally potential parameters for the learnt set in docking that minimizes the error in docking calculations. Even if the problem is infeasible and there is no set of parameters that recognizes all the correctly docked conformations, it is possible to find a parameter set that minimizes the extent of mispredictions.

In contrast to protein folding, rigid docking has only six dimensions (the relative translations and rotations of one rigid protein with respect to the second protein). The smaller dimensionality makes exact enumeration of discrete space possible. In Appendix A we analyze the errors of a discrete space representation and illustrate that they are bounded, and provide practical grid spacing. Based on the analysis presented in Appendix A, and for a pair of proteins with radii of about 40 Å we estimate a translational grid spacing of 1.6 Å and 68 760 rotations. This estimate provides ∼1010 alternative docked conformations.

In the present paper we do not introduce a new sampling algorithm for docking and use instead approaches that were employed successfully in the past. These techniques are based on fast Fourier transforms (FFT) (Refs. 3, 11, 27, 28, 29, 30, 35) of translational space and grid based searches of rotational space. The contribution of the present paper is in the calculation of the energy. This brings us to the second step of determining a scoring function.

Besides exhaustive search of conformations, we also need to score (or compute the energies of) the structures of the complexes or folds. Obviously an exact energy surface for solvated proteins should work for both folding and docking. Physically, however, while protein folding emphasizes hydrophobicity, docking may include more subtle polar interactions. The use of different potentials for each case allows emphasizing physical interactions that better fit the problem we study. Hence, it makes sense to design an energy function specifically tailored to docking, as is done in the present paper.

While the sampling of conformational space is significantly easier in docking compared to folding, the design of energy functions for docking and folding is comparable in complexity. There are a number of reasons for the additional complexity of energy design for protein-protein interactions: First, the energies of complex formation are small and are sometimes as low as a few kT – the thermal energy. The demands from ab initio or physics-based energies are therefore very high. Physics-based energies are usually not accurate enough in separating wrong and correct structures in protein folding calculations. Reproducing smaller energy difference in docking is even more difficult. Note that free energy differences, including changes in solvent reorganization, are required for this estimate. Hence, not only the accuracy is insufficient but also the significant computational cost forbids large scale examination of docked alternatives. Due to computational costs most estimates of solvation effects are implicit and approximate (such as the GBSA model of solvation4).

Second, not only the overall binding energy but also the number of individual pair interactions in protein complexes is small compared to the number of interactions of folded proteins. The stability of complexes, supported by only a few contacts, is marginal and leaves little room for errors (a contact is set between two residues or atoms if the distance between the two objects is smaller than a critical value).

Third, the statistics of empirical complexes is rather poor. This observation has important consequences for machine learning approaches to potential design. Determining parameters of energies learned directly from the structures of the complexes, the so-called knowledge-based potentials, depends on the availability of ample empirical data. In protein folding a large number of correct folds is available (about ∼72 000 in the Protein Data Bank (PDB) (Ref. 5)). Each of these folds contributes (in principle) residue-residue contacts to the statistics. In our fold database6 we have about ∼18 K independent structures. The statistics for protein complexes is significantly smaller. There are less than 1000 structures of independent protein-protein complexes in the PDB and each of these complexes has fewer contacts at the interface compared to a typical number of contacts in folded proteins. Smaller statistics of correct complexes make estimate of parameters for knowledge-based potential more difficult.

One of the more popular models of deriving knowledge-based potentials for protein folding is the log-odd ratio or a statistical potential.7, 8 In this design the structures of the proteins are examined and probability densities for contacts are computed. The energy of interaction is given by a sum of pair interactions U = ∑i > jWα(i), β(j)(rij). The indices i, j denote positions of the amino acids (or other particles) along the chain, and α, β are the indices of particle types. If the geometric center of an amino acid side chain type α is rg, the potential of mean force between a pair of amino acids is estimated as

Wαβ(rαβ)=logPαβ(rαβ)Pαβ,ref(rαβ)rαβ=|rαgrβg|,

where P(rαβ) is the probability of observing a distance, rαβ, between the two amino acid types α and β, anywhere in the learned set. Pαβ, ref(rαβ) is a reference distribution of a model expected by chance. For example, it can be a product of probabilities: The probability for a distance between any two amino acids times the probabilities of observing amino acid α and β, i.e., Pαβ, ref(rαβ) ≈ PαPβP(r).

This simple model and method, learning from known structures of proteins, was proven very effective in studying protein structures and is at the core of many successful protein-folding programs.9, 10 However, direct applications of statistical potentials “as are” to docking is limited due to the small statistics available as we also illustrate in the present paper. Some programs for docking are using a combination of physical interactions (e.g., electrostatic, exposed surface area) and statistical potentials.11 Here we re-emphasize the learning from negative examples. Negative examples are not used in learning statistical potentials and are particularly promising in docking, in which exhaustive enumeration of all false complexes is made.

General consideration for design of docking potential with mathematical programming

For the purpose of separating correct from incorrect structures we consider the approach of linear and quadratic programming. In linear programming one learns from both positive and negative examples by requiring that the following inequalities are satisfied:

U(Xd;P)UXn;P>0n,d, (1)

where U is the potential energy, Xn and Xd are the coordinate vector of the correct (native) and decoy structures, respectively, and P is a vector of potential parameters. The energy depends linearly on the parameters P, which are the unknowns we wish to determine. Equation 1 is therefore re-written as

αpα[fα(Xd)fα(Xn)]>0. (2)

The α summation is over the parameters, and the functions fα(X) depend only on geometrical variables, for example, the distance between two amino acids. The linear dependence is not a theoretical limitation since any potential can be expanded in a basis set with linear coefficients to be determined. However, in practice the choice of the expansion can impact the flexibility and reliability of the results, and is discussed further in the paper.

Docking one pair of proteins generates about 1010 candidates. Therefore consideration of the complexes available in the PDB (∼600) requires the solution of ∼1013 inequalities. This fantastically large number suggests that the available statistics for learning a potential in this case is significant. There are, however, a number of technical problems that we need to address. We provide below a verbal description of the challenges. This is to help follow the more detailed mathematical formulation of the following sections.

The first challenge in solving the linear programming problem we just formulated is the problem of infeasibility. The number of potential parameters that we determined for the set is in the hundreds and is obviously much smaller than the number of inequalities. While it is not impossible that a set of parameters exists that satisfies exactly the 1013 inequalities, it is not likely. The set of functions, fα(X), that we use for the potential is not exact and the optimization of parameters is limited by the flexibility of the functional form. Vendruscolo and Domany12 have shown that there are no parameters of a contact potential that ranks correctly decoy and native structures of a selected protein. Tobi et al.13, 14 pushed this argument further to demonstrate that a general parameterization of a pair potential as a function of distance is also insufficient. Their sets were much smaller than the set of 1013 inequalities we consider here, while the Tobi's potential14 was more flexible than the current choice.

We therefore expect (as is indeed the case) that a straightforward application of a linear programming approach to the problem at hand will detect infeasibilities, i.e., we find that there is no set of parameters that satisfies all the inequalities at hand. How should we deal with such an imperfect potential? One solution is to come up with a more flexible functional form for which a desired set of parameters could be found. This is, however, not always possible. We must keep in mind that the structures we usually employ are approximate. It may be the case that an energy function that scores the nearest to native approximations as the best models, simply does not exist. Hence, the current learning of a potential is of learning with noise. If we try too hard to learn an energy function by adding a large number of parameters we may end up learning the noise and not the molecular data. How can we determine if the potential is the best for the current (approximate) functional form?

In machine learning it is a common practice to test models learned from data with noise on independent test sets to check the transferability and generality of the designed score. Hence while learning data with noise we are willing to accept some inaccuracies to retain simple-to-use functional form that is generalizable to other sets. The adjusted goal of optimizing with noise is reflected in a new set of inequalities and an optimization of a target function together with the solution of the inequalities,

U(Xd;P)UXn;P>1ηndΔndηn,d,Δn,d0n,d(P,η)= argmin |P|2+Cηn,d (3)

The slack variables ηn, d measure the degree of inequality violations and their sum is minimized in formulation 3. Note that formulation 3 is not exactly what we had in mind. The prime goal is to optimize the ranking, making the correct structure lower in energy than any other decoy structure. The above formulation minimizes the differences in energy, or the energy gap, not the ranking. It is possible that a solution in which many wrong structures are only slightly lower in energy than a correct structure will be preferred to a solution in which only one structure scores better than the native by a wide margin. It turns out that optimizing the ranking directly is harder. We therefore stay with optimization of the sum of the slack variables. The scaling of the slack variables with a distance measure discussed below helps remove high-scoring decoys with significant distance from the correct assembly (penalty is higher for larger mistakes).

The distance measure, Δn.d, is introduced to bias the energy landscape towards a funnel.15 We penalize violations less if the structures are reasonably similar. The penalty increases for structures far from the native that have energy lower than the correct structure. In docking, a common distance measure between the model and the correct structure of the complex is the interface root mean square difference (“i-RMSD”) after optimal overlaps of the residues at the interface). We will use the i-RMSD for Δn.d in the present paper; however, this choice is clearly not unique. Finally the “1” in the above formulation set a scale for the potential parameters that are scale-free in Eq. 1. The coefficient C determines the penalty for violation.

Equation 3 addresses the infeasibility problem. We note that Eq. 3 is known in the field of quadratic programming and is used widely in machine learning. It is most popular nowadays in the form of support vector machine (SVM),16 which is used for classification and more recently in ranking. Support vector machine was introduced as a method for binary classification—to learn a linear separator to differentiate between two classes. The framework argues that maximizing the margin between the plane and its closest points leads to good generalization. In the present paper we built on recent results in structural SVM (Refs. 17 and 18) but our learning algorithm has useful additional twists appropriate for the docking problem.

The second challenge of our plan to solve 1013 inequalities is computational. Solving all the inequalities directly is not possible today even with the most advanced computing technologies. We have spent considerable time to extend interior point code19 to our purposes. Indeed parallelization and exploitation of the special structure of the problem (relatively small number of parameters and huge number of inequalities)20 increase the number of constraints that we can address in one run by more than two orders of magnitude. In the present paper we exploit clustering to make the calculations more efficient and accurate. The asymmetry of the number of inequalities, N, versus the number of parameters, L, is particularly worth exploiting within the primal-dual representation of linear and quadratic programming problems (see Appendix C and Ref. 20). The linear system solved during the process of determining an optimizing step can be made as small as LxL (rather than NxN), which is clearly advantageous for our problem. The formation of the matrix involved though is expensive but can be done in a data parallel fashion even when constraints are grouped based on clusters.

However, even with these enhancements we are not able to solve more than ∼108 inequalities, which is still a smaller number than 1013. It means that a solution of the complete convex programming problem must be made with a selected part of the inequalities. In the past we sampled heuristically subsets of inequalities by considering the first few millions of constraints.13, 14, 22, 23 This was also the approach taken in our first design of a docking potential reported in Ref. 22. While it is an appealing choice, it is not obvious that the sampled constraints are the needed set. For example, it is possible that we over sample constraints of one parameter while leaving other parameters ill determined. Rather than picking inequalities heuristically and risking missing important constraints on parameter values it is desirable to have a rigorous approach that allows for a systematic selection of a subset of inequalities. The selection is expected to provide error estimates and provides systematic means of improving the selection of a subset of inequalities.

Structural SVM (Ref. 17) was proposed as a method for learning sequence alignments18, 24 in which inequalities are iteratively and systematically selected and added to the set of inequalities to be solved (Eq. 3). The authors used quadratic programming as the optimization method and showed that the procedure converged in linear number of steps in the number of examples. However, structural SVM is based on using one slack variable for all decoys associated with the same native data point (correct alignment). As argued in section “Learning with clustering,” this choice is not ideal for the docking problem since some violated pairs are over penalized. In the present paper we suggest another way of selecting slack variables and inequalities. The new selection addresses this issue while preserving the attractive and formal features of structural SVM. Even in this case, we obtain an iterative procedure of determining the inequalities to solve. The iterations are guaranteed to improve the quality of the solution of Eq. 3.

The new approach for the selection of inequalities, exploits the low dimensionality of the problem (only six dimensions). The large number of decoy structures that we generate includes complexes that are quite similar to each other. Similar decoys yield inequalities that are not significantly different and contain little new information. Furthermore, particular violations are oversampled. Since we do not use all inequalities, oversampling of some constraints may cause other important constraints to be missed. Hence clustering allows more uniform sampling. We assign a single slack variable to a cluster of decoys.

The paper is organized as follows. We consider first the functional form of the energy that we choose to optimize. Second, we describe the grid that we employ to represent the space of docking configurations. Third, we discuss the algorithm to determine the potential parameters. Finally, we consider examples and benchmarks.

FUNCTIONAL FORM FOR DOCKING ENERGY

The energy function depends on the translation and rotation (or transformation) of one protein chain (L ligand) with respect to the other chain (R receptor). We denote a transformation by τ (τ = (t, u) where t and u are the translation and rotation of the ligand, respectively), and τ(rj) are the transformed coordinates of particle j,

U(τ)=Uattr(τ)+prepul·Urepul(τ)+i,j(ij)pα(i,j)nij(rij)rij=|riτ(rj)|, (4)

where U(τ) is the total energy of the complex, as created with the transformation τ. Uattr(τ) and Urepul(τ) are attractive and repulsive components of the energy. The last term is a summation over interactions of pairs of amino acids. nij(rij) is a function of the distance between particles placed at different protein chains. Following Tobi and Bahar,25 we use 20 side chain centers of mass (cntd),the backbone carbonyl oxygen, and amide groups (bkbn) as different particle types. The parameter prepul determines the strength of the overall repulsion. The coefficient pα(i, j) determines the strength of the interaction type α of particles (i, j). We provide below the explicit functional form of these sub-energies.

Residue and backbone contact potential

Most knowledge-based potentials in the protein folding field employ contact potentials11, 26 (residue or otherwise) which are Ucontact(τ)=i,j=1NR,NLpαi,jnij(rij) where pα(i, j) is the score for contact type α and nij(rij)=1rijRc0rij>Rc (a step function) defines a contact. The total number of particles of the first protein (receptor) is NR and the second (ligand) is NL. The total energy is therefore a sum of weighted step functions. For a continuous description of the contact as a function of distance, we use a linear interpolation function h(r) in place of n(r) where

h(r)=1rRminRmaxrRmaxRminRmin<rRmax0r>Rmax. (5)

This function interpolates continuously from zero to one, using two distances Rmin  and Rmax  with a range determined as

Rmin=6Åcntd,cntd5Åcntd,bkbn4Åbkbn,bkbnRmax=8Åcntd,cntd7Åcntd,bkbn5Åbkbn,bkbn. (6)

For efficient calculations of the energy it is convenient to define a receptor grid. If a grid is available the calculation of the energy is proportional to the number of particles.

Receptor grid

The function representing the potential experienced by a particle type q due to the particles of the receptor is defined as

Rq(j)(l,m,n)=i=1NRpq(i)q(j)h(ri,(l,m,n)), (7)

where ri, (l, m, n) is the distance from particle i and the corner of the cell (l, m, n) with smallest coordinates (least l, m, n). The receptor grid Rj provides a discretization of potential experienced by a particle j of type q(j). For the calculation of the grid it is convenient to consider a single particle type (q(j)) instead of a contact type α(i, j) as in Eq. 4. The receptor is placed in a rectangular box that is partitioned into cubic cells of side length g. Consider a point r contained in cell (l, m, n); the value of potential experienced by particle of type j can be approximated by the value at the center of the cell (l, m, n). More accurate, however, would be an interpolation of the potential within the cell, and we use a trilinear interpolation. Consider a point r in cell (l, m, n). The integers (l, m, n) are defined as the largest integers that are less or equal the Cartesian component of r, that is (l = ⌊rx/g⌋, m = ⌊ry/g⌋ and n = ⌊rz/g⌋). The displacement of the point with respect to the lattice (grid) point (l, m, n) is given by

x=rxgl,y=rygm,z=rzgn.

Let χαβγ(r) = (xδα, 0 + (1 − xα, 1)(yδβ, 0 + (1 − yβ, 1)(zδγ, 0 + (1 − zγ, 1) (δ is the Kronecker delta). The potential for a particle of type q at r is approximated as a linear combination of the potential on the eight corners of the grid cube containing r. The function χαβγ is the weight for the contribution of the potential R at the corner αβγ of the cube,

Φq(r)=α,β,γ{0,1}χαβγ(r)Rq(l+α,m+β,n+γ), (8)

which is essentially a linear interpolation between corners and edges of the box. The ligand grid Lq provides occupancies of particles of type q. It is defined as Lq(l, m, n) = ∑α, β, γ ∈ {0, 1}j ∈ (l − α, m − β, n − γ)χαβγ(r(j)) where j is a particle of type q with position r(j) in cell (l − α, m − β, n − γ).

Vdw attraction and repulsion

Shape complementarity is an important determinant of protein-protein docking. FFT-based docking algorithms11, 27, 28, 29, 30 define shells of various sizes around the surface of each protein and discretize them on a grid. For a translation and rotation the overlap between the translated and rotated shells of the ligand and the shells of the receptor are computed and the shape complementarity is quantified as a linear combination of these overlaps.

The total interface vdw energy of a complex when the proteins are docked according to transformation τ is Evdwexact(τ)=i=1,j=1NR,NLVij(rij) where rij = |ri − τ(rj)| and Vij(rij)=4Dij(σij12rij12σij6rij6). The indices i, j are running over the particles of the receptor and the ligand, respectively, and NR, NL are the number of particles of the two proteins in the complex. We use the OPLS force field—Optimized Potentials for Liquid Simulations is a standard force field for modeling proteins derived by optimizing fitness for gas phase and liquid phase properties of water, capped amino-acids and small peptides.31 The energy and the contact distance factorize to single particle properties as: Dij=DiiDjjDiDj and σij=σiiσjjσiσj. We use an approximation

Evdw(τ)=4j=1NLDjσj33.23i=1NRDiφrij3.2σi where φ(x)=prepul×(0.8x),x0.80.5×(x0.8),0.8<x1.10.15,1.1<x1.320.15×(1.8x)0.48,1.32<x1.80,x>1.8 (9)

We adopt this approximation because it is convenient to handle using FFTs and it resembles 6–12 intermolecular potentials (comparison provided in Fig. 1).

Figure 1.

Figure 1

The relation of vdw approximation function to the Lennard-Jones 6/12 potential (dotted line), wvdw̱repul was set to −9.

The receptor and ligand grids are defined as

Rvdw(l,m,n)=pi(l,m,n)Difri,(l,m,n)3.2σi where f(x)=1prepul×(0.8x)x0.80.5×(x0.8)0.8<x1.10.151.1<x1.320.15×(1.8x)0.481.32<x1.80x>1.8, (10)
Lvdw(l,m,n)=4α,β,γ{0,1}k(lα,mβ,nγ)Djσj33.23χαβγ(r(k)).

Note that the grid Rvdw(l, m, n) has complex values. The imaginary component of the calculation stores the repulsion due to overlap. Note the change in sign, in our formulation higher scores are better.

With the energy terms in place we discuss the algorithm to generate alternate docked conformations. The inputs for the algorithm are the coordinates of the receptor and the ligand. In the sketched algorithm below we allow for (only) rigid body transformations to be performed, but even with this restriction the number of conformations is ∼1010. Rather than saving all transformations, we calculate the energies of the complexes on-the-fly and store only top Λ candidates.

In the present paper we do not consider the process of final selection and refinement. Refinement is the adjustment of the unbound conformations to remove sterically unacceptable shapes of complexes and bring the final structures closer to the true bound form. We comment that the algorithm as described is not new and was used in docking experiments elsewhere.11, 26, 27, 28, 29, 30, 35 We provide it for completeness since the search is strongly coupled to the learning process and to our concrete choice of energy functions. For the algorithm to function efficiently, every energy term must be presented as a product or a convolution.

The use of grid representation for molecular positions and interaction energies is common to the field. In some cases energies are defined directly on the grid and are discontinuous. It is not obvious if discontinuous energy functions are mapped correctly from the grid to the continuous space, as the grid size is made smaller. Such mapping is important since by the end of the day we wish to determine docked conformations in continuous space and score these conformations with energies appropriate for that space. In Appendix A we analyze the errors of our implementation of the different energy terms and demonstrate that in our case the functions go to the correct limit.

ALGORITHM TO DETERMINE OPTIMAL POTENTIAL PARAMETERS

We are not the first group to propose a docking energy function. Below we review some of the leading potentials and algorithms and discuss them in the context of the present study. An important docking program is ZDOCK.11 ZDOCK uses an atomic contact energy (ACE)—a statistical potential35a derived with a random crystal structure as the reference state (atom pairs were randomly exchanged in the crystal structure to obtain the reference state) that explained protein solvation energies very well. While the ability of transferring potential parameters between fields is impressive and important, one may expect that a potential designed specifically for the protein-docking problem will be better at that specific task.

Another statistical potential (atomic) derived from decoys obtained from docking algorithm as the reference state is employed in PIPER.36, 37 Perhaps the most challenging problem in the design of statistical potentials is the definition of the reference state. The reference state represents hits by chance or predictions that are false. PIPER uses decoys as reference state but assumes that the distribution of pairwise contacts are independent of each other. Distribution of contacts is highly dependent during hydrophobic collapse. The convex programming approach, which we advocate here, implicitly generates a reference state by considering explicitly pairs of false and positive predictions. No independence assumptions are made in the generation of the reference state. The disadvantage of the convex programming approach is that typically the statistics of false positive is expensive to generate and in many cases it is too poor to get an accurate grasp on the overall shape of the false positive distribution. On the other hand, sampling directly from the false positive distribution has the advantage that no ad hoc assumptions are made while proposing a reference state. The difficult task of choosing a functional form for the reference distribution is avoided.

Self-consistent iterative procedures that circumvent the choice of a reference state in deriving statistical contact potentials were proposed38 and applied to the design of scoring functions for protein-protein docking.39 The method is restricted to the class of contact potentials and is based on separating the near-native from the average incorrect structure. Again, the statistics of the average incorrect structure is not too difficult to obtain. However, the direct comparison of pairs of false and true predictions provides richer information.

Algorithm for docking

1: Input: receptor, ligand, tolerated error in energy (ε), and minimum number of transformations to retain (Λ). The tolerated error in energy is how far the best solution found by the discrete space search deviate from the optimal solution in continuous space. In the training and testing we used Λ = 219 = 524 388.
2: Find radius of each protein and determine the density of rotational sampling and grid spacing to be used such that the error can be bounded by ε (see Appendix A).
3: Compute grids Rvdw and ∀j ∈ {1, …, 22}Rj on the receptor protein and their inverse Fourier transforms: IFT(Rvdw) and IFT(Rj)∀j ∈ {1, …, 22}
4: Initiate the set of conformations selected (Γ) to empty set.
5: Let uα be a rotation matrix in discrete grid on the space of all rotations SU3. Begin loop on uα
 5.1 compute grids Lvdw and ∀j ∈ {1, …, 22}Lj on the ligand-protein rotated according to uα and their Fourier transforms
 5.2 compute scores for all translations (Γα) involving current rotation α using the convolution theorem (all functions below carry the index α to denote the current rotation), Eαvdw=1NxNyNzIFT{FFT{Lvdw}IFT{Rvdw}}, (Nx, Ny, Nz are the dimensions of the grid used) Eαparticle̱pair=1NxNyNzIFT{j=122FFT{Lj}IFT{Rj}} and Eα= Re {Eαvdw}+ Im {Eαvdw}+ Re {Eαparticle̱pair}.
 5.3 Consider the set of conformations and their energies just discovered (Γα, Eα) and the set of the other conformations (other rotations) that were already explored (Γ, E). In this step we merge the two sets. We sort both sets in decreasing order of score and retain transformations that are within top Λ (their energies are the lowest) or have a score within ε from the best solution. (Γ, E) is updated.
6: End loop
7: Output: (Γ, E)

We designed parameters for scoring docked conformations in our earlier work.22 In that work an extensive set of 2-chain complexes (462 bound-bound, 123 bound-unbound, and 55 unbound-unbound cases) was derived from the PDB.5 In that work we did not perform exhaustive sampling of all docked conformations, but instead use Patchdock to sample nativelike and incorrect transformations for each case. We then derive parameters pα by minimizing the slack variables ∑ij, ikηij, ik similar to Eqs. 2, 3 such that

αpα[fα,ijfα,jk]>1ηij,ik and ηij,ik>0ij,ik, (11)

where fα, ij is a vector of interface properties of the jth correctly docked structure for complex i and fα, ik is the interface property vector of the kth mis-docked structure for complex i. The results of the parameter optimization, using structures sampled by Patchdock, may have been program dependent and not appropriate for other sampling techniques. No exhaustive enumeration of all possible translations and rotations was performed. The potential derived there compared favorably against other docking energies22 using Patchdock and Zdock generated structural sets. It is only when we tried to use that potential for exhaustive sampling that we found out that our original potential also generates significant number of false positives.

In this work we derive parameters that ensure selection of nativelike structures from all possible transformations. If one follows the linear programming formulation, one is faced with enormous number of constraints – the number of possible transformations for a pair of proteins when transformations are sampled on a cubic grid discretized into 100 intervals in each dimension and when around 54 000 rotations are used (the number of rotations required to sample SU3 at 6°) is 54 × 109; the total number of constraints would be 36 trillion, the resultant linear program cannot be solved in practice. The number of inequalities is simply too large to load directly to linear programming solvers.20

A solution is possible by sampling a subset of the inequalities. In principle many inequalities do not provide new information (e.g., of the inequalities a > 5 and a > 3 it is sufficient to keep only the first inequality a > 5). While the problem at hand is usually more complex it is still expected that a smaller number of inequalities of the total possible will be sufficient to obtain a satisfactory solution. In the past we have sampled heuristically Λ inequalities (Λ is much smaller than the total number of inequalities possible) and still we were able to find high quality solutions.13, 14, 22 The inequalities chosen were the ones least satisfied.

This choice is intuitively appealing, however, it is not precise. A potential problem might be that many of the selected inequalities do not provide additional information. Sampling more of (almost) the same inequalities does not add significant new information. For a fixed total number of inequalities that we can consider this procedure may miss important constraints and some parameters may be left ill determined. It is therefore desired to have a more systematic way of choosing inequalities, perhaps using iterations if they are guaranteed to improve the solution.

Joachims et al.18 and Tsochantaridis et al.17 provide a quadratic programming formulation for these classes of learning problems and demonstrate an iterative scheme that solve these quadratic programs efficiently. The algorithm is based on iteratively adding selected violated constraints that ensure that the optimal parameters are found in number of iterations that is linear in the number of complexes. The optimization problem to be solved in their framework (quadratic programming: structural SVM) for learning docking potentials is

QP(STṞSVM)(P*,η*)=argmin12PtP+CniηiP,η such that αpα[fα(τ)fα(τi)]1ηiΔτi,τηi0i,τ,τiτiτ (12)

n is the number of complexes in the training set, τ is a rigid body transformation,τi is the transformation for correctly docked structure of complex i, and fα(τ) is the α element of a vector of interface properties of transformation τ. The elements of the parameter vector P are the pα. The function Δ(τi, τ) is the i-RMSD between structures generated by the transformations τ and τi. The i-RMSD (or other dissimilarity measures of the interfaces that we could have chosen) helps shape the potential like a funnel. As the complex is getting closer to the correctly docked conformation the penalty for mis-ranking becomes smaller. It imposes larger penalty if the interfaces in the complexes (decoy and native) are less similar to each other, creating an energy landscape with a funnel structure. (x**)= argmin [f(x)] is a notation to indicate x* minimizes the value of the function f(x).

The formulation with a single slack variable per complex belongs to the algorithm category SVM1Δs (Sec. 2.2.3 of Tsochantaridis et al.17). The algorithm proposed there can be directly used to solve the problem and it provably converges. The efficiency of algorithms in the structural SVM framework17 comes from the intelligent formulation involving only a single slack variable per instance of observed sequence-structure pair (there is only one slack variable ηi per complex). This formulation has a serious limitation in our case –ηi reflects the maximum difference between the score of the optimal mapping (τ*) according to current set of parameters (pα) and the observed output (τi); that is, αpαfα(τ*)αpαfα(τi)+1ηiΔmin. This means that the score of the native transformation is within ηiΔmin from the optimal solution. This measure is not an indicator of how many mispredictions would result from a docking algorithm based on the parameter set P (see illustration in Fig. 2). The reason is that all violations for a particular complex are going to be penalized according to the worst-case scenario while no information is provided on the extent of violations of the rest of the inequalities. There can be many violations that are close to the native and therefore over-penalized. For example, we find in our learning of parameters that many inequalities do not satisfy the gap criterion (their difference is smaller than one) but have scores worse that the score of the correct complex. Hence, the recognition is actually better than one may expect from the number of violations.

Figure 2.

Figure 2

The problem of learning to dock is approached as learning a linear separator w(represented by the line here) that scores the native transformation (+) above all possible transformations (−). In formulation QPstrucṯsvm same penalty is paid in both cases while case 2 has many false positives.

It would be nice to minimize the false positive rate (the number of misclassified complexes) rather than empirical risk which is the sum of the slack variables, or extent of violation, but minimizing false positive rate is NP-hard (the computational cost grows exponentially with the number of complexes) even when all constraints are explicitly listed (this is a simple corollary of the construction employed by Hoffgen et al.40 to show that finding a separating plane with minimum misprediction rate is NP-hard).

Building on results in linear programming approach20 for learning protein threading potentials, we propose a quadratic programming approach for sum-slack minimization. In place of counting number of false positives, we penalize a false positive by the extent to which it scores above the native,

QP(extenṯmisprediction)(P*,η*)=argminP,η12PtP+Cni,jηij such that αpα[fα(τij)fα(τi)]1ηijΔ(τi,τij)ηij0i,j,τij,τiτiτij (13)

The above representation is more flexible than the method of structural SVM, better measures the extent of violations for a particular complex, and is therefore likely to produce better potential parameters. The main assumption is that the noise level is low, since if it is high the sum of the slack variables will be higher (there are now a lot more slack variables) and the noise will bias the minimization.

Furthermore, if some generated decoy structures are highly similar (and so are the inequalities) we would end up over penalizing for the same mistake (by adding similar inequalities we repeatedly add similar slack variables to the function to be optimized). To address both of these concerns we introduced the idea of clustering to assist in selection and weighting of inequalities. Clustering of solutions is implicit in structure prediction tasks, so similar false positives should not be penalized multiple times. A natural way of clustering is to choose ε-balls in a metric Θ on the output space (the space of all translations and rotations – SE3). Complexes that fall within a single ε-ball are collected to one cluster. One would like to have a penalty per mispredicted cluster. The goal is to derive parameters P such that an exhaustive search algorithm based on P that uses clustering of predicted outputs, results in the minimum extent of misprediction. An optimization problem that captures these properties is

QP(extenṯmispredictioṉwitẖclustering)(P*,η*,Π*)=argminΠ,P,η(Π)×12PtP+Cni,jηΠij such that αpα[fα(τij)fα(τi)]1ηΠijΔ(τi,τij)ηΠij0i,j,τij,τiτiτij (14)

where Πik is the cluster covering transformation τik, and ηΠik are the slack cost associated with cluster k of the complex i. A direct solution for QP(extenṯmisprediction) or QP(extenṯmispredictioṉwitẖclustering) would require listing all negative transformations which is impractical. For a natural metric on SE3, we provide an iterative algorithm that finds P+ which is comparable in quality to the solution P* of QP(extenṯmispredictioṉwitẖclustering).

Algorithm description

The idea of the algorithm in short is – iteratively – dock all pairs of proteins using the current estimate for P, as new violations are discovered, (a violation is when the score of a decoy complex is better (higher) that the score of the correctly docked protein pair). If current set of clusters does not cover the points causing the violations, add new clusters (that is create new slack variables), else use existing clusters. Over-sampling a neighborhood of a cluster does not lead to extra penalty as the penalty assignment is per cluster and not per inequality. Add new constraints requiring the slacks to be large enough to cover the violations; and retrain the potential. Continue this process until no new violations are discovered, the number of iterations is bounded by the minimum extent of misprediction attainable on the explicit enumeration of all constraints. The comprehensive screening of all transformations to find the most violated constraints for a given choice of parameters is needed for the correctness of the algorithm. Our code DOCK/PIE and a fast rmsd algorithm accomplish the exhaustive enumeration. The rmsd between a pair of docked structures arising in rigid body docking is computed in constant time (independent of the protein sizes) with the addition of a simple pre-processing step (Appendix B).

We provide a constant-time algorithm to calculate rms between a pair of docked structures arising in rigid body docking (Appendix B). Together with DOCK/PIE this completes the description of the procedure for efficient generation of the top Λ violated constraints in step 8 of the learning algorithm.

Below, we discuss the proof that the algorithm converges within known error bars from the exact solution. Readers that are more interested in the practical aspects of the algorithm are encouraged to skip sections: Algorithm convergence, Duality theory and a summary of previous work, and Learning with clustering.

Algorithm convergence

The proof of convergence of the learning algorithm depends on properties of the clustering algorithm. The optimal clustering should be connected to the iterative clustering procedure used in the algorithm (problem is portrayed in Fig. 3). We show that the number of clusters that the iterative procedure ends up adding for every cluster in the optimal clustering is bounded. When comparing two clustering schemes, the covering number is defined as the largest number of clusters in one scheme that intersect a cluster in another scheme. In the following text we show that the covering number is small for the docking problem and so the iterative clustering scheme is relatable to the optimal clustering.

Figure 3.

Figure 3

Elements of the optimal cover Π* are in red and elements of the current cover Πe are in black. When the set of parameters w* is used, slack cost is paid only for points in a red cluster.

Definition. A metric Θ on space X is said to satisfy small-cover property if there is a constant K such that for all ε > 0,x ∈ X, Bε(x) = {x′ ∈ XΘ(x, x′) < ε} (ε-ball around x) and covers P(ɛ)=⋃i{Bɛ(xi)} of X that satisfy the condition ∀ijΘ(xi, xj) > ε; at-most K elements of P(ε) are sufficient to cover Bε(x). K is said to be the covering number of X, Θ.

Theorem. For the metric Θ((t1,u1),(t2,u2))=t1t22+L2×u1u21IFrobenius on the space of rigid body transformations SE3, and for covers P(ε) with ɛL8, the covering number K ⩽ 46.

Proof. Frobenius-distance is a metric on the space of matrices. For an orthogonal matrix O=o11o12o13o21o22o23o31o32o33, OIF=2Trace(IO)

(OIF)2=(1o11)2+o122+o132+o212+(1o22)2+o232+o312+o322+(1o33)2=32(o11+o22+o33)+(o1122+o122+o132=+o212+o2222+o232+o312+o322+o332)=2Trace(IO).

Frobenius norm is invariant under rigid body rotation (Theorem 3.1 in Trefethen and Bau42). So, we have u1u21IF=u1u2TIF=(u1u2)×u2TF=u1u2F. Hence Θ is a metric.

Algorithm for learning to dock

1: Input: Set of correctly docked conformation Xij (i is the complex index and j is the index of the protein chain, total of n complexes and 2n chains), their sequences, and their transformations τi ((X11, X12), τ1), …, ((Xn1, Xn2), τn), C – the weight of the slack variable penalty, tolerated approximation error υ, size of region in output space ε.
2: Start the search by calculating an initial set of potential parameters. For all n complexes with known empirical structures generate set of incorrect transformations Γi(0)i = 1, …, n (i is index of the complex). Any set of decoys can be used to boot strap the algorithm. In the present study we used Patchdock.41
3: Calculate the set of constraints Sii and a set of clusters of transformations Gkk (k is the index of the cluster) Si{GkΓi(0)τi(j)Gk:αpα(fα(τi)fα(τi(j)))1ηikΔ(τi,τi(j))} where Δ(τi,τi(j))=irmsd((Xi1,Xi2)(τi),(Xi1,Xi2)(τi(j))), τi(j) is an element in the cluster of transformations Gk.
4: Solve the quadratic programming problem (P,η)= argmin w,ξ12PtP+Cni,kηik subject to the constraints i=1nSi and ∀i, kηik ⩾ 0
5: Start the main iteration cycle and set the number of iterations:ς = 0
6: Repeat: ς = ς + 1
7: fori = 1, …, ndo /* Loop over all complexes*/
8.1: Find most violated transformations, Ti(ς), the energies of the violating decoys, E, and their similarity to the native, Δ. The input is the coordinates of the two chains Xi1 and Xi2 of complex i, the set of transformations, τi to model complex i, the set of parameters P, tolerated energy error υ, the geometrical size of a ball ε that determines the boundary of a cluster, and the number of complex structures to retain,Λ. We also provide the set of clusters added in previous iterations cycles β=0ς1kΓik(β) so that clusters are added only if not already present.
find_top_violations:
• Input: receptor XR, ligand XL, native transformation τnat, scoring function parameters w, tolerated error υ, existing clusters T, cluster size ε and minimum number of solutions to retain Λ
find radius of each protein and determine the density of rotational sampling and grid spacing to be used such that the error can be bounded by υ
compute score of native transformation Enat
compute grids Rvdw and ∀j ∈ {1, …, 22}Rj on the receptor protein and their inverse Fourier transforms
(Γ, E, Δ) ← Ø /* set of high scoring transformations, their energies and distances from native */
Vsorted = [0, 0, …, 0] (sorted array of extents of top Λ violations)
• foruαU(the space of rotations) do
compute scores EαGrid for all transformations (Γα) involving current rotation
○ for τ ∈ ΓαwithEτnatEτGrid1+υdo
   compute Δ(τ,τnat)=irmsd((XR,XL)(τ),(XR,XL)nat))
  ▪ if(1(EτnatEτGrid))>Vsorted[Λ]Δ(τ,τnat)υ, then
    compute exact score Eτ
   ▪ if(1(EτnatEτ))>Vsorted[Λ]Δ(τ,τnat) and EτnatEτ≤1, then
     compute Vτ=Δ(τ,τnat)(1−(EτnatEτ)), update Vsorted
   • fi
  ▪ fi
• end for
○ end for
incremental cluster retained transformations and add/update clusters, let Tout be the final set of clusters
• Output: (Tout, E, Δ)
More compactly: (Ti(ς),E,Δ)= find ̱ top ̱ violations (Xi1,Xi2,τi,P,υ,ɛ,Λ,β=0ς1kΓik(β))
8.2: Create new clusters if regions of top scoring transformations have not been seen so far according to current set of parameters.
 /* add violated constraints to the working set */
forGkTi(α)do
  forτi(j)Gkdo
    if Δ(τi(j),τi)(1αpα[fα(τi)fα(τi(j))])>0 /* have a violation */
    if cluster ik exists from previous iteration or added in this loop (say it was Γil(ι)) and Δ(τi(j),τi)(1αpα[fα(τi)fα(τi(j))])>ηil+υ, Γil(ι)=Γil(ι){τi(j)} fi
    if new cluster ik, Γik(ς)={τi(j)} fi
   fi
  end for
end for
9: Set constraints
Siβ=0ς{GkΓi(β)τi(j)Gk:αpα[fα(τi)fα(τi(j))]1ηikΔ(τi,τi(j))} and solve the quadratic programming problem: (P,η)= argmin P,η12PtP+Cni,kηik subject to the constraints i=1nSi and ∀i, kηik ⩾ 0
10: end for
11: until no new constraints found during iteration (i,kΓik(ς)=Ø)
12: Output:P

Distances max out at 8 (under the Frobenius norm) in the rotation space while they could get arbitrarily large in the translation space. We introduce a scaling factor L to combine distances in both spaces in the metric Θ. For ɛL8, consider covers P(ε) of SE3 that are collections of ε-balls (It is sufficient to show the covering property for small ε, the proof can be extended to all values of ε by taking into account the maxing out of distances in SO3.), such that the centers of any pair of balls are at least ε apart. The number of elements in P(ε) is infinite, but we are interested in covering arbitrary ε-balls in SE3 using elements in P(ε). Let B be the given ε-ball that needs to be covered, let R(B) be its center. Consider all balls {B1, B2, …, Bn} in P(ε) that intersect B and let R(B1), R(B2), …, R(Bn) be their centers. Then d(R(B), R(Bi)) < 2ε; since d(R(Bi), R(Bj)) > ε, every ball Bi has a sphere of radius ɛ2 around R(Bi) that does not intersect any other element of P(ε). So K times volume of radius ɛ2< = volume of radius of 2ε.

Lemma. For ɛL8, the volume of an ε-ball is proportional to ε6.

Proof of lemma. Consider the quaternion representation of SO3, let Oa+bi+cj+dk. We have a2 + b2 + c2 + d2 = 1

(OIF)2=2Trace(IO)=2×(3(a2+b2c2d2)+(a2b2+c2d2)+(a2b2c2+d2)=2×(33a2+b2+c2+d2)=8×(b2+c2+d2),

i.e., for ν ⩽ 1, the volume of a ν-ball in SO3 is 43πv833. The volume of an ε-ball in SE3 is given by volume=tx2+ty2+tz2+L2×8(a2+b2+c2)ɛ2dtxdtydtzdadbdc=π36×(L8)3ɛ6.

volume=tx2+ty2+tz2+L2×8(a2+b2+c2)ɛ2dtxdtydtzdadbdc=1(L8)3tx2+ty2+tz2+xa2+xb2+xc2ɛ2dtxdtydtzdxadxbdxc=π36×(L8)3ɛ6

Hence K ⩽ 46.

With the metric Θ we perform greedy incremental clustering. We add new clusters if the distance (using Θ) to the existing cluster centers exceeds 10 Å. The clustering is performed as follows. We start with the most violated constraint as the center of the first cluster. If there are more violations that are farther than 10 Å from existing clusters we take the most violated constraint and add it as a new cluster. This process is repeated until all the violations are counted for. In practice, in a single iteration we allow the addition of 105 clusters per complex.

Duality theory and a summary of previous work

We state results from duality theory that are used later in proving that the algorithm converges. Given a clustering Π of transformations, denoting η(Π)ik by ηik, the quadratic optimization problem with all constraints included is

Z(P,η)=12PtP+Cni,kηik such that i,τSE3τi:×αpαfα(τi)fατi(j)1ηikΔ(τi,τi(j)) and ηik0(P*,η*)= argmin P,ηZ(P,η) (15)

Define M as the matrix of inequalities M(ij,α)=fα(τi)fα(τi(j)), Nj,k={1Δ(τi,τi(j))ifk=j(i)0else(transformation τi(j) belongs to cluster k). Let e be a column vector with each element equal to 1. The constraints can be written in matrix form as MP + Nη ⩾ e.

Let L(P,η,α,t,s)=12PTP+CneTηαT(MP+Nηse)tTη such that η ⩾ 0, α ⩾ 0, s ⩾ 0, t ⩾ 0 and D(α)=12αTMMTα+eTα such that NTαCne. The problem (P*, η*) = argminP, ηZ(P, η) is said to be primal problem and α**= argmax αD(α) is said to be the dual problem. A point is said to be feasible point of a problem if it satisfies the constraints associated with the problem.

The following properties hold:

  1. For every feasible point (P, η, α, t, s) of L if (P, η) is a feasible point of the primal, L(P, η, α, t, s) ⩽ Z(P, η).

  2. For every feasible point (P, η) of the primal there exist α, s, t such that (P, η, α, t, s) is a feasible point of L.

  3. For every feasible point (α) of the dual there exist P, η, s, t such that (P, η, α, t, s) is a feasible point of L.

  4. For every feasible point (P, η, α, t, s) of L if (α) is a feasible point of the dual, D(α) ⩽ L(P, η, α, t, s).

  5. If (P*, η*) = argminP, ηZ(P, η) and α**= argmax αD(α), there exist s, t such that D(α*) = L(P*, η*, α*, t, s) = Z(P*, η*). Further (P*, η*, α*, t, s) satisfy P*(α*)TM=0LP=0, Cnet(α*)TN=0Lη=0, MP+Nηse=0Lα=0.

  6. As a result of 1–5; if (P, η) is a feasible point of the primal and α a feasible point of the dual, Z(P, η) ⩾ D(α).

We summarize the framework of Tsochantaridis et al.17 here for the benefit of the reader. Our proof extends their ideas to incorporate clustering. Let Δi = max τ{Δ(τi, τ)}, Δ¯=maxi{Δi}, Ri = max τ{∑αfα(τ) − fαi)∥},R¯=maxi{Ri}. For docking, Δ¯ would be the maximum i-RMSD and R¯ is the maximum feature difference (in absolute terms) encountered in the problem.

The authors show that one does not have to list all constraints to solve the primal problem. They show that it is sufficient to add violations that incur the largest penalty at each iteration. Although a quadratic optimization is solved to update parameters at each iteration, the analysis is in terms of progress made in solving the dual D. They show that the dual improves by at least a constant amount in each iteration. Since P = 0 and η=Δ¯e is a feasible solution of primal, D(α*)=Z(P*,η*)Z(0,Δ¯e) and so the procedure converges.

Learning with clustering

Proposition. (Extension to Proposition 16 of Tsochantaridis et al.17): The improvement in dual objective function δ is lower bounded by δmin{Cυ2n,υ28Δ¯2R¯2}.

While the dual problem stays the same during the solution to SVMstruct, the dual changes as new clusters are added during the course of our algorithm, i.e., new columns and new rows are added to the MMT matrix. The old solution with additions of zeroes in the new dimensions is feasible for the new problem (since we get a trivial solution for the new component). This solution can now be improved following Proposition 16, i.e., given a new D(α) the solution can now be optimized.

Our task here is complicated by the fact that we do not know the best clustering scheme and we use instead clustering on-the-fly as more inequalities are added. The impact of less than optimal clustering on the learning needs to be evaluated. The covering property discussed previously allows us to estimate the cost of clustering in the worst case scenario that still provide coverage of the conformation space leading to violations.

Theorem.Let(P*,η*,Π*)= argmin Π,w,η(Π)12PtP+Cni,kη(Π)iksuch thati,τSE3τi:αpα[fα(τi)fα(τi(j))]1ηΠikΔ(τi,τi(j)) (the least expensive solution over all possible clustering schemes). Letσ*=12P*tP*+Cni,kηik*. For a given υ > 0, the learning algorithm terminates afterK×max2nσ*Cυ,8Δ¯2R¯2σ*υ2iterations.

Proof. At each step the dual objective function increases by at least δ=min{υ2n,υ8Δ¯2R¯2}. Suppose the algorithm does not converge in said number of iterations; let (Pe, ηe, Πe) be the solution at this stage and let σZ, σD be the values of the primal and dual objective functions (with partial covering Πe of transformation space – clustering of a space induces a cover on it) and σZ+ be the cost of the optimal parameters P* when the current clustering Πe is used.

  • σD ⩽ σZ(primal is a minimization problem)

  • σZσZ+ (Pe minimizes the primal objective function when clustering Πe is used)

  • Claim. σZ+Kσ*

Proof of claim. Consider the following mapping from Πe to Π*∪{O}: for each cluster in GikeΠe, if it does not intersect any element in Π*, map it to O, if it does intersect, map it to the intersecting element with the largest slack cost. By the covering property, the number of elements that get mapped to any element of Π* is at most K and P* does not incur any penalty on elements mapped to O. So σZ+Kσ*.

Since the algorithm makes progress of at least δ in each iteration we have σDKσ*, leading to a contradiction. Hence the theorem.

It follows from the proof that the solution returned satisfies σZKσ*.

We emphasize that we retain all inequalities in the iterations and clustering procedure. Clustering is only used to determine the slack variables.

RESULTS

Derivation of parameters

We used the set of 640 protein-protein dimer complexes prepared in our earlier work.22 The scoring function is based on a linear combination of vdw attraction, vdw repulsion, and contacts between 22 different particle types (a particle type was chosen for the backbone carbonyl group, backbone amide group, and each residue type was represented by a different particle type). We used a piecewise linear interpolation to represent the functional form for the contact function (see section “Residue and backbone contact potential”).

The initial parameter set was computed with straightforward linear programming using decoys generated by Patchdock (as outlined in our earlier work22). This study does not include exhaustive set of transformations and it relies instead on another docking program (Patchdock41) to provide a set of structures appropriate for learning.

The parameters determined from optimization with the Patchdock-based set of structures were used in exhaustive ranking of all docking candidates on a grid at the first iteration, as described in the text. At each iteration, we docked protein partners using the current parameter set. To reduce the noise in the learning we added a new constraint requiring that the native (bound) transformation score above any false transformation. If the false positive led to a new cluster, we added a new slack variable. For each complex, up to 100 000 top violated constraints were added in each iteration. All constraints from the 640 complexes were pooled with the constraints discovered so far and the resultant quadratic program was solved for the new set of parameters. The dimension of the feature space was 252. The largest quadratic optimization problem solved as part of the learning involved 258 127 822 constraints with 27 564 303 slack variables. We follow the framework outlined in OOQP,43 use the parallel routines reported in an earlier work,20 and develop a primal-dual interior point algorithm for solving the QP arising in the learning algorithm. The final quadratic program was solved in 32 h on 618 cores on Ranger, a super computer maintained by Texas Advanced Computing Center. The potential converges with successive iterations (illustrated in Fig. 4 and Table 1).

Figure 4.

Figure 4

The scoring function converges as iterative learning proceeds, for each iteration we plot the dot product between the parameters (normalized to have L2 norm 1) at this iteration and the previous iteration. The blip at iteration 2 arises due to switching from linear programming to quadratic programming for parameter estimation.

Table 1.

A complex is said to be explained if a high quality hit – ((irmsd≤3Å)∨(Cαrmsd≤3)∧(frac̱native̱contacts≥0.5)) is ranked within top N.

        No. of complexes explained
Iteration Method No. of constraints No. of clusters Top 1 Top 10 Top 100 Top 1000
Zdock3.0       333 375 441 489
Patchdock       201 302 431 513
0 LP 25 719 027 179 237 337 437
1 QP 66 893 447 8 293 956 171 212 278 387
2 QP 101 088 699 12 090 751 234 303 401 497
3 QP 134 079 719 15 109 006 270 353 446 535
4 QP 168 055 994 18 911 122 291 364 454 526
5 QP 204 805 256 22 225 252 315 389 471 546
6 QP 258 127 822 27 564 303 318 398 484 557

We obtained the initial guess (iteration 0) from linear programming framework following the procedure explained in our earlier work,22 this switch from linear programming to quadratic programming probably caused the blip at iteration 2. The normalized dot product between the normalized parameter vectors estimated at iteration 5 and 6 is 0.978. Our results reaffirm the observation of Lu et al.32 that subtle differences in the potential grossly affect its performance in structure prediction tasks; the dot product between the parameters estimated at iterations 0 and 6 is 0.7 while the performance on protein-protein docking improves from iteration 0 to iteration 6 by a factor of 2. The final potential is provided in Table 2.

Table 2.

Vdw residue backbone scoring function (PIE_Vdw_Res_Bkbn).

  ILE VAL LEU PHE CYS MET ALA GLY THR SER TRP TYR PRO HIS ASN GLN ASP GLU LYS ARG NH CO
ILE 0.132 0.072 0.09 0.086 0.066 0.1 0.055 0.077 0.058 0.03 0.016 0.063 0.001 0.024 0.018 0.034 0.016 −0.007 −0.025 0.009 −0.041 0.006
VAL 0.072 0.16 0.104 0.059 −0.006 0.058 0.038 0.015 0.026 0.017 −0.005 0.066 0.043 0.025 0.025 −0.007 0.006 0.04 −0.007 0.041 −0.029 −0.001
LEU 0.09 0.104 0.1 0.1 0.049 0.05 0.072 0.044 0.027 0.029 0.059 0.092 0.043 0.019 0.004 0.021 −0.001 0.001 −0.008 0.023 −0.052 0.007
PHE 0.086 0.059 0.1 0.071 0.067 0.094 0.036 0.041 0.065 0.032 0.021 0.089 0.074 0.049 0.051 0.049 0.034 0.02 −0.023 0.029 −0.037 0.005
CYS 0.066 −0.006 0.049 0.067 0.118 0.124 0.007 0.034 0.005 0.02 −0.049 0.068 0.03 0.09 −0.049 −0.04 −0.016 −0.048 0.018 0.047 0.021 −0.021
MET 0.1 0.058 0.05 0.094 0.124 0.059 0.058 0.057 0.056 0.048 0.054 0.099 0.006 0.044 0.047 0.025 0.012 −0.01 −0.015 0.003 −0.049 0.009
ALA 0.055 0.038 0.072 0.036 0.007 0.058 0.052 0.026 0.027 0.032 0.07 0.046 −0.024 0.01 −0.002 0.027 −0.023 0.011 0.007 −0.002 −0.009 −0.025
GLY 0.077 0.015 0.044 0.041 0.034 0.057 0.026 0.061 0.028 0.036 0.028 0.051 0.032 0.005 0.034 0.005 0.011 −0.006 −0.006 0.043 −0.033 −0.008
THR 0.058 0.026 0.027 0.065 0.005 0.056 0.027 0.028 0.012 0.034 0.006 0.042 0.019 0.044 0.025 0.052 0.039 0.029 −0.035 0.022 −0.03 −0.007
SER 0.03 0.017 0.029 0.032 0.02 0.048 0.032 0.036 0.034 0.034 0.019 0.007 0.038 0.047 0.007 0.013 0.01 0.034 0.01 −0.007 −0.015 −0.022
TRP 0.016 −0.005 0.059 0.021 −0.049 0.054 0.07 0.028 0.006 0.019 0.186 0.003 0.074 0.107 0.053 0.069 −0.063 −0.039 0.034 0.057 −0.004 0.027
TYR 0.063 0.066 0.092 0.089 0.068 0.099 0.046 0.051 0.042 0.007 0.003 0.103 0.078 0.022 0.057 0.013 0.014 0.004 −0.002 0.046 −0.033 0.013
PRO 0.001 0.043 0.043 0.074 0.03 0.006 −0.024 0.032 0.019 0.038 0.074 0.078 0.01 0.014 0.021 0.024 0.01 0 −0.052 −0.013 −0.045 0.004
HIS 0.024 0.025 0.019 0.049 0.09 0.044 0.01 0.005 0.044 0.047 0.107 0.022 0.014 0.026 0.026 −0.025 0.035 0.018 0.016 −0.041 −0.041 0.018
ASN 0.018 0.025 0.004 0.051 −0.049 0.047 −0.002 0.034 0.025 0.007 0.053 0.057 0.021 0.026 0.054 0.015 −0.002 −0.008 −0.018 −0.002 −0.028 −0.001
GLN 0.034 −0.007 0.021 0.049 −0.04 0.025 0.027 0.005 0.052 0.013 0.069 0.013 0.024 −0.025 0.015 0.003 0.008 −0.031 0.023 0.005 −0.025 0.008
ASP 0.016 0.006 −0.001 0.034 −0.016 0.012 −0.023 0.011 0.039 0.01 −0.063 0.014 0.01 0.035 −0.002 0.008 −0.009 −0.034 0.039 0.085 0.006 −0.041
GLU −0.007 0.04 0.001 0.02 −0.048 −0.01 0.011 −0.006 0.029 0.034 −0.039 0.004 0 0.018 −0.008 −0.031 −0.034 −0.038 0.059 0.062 0.018 −0.042
LYS −0.025 −0.007 −0.008 −0.023 0.018 −0.015 0.007 −0.006 −0.035 0.01 0.034 −0.002 −0.052 0.016 −0.018 0.023 0.039 0.059 −0.04 −0.065 −0.022 0.018
ARG 0.009 0.041 0.023 0.029 0.047 0.003 −0.002 0.043 0.022 −0.007 0.057 0.046 −0.013 −0.041 −0.002 0.005 0.085 0.062 −0.065 −0.016 −0.042 0.027
NH −0.041 −0.029 −0.052 −0.037 0.021 −0.049 −0.009 −0.033 −0.03 −0.015 −0.004 −0.033 −0.045 −0.041 −0.028 −0.025 0.006 0.018 −0.022 −0.042 0.03 0.011
CO 0.006 −0.001 0.007 0.005 −0.021 0.009 −0.025 −0.008 −0.007 −0.022 0.027 0.013 0.004 0.018 −0.001 0.008 −0.041 −0.042 0.018 0.027 0.011 0.001
VDW 0.009                                          

Test on newly deposited complexes

There were 157 heterodimeric protein-protein complexes deposited in the PDB since 2008 that were not similar to any complex in the training set. Of these 55 complexes had no ligand molecules or ions close to the interface, no disulphide bonds and did not involve extensive conformational change upon docking (terminal unfolding/insertion, domain rearrangement). Twelve of these complexes had unbound configurations (homolog with tm-score45 below 0.95) for at least one chain; these constitute the test set. When using homologs we always model the structure of the native sequence based on the homolog and dock models. The input pdbs are available at http:∕/users.ices.utexas.edu∕~ravid∕pie∕test_set∕.

Our protocol (flowchart illustrated in Fig. 5) is to dock using the learnt potential, sort the solutions, cluster them based on i-RMSD, and return high scoring representatives from these clusters. The same algorithm was used for CAPRI targets 46, 48, 49, and 50 (parameters from iteration 3 were used for target 46). This protocol identifies a near native solution within top 10 (100) on 5 of 12 cases (8 of 12) cases compared to 3 of 12 (6 of 12) by Zdock3.0 + Zrank; 5 of 12 (7 of 12) by Cluspro34 and 2 of 12 (5 of 12) by Gramm-X.35 The comparison on each case is provided in Table 3. Note that the algorithm presented in the text is coarse grained, using residue-based potential, and rigid protein shapes. The other algorithms we compared to are using more sophisticated description, including atomic models and refinement of the initial structures. We find it encouraging that the simplified model is doing consistently better than other approaches.

Figure 5.

Figure 5

Outline of algorithm used in CAPRI to predict mode of binding in a protein-protein interaction, testing and available as web service. We only retain 219 = 524 388 conformations due to computational limitations.

Table 3.

Performance of Dock/PIE is comparable to Zdock (Ref. 11) with Zrank (Ref. 33), Cluspro (Ref. 34), and Gramm-X (Ref. 35). A model is said to be a hit if (irmsd ⩽ 4 Å) from native. The entries of Besthit indicate the lowest ranked model that is a hit. ZD3.0ZR is the result of rescoring transformations generated by Zdock3.0 using Zrank. We use the greedy i-RMSD based clustering developed as part of our algorithm on structures generated by Zdock3.0 with Zrank and report the results under the columns labeled ZDZR + cluster. In summary, Dock/PIE predicted correctly 0/5/8 complexes in the top 1/top 10/top 100 hits, Zdock3.0 1/3/6, ZD3.0ZR 1/1/4, ZDZR + cluster 1/1/6, Cluspro 0/5/7, and Grammx 0/2/5. Dock∕PIE according to this test is at par with these leading technologies.

  Dock∕PIE and cluster
Zdock3.0
ZD3.0ZR
ZDZR + cluster
Cluspro
Gramm-X
Case Besthit No. of hits No. of clusters Besthit Besthit No. of Hits Besthit No. of hits Besthit No. of hits Besthit No. of hits
2wfx_from_3ho4B_-_2ibgH_- 1078 4 9604 6825 3012 4 0 0 0 0 190 1
3d65_from_3d65_E_3btmI_- 5 75 2228 77 108 1547 45 52 3 10 2 17
3di3_from_3di3_B_3di2C_- 49 8 2744 27 166 384 35 7 0 0 0 0
3fpn_from_3fpn_A_2nmvA_- 10 7 1001 1 24 462 17 4 10 1 270 1
3g9a_from_3ed8D_-_3g9a_B 10 33 3762 106 431 675 174 19 10 1 198 1
3hct_from_1fxtA_-_3hct_A 0 0 1038 20 084 662 1 0 0 0 0 137 1
3jrq_from_2iq1A_-_3jrq_B 4 3 1619 4 26 405 12 1 9 1 34 3
3l1z_from_3fshB_-_3l1z_B 75 10 3154 4985 3897 98 720 4 13 1 4 4
3l9j_from_2tnfB_-_3l9j_C 175 9 1409 108 1 624 1 11 0 0 50 1
3m18_from_3m18_A_1i56A_- 88 16 4258 5 1284 234 1128 6 0 0 134 3
3m62_from_3m62_A_1nddB_- 2 60 2357 40 26 743 17 35 4 8 16 2
3nbp_from_1mu2A_-_3nbp_B 262 1 5365 0 0 0 0 0 24 1 0 0

Tests on Zlab benchmark

The Zlab benchmark is the de facto standard in the field. We therefore decided to test our potential on this set as well. We removed constraints corresponding to cases similar to Zlab Benchmark 2 (Ref. 46) from the learning set and retrained the potential. This potential was used for evaluating the learning procedure on the benchmark. We docked every pair listed in the Zlab benchmark using our docking procedure and ranked solutions according to the potential designed here. We did not use the potential reported in Ref. 22 since it does not work well for exhaustive sampling. When asked to pick the best transformation from all possibilities in the rigid transformation space, the scoring potential almost always picks up an incorrect solution. This issue is addressed here. Benchmark 2 comprises 84 complexes for unbound protein-protein docking. Our algorithm selects a near native solution on the top/top 10/top 100 in 12/28/52 cases compared to 10/21/41 by Zdock3.0 + Zrank (Table 4).

Table 4.

Comparing Dock/PIE and ZDOCK + ZRANK on Zlab benchmark. Dock/PIE ranks a near native solution at the top 1/top10/top 100 in 12/28/52 cases compared to 10/21/41 by Zdock3.0.

  DOCK∕PIE
ZDOCK + ZRANK
Case Besthit Nos. returned No. of hits Besthit No. of hits
1A2K 2 5784 23 1038 570
1ACB 1 813 4 780 581
1AHW 64 3299 5 27 347
1AK4 50 1720 17 1315 253
1AKJ 1064 3341 3 175 236
1ATN 670 14 708 11 1076 15
1AVX 5 7291 45 11 744
1AY7 24 817 4 74 407
1B6C 3 1353 5 1 509
1BGX 0 4776 0 0 0
1BJ1 3 1553 12 19 1637
1BUH 5 1396 5 353 514
1BVK 321 6793 52 116 425
1BVN 2 2252 34 10 946
1CGI 13 1818 20 22 1167
1D6R 39 13 847 107 2347 32
1DE4 4 47 977 117 426 133
1DFJ 1 2100 4 2 334
1DQJ 401 2236 11 753 374
1E6E 14 7823 84 3 448
1E6J 7 660 10 1 1244
1E96 2 3023 16 24 196
1EAW 2 787 6 1 597
1EER 97 12 028 5 330 11
1EWY 27 985 5 21 586
1EZU 1 1763 4 2247 340
1F34 19 3664 6 62 172
1F51 40 3973 7 3 304
1FAK 0 10 006 0 0 0
1FC2 96 903 15 154 92
1FQ1 189 8187 4 15 260 9
1FQJ 1301 9687 5 491 24
1FSK 1 2045 25 1 851
1GCQ 407 1602 4 922 146
1GHQ 0 6887 0 2982 2
1GRN 321 1025 1 558 166
1HE1 7 1666 4 36 253
1HE8 323 10 776 17 75 8
1HIA 1 824 6 618 145
1I2M 11 1366 1 473 98
1I4D 19 2407 4 1349 351
1I9R 112 4585 16 31 370
1IB1 26 7290 18 33 099 2
1IBR 0 2346 0 0 0
1IJK 107 3773 16 444 116
1IQD 1 4171 23 1 802
1JPS 135 2143 5 1 385
1K4C 3663 34 627 78 162 1323
1K5D 798 1514 1 84 143
1KAC 725 7541 35 11 160
1KKL 115 2815 12 70 173
1KLU 907 11 041 24 13 333 18
1KTZ 1003 3369 12 397 90
1KXP 1 5639 7 12 283
1KXQ 4 1848 3 14 200
1M10 3705 4175 1 10 647 4
1MAH 16 11 048 170 3 1177
1ML0 1 5999 167 1 548
1MLC 129 2715 17 5 616
1N2C 4 12 875 47 3203 129
1NCA 317 10 089 37 14 126
1NSN 375 28 750 121 468 174
1PPE 1 370 28 1 3616
1QA9 0 3170 0 1850 29
1QFW 1737 3265 7 192 107
1RLB 9 9268 70 1 1767
1SBB 2334 3298 1 3639 26
1TMQ 2 8857 59 71 353
1UDI 1 2855 23 2 359
1VFB 92 1571 5 437 341
1WEJ 65 1059 9 2 907
1WQ1 41 2082 2 296 142
2BTF 51 11 102 26 151 295
2HMI 67 13 332 21 272 331
2JEL 1 2696 20 42 1285
2MTA 68 940 9 57 627
2PCC 13 1549 8 218 389
2QFW 16 3978 15 6 510
2SIC 14 1010 4 1 768
2SNI 1 608 6 114 554
2VIS 6084 25 047 4 8 703
7CEI 2 1488 10 3 965

Comparison to other residue contact potentials

The algorithm to generate decoys influences the learning of a potential. It is therefore not trivial to compare score functions on decoy structures that were generated by the same approach that is used for the learning. In the previous sections we compared the algorithms (not the energy functions) letting every protocol generate its own candidates for correct docking. Nevertheless, there are docking potentials learnt with different techniques (statistical potentials, or linear programming) without clearly defined docking algorithm that we wish to evaluate and compare to our approach. To conduct the comparison it is necessary to generate decoy structures that are independent of our own (and others) procedures. We therefore use Zdock3.0 to computed 54 000 decoys for the 640 complexes that were included in the training set. The results of scoring these structures with different energy functions are provided in Table 5. Statistical potential derived for template identification47 is better than random (has p-value below 0.5), potentials derived on protein-protein interfaces32 perform better, discriminative learning improves them further. Accounting for exhaustive enumeration of conformation space improves the result even further. Ignoring the OPLS factor in our potential (set Pvdw to 0 and use the remaining terms as is from Table 2), which results in a function closer to a contact potential, still explains a significant fraction of complexes in the training set.

Table 5.

Top 54 000 structures generated by Zdock3.0 are re-ranked under various schemes, for each case the rank for which the probability that a random scoring function will do better with probability 0.5 is computed, the row labeled reference summarizes this evaluation (Zdock3.0 generated a lot of near native structures, scoring at random would pick up a hit in top 100 models in 162 cases). Statistical potentials capture signal in protein-protein interfaces in the PDB, our iterative learning procedure does a better job of mining this information. Round6 potential does a better job when used for sampling and scoring rather than rescoring alone.

  No. of complexes explained
Method Top 1 Top 10 Top 100 Top 1000
Reference 0 1 162 501
MJ3 50 104 224 386
LLS 94 179 282 417
TB 148 229 350 460
Round6 potential 290 361 434 502
Round6 potential no 244 334 417 489
shape complementarity        

DISCUSSIONS

Mathematical programming was used extensively in the field of protein folding,12, 13, 14, 48, 49 protein docking,22, 25 and protein design.51, 52 These algorithms are invariably based on heuristic sampling of constraints. As carefully as the selections were made, there was no proof that the algorithms converge or even improve with the addition of new constraints. The extension provided in the present work provides an algorithm that shows systematic and monotonic improvement in the energy function as the number of constraints that are added to the set increases.

The learning algorithm presented in this work connects iterative learning procedures used for potential design in protein folding and protein docking community with research in support vector machines. Learning a contact potential (residue, atomic, even after inclusion of distance dependence) that always scores native above all possible conformations is infeasible12 and it is critical to develop approaches to select the best model under these circumstances. Other simpler methods that optimize Z-scores9 or work with limited conformational sampling13, 38, 47, 48, 49 have worked well in practice and the quality of the potential generally improved with the extent of the conformational space explored. However, in these approaches there are no theoretical guarantees on the quality of solution, and it is difficult to further improve the solution upon the discovery of new examples. As more data become available, it is important for algorithms to continue systematically and consistently to improve their capacity. This is the promise of convex programming.

Recent research in structured output prediction has provided significant breakthroughs on this front.17 When the global optimum can be found, learning procedures can be designed with provable guarantees. Efficient procedures exist for finding the global optimum in certain models for protein docking, which makes the above advances particularly relevant to docking. We extend the structural SVM framework to include notions of clustering, traditionally used in protein structure prediction, and illustrate the learning procedure on protein docking.

The learning algorithm presented in this work connects iterative learning procedures used in protein docking community to the rich body of research on support vector machines. For the first time, we provide a learning procedure that provably and systematically improved the quality of the parameter set by a large-scale minimization of misprediction extent. The procedure has a simple-minded explanation – “progress in learning is achieved by looking at the largest mistakes, the more the better.”

Interestingly, a single residue-based potential is doing well for both filtering and ranking. It is of course possible to re-rank the final results with more fine tuned energy functions and protocols and perhaps improve the results. However, the current potential is putting together the ranking and filtering quite successfully, and on coarse level.

The framework is applicable for parameter optimization in a wide range of tasks in structural bioinformatics – wherever the problems of finding the global minimum and finding the largest violation with a given choice of parameters are tractable and one can look at the output space as a low dimensional space. For instance, potentials can be designed for protein structure refinement by iteratively adjusting the model using the current potential, accumulating mistakes made, and constraining the potential to avoid (pay for) these mistakes. Nonlinear functional forms using Gaussian kernel functions in quadratic programming have shown promise in designing scoring functions for protein design.53 These approaches could be explored through extensions to the QP solver.

ACKNOWLEDGMENTS

We thank Michael Wagner for providing the parallel linear algebra routines and Steve Wright for the development and open source availability of OOQP that were extremely helpful in preparation of the QP solver. Texas Advanced Computing Centre's facilities (in particular Stampede, and Ranger) were used extensively in this work. We thank the reviewers for careful reading of the paper and making many useful suggestions. This research was supported by National of Institutes of Health (NIH) Grant No. GM67823.

APPENDIX A: ERROR ANALYSIS

Theorem.Let τ* = (t*, u*) be the optimal solution for docking based on the scoring function parameterized by w, let D(g, U) be the discretization used. There is a solution τ = (t, u) ∈ D such that ε = |E(τ*) − EGrid(τ)| approaches 0 along with ∥t* − t∥ and ∥u* − u∥ as g, err(U) approach 0.

Proof. Let u= argmin ujU(∥u*uj∥)(the rotation in the discrete set closest to the correct rotation) and t=[ argmin (l,m,n)(∥(l,m,nst*∥)]×g(the translation in the discrete set closest to the correct translation), let τ = (t, u).

E(τ*)EGrid(τ)=(E(τ*)E(τ))+(E(τ)EGrid(τ))=(Evdw(τ*)Evdw(τ))1+(Eparticle̱pair(τ*)Eparticle̱pair(τ))2+(Evdw(τ)EGridvdw(τ))3+(Eparticle̱pair(τ)EGridparticle̱pair(τ))4.Term2.Eparticle̱pair(τ*)Eparticle̱pair(τ)=j=122Eparticle̱pairj(τ*)Eparticle̱pairj(τ).

Claim. |Eparticle̱pairj(τ*)Eparticle̱pairj(τ)|NL(j)×maxerror(φj(r))whereNL(j)is the number of particles of typejin the ligand.

Proof. Consider a particle p of type j of the ligand, let P be its position upon application of transformation τ* and Q be the position upon application of transformation τ. Consider spheres of radii Rmin  and Rmax  around P and Q; let Ω1 be the region common to the smaller spheres, Ω2 be the region not common to the smaller spheres and Ω3 be the region enclosed by the larger spheres not in Ω1 and Ω2. These regions are illustrated in Fig. 6.

dp=|PQ|=|τ*(rp)τ(rp)|=|(t*t)+(u*u)(rp)||(t*t)|+|(u*u)(rp)|32gs+dLerr(U) also ,d=max1pNL{dp}32g+dLerr(U).

Figure 6.

Figure 6

In proof of the theorem for error analysis of DOCK/PIE, P is the position of the particle upon application of transformation τ* and Q is the position upon application of transformation τ. Consider spheres of radii Rmin  and Rmax  around P and Q. The difference between contact potential at P and Q is only dependent on particles in the regions 2 and 3.

The error in the potential ΔΦj(p)=Φj(P)Φj(Q)dpRmaxRmin×maxi{1,...,22}{|wij|}×η where η is the number of particles of the receptor with centers in the region Ω2∪Ω3.

η ⩽ NR, (actually one can derive a tighter upper bound on η if one uses a proposition that the particles of the receptor are packed such that they have impenetrable cores) so maxerror (Φj(r))dRmaxRmin×maxi{1,...,22}{|wij|}×NR

so ,|Eparticle̱pairj(τ*)Eparticle̱pairj(τ)|NL(j)×dRmaxRmin×maxi{1,...,22}{|wij|}×NR
Hence |Eparticle̱pair(τ*)Eparticle̱pair(τ)|dRmaxRmin×NL×maxi{1,...,22}{|wij|}×NR
Term4.Eparticle̱pair(τ)EGridparticle̱pair(τ)=j=122Eparticle̱pairj(τ)EGridparticle̱pairj(τ)
Claim.|Eparticle̱pairj(τ)EGridparticle̱pairj(τ)|NL(j)× max ̱ approx ̱ error (Φj(r)).

Proof. The error in the potential is given by

ΔΦj(p)=Φj(P)ΦjGrid(P)1qNRwtype(q)j(h(rpq)h interpolate (rpq))maxk{1,...,22}{|wkj|}×pi(h(rpq)h interpolate (rpq))h(rpq)h interpolate (rpq)=α,β,γ{0,1}wαβγh(rpq)hrpqαβγ,

where rpqαβγ is the distance between center of particle q and corner αβγ of the cell containing the point P.

h(rpq)hrpqαβγ|rpqrpqαβγ|RmaxRmin3gRmaxRmin, so ,×h(rpq)h interpolate (rpq)3gRmaxRmin.

Therefore ΔΦj(p)maxk{1,...,22}{|wkj|}×η2×3gRmaxRmin where η2 is the maximum number of particles of the receptor with centers in the region spawned by the union of spheres of radii Rmax  centered at corners of a grid cell.

η2NR, so ,ΔΦjpmaxk{1,...,22}{|wkj|}×NR×3gRmaxRmin;

therefore

|Eparticle̱pairj(τ)EGridparticle̱pairj(τ)|NL(j)×maxk{1,...,22}{|wkj|}×NR×3gRmaxRmin

and

|Eparticle̱pair(τ)EGridparticle̱pair(τ)|3gRmaxRmin×NL×maxk{1,...,22}{|wkj|}×NR
Term1.|Evdw(τ*)Evdw(τ)|max{0.5,|wvdw̱repul|}×maxi,j4ɛiɛjσj2.53.23.5×NLatom×NRatom×d
Term3.|Evdw(τ)EGridvdw(τ)|max{0.5,|wvdw̱repul|}×maxi,j4ɛiɛjσj2.53.23.5×NLatom×NRatom×3g

Hence the theorem.

APPENDIX B: FAST COMPUTATION OF RMSD IN RIGID BODY DOCKING

We provide a constant-time algorithm to calculate rms between a pair of docked structures arising in rigid body docking. Together with DOCK/PIE this completes the description of the procedure for efficient generation of the top Λ violated constraints in step 8 of the learning algorithm.

Preprocess

1: Input: Sets of points XR={r1,r2,...,rm},XL={l1,l2,...,ln} representing the receptor and ligand in the native structure.

2: Center the point sets,

r=i=1nrim,l=i=1mln,c=mr+nlm+nri=ric

and li=lic, recalculate r=i=1nrim and l=i=1mln

3: Compute

A=axxaxyaxzayxayyayzazxazyazz

where

aαβ=i=1nriα×riβ

and

B=bxxbxybxzbyxbyybyzbzxbzybzz

where

bαβ=i=1mliα×liβ

RMSD calculation

1: Input: Rigid body transformation τ = (t, u), to compute RMSD((XR, XL), (XR, τ(XL)))

2: Compute

=ρxxρxyρxzρyxρyyρyzρzxρzyρzz

where

ραβ=aαβ+i=1n(τ(li))α×liβm+n=aαβ+i=1n(uαxlix+uαyliy+uαzliz+tα)×liβm+n=aαβ+uαxbxβ+uαybyβ+uαzbzβ+tα×lβm+n

3: Calculate rmsd from eigen-values of ℜT

Claim. LetNR, NLbe the number of points in the receptor and ligand. The algorithm RMSD_RIGID_DOCK takesO(NR + NL + M) time to processMtransformations.

The straight forward approach of computing RMSD between the model and the native by explicitly listing the points of the model would involve Θ((m + n)M) operations. This procedure is also applicable to cluster solutions according the I-RMSD, all pairwise distances between M transformations are calculated in O((m + n)M + M2) (the straight forward procedure would involve Θ((m + n)M2) operations).

APPENDIX C: QP SOLVER

OOQP (Ref. 43) provides open source framework with tools for solving general quadratic programs and quadratic programs arising from binary classification problems. We follow the framework outlined in OOQP, use the parallel routines reported in an earlier work (Wagner et al.20), and develop a primal-dual interior point algorithm for solving the QP arising in the learning algorithm. Predictor-corrector based interior point algorithms iteratively approach the optimal solution by alternatively reducing the duality gap and maintaining centrality.

We develop two algorithms for quadratic programming. The first one follows the infeasible starting point algorithm introduced by Potra.54 The solution traverses a path guided by the Karush-Kuhn-Tucker (KKT) conditions. KKT conditions are a generalization of the method of Lagrange multipliers for solution of systems with inequality constraints. After each update, the new point is closer to satisfying feasibility and the KKT conditions compared to the starting point. The second algorithm follows Gondzio's approach.55

Primal.

(w,ξ)= argmin w,ξ12||w||2+Cni,kξik subject to ξik ⩾ 0 and β=0ατi(j)Si:wT(PτiPτi(j))1ξikΔ(τi,τi(j)), rewritten as Mw + Nξ ⩾ e where Nj,k={1Δ(τi,τi(j))ifk=j(i)0else (transformation τi(j) belongs to cluster k)

Primal-dual.

12||w||2+CneTξυT(Mw+Nξse)tTξ subject to ξ ⩾ 0, υ ⩾ 0, s ⩾ 0, t ⩾ 0

KKT conditions.

wMTυ = 0,Mw + Nξ − se = 0,CneNυt=0,VSe = 0 and TΞe = 0. Where, V is a diagonal matrix with V[i, i] = vi; S, T, Ξ are similar diagonal matrices defined by s, t, ξ.

The system of equations to be solved for the update at each iteration is

IMT000M0NI00N00I0S0V000T0ΞΔwΔυΔξΔsΔt=rwrMrCrSVrTΞ

Which reduces after block eliminations to,

(I+MTDM)Δw=rw+MTD(rMV1rSV+NT1rTΞ+NT1ΞrC)

where

D=(V1S+NT1ΞNT)1.

The computationally intensive task is to form and compute the Cholesky factorization of I + MTDM quickly so that the linear system arising in each iteration of the interior point algorithm can be solved efficiently. We reuse the procedure described in an earlier work20 (after minor extensions to handle clustering) for the calculation of I + MTDM in a completely data parallel fashion. We modified the SVM module in OOQP to parallelized combined weighted slack 1-class SVM to handle our formulation.

References

  1. Alberts B., Cell 92, 291 (1998). 10.1016/S0092-8674(00)80922-8 [DOI] [PubMed] [Google Scholar]
  2. Jones S. and Thornton J. M., Proc. Natl. Acad. Sci. U.S.A. 93(1), 13 (1996). 10.1073/pnas.93.1.13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Kozakov D., Hall D. R., Beglov D., Brenke R., Comeau S. R., Shen Y., Li K. Y., Zheng J. F., Vakili P., Paschalidis I. C., and Vajda S., Proteins: Struct., Funct., Bioinf. 78(15), 3124 (2010). 10.1002/prot.22835 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Onufriev A., Bashford D., and Case D. A., Proteins: Struct., Funct., Bioinf. 55(2), 383 (2004). 10.1002/prot.20033 [DOI] [PubMed] [Google Scholar]
  5. Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., Shindyalov I. N., and Bourne P. E., Nucleic Acids Res. 28(1), 235 (2000). 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Vallat B. K., Pillardy J., and Elber R., Proteins: Struct., Funct., Bioinf. 72(3), 910 (2008). 10.1002/prot.21976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Miyazawa S. and Jernigan R. L., Macromolecules 18(3), 534 (1985). 10.1021/ma00145a039 [DOI] [Google Scholar]
  8. Betancourt M. R. and Thirumalai D., Protein Sci. 8(2), 361 (1999). 10.1110/ps.8.2.361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lu H. and Skolnick J., Proteins: Struct., Funct., Bioinf. 44(3), 223 (2001). 10.1002/prot.1087 [DOI] [PubMed] [Google Scholar]
  10. Fleishman S. J., Corn J. E., Strauch E. M., Whitehead T. A., Andre I., Thompson J., Havranek J. J., Das R., Bradley P., and Baker D., Proteins: Struct., Funct., Bioinf. 78(15), 3212 (2010). 10.1002/prot.22784 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chen R., Li L., and Weng Z., Proteins: Struct., Funct., Genet. 52(1), 80 (2003). 10.1002/prot.10389 [DOI] [PubMed] [Google Scholar]
  12. Vendruscolo M., Najmanovich R., and Domany E., Proteins: Struct., Funct., Genet. 38(2), 134 (2000). [DOI] [PubMed] [Google Scholar]
  13. Tobi D., Shafran G., Linial N., and Elber R., Proteins: Struct., Funct., Genet. 40(1), 71 (2000). [DOI] [PubMed] [Google Scholar]
  14. Tobi D. and Elber R., Proteins: Struct., Funct., Genet. 41(1), 40 (2000). [DOI] [PubMed] [Google Scholar]
  15. Bryngelson J. D., Onuchic J. N., Socci N. D., and Wolynes P. G., Proteins: Struct., Funct., Genet. 21(3), 167 (1995). 10.1002/prot.340210302 [DOI] [PubMed] [Google Scholar]
  16. Christianini N. and Shawe-Taylor J., An Introduction to Support Vector Machines (Cambridge University Press, Cambridge, England, 2000). [Google Scholar]
  17. Tsochantaridis I., Joachims T., Hofmann T., and Altun Y., J. Mach. Learn. Res. 6, 1453 (2005). [Google Scholar]
  18. Joachims T., Galor T., and Elber R., Learning to Align Sequences: A Maximal Margin Approach (Springer-Verlag, Berlin, 2005). [Google Scholar]
  19. Wright S. J., Primal-Dual Interior-Point Methods (SIAM, Philadelphia, 1997). [Google Scholar]
  20. Wagner M., Meller J., and Elber R., Math. Program. 101(2), 301 (2004). 10.1007/s10107-004-0526-7 [DOI] [Google Scholar]
  21. Ravikant D. V. S. and Elber R., Proteins: Struct., Funct., Bioinf. 78(2), 400 (2010). 10.1002/prot.22550 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Qiu J. and Elber R., Proteins: Struct., Funct., Bioinf. 61(1), 44 (2005). 10.1002/prot.20585 [DOI] [PubMed] [Google Scholar]
  23. Yu C.-N. J., Joachims T., Elber R., and Pillardy J., J Comput. Biol. 15(7), 867 (2008). 10.1089/cmb.2007.0152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Tobi D. and Bahar I., Proteins: Struct., Funct., Bioinf. 62(4), 970 (2006). 10.1002/prot.20859 [DOI] [PubMed] [Google Scholar]
  25. Comeau S. R., Kozakov D., Brenke R., Shen Y., Beglov D., and Vajda S., Proteins: Struct., Funct., Bioinf. 69(4), 781 (2007). 10.1002/prot.21795 [DOI] [PubMed] [Google Scholar]
  26. Katchalski-Katzir E., Shariv I., Eisenstein M., Friesem A. A., Aflalo C., and Vakser I. A., Proc. Natl. Acad. Sci. U.S.A. 89(6), 2195 (1992). 10.1073/pnas.89.6.2195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Gabb H. A., Jackson R. M., and Sternberg M. J. E., J. Mol. Biol. 272(1), 106 (1997). 10.1006/jmbi.1997.1203 [DOI] [PubMed] [Google Scholar]
  28. Vakser I. A. and Aflalo C., Proteins: Struct., Funct., Genet. 20(4), 320 (1994). 10.1002/prot.340200405 [DOI] [PubMed] [Google Scholar]
  29. Mandell J. G., Roberts V. A., Pique M. E., Kotlovyi V., Mitchell J. C., Nelson E., Tsigelny I., and Ten Eyck L. F., Protein Eng. 14(2), 105 (2001). 10.1093/protein/14.2.105 [DOI] [PubMed] [Google Scholar]
  30. Jorgensen W. L. and Tirado-Rives J., J. Am. Chem. Soc. 110(6), 1657 (1988). 10.1021/ja00214a001 [DOI] [PubMed] [Google Scholar]
  31. Lu H., Lu L., and Skolnick J., Biophys. J. 84, 1895 (2003). 10.1016/S0006-3495(03)74997-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Pierce B. and Weng Z., Proteins: Struct., Funct., Bioinf. 67(4), 1078 (2007). 10.1002/prot.21373 [DOI] [PubMed] [Google Scholar]
  33. Comeau S. R., Gatchell D. W., Vajda S., and Camacho C. J., Bioinformatics 20(1), 45 (2004). 10.1093/bioinformatics/btg371 [DOI] [PubMed] [Google Scholar]
  34. Tovchigrechko A. and Vakser I. A., Nucleic Acids Res. 34(suppl. 2), W310 (2006). 10.1093/nar/gkl206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhang C., Vasmatzis G., Cornette J. L., and DeLisi C., J. Mol. Biol. 267(3), 707 (1997). 10.1006/jmbi.1996.0859 [DOI] [PubMed] [Google Scholar]
  36. Kozakov D., Brenke R., Comeau S. R., and Vajda S., Proteins: Struct., Funct., Bioinf. 65(2), 392 (2006). [DOI] [PubMed] [Google Scholar]
  37. Chuang G.-Y., Kozakov D., Brenke R., Comeau S. R., and Vajda S., Biophys. J. 95(9), 4217 (2008). 10.1529/biophysj.108.135814 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Thomas P. D. and Dill K. A., Proc. Natl. Acad. Sci. U.S.A. 93(21), 11628 (1996). 10.1073/pnas.93.21.11628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Huang S. Y. and Zou X., Proteins: Struct., Funct., Bioinf. 72(2), 557 (2008). 10.1002/prot.21949 [DOI] [PubMed] [Google Scholar]
  40. Hoffgen K. U., Simon H. U., and Vanhorn K. S., J. Comput. Syst. Sci. 50(1), 114 (1995). 10.1006/jcss.1995.1011 [DOI] [Google Scholar]
  41. Nelly Andrusier R. N., and Wolfson Haim J., Proteins: Struct., Funct., Bioinf. 69(1), 139 (2007). 10.1002/prot.21495 [DOI] [PubMed] [Google Scholar]
  42. Lloyd D. B. and N.TrefethenIII, Numerical Liner Algebra (SIAM, Philadelphia, 1997). [Google Scholar]
  43. Gertz E. M. and Wright S. J., ACM Trans. Math. Softw. 29(1), 58 (2003). 10.1145/641876.641880 [DOI] [Google Scholar]
  44. Zhang Y. and Skolnick J., Nucleic Acids Res. 33(7), 2302 (2005). 10.1093/nar/gki524 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Mintseris J., Wiehe K., Pierce B., Anderson R., Chen R., Janin J., and Weng Z., Proteins: Struct., Funct., Bioinf. 60(2), 214 (2005). 10.1002/prot.20560 [DOI] [PubMed] [Google Scholar]
  46. Miyazawa S. and Jernigan R. L., Proteins: Struct., Funct., Bioinf. 34(1), 49 (1999). [DOI] [PubMed] [Google Scholar]
  47. Maiorov V. N. and Crippen G. M., J. Mol. Biol. 227(3), 876 (1992). 10.1016/0022-2836(92)90228-C [DOI] [PubMed] [Google Scholar]
  48. Vendruscolo M. and Domany E., J. Chem. Phys. 109(24), 11101 (1998). 10.1063/1.477748 [DOI] [Google Scholar]
  49. Taylor M. S., Fung H. K., Rajgaria R., Filizola M., Weinstein H., and Floudas C. A., Biophys. J. 94(7), 2470 (2008). 10.1529/biophysj.107.117622 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Floudas C. A., Fung H. K., McAllister S. R., Monnigmann M., and Rajgaria R., Chem. Eng. Sci. 61(3), 966 (2006). 10.1016/j.ces.2005.04.009 [DOI] [Google Scholar]
  51. Hu C., Li X., and Liang J., Bioinformatics 20(17), 3080 (2004). 10.1093/bioinformatics/bth369 [DOI] [PubMed] [Google Scholar]
  52. Potra F. A., Math. Program. 67(1), 383 (1994). 10.1007/BF01582228 [DOI] [Google Scholar]
  53. Gondzio J., Comput. Optim. Appl. 6(2), 137 (1996). [Google Scholar]

Articles from The Journal of Chemical Physics are provided here courtesy of American Institute of Physics

RESOURCES