Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 10.
Published in final edited form as: Proc IEEE Int Conf Data Min. 2011 Dec:1062–1067. doi: 10.1109/ICDM.2011.88

Learning Protein Folding Energy Functions

Wei Guan *, Arkadas Ozakin , Alexander Gray *, Jose Borreguero , Shashi Pandit , Anna Jagielska , Liliana Wroblewska , Jeffrey Skolnick
PMCID: PMC4192713  NIHMSID: NIHMS632994  PMID: 25311546

Abstract

A critical open problem in ab initio protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.

Keywords: ab initio protein folding, energy function, learning-to-rank, support vector machine, non-negativity constrained SVM optimization

I. Introduction

Proteins are polymers assembled from 20 naturally occurring amino acids, which fold to unique, biologically active, three-dimensional conformations called native structures. Their biological functions are governed by their three-dimensional structures, which in turn are fully determined by their amino acid sequences. Predicting the native structure of a protein from its amino acid sequence is one of the most important and challenging scientific problems in contemporary biology and chemistry [1]. The capability to reliably make such predictions would allow biochemists to design drugs more efficiently, understand biological processes in detail, and answer fundamental questions about biological systems, diseases, immune response, and more.

The experimental determination of protein structure is a time consuming and expensive process. Hence, computational methods play an essential role in the prediction of the native structures of proteins. There are three classes of computational approaches to protein structure prediction: homology modeling, threading, and ab initio folding. Homology modeling and threading methods utilize proteins with known structure that are evolutionarily related to the target protein with unknown structure [2]. If one can not find such proteins in the available library of experimentally resolved protein structures, the only remaining approach to predicting the native structure is ab initio folding.

Ab initio folding attempts to find the native structure of a protein “from scratch”. The fundamental assumption in ab initio folding is the existence of a free energy function that assigns an energy value to each three-dimensional structure the protein can in principle assume. The native structure is assumed to be the one with the lowest energy [3]. Thus, there are two main ingredients in ab initio folding: The design of a reliable energy function, and the development of an efficient approach to search the space of all possible conformations for the one with the lowest energy. In this paper, we focus on the first problem.

The energy functions used in ab initio folding are physics-based: for a given three dimensional configuration of a protein, one first calculates various terms contributing to the total energy such as electrostatic energy, covalent bonding energy, Van der Waals energy, etc., and then adds these terms to obtain the total energy [4]. While these terms are based on physics, their functional forms are sometimes approximate, and the coefficients that appear are obtained by various fitting procedures. In this work, we represent the total energy of a configuration as a linear combination of these physics-based energy terms, and optimize the coefficients.

The fitness of a given energy function for a given protein can be visually inspected by plotting the total energy versus the structural dissimilarity from the native structure. In order to do this, one generates many possible conformations and computes the total energy and dissimilarity from the native structure for each. 1 Fig. 1 shows such a plot for a desirable energy function. As can be seen, the energy value is higher for conformations that have large dissimilarities from the native structure, with a roughly monotonic trend. Due to the monotonic trend, reducing the energy corresponds to getting closer to the native structure during ab initio folding procedure. If one can construct an energy function that has energy vs. structural dissimilarity plots like that of Fig. 1, one can hope to reproduce a similar trend for proteins with unknown structure.

Figure 1.

Figure 1

Energy versus Structural Dissimilarity Plot: each dot represents a non-native conformation, and the red square represents the native structure

For a given protein, we represent the total energy of a conformation si as f (si) = wT xi = Ei, where xiRn represents the collection of the energy terms for si, and wRn are the weight coefficients. Treating w as the unknown, the task of learning an ab initio protein folding energy function becomes a weight optimization problem. Much of the literature on this problem is based on maximizing correlation (or related quantities) between the total energy and the dissimilarity [4],[6]. In this paper, we propose a ranking-based approach to this problem. Namely, given m conformations for each protein, we search for the weights w such that for each protein a meaningful subset of the constraints below are satisfied.

  • Total energy of the native structure is the minimum, that is, E0 < Ej for all j = 1, … , m.

  • Energy of random conformations with smaller structural dissimilarity are smaller than those with larger dissimilarity, that is, if rj < ri, then Ej < Ei.

The paper is organized as follows. We begin in Section 2 by converting the weight optimization problem into a learning-to-rank task and then describe RankingSVM, a ranking-via-classification method that we utilize. Due to physicality constraints, we restrict the problem to non-negative weights. Section 3 describes two efficient algorithms to solve the constrained ranking problem. Section 4 summarizes the experiment results. Section 5 concludes the study and discusses future work.

II. Approach: Weight Learning By Ranking

The problem of learning protein energy function can be reduced into a learning-to-rank problem if we consider the ordering derived from structure dissimilarity as the true ordering over the protein conformations, and the ordering derived from the energy function as the predicted ordering. The reduced problem seeks to find a ranking function f (s) = wT x that optimally approximates the true ordering. That is, for each protein, we expect the predicted ordering to satisfy the following requirements as closely as possible:

  1. Rank the native structure above other conformations.

  2. Rank conformations with lower structural dissimilarities above those with higher dissimilarities.

Current machine learning approaches in learning-to-rank tasks can be divided into three classes: pointwise approach [7],[8]; pairwise approach [9],[10]; listwise approach [11],[12]. Pointwise and pairwise approaches have the advantage that the existing theories and algorithms on regression and classification can be readily applied into the learning task. Moreover, pairwise approaches generally outperform pointwise approaches and have been successfully applied to various information retrieval applications [13],[14]. Therefore, we adopt the pairwise ranking-via-classification approach, to solve our problem.

We next describe RankingSVM method, which is the basis of our proposed methods.

A. Ranking Via Support Vector Machines

RankingSVM finds a ranking function f (·) that maximizes the expected Kendall τ statistic on training dataset S = {S1, … ,SN}. Kendall’s τ statistic τSq(Oq,O^q)=concordant#discordant#concordant#+discordant# [15], where an object pair sisj is called discordant if the orderings Oq and O^q do not agree in how they order si and sj, and called concordant otherwise. In our study, Sq={(xiq,riq)}i=0mq contains the energy data of the 3D confirmations of qth protein. Ranking Oq denotes the true ordering of the protein conformations derived from structure dissimilarity, and O^q denotes the ordering determined by the ranking function.

For strict orderings on m instances, we have m(m1)2= concordant # + discordant #. Maximizing the expected Kendall’s τ statistic of a linear ranking function on a data set S is equivalent to maximizing the pairwise agreement (concordant #). This optimization problem can be formulated as a search for the weight vector w that maximizes the number of inequalities of form Sign(ri − rj)wT(xi − xj) ≥ 1 that hold true. It can be approximately solved by learning the SVM classifier [16] on the transformed data set, Sdif f = {zij = xixj, yij = Sign(rirj )}, where zij is the pairwise difference vector, yij is the sign of the rank difference of objects si, sj, and ξij are the slack variables.

min12wTw+πΣijξijs.t.yijwTzij1ξij,ξij>0,i,j (1)

III. Method: Non-Negativity Constrained Weight Learning

A. Non-Negativity Constraints

The energy terms used in our optimization represent “costs”, in the sense that the natural physical tendency of the protein is to decrease each one of these values. Each energy term, taken separately, represents a uniquely defined physical tendency. For the case of electrostatic interactions, two positive charges move away from each other in order to lower their interaction energy. Reversing the sign of this interaction energy would turn the repulsive force to an attractive one, hence resulting in an unphysical interaction. If we sacrifice the physicality of the energy function by picking negative weights for some terms, it may be possible to obtain a better ranking on the collected set of conformations. Unfortunately, experience shows that such unphysical energy functions, while performing well on the chosen set of existing proteins, perform poorly when predicting new physical structures. This is partly because it is impossible to sample the whole set of possible conformations for a given protein, and the methods used to generate the conformations in the training set start from special, compact conformations that already satisfy various physicality properties. Dropping the positivity constraints could improve the ranking for these special conformations, but there will be very large, unsampled subsets of the set of possible conformations where the negative coefficients would result in incorrect foldings/orderings. Thus, one enforces a positivity constraint on the weights in order to avoid overfitting to the (small) set of sampled conformations.

We next describe two approaches to non-negative support vector machines.

B. Non-Negative L2-norm SVM

In this section, we propose a non-negative version of SVM by using an L2 norm approach, and solve it through the exponential gradient (EG) algorithm [17].

Due to the characteristics of our problem, we formulate the optimization in primal form. Adding the non-negativity constraints to the standard SVM formulation gives the optimization problem,

minw0ν2wTw+1l1lT(1lDAw)+ (2)

where (u)+ = max(u, 0) sets the negative elements of the vector u to zero, A denotes the data matrix with rows given by the zij s, D = diag(y1, … , yl) is the label matrix, 1l = [1, 1, 1,…, 1]T is an l-dimensional vector of 1s, and l is the total number of data points (i.e. total number of pairwise difference vectors in our study).

The objective function in (2) is non-differentiable, hence typical optimization methods cannot be directly applied to this problem. To address this issue, we use the L2-norm of the hinge loss variables in the objective function. This type of SVM has gained popularity in large scale classification because the resulting objective function J (w) is a piecewise quadratic and strongly convex function, and efficient algorithms such as coordinate descent can be applied. The Non-Negative L2-norm SVM (NNL2SVM) objective function is,

J(w)=minw0ν2wTw+12l(1lDAw)+2 (3)

We use the exponential gradient (EG) algorithm [17] to solve this NNL2SVM problem because its optimization is naturally constrained to the non-negative space R+n. The algorithm is summarized in Table I.

Table I.

EG Algorithm for NNL2SVM Problem

graphic file with name nihms-632994-f0002.jpg

The standard normalization sets ∥ wt+11= 1. We also investigate another normalization rule that enforces ∥ wt+11≤∥ wt1 by keeping wt+1 unchanged if ∥ wt+11 is less than ∥ wt1, and setting its norm to ∥ wt1 otherwise. In our study, we set the learning rate η=1R, where R = maxij(maxk zij,k − mink zij,k) is the largest value over the sample set of the maximum difference between the components of a feature vector zij.

C. Non-Negative L1-norm SVM

Another approach to the NNSVM problem is to add non-negativity constraints to the L1-SVM formulation and extend the existing L1SVM algorithm [18] to solve the resulting NNL1SVM problem. The optimization problem is,

min1nTw+π1lTξs.t.DAw1lξw,ξ0 (4)

We solve this problem using an approach described in [18]. Proposition 1 in [18] states that for any (0,] for some >0, the optimal solution of the exterior penalty problem gives an exact solution to the original, primal problem. The corresponding exterior penalty problem can be derived by assigning quadratic penalty terms to the constraints of the dual problem. The exterior penalty problem of (4) minimize the following objective function,

J(μ)=1lTμ+12((ATDμ1n)+2+(μπ1l)+2+(μ)+2). (5)

Problem (5) is an unconstrained optimization problem. We solve it using a generalized Newton method.

Following the definition of generalized Hessian in [18], the gradient and hessian for (5) are given as,

ΔJ(μ)=1l+DA(ATDμ1n)++(μπ1l)+(μ)+2J(μ)=DAdiag{ATDμ1n}ATD+diag{(μπ1l)+(μ)}

where u* = Sign(u+), with Sign being applied element-wise on the vector. Notice that at each Newton iteration, we need to invert the matrix Q = δIl + 2J (μ). This is computationally expensive when the total number of data points l is large (l > 1000). We address this issue by using the Sherman-Morrison-Woodbury formula [19]. We decompose the hessian matrix as Q = F + H * HT, where diagonal matrix F = diag(ρ) with ρ = δ1l + (μπ1l)* + (−μ)* and ρ > 0, and matrix H = DAE with E = (diag(ATDμ−1n) ) 2. The inversion can be computed as Q−1 = F −1F −1H(Il + HT F−1 H)−1 HT F−1 and the time complexity is reduced from O(l3) to O(ln2)+ O(n3).

IV. Results and Discussion

A. Data Set Description

The dataset used in this study consists of the values of various energy terms for a non-redundant set of 171 proteins that fall into the ab initio folding class. This set is representative of the “hard target” protein sequences in the Protein Data Bank with up to 200 residues, meaning that current homology search tools fail to identify proteins with an evolutionary relationship with proteins in this class.

For each protein, a large set of non-native random conformations (over 50, 000 per protein) are generated in the manner described in [4]. The energy terms for the native structure and each one of the generated conformations are collected. The energy terms are obtained from the CABS (Cα-Cβ-Side chain) force field [4], which is used in the protein structure prediction tool TASSER [20]. We include 20 different energy terms from this force field, briefly summarized in Table III. The structural similarity of conformations is measured by the 1-(TM-score) [21], which is intended as a more accurate similarity measure than the commonly used RMSD [5].

Table III.

Energy Terms used in TASSER

graphic file with name nihms-632994-f0004.jpg

B. Previous Approach

In an earlier optimization study [4], the authors proposed to use an objective function related to the correlation corr(r(q), E(q)) between the structural dissimilarity and the total energy of the generated conformations. Namely, they used the product of two quantities G1 and G3, given by,

G1=11+1NΣq=1Ncorr(r(q),E(q))G3=11+1NΣq=1NZn(E(q))

where Z-score of the mean of the total energy Zn(E(q))=(E(q)E0(q))(E2(q)(E(q))2).

Using the CERN MINUIT package [22] to optimize the weights, they achieved significant results in CASP [20]. Their study employed proteins from all homology modeling, threading, and ab initio prediction classes.

C. Experiment Design

The number of all pairwise difference vectors zij = xixj is quadratic in the number of data points (conformations). In addition to this computational issue, it is not realistic to expect the energy function to rank all conformations according to their dissimilarity from the native structure. Therefore, in our experiments, we use the following sampling scheme to generate the training data set.

  • For the first class, C1 = {zi0 = xix0 | yi0 = sign(ri − r0) = 1}, we uniformly sample 100 non-native conformations according to their structural dissimilarity and include their comparisons with the native structure.

  • For the second class, C2 = {zjk = xjxk | yjk = sign(rj − rk) = −1}, we generate pairs of comparisons between non-native conformations. If two conformations have close values of dissimilarity from the native structure, it may not be reasonable to require the energy function to rank them according to the dissimilarity. We thus restrict the second class to pairs whose dissimilarities from the native structure are sufficiently different. In particular, we first partition the set of non-native conformations into 6 subsets, S(0,0.1), S[0.1,0.2), … , S[0.4,0.5), S[0.5,0.6], where S(0,0.1) contains conformations with dissimilarity from the native structure in the range (0, 0.1), etc. We then uniformly sample 25 conformations {si(j)}i=125 according to dissimilarity from each subset S[aj ,bj ). The comparisons we include are then, (s1(1)s1(3)),(s2(1)s2(3)),,(s25(1)s25(3)),(s1(2)s1(4)),(s25(2)s25(4)),,(s25(4)s25(6)).

By the sampling method described above, we generate 100 data points in each class, for each protein. This gives a total of 34, 200 data points of dimension 20.

D. Results Analysis

We evaluate the capability of our Non-Negative RankingSVM (NN-RankingSVM) approach to learning protein energy functions through 10-fold cross validation. We randomly partition the 171 proteins into 10 folds. For each fold i, we learn an energy function from the energy data of the other 9 folds using each method, and evaluate the learned energy functions on the data of fold i. We employ the grid search procedure during cross-validation for parameter tuning of the NNSVM methods. We denote ranking via NNL2SVM with the normalization rule ∥ wt1= 1 as NNL2SVMn1, ranking via NNL2SVM with the normalization rule ∥ wt+11≤∥ wt1 as NNL2SVMn2, and ranking via the NNL1SVM approach as NNL1SVM. The baseline method (see Section 4.C) is denoted as TASSERMINUIT (with randomly generated initial weights).

1) NN-RankingSVM versus TASSERMINUIT: We first compare the protein folding energy function learned by our proposed NN-RankingSVM approach with those learned by the baseline method. We use two criteria to evaluate the fitness of the learned energy functions: Kendall τ rank statistics (approximated by sampled pairwise agreement) and Pearson’s correlation. Fig. 2(a) lists the sampled pairwise agreement, measured by the average testing accuracy on the labeled pairwise difference data. Fig. 2(b) lists the average correlation coefficients between the (1-TMscore) value and the energy, which are computed using the learned energy function during each cross-validation.

Figure 2.

Figure 2

Error Plot of the Performance of the Learned Energy Functions

Comparing to the energy functions learned by base-line method, energy functions learned using the NN-RankingSVM approach generally achieve better performance in both sampled pairwise agreement and correlation. On average, energy functions learned using the NNL2SVM methods can obtain around 9% increase in the sampled pairwise agreement and around 34% increase on the correlation values. While those output by NNL1SVM method have around 10% and 28% improvement on those values, respectively. In addition, the average computation time of a 10-fold cross validation for NNL1SVM is about 7 seconds, which is much more efficient comparing to the baseline method (around 2 minutes) and the NNL2SVM methods (around 20 minutes). In summary, the experiments show that NNL1SVM method outperform the other methods in terms of both learning performance and computational efficiency.

2) NNL2SVM versus NNL1SVM: We then analyze the trend of the sampled pairwise agreement (measured by testing accuracy) and the sparsity of the weight vector during the algorithm optimization of the proposed NNSVMs. As shown in Fig. 3, NNL2SVM methods generally converges to the optimal solutions after about 5000 EG iterations. NNL2SVMn1 method obtains sparser solutions than those of NNL2SVMn2 method on average, but at the cost of classification accuracy. Overall, NNL1SVM method demonstrates robust performance in classification accuracy and sparsity while enjoying fast convergence.

Figure 3.

Figure 3

NNSVM Optimization Method Comparison

V. Conclusion

A critical open problem in ab initio protein folding is protein energy function design. In this paper, we addressed this problem as a weight optimization problem, and demonstrated a machine learning approach using the ranking-via-classification paradigm. Comparing with state-of-the-art approach that based on maximizing the correlation between the total energy and the structural dissimilarity, our learning-to-rank approach was able to learn energy functions that maintain the correct ordering of the conformations more often, and give higher correlations with the structural dissimilarity from the native structure. We believe that this new approach for learning protein energy functions presents a new avenue of exploration with potential. We will investigate generalization of our methods to the problem of learning nonlinear energy functions. We also expect the capability to learn SVMs with non-negative weights to have diverse applications beyond protein structure prediction.

Table II.

Newton Method for NNL1SVM Problem

graphic file with name nihms-632994-f0003.jpg

Footnotes

1

There are various notions of structural similarity used in the literature, the most basic one being the root mean squared distances (RMSD) [5] between the building blocks (e.g., atoms) of the protein as represented in two candidate structures aligned in three-dimensional space.

References

  • [1].Baker D, Sali A. Protein structure prediction and structural genomics. Science. 2001;294(5540):93. doi: 10.1126/science.1065659. [DOI] [PubMed] [Google Scholar]
  • [2].Skolnick J, Kihara D, Zhang Y. Development and testing of the PROSPECTOR 3.0 threading algorithm. Proteins. 2004;3:502–518. doi: 10.1002/prot.20106. [DOI] [PubMed] [Google Scholar]
  • [3].Anfinsen C, et al. Principles that govern the folding of protein chains. Science. 1973;181(96):223–230. doi: 10.1126/science.181.4096.223. [DOI] [PubMed] [Google Scholar]
  • [4].Zhang Y, Kolinski A, Skolnick J. Touchstone II: A new approach to ab initio protein structure prediction. Biophysical Journal. 2003;85:1145–1164. doi: 10.1016/S0006-3495(03)74551-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A. 1976;32(5):922–923. [Google Scholar]
  • [6].Kuhlman B, Baker D. Native protein sequences are close to optimal for their structures. Proceedings of the National Academy of Sciences. 2000;97(19):10383. doi: 10.1073/pnas.97.19.10383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Caruana R, Baluja S, Mitchell T. Using the future to sort out the present: Rankprop and multitask learning for medical risk analysis. NIPS’95. :959–965. [Google Scholar]
  • [8].Crammer K, Singer Y. Pranking with ranking. NIPS’02. :641–648. [Google Scholar]
  • [9].Herbrich R, Obermayer K, Graepel T. Advances in Large Margin Classifiers. MIT Press; 2000. Large margin rank boundaries for ordinal regression; pp. 115–132. [Google Scholar]
  • [10].Freund Y, Iyer RD, Schapire RE, Singer Y. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research. 2003;4:933–969. [Google Scholar]
  • [11].Cao Z, Qin T, Liu T, Tsai M, Li H. Learning to rank: from pairwise approach to listwise approach; Proceedings of the 24th ICML conference.2007. pp. 129–136. [Google Scholar]
  • [12].Yue Y, Finley T, Radlinski F, Joachims T. A support vector method for optimizing average precision; Proceedings of the 30th ACM SIGIR conference.2007. pp. 271–278. [Google Scholar]
  • [13].Iyer R, Lewis D, Schapire R, Singer Y, Singhal A. Boosting for document routing; Proceedings of the 19th CIKM conference.2000. pp. 70–77. [Google Scholar]
  • [14].Joachims T. Optimizing search engines using clickthrough data; Proceedings of the 8th ACM SIGKDD conference.2002. pp. 132–142. [Google Scholar]
  • [15].Kendall M. A new measure of rank correlation. Biometrika. 1938;30(1-2):81–93. [Google Scholar]
  • [16].Vapnik V. The Nature of Statistical Learning Theory. Springer; 1995. [Google Scholar]
  • [17].Kivinen J, Warmuth M. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation. 1997;132(1):1–63. [Google Scholar]
  • [18].Mangasarian O. Exact 1-norm support vector machines via unconstrained convex differentiable minimization. Journal of Machine Learning Research. 2006;7(2):1517–1530. [Google Scholar]
  • [19].Golub G, Van Loan C. Matrix computations. Johns Hopkins University Press; 1996. [Google Scholar]
  • [20].Zhang Y, Skolnick J. TASSER: An automated method for the prediction of protein tertiary structures in CASP6. Proteins. 2005;61(S7):91–98. doi: 10.1002/prot.20724. [DOI] [PubMed] [Google Scholar]
  • [21].Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004;57(4):702–710. doi: 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
  • [22].MINUIT http://seal.web.cern.ch/seal/snapshot/work-packages/mathlibs/minuit/

RESOURCES