Skip to main content
Journal of Zhejiang University. Science. B logoLink to Journal of Zhejiang University. Science. B
. 2005 Dec 21;7(1):7–12. doi: 10.1631/jzus.2006.B0007

Heuristic algorithm for off-lattice protein folding problem*

Mao Chen 1,, Wen-qi Huang 1
PMCID: PMC1361753  PMID: 16365919

Abstract

Enlightened by the law of interactions among objects in the physical world, we propose a heuristic algorithm for solving the three-dimensional (3D) off-lattice protein folding problem. Based on a physical model, the problem is converted from a nonlinear constraint-satisfied problem to an unconstrained optimization problem which can be solved by the well-known gradient method. To improve the efficiency of our algorithm, a strategy was introduced to generate initial configuration. Computational results showed that this algorithm could find states with lower energy than previously proposed ground states obtained by nPERM algorithm for all chains with length ranging from 13 to 55.

Keywords: Protein folding, AB off-lattice model, Gradient method

INTRODUCTION

Protein folding problem, or protein structure prediction problem, is one of the central problems in the field of bioinformatics. Studies indicated that proteins’ biological functions are determined by their dimensional folding structures (Anfinsen, 1973). Since the structure of a protein is strongly correlated with the sequence of amino acid residues, predicting the native states of a protein from its given sequence by using theoretical computing method is a feasible approach and of great significance for protein engineering (Lau and Dill, 1989).

Since the problem is too difficult to be approached with fully realistic potentials, the theoretical science community has introduced and examined several highly simplified models, one of which is the HP lattice model of Dill (1985) where each amino acid is treated as a point particle on a regular (quadratic or cubic) lattice, and only two types of amino acids–hydrophobic (H) and hydrophilic (P)–are considered. The energy between any two neighboring non-bonded hydrophobic monomers (H-H) is defined as −1, otherwise 0.

Being the most simplified and most popular model, HP model only considers the interactions between neighboring non-bonded H monomers, neglecting the other nonlocal effects caused by P-P, H-P and non-neighbored H-H pairs, which also exert significant statistical influence on the conformation of the monomers in the properly folded state.

To illustrate the influence of nonlocal effects on protein folding, Stillinger (1995) proposed a more realistic simplified model, namely, AB off-lattice model, which also uses only two types of monomers, now called “A” (hydrophobic) and “B” (hydrophilic). The distances between consecutive monomers along the chain are held to be 1, while nonconsecutive monomers interact through a modified Lennard-Jones potential. In addition, there is an energy contribution called bending energy from each bond angle θi between successive bonds. Hence, the total energy function U 1 for an n monomers chain is expressed as

graphic file with name M1.gif (1)

, where

graphic file with name M2.gif (2)

,

graphic file with name M3.gif (3)

Here rij is the distance between monomer i and j (with i<j). Each ζi is either A or B, and C(ζi, ζj) is +1, +1/2 and −1/2 respectively, for AA, BB, and AB pairs, thus producing strong attraction between AA pairs, weak attraction between BB pairs, and weak repulsion between AB pairs, roughly analogous to the situation in real proteins.

Even in this highly simplified model, it is not easy to predict the native state for the protein folding problem. This problem has been recognized to be NP-complete, which means that it is not solvable in polynomial time, even for an optimal algorithm (Crescenzi et al., 1998). Consequently, various heuristic schemes have been proposed for approaching this problem.

For its two-dimensional (2D) version, neural networks (Stillinger, 1995), Monte Carlo (Irback et al., 1997) and biologically motivated methods (Torcini et al., 2001) were used to find the native state. An improved pruned enriched Rosenbluth method with importance sampling, namely, nPERM was proposed by Hsu et al.(2003), which found states with lower energy than previously proposed putative ground states for all four Fibonacci sequences with chain lengths≥13. Without modifying the energy function, Hsu et al.(2003) extended the 2D AB model to 3D version and presented some putative lowest energy states for the four sequences. Although the resulting configuration corresponding to the lowest energy has a single hydrophobic core for the short sequence with length 13, the longer sequences with length ranging from 21 to 55 do not fold into configurations with single hydrophobic cores. Recently, better results in three dimensions for the four sequences were achieved by means of energy landscape paving (ELP) minimizer (Bachmann et al., 2005) and conformational space annealing (CSA) method (Kim et al., 2005).

In this paper, we propose a quite different class of heuristic algorithm for predicting the native structure for the 3D AB off-lattice model. The proposed algorithm integrates the well-known gradient method and a novel strategy of generating promising initial configuration in order to find the globally optimal state. Compared with nPERM, the experimental results showed that our algorithm can find lower energy states and that each of the four resulting configurations has single hydrophobic cores.

PROPOSED ALGORITHM

Mathematical formulation

Consider the problem in 3D Euclidean space. Consider an amino acid sequence as a chain of black balls (A) and white balls (B) with radius R=0.5, with the balls being numbered from 1 to n. Denote the coordinates of the center of the ith (i=1, 2, …, n) ball by (xi, yi, zi). At any moment, the entirety of the coordinates of the center of the n balls, x 1, y 1, z 1, …, xn, yn, zn, is called a configuration.

Now, the protein folding problem can be described as the following mathematical model:

min(U1) (4)

, subject to

graphic file with name M4.gif (5)
graphic file with name M5.gif (6)

In this model, there are 3n continuous deterministic variables and n−1 constraints where n is the number of balls. Constraint Eq.(5) ensures that the distances between the centers of two consecutive balls along the chain are equal to 1. A configuration that satisfies constraint Eq.(5) is termed a legal configuration.

Eqs.(4)~(6) form a specific type of nonlinear constraint-satisfied problem. This is just the mathematical model for our protein folding problem. It is rather difficult to solve this kind of problem directly due to the loss of smoothness in the solution space. Therefore, a scheme is proposed below to convert this problem into an unconstrained optimization problem which is smooth in the solution space.

New mathematical description

Instead of fixing the distances between two successive balls, we imagine the centers of two consecutive balls i and i+1 (i=1, 2, …, n−1) are connected by a fictitious spring with natural length held to be 1. Springs have the tendency to return to their natural length after being compressed or stretched. So springs can be used to relax the requirement on the solvability of the original constraint-satisfied problem.

Under any configuration, the length of a spring connecting the centers of two consecutive balls along the chain is

graphic file with name M6.gif (7)

If li ,i+1>1, it means that the spring is extended; if li ,i+1<1, the spring is compressed. According to Hook’s law, the elastic potential energy of a spring is

graphic file with name M7.gif (8)

Here, K s is the spring coefficient, K s>0. Then the total spring potential energy of the whole configuration is

graphic file with name M8.gif (9)

Now, the total potential energy function of the whole configuration consists of three types of contributions: bond angle, Lennard-Jones and spring. The new energy function can be rewritten as:

graphic file with name M9.gif (10)

Here K 1 is a proportional coefficient, whose use will be discussed later. It can be seen from Eqs.(1)~(3) to Eqs.(7)~(10) that the potential energy U of the whole configuration is a known function of the coordinates x 1, y 1, z 1, …, xn, yn, zn of the centers of all the balls:

graphic file with name M10.gif (11)

U(x 1, y 1, z 1, …, xn, yn, zn) is defined on the entire 3n-dimentianal Euclidean space (−∞, +∞)3n, smooth, continuous and differentiable everywhere. Based on this new energy function, the protein folding problem is converted to a problem of optimization of the total potential energy U(x 1, y 1, z 1, …, xn, yn, zn). The aim is to find a configuration Inline graphic with minimum energy:

graphic file with name M12.gif (12)

Obviously, this problem is an unconstrained optimization problem, for which, there exists a ready-made algorithm for its solution, the gradient method, or the steepest descent method (Wang et al., 2002).

Eq.(8) and Eq.(9) show that the spring potential energy is non-negative. According to Eq.(8), if the coefficient Ks is set to be large enough, a spring with length differing slightly from the natural length 1 can considerably increase the whole energy of the configuration. Accordingly, if a configuration is not a legal one, that is, there are some springs compressed or stretched, the total energy of the configuration will not be very low. Therefore, we can see that the total elastic energy of the springs acts as a penalty function of the degree of departure of a configuration from a legal one, thus ensuring that the resulting configuration is legal.

Gradient method

Randomly define 3n real numbers in 3D Euclidean space as the initial configuration Inline graphic. Calculate gradU at Inline graphic:

graphic file with name M15.gif (13)

, where

graphic file with name M16.gif (14)

. Then a new configuration can be calculated following the gradient method:

graphic file with name M17.gif (15)

where the partial derivatives ∂U/∂xi, ∂U/∂yi, ∂U/∂zi, i=1, 2, …, n are defined at Inline graphic in a 3n-dimensional space. ε is step size, which is a small positive real number. We let ε be 10−6 in our procedure. Using vector representation,

graphic file with name M19.gif (16)

After moving towards the opposite direction of the gradient by ε|gradU|, configurationInline graphic becomesInline graphic. The physical meaning of the negative gradient, −gradU, in the gradient method is the generalized force in the system. (−∂U/∂xi, −∂U/∂yi, −∂U/∂zi) represents the magnitude and direction of the total force exerted on the ith ball. It should be pointed out that the evolution of (x 1, y 1, z 1, …, xn, yn, zn) in the gradient method is a series of movements of the positions of the n balls to a legal configuration with minimum energy.

To adjust the proposition of U 1 in the total energy, we multiply U 1 by a proportional coefficient K 1. At the initial phase of the iteration process, we let K 1 be much larger than K s so that U 1 dominates the evolution of the configuration to low energy states. As pointed out earlier, to ensure that the resulting configuration is a legal one, the coefficient K s should be large enough so that a little deformation of the springs away from the natural length will cause considerable increase of the total energy. So we increase K s and decrease K 1 gradually as the iteration continues, which will increase U s to drive the configuration to a legal configuration. At the end of the iteration process, the configuration becomes a legal one with low energy.

The calculating procedure is presented as follows:

(1) Randomly give n points (x 1, y 1, z 1), …, (xn, yn, zn) in 3D Euclidean space as the initial configuration Inline graphic. Let t=0, K 1=4001, Ks=1. Choose a very small positive number, λ, as the criterion for judging gradU to be zero approximately.

(2) Calculate |gradU| under configurationInline graphic. If |gradU|<λ, go to Step (6).

(3)

graphic file with name M24.gif

.

(4) If K 1>1, then

graphic file with name M25.gif

.

(5)

graphic file with name M26.gif

and turn to Step (2).

(6) Now, the gradient is approximately zero. Calculate the energy of the resulting configuration according to Eq.(1) as the solution and then stop the computation procedure.

Since K s is rather large (K s>107) at the end of the calculation, the resulting configuration satisfies Eq.(5) approximately, that is, the length of the springs satisfies the following requirement:

graphic file with name M27.gif (17)

Strategy of generating promising initial configuration

It should be pointed out that the solution of the algorithm above might just be a local (and hopefully also global) minimum. Since gradient method is a deterministic algorithm and the initial configurations are generated randomly, the resulting solutions are very unstable. So we start from a new initial configuration and the above-described computation resumes over again. From many solutions, we choose the best one. Experiments showed that a good result would be obtained from more than one hundred times computation. Thus ensuring that the initial configuration is certainly desirable.

Inspired by the phenomenon that hydrophobic amino acids are lumped together as a compact core surrounded by hydrophilic amino acids in a protein molecule, we put forward a heuristic strategy to generate promising initial configuration that simulates the real protein structure.

We define two spherical spaces with radii R 1 and R 2, respectively, where R 1 and R 2 are positive numbers with R 2=2R 1. The two spherical spaces have the same center, which is the origin of the 3D Cartesian coordinate system. For a black ball in initial configuration, its center position can only be generated randomly in a 3D space confined in the spherical space with radius R 1. We set

graphic file with name M28.gif

in our algorithm. For a white ball in initial configuration, its center position can only be generated randomly in a 3D space confined in the ball with radius R 2 but excluding the space of ball R 1. In a more formal way, it can be stated as follows:

graphic file with name M29.gif (18)
graphic file with name M30.gif (19)

, where i is black ball and j is white ball, and x, y, z are the coordinates of the center of a randomly generated ball.

Experimental results showed that this strategy could generate relatively better initial configurations. To illustrate this strategy, an initial configuration of 13 balls is shown in Fig.1. For ease of visualization, the illustration is confined to two dimensions.

Fig. 1.

Fig. 1

An initial configuration of 13 balls generated by the strategy of generating promising initial configuration

RESULTS

Table 1 shows the lowest energies obtained by our heuristic algorithm, along with the results by nPERM, ELP and CSA. It can be seen that our results are better than those of the nPERM for all the four sequences, with the energy difference increasing gradually for longer chains. For sequence with length 13, our result was also slightly better than that of ELP, and was equal to that of CSA. For other cases, however, we cannot reach the energy yielded by ELP and CSA.

Table 1.

Test sequences and the lowest energies obtained by heuristic algorithm (HA), in comparison with those by nPERM, ELP, and CSA, respectively

n Sequence nPERM ELP CSA HA
13 ABBABBABABBAB −4.9616 −4.967 −4.9746 −4.9746
21 BABABBABABBABBABABBAB −11.5238 −12.316 −12.3266 −12.0617
34 ABBABBABABBABBABABBABABBABBABABBAB −21.5678 −25.476 −25.5113 −23.0441
55 BABABBABABBABBABABBABABBABBABABBABBABABBABABBABBABABBAB −32.8843 −42.428 −42.3418 −38.1977

Fig.2 shows the lowest energy configurations obtained by our heuristic algorithm, where black circles indicate hydrophobic monomers (A) and white circles indicate hydrophilic monomers (B). It can be seen that the configuration has single hydrophobic core for all four sequences, which is analogous to the real protein structure.

Fig. 2.

Fig. 2

Fig. 2

Fig. 2

Fig. 2

The lowest energy configurations for the four sequences obtained by heuristic algorithm (a) n=13; (b) n=21; (c) n=34; (d) n=55

It should be pointed out that each of the results is the best one of the solutions iterated from several (≤10) randomly generated initial configurations. The runtime for all the four sequences was less than 2 h on a P4 2.4 GHz PC with 512 MB memory, while the computation time of nPERM was up to 2 d on Linux and UNIX workstation. Obviously, HA is much faster than nPERM. Note that the runtime of ELP and CSA was not reported in the literature.

CONCLUSION

The objective of the protein folding problem is to find inherent structures for a given set of attracting particles (amino acid monomers) that initially are widely dispersed. The elastic potential energy of spring is introduced into the energy function of the configuration to convert the protein folding problem to an unconstrained optimization problem solvable by the steepest descent method. Random initial configurations of the n particles were mapped onto the final inherent structure configurations by a numerical steepest descent on the potential energy surface. You can watch particles move according to the steepest descent algorithm from an initial diffuse random array towards a more compact array with lower potential energy.

Since gradient method is only a local search algorithm, it is possible for the gradient method to fall into the trap of local minimum. Selecting the best one from many solutions iterated from a promising initial configuration in a confined space may help to find a comparably good solution, but that will cost much computation time. In our future work, we hope to find some efficient strategy of jumping out of local minimum to develop more efficient algorithm.

Footnotes

*

Project supported by the National Basic Research Program (973) of China (No. 2004CB318000) and the National Natural Science Foundation of China (No. 10471051)

References

  • 1.Anfinsen C. Principles that govern the folding of protein chains. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. [DOI] [PubMed] [Google Scholar]
  • 2.Bachmann M, Arkin H, Janke W. Multicanonical study of coarse-grained off-lattice models for folding heteropolymers. Phys Rev E. 2005;71 doi: 10.1103/PhysRevE.71.031906. 031906. [DOI] [PubMed] [Google Scholar]
  • 3.Crescenzi P, Goldman D, Papadimitriou C, Piccolboni A, Yannakakis M. On the complexity of protein folding. Journal of Computational Biology. 1998;5(3):409–422. doi: 10.1089/cmb.1998.5.423. [DOI] [PubMed] [Google Scholar]
  • 4.Dill KA. Theory for the folding and stability of globular proteins. Biochemistry. 1985;24:1501–1509. doi: 10.1021/bi00327a032. [DOI] [PubMed] [Google Scholar]
  • 5.Hsu HP, Mehra V, Grassberger P. Structure optimization in an off-lattice protein model. Phys Rev E. 2003;68 doi: 10.1103/PhysRevE.68.037703. 037703. [DOI] [PubMed] [Google Scholar]
  • 6.Irback A, Peterson C, Potthast F. Identification of amino acid sequences with good folding properties in an off-lattice model. Phys Rev E. 1997;55:860–867. doi: 10.1103/PhysRevE.55.860. [DOI] [Google Scholar]
  • 7.Kim SY, Lee SB, Lee J. Structure optimization by conformational space annealing in an off-lattice protein model. Phys Rev E. 2005;72 doi: 10.1103/PhysRevE.72.011916. 011916. [DOI] [PubMed] [Google Scholar]
  • 8.Lau KF, Dill KA. A lattice statistical mechanics model of the conformational and sequence space of proteins. Macromolecules. 1989;22:3986–3997. doi: 10.1021/ma00200a030. [DOI] [Google Scholar]
  • 9.Stillinger FH. Collective aspects of protein folding illustrated by a toy model. Phys Rev. 1995;52:2872–2877. doi: 10.1103/physreve.52.2872. [DOI] [PubMed] [Google Scholar]
  • 10.Torcini A, Livi R, Politi A. A dynamical approach to protein folding. J Biol Phys. 2001;27:181–186. doi: 10.1023/A:1013104123892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang HQ, Huang WQ, Zhang Q, Xu DM. An improved algorithm for the packing of unequal circles within a larger containing circle. European Journal of Operational Research. 2002;141:440–453. doi: 10.1016/S0377-2217(01)00241-7. [DOI] [Google Scholar]

Articles from Journal of Zhejiang University. Science. B are provided here courtesy of Zhejiang University Press

RESOURCES