Abstract
We apply the gradient-directed Monte Carlo (GDMC) method to select optimal members of a discrete space, the space of chemically viable proteins described by a model Hamiltonian. In contrast to conventional Monte Carlo approaches, our GDMC method uses local property gradients with respect to chemical variables that have discrete values in the actual systems, e.g., residue types in a protein sequence. The local property gradients are obtained from the interpolation of discrete property values, following the linear combination of atomic potentials scheme developed recently [M. Wang et al., J. Am. Chem. Soc. 128, 3228 (2006)]. The local property derivative information directs the search toward the global minima while the Metropolis criterion incorporated in the method overcomes barriers between local minima. Using the simple HP lattice model, we apply the GDMC method to protein sequence design and folding. The GDMC algorithm proves to be particularly efficient, suggesting that this strategy can be extended to other discrete optimization problems in addition to inverse molecular design.
INTRODUCTION
Global optimization is a key challenge in many research areas.1 In chemistry, optimization is an essential part of structure prediction, protein folding, and ligand design.2 In these cases, the number of molecular conformations increases very rapidly with the number of atoms in a molecule, so determining the global energy minimum structure is nontrivial. Global optimization methods can be divided into deterministic and stochastic classes. Deterministic optimizations have no randomness associated with the optimization algorithms. In contrast, stochastic approaches incorporate probabilistic (random) elements during optimization.3 The problems to which the optimization methods are applied can be categorized in three classes: discrete problems, continuous problems, and mixed-integer problems. Here, we focus on discrete problems. For discrete variables, stochastic methods can be implemented and have been used widely in global optimization. Two popular stochastic approaches are simulated annealing (SA) (Ref. 4) and genetic algorithms (GAs).5 Both SA and GAs, however, ignore local information characteristic of the discrete space of interest. A study of Oda et al.6 showed that the ability of a discrete algorithm to find the global minimum is significantly improved when SA or GA is combined with local searching for Ising spin clusters. For instance, in the hybrid GA incorporated with one-neighborhood search,6 each spin from the current generation structure is flipped by single-point mutations, and the fittest individual is selected for further GA operations (such as crossover and selection). Their study suggests that a proper scheme of normal stochastic global optimization combined with some local searching can reduce the cost of stochastic global searches.
Instead of stochastic approaches that operate directly on discrete variable spaces, several approaches have been proposed to embed discrete spaces in virtual continuous spaces.7, 8, 9, 10 This strategy has proven to be especially effective in problems of molecular design. In particular, a linear combination of atomic potential (LCAP)9, 11 approach was developed and used to optimize nonlinear optical (NLO) properties by varying electrostatic potentials, expressed as a linear combination of atomic electrostatic potentials. The optimal structures can be determined without enumerating species or by performing stochastic searches. The LCAP study indicates that it is possible to treat discrete variables as continuous, and then to use local derivative information for the target property to perform continuous optimization.
From our experience with the LCAP approach, the complexity of the continuous virtual property surface depends on the property of interest and the specific library of chemical systems. However, we have found that LCAP gradients assist in guiding local searches in discrete space.12, 13 Specifically, we showed that the LCAP approach combined with the new gradient-directed Monte Carlo (GDMC) method efficiently optimizes NLO properties in a class of donor-acceptor substituted benzene and porphyrin motifs.12 Here, we use this GDMC approach for global optimization in discrete spaces, combining derivative information from the continuous interpolation with stochastic Monte Carlo (MC) moves. Local gradients in GDMC are constructed simply by treating the discrete variables as continuous variables. The derivatives then bias the MC moves during the optimization in discrete space. As such, the local searches, with the gradients derived from the virtual continuous surface, are used to enhance the local searching abilities of SA or GA without requiring extensive exploration of intermediate virtual structures.
In this article, we describe the mathematical background and general procedure for GDMC optimization in Sec. 2. GDMC is applied to protein sequence design and protein folding for a simple Hamiltonian in Sec. 3. For design and folding, the schemes for obtaining the local gradients are explained. The details of the GDMC procedures in both cases are also described. The results are summarized in Sec. 4.
THE GENERAL GDMC METHOD
The GDMC approach described previously12 can be modified to establish a general tool for global optimization in a discrete space. Mathematically, the constrained discrete global optimization explored here takes the following form:
| (1) |
where f(x) is an object function, h(x) is a set of equality constraints, g(x) is a set of inequality constraints, and n is the number of discrete variables. Each element in {xi:i=1,…,n} is a discrete variable, which means that each element can only take on discrete values. However, we treat each element {xi:i=1,…,n} as being a continuous variable for the purpose of obtaining the gradients {∂f∕∂xi:i=1,…,n}. Then, using the steepest descent method,14 a new set of is constructed as
| (2) |
where ε is a step-size parameter. Equation 2 implies that each element in {xi:i=1,…,n} will increase or decrease depending on the signs of the gradients {∂f∕∂xi:i=1,…,n}. Thus, these gradients can provide useful information regarding the discrete variables to guide the next generation of .
Once the gradients {∂f∕∂xi:i=1,…,n} are computed, the general GDMC procedure for the optimization defined in Eq. 1 is as follows.
-
(a)
Set the iteration index i=0 and begin with an initial set of satisfying the constraints of Eq. 1.
-
(b)
Set i=i+1. Calculate fi=f(xi) using and the gradients with respect to all coefficients . Exit if the optimization goal or the maximum number of iterations is reached.
-
(c)
Make a trial move to generate a new set of based on specific schemes following the computed gradients with Eq. 2. Accept the trial move with probability p=min{1,e−β(fi+1−fi)} using the Metropolis criterion. If the trial move is accepted, go to step (b); if the trial move is rejected, go to step (c). (Here, β is an empirical parameter.)
The specific schemes in this procedure use the computed gradients to generate a trial move. These schemes need not be based on the steepest decent direction. In addition, the schemes depend on the specific optimization objects. For instance, such a scheme using the gradients to generate a trial protein sequence will be described in Sec. 3A for protein sequence design.
The main difference between the GDMC scheme and the classical MC approach is that property gradients are obtained first from a virtual continuous surface in GDMC and used to generate the next set of {xi,i=1,…,n} at each optimization step. A key issue explored here is how to construct such a continuous surface from a discrete one so that the derivative information is utilized to direct the selection of the next generation of discrete variables. For example, as implemented in the LCAP approach, linearly combining atomic potentials on each variable site provides a simple approach to link all of the discrete molecules and generate a continuous surface.9 In the current study, for protein sequence design, the change from hydrophobic (H) to polar (P) residues is taken as continuous. Then, analytical gradients on the virtual structure surface can be calculated subject to constraints. Nevertheless, for protein folding, there is no effective way to calculate gradients analytically because of the strict structural constraint that each structure in the optimization should be physically meaningful. Therefore, “numerical gradients,” as described below, which satisfy the structural constraints, are calculated and used in GDMC.
PROTEIN SEQUENCE DESIGN AND FOLDING WITH LATTICE MODELS
Protein sequence design and protein folding represent central challenges in biology.15, 16 One challenge is to determine the sequence with the lowest energy for a prescribed three-dimensional (3D) fold (the inverse folding problem). The second challenge is to determine the most stable fold for a specific sequence (the folding problem). Here, to explore these two challenges with the GDMC algorithm, we describe the protein with a simple HP lattice model.17, 18, 19 In the model, two types of residues can exist on fixed two-dimensional (2D) or 3D lattice sites. The residues at each site may be hydrophobic (H) or polar (P) and the lattice sites can be occupied by only one monomer. A protein chain is configured as a self-avoiding walk (SAW), i.e., a path from one point to another that never crosses itself. A 2D lattice model with 16 lattice sites is shown in Fig. 1.7, 10 Although the simplicity of HP lattice models limits the model somewhat,20, 21, 22 protein sequence design and folding with this HP lattice model are nondeterministic polynomial hard problems.23, 24 Therefore, the simple HP lattice model is often used to benchmark global optimization methods. Here, we focus on the 2D lattice HP models to explore the efficiency of the GDMC optimization for protein sequence design and folding.
Figure 1.
4×4 2D HP lattice conformation. Each site is populated by either an H or P monomer. This specific conformation is used in Sect. 3A for protein sequence design.
In the HP lattice model, the total energy depends on the sum of interactions between all pairs of adjacent “nonbonded” residues in the conformation. For example, in Fig. 1, site 3 has nonbonded interactions with sites 12 and 16 (but not with sites 2 and 4). The interaction energy e depends on interacting residue types. Note that we do not address the negative design issue (i.e., design with additional constraints).25, 26 Here, we set eHH=−2.3, eHP=1.0, and ePP=0 (Ref. 7) for protein sequence design; for protein folding, only H–H interactions are nonzero (eHH=−1.0).27 The above values were used to compare our GDMC optimizations with corresponding methods in the literature.
The total HP energy7 is
| (3) |
where
| (4) |
| (5) |
In Eq. 3, N is the number of residues and A(i,j) is the interaction matrix for all residues. In Eq. 5, S represents the residue type on each site: Si=0 corresponds to a P residue on the site; Si=1 corresponds to a H residue on the site.
Protein sequence design
To design a protein sequence with a given fold, the total energy of a protein conformation is optimized with respect to the residue type (H or P) on each lattice site. Here, we choose the 2D lattice protein fold shown in Fig. 1.7 The set {Si,i=1,…,N} is optimized to design a protein sequence. In Ref. 7, the Si value was replaced by a Gaussian function to construct a continuous energy surface, and then continuous optimization of {Si,i=1,…,N} was carried out. In our study, however, Si is a continuous variable. Thus the energy gradient with respect to Si is
| (6) |
If there are no constraints on protein sequence in the HP model, Eq. 3 shows that the minimum energy is found when all sites are H residues, because eHH has the lowest interaction energy. Following the protocol of Ref. 7, the number of H residues (NH) is constrained for a nontrivial optimization to design a variety of protein sequences,
| (7) |
For this constrained minimization, a new object function of the constrained energy is constructed,
| (8) |
where the Lagrangian multiplier is
| (9) |
Accordingly, the gradient of Si in GDMC is
| (10) |
After the object function and its gradients with respect to Si are formulated, the GDMC optimization procedure for protein sequence design is as follows.
-
(a)
Set the iteration number i=0 and begin with a random sequence , is either 0 or 1, maintaining .
-
(b)
Set i=i+1. Calculate Li and gradients using Eqs. 8, 10. Exit if the optimization goal or the maximum number of iterations is reached. (Here, we chose 1000 iterations.)
-
(c)
Generate a new set of using the computed gradients . To maintain , the generation scheme for a trial protein sequence (see below for details) is used. Accept the new sequence with a probability p=min{1,e−β(Li+1−Li)} using the Metropolis rule. (Here, the empirical parameter β is set to 1.2×10−3 based on trial runs.)
The generation scheme in the above procedure uses the current ith sequence and its gradients to jump to the next sequence, while satisfying the constraint on the total number of hydrophobic residues. The detailed rules used to generate a new sequence are as follows.
-
(1)
Using the computed gradients , all of the N lattice sites are placed in decreasing order based on their gradients. The sites with the NH lowest gradients are set to be H residues and all others are set to be P residues. The constraint is retained.
-
(2)
If this generated sequence was visited in an earlier step, the site that has the (NH+1)th lowest gradient is switched to an H residue. Meanwhile, the site that has the NHth lowest gradient is switched to a P residue to satisfy the constraint . This switching procedure is repeated until the generated sequence has not been visited, or until we exit if a new sequence cannot be generated after NH attempts.
In this protocol, the constraint of Eq. 7 is always enforced by the above scheme and the gradients are used to guide the MC search. In a sense, the GDMC search is a “semistochastic” algorithm. The local gradients generate the sequence deterministically while the Metropolis criterion stochastically accepts or rejects the sequence so generated.
In the following analysis, the conformation of Fig. 1 is examined in greater detail and the lowest energies obtained by GDMC search are shown in Table 1. MC in this table refers to the values of being set to random numbers. No gradient information from Eq. 10 is used in conventional MC sampling. Both MC and GDMC are tested in five independent runs. Information from the GDMC run that takes the smallest number of steps to find the lowest energy for each NH is shown in the last three columns of Table 1.
Table 1.
Comparison of GDMC with MC performance to design protein sequences for the fixed conformation in Fig. 1. Five MC and GDMC runs were executed independently. NH is the number of H residues. Emin is the global minimum energy found by enumerating all possible sequences. is the minimum energy obtained from the standard MC search. is the minimum energy from GDMC. The number in parentheses for and is the smallest number of steps needed to find the global minimum energy during five MC and GDMC runs. The information for the GDMC run, which takes the smallest number of steps to find the lowest energy for each NH, is shown in the last three columns. “Iteration” is the number of steps in the GDMC optimization. “Answers” indicates how many degenerate sequences with the same lowest energy are found during the optimization. “Successful jumps” denotes how many sequences generated by the prior scheme lower the total energy.
| NH | Emina | GDMC | ||||
|---|---|---|---|---|---|---|
| Iteration | Answers | Successful jumps | ||||
| 1 | −2.0 | −2.0(2) | −2.0(2) | 6 | 5 | 5 |
| 2 | −4.3 | −4.3(44) | −4.3(7) | 7 | 1 | 4 |
| 3 | −6.6 | −6.6(72) | −6.6(9) | 9 | 1 | 5 |
| 4 | −8.6 | −8.6(10) | −8.6(2) | 6 | 4 | 3 |
| 5 | −10.9 | −10.9(286) | −10.9(2) | 14 | 1 | 6 |
| 6 | −12.2 | −11.9(540) | −12.2(2) | 17 | 4 | 7 |
| 7 | −13.5 | −13.5(185) | −13.5(2) | 14 | 6 | 3 |
| 8 | −14.8 | −14.8(149) | −14.8(3) | 32 | 4 | 11 |
| 9 | −16.1 | −15.8(53) | −16.1(7) | 55 | 1 | 27 |
| 10 | −17.1 | −17.1(100) | −17.1(2) | 46 | 8 | 22 |
| 11 | −18.4 | −18.4(494) | −18.4(2) | 34 | 2 | 20 |
| 12 | −19.4 | −19.4(10) | −19.4(2) | 14 | 8 | 11 |
| 13 | −20.7 | −20.7(59) | −20.7(2) | 15 | 1 | 11 |
| 14 | −20.7 | −20.7(4) | −20.7(2) | 16 | 2 | 12 |
Taken from Ref. 10.
Table 1 shows that the GDMC search always finds the global minimum energy that is known from exhaustive sequence enumeration.7 Several different protein sequences that share the same lowest possible energy are found efficiently. The deterministic continuous optimizations7, 10 only find one “intermediate” sequence close to one local minimum during one trial run within thousands of steps. The total number of iterations in GDMC, in contrast, is typically less than 50. Furthermore, the successful jumps in GDMC, defined by how many generated sequences use the GDMC scheme to lower the total energy, usually constitute more than 50% of the total number of iterations. Therefore, compared to the continuous model,7, 10 our GDMC approach utilizes the smooth energy surface constructed by continuing the discrete variables {Sk:k=1,…,N} and efficiently explores the real sequences in discrete space.
Table 1 shows that, although the conventional MC scheme is still robust in its ability to find the global minima for most protein sequence designs, MC fails when NH=6 and NH=9 because of the large number of possible sequences (i.e., N!∕NH!(N−NH)!) (Ref. 10) when NH is between 6 and 9. However, using GDMC, all five independent runs for each NH find the global energy minimum. To find the first global minimum, MC usually requires hundreds of steps (see the number in parentheses for in Table 1), while the GDMC method requires fewer than ten steps (see the number in parentheses for in Table 1). This result suggests that the constructed energy surface of {Sk:k=1,…,N} is sufficiently smooth that only a few steps are required to reach the global energy minimum using GDMC.
To further examine GDMC performance, another complicated 6×6 2D HP lattice conformation was chosen to design the protein sequence with eight H residues of the 36. The total number of possible sequences with eight H residues is more than 3×107. Beginning with one random initial guess, only 48 iterations were carried out in the GDMC procedure and four degenerate sequences, shown in Fig. 2, were obtained with an optimal energy of −18.1.7 Note that the positions of five H residues are conserved on this 2D lattice, which might suggest that these five adjacent positions form a favorable hydrophobic core.
Figure 2.
The four degenerate sequences with the optimal energy equal to −18.1 were obtained and shown in this figure for a 6×6 2D HP lattice conformation with eight H residues out of 36. All the empty circles indicate that P residues are placed at those sites.
For protein sequence design, the GDMC approach is not fully stochastic because the analytical gradient information is used to generate new sequences. The local gradients direct the global optimization in the discrete space to find an improved sequence and the Metropolis criterion helps GDMC avoid becoming trapped in local minima through the stochastic moves. The combination of local gradients and the Metropolis criterion enables GDMC to find the global energy minimum efficiently.
Protein folding
Protein folding with an HP lattice model causes the interaction matrix A in Eq. 3 to vary as a protein of fixed sequence folds. Thus, A(i,j) is the variable to be optimized toward the lowest energy conformations. However, maintaining the protein structure during folding, a constraint on the A matrix must be imposed: The HP lattice chain must be configured to self-avoid on the lattice.28 Analytical energy gradients with respect to the element A(i,j) subject to the SAW constraint cannot be easily calculated. Thus, numerical gradients with respect to the conformational changes are used. Specifically, the numerical gradients are calculated from the energy changes when one residue is moved to another position along the four directions shown in Fig. 3. Accordingly, other sites should also be moved to maintain an allowed conformation. Here, we use a local move set (i.e., pull moves of Ref. 29) to fold the extended lattice chain in the HP lattice model. Pull moves are complete. That is, any pair of valid conformations is connected to all others by pull moves. Therefore, pull moves determine how other residues respond in 2D HP lattice models when one residue is moved in one direction, as defined in Ref. 29.
Figure 3.
Four move directions (i.e., indices 1, 2, 3, and 4) of residue i. If position L is vacant, residue i can be moved to L. Then residue (i−1) is moved to position C to sustain a valid conformation. If this does not complete the move, residue (i−2) is moved to the position previously held by residue i, residue (i−3) is moved to the position previously held by residue (i−1), and so on, until the move is completed and a new valid conformation is obtained.
According to the pull move rules, four move directions (i.e., indices 1, 2, 3, and 4) are defined in Fig. 3. If the position in one direction (such as P, N, L, or M) is already occupied by the other residue, the move in that direction is not valid. If position P, N, L, or M is vacant, the moves to these positions are permitted. After one valid move, residue i has a neighbor that is either residue (i−1) or (i+1). The other residues from the (i−1)th residue to the first residue in the chain need to make the necessary moves to sustain a valid protein conformation, according to some specific rules of moving.29 For instance, if residue i is moved to the vacant position L, residue (i−1) is then moved to position C to complete the move and to sustain a valid conformation (as required by the SAW constraint). If the conformation is not valid after this move, residue (i−2) is moved to the position i previously held by residue i. If the conformation is still invalid, residue (i−3) is then moved to the position previously held by residue (i−1) and so on, until the move is completed and a new valid conformation is obtained. The first and last residues in the chain require special pull moves (see Ref. 29). Basically, the strategy of pull moves is that the minimal possible number of residues should change their positions when one residue is moved. Therefore, pull moves are local.
Numerical gradients can be computed for each pull move of any residue. If a new conformation is not valid from a move in one direction, the gradient for this direction is set to be a large positive value. Otherwise, the energy of this new conformation is obtained from Eq. 3. Subsequently, the gradient equals the energy difference between the new conformation and the one prior to the pull move. Hence, numerical gradients for any site under the SAW constraint are readily calculated.
The GDMC procedure in protein folding is as follows.
-
(a)
Start from an extended structure with the fixed protein sequence. Set i=1 and E1=0.
-
(b)
Set i=i+1. Randomly choose a residue to move, i.e., the kth residue. Calculate numerical gradients with respect to four move directions shown in Fig. 3 for the kth residue. Exit if the maximum number of iterations is reached.
-
(c)
Make a trial move for the kth residue following the computed gradients for four move directions. Accept the trial move with probability p=min{1,e−β(Ei−Ei−1)} using the Metropolis criterion. If the trial move is accepted, go to step (b); if the trial move is rejected, go to step (c). (Here, the empirical parameter β is set to be 2.4 based on trial runs.)
Since eHH=−1, eHP=0, and ePP=0, the gradients with respect to the move directions of the P residues are always zero. As such, all of these gradients are set equal to random numbers. When some of the move directions of the chosen residue are degenerate, one direction is chosen randomly to move the residue. Ten proteins with different sequences in Table 2 are taken from the literature27, 29, 30 (the number of residues varies from 20 to 100). The sequences are held fixed during folding, and their conformations were optimized in this study. We compared the GDMC approach to MC, GA, and the contact interaction (CI) method. The results appear in Table 3.
Table 2.
HP sequences for model proteins.
| ID | No. of residues | Sequencea |
|---|---|---|
| 1 | 20 | HPHPPHHPHPPHPHHPPHPH |
| 2 | 24 | HHPPHPPHPPHPPHPPHPPHPPHH |
| 3 | 25 | PPHPPHHPPPPHHPPPPHHPPPPHH |
| 4 | 36 | PPPHHPPHHPPPPPHHHHHHHPPHHPPPPHHPPHPP |
| 5 | 48 | PPHPPHHPPHHPPPPPHHHHHHHHHHPPPPPPHHPPHHPPHPPHHHHH |
| 6 | 50 | HHPHPHPHPHHHHPHPPPHPPPHPPPPHPPPHPPPHPHHHHPHPHPHPHH |
| 7 | 60 | PPHHHPHHHHHHHHPPPHHHHHHHHHHPHPPPHHHHHHHHHHHHPPPPHHHHHHPHHPHP |
| 8 | 64 | HHHHHHHHHHHHPHPHPPHHPPHHPPHPPHHPPHHPPHPPHHPPHHPPHPHPHHHHHHHHHHHH |
| 9 | 85 | HHHHPPPPHHHHHHHHHHHHPPPPPPHHHHHHHHHHHHPPPHHHHHHHHHHHHPPPHHHHHHHH |
| HHHHPPPHPPHHPPHHPPHPH | ||
| 10 | 100 | PPPPPPHPHHPPPPPHHHPHHHHHPHHPPPPHHPPHHPHHHHHPHHHHHHHHHHPHHPHHHHHH |
| HPPPPPPPPPPPHHHHHHHPPHPHHHPPPPPPHPHH |
Table 3.
Comparison of MC and GDMC with pull moves vs MC (Ref. 31), GA (Ref. 31), and the CI (Ref. 27) method for protein folding using the HP lattice model.
| ID | MCa | GAa | CIb | MC-pull movesc | GDMC-pull movesc | Durationd (×103) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ee | Timef | Ee | Timef | Ee | Timef | Hitsg | Ee | Timef | Hitsg | Ee | Timef | Hitsg | ||
| 1 | −9 | 292 443 | −9 | 30 492 | −9 | 171 | 5 | −9 | 149 | 5 | −9 | 496 | 5 | 10 |
| 2 | −9 | 2 492 221 | −9 | 30 491 | −9 | 1425 | 5 | −8 | 144 | 5 | −9 | 1248 | 4 | 10 |
| 3 | −8 | 2 694 572 | −8 | 20 400 | −8 | 1132 | 5 | −8 | 674 | 4 | −8 | 1683 | 5 | 30 |
| 4 | −13 | 6 557 189 | −14 | 301 339 | −14 | 40 237 | 2 | −14 | 8895 | 5 | −14 | 4238 | 5 | 300 |
| 5 | −20 | 9 201 755 | −22 | 126 547 | −23 | 204 928 | 1 | −23 | 29 269 | 5 | −23 | 43 434 | 5 | 300 |
| 6 | −21 | 15 151 203 | −21 | 592 887 | −21 | 13 464 | 5 | −19 | 170 047 | 3 | −21 | 76 349 | 1 | 300 |
| 7 | −33 | 8 262 338 | −34 | 111 400 | −35 | 361 533 | 1 | −36 | 198 486 | 1 | −36 | 194 950 | 2h | 500 |
| 8 | −35 | 7 848 952 | −38 | 97 220 | −40 | 461 099 | 1 | −42 | 76 401 | 4 | −42 | 61 233 | 3 | 500 |
| 9 | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | −52 | 233 039 | 5 | −53 | 4 302 404 | 2 | 5000 |
| 10 | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | ⋯ | −47 | 295 270 | 2 | −48i | 1 441 692 | 1 | 5000 |
Values were taken from Ref. 31.
Values were taken from Ref. 27.
Values were taken from this current work.
The maximum number of iteration steps for each independent trial run. The duration is same for the trial runs of CI, MC-pull moves, and GDMC-pull moves. The step numbers were taken from Ref. 27.
Lowest energy value found in the most efficient of five trial runs.
Number of time steps before the lowest energy was found.
Number of the lowest energy value found in five trial runs.
Total runs are ten for this case.
We first examined the efficiency of pull moves using the conventional MC approach (MC-pull moves). As shown in Table 3, “MC-pull moves” have the same procedure as GDMC, except that a random choice for the movement is applied without the calculations of numerical gradients. Compared to both the MC and GA approaches without pull moves in Ref. 31, much less time is required in most cases for MC-pull moves to obtain the lowest energy. Furthermore, the pull moves are much more efficient for guiding the MC stochastic moves that lead to more stable new conformations when the sequence lengths are longer (e.g., sequences 5, 7, and 8). Thus, the pull moves used here are quite efficient.
Employing the conventional MC-pull moves without the gradient information fails to find the lowest or lower energies for sequences 2, 6, 9, and 10. Thus, the GDMC approach combined with pull moves (GDMC-pull moves) is applied in these cases. Compared to the MC-pull moves, GDMC-pull moves find conformations with much lower energies for all ten sequences. Even for the much longer sequences 9 and 10 in Table 2, the GDMC approach with the local pull moves found two lower energies than the MC-pull moves. The two corresponding conformations match those in Refs. 29, 30. For sequences 1, 3, and 5, GDMC requires more time steps to find the lowest energy than the MC-pull moves; for sequence 7, more trial runs in GDMC were required to find the native conformation compared to the required stochastic MC-pull moves. This result suggests that the landscape for protein folding in the HP model is more complicated than the landscape for protein sequence design. (The complexity of the landscape depends on both the specific protein sequence and the number of residues.) Sometimes, using the local gradient information causes “near-sighted” side effects for global optimization and may mislead the optimization. Eventually, however, the MC moves in GDMC eliminate some side effects of the local gradients and allow the GDMC optimization to escape the local minima. On the whole, GDMC is more likely to find deeper minima for protein folding with the HP lattice model than MC-pull moves without gradient information.
The efficiency of the GDMC search is comparable with that of the CI method.27 CI is an efficient computational strategy31, 32, 33, 34, 35 to simulate and analyze protein folding for HP lattice models. As shown in Table 3, for sequences 4, 5, and 8, GDMC has more hits than CI for searches of the same duration and five trial runs. Moreover, for these sequences, GDMC requires substantially fewer optimization cycles to reach the optimum compared to CI. For sequence 8, the conformation with the lowest energy was found after 61 233 cycles in GDMC compared to 461 099 cycles in CI. For sequences 6 and 7, GDMC requires more iteration time or more trial runs than CI. This is also caused by the near-sighted local gradients in directing the simulation toward the global minimum when the landscape of protein folding is complicated. The MC stochastic moves in GDMC assist in relieving the side effects of local gradients and finally locate the native conformation of sequences 6 and 7. Overall, GDMC is more efficient than CI for finding lower energy minima.
Pull moves are also used in the equienergy (EE) sampling approach of Ref. 36. Pull moves in EE are employed as a mutation operator to change a given state and the generated configuration is accepted or rejected based on the Metropolis–Hastings criterion. With the aid of pull moves, EE samples the entire phase space while considering the entropic contributions to the folding free energy, rather than the energy landscape in the traditional optimization approaches. In contrast, our GDMC approach performs pull moves to compute numerical gradients and the trial move along the direction with the steepest gradient is accepted or rejected by the Metropolis criterion. Here, since we focus on testing the GDMC approach for global optimization, our aim is to find the global energy minimum conformation in the HP lattice models with the fixed sequences.
For HP lattice model protein folding, the analytical energy gradients with respect to the interaction matrix elements are difficult to obtain because the conformation must obey the SAW condition. To implement the GDMC approach, numerical gradients are calculated using the set of local pull moves. Our results demonstrated that GDMC, combined with those local gradients, more efficiently explores the conformation space compared to MC, GA, the MC-pull moves, and the CI method.
CONCLUSIONS
We have described a general global optimization approach for discrete space, i.e., a GDMC approach. GDMC utilizes local gradient information derived for the virtual continuous surface. The Metropolis criterion helps GDMC escape local minima. To demonstrate the efficiency of GDMC, protein sequence design and folding problems were studied for 2D HP lattice models, and the performance of the GDMC procedure for each case was described.
In protein sequence design, the energy function depends on the residue type at each lattice site. Combined with the constraint that arises from the number of H residues in the sequence, the energy gradients with respect to continuously interchanging H and P on each site were calculated. Compared to the continuous optimization in Refs. 7, 10 and the conventional MC, GDMC requires fewer calculations of real sequences in order to find the sequence with the lowest energy. In one trial run, multiple degenerate sequences with the same lowest energy were obtained.
In protein folding, the SAW constraint was imposed. Numerical gradients were computed using local pull moves. Because the landscape for protein folding is rugged, near-sighted local gradients in GDMC may trap the optimization in some local minima. GDMC jumps out of the local minima and finds the lowest energies reported in the literature. Our results indicate that GDMC explores the conformation surface more efficiently than the MC, GA, MC-pull moves, and CI methods.
In conclusion, combining gradient information with MC moves is efficient for finding the globally optimal protein sequences and folds for this HP lattice model. In particular, when the virtual continuous surface is smooth (e.g., protein sequence design), jumps between two sets of discrete variables guided by the local gradients are particularly efficient. Furthermore, it is important to note that this GDMC approach is general. The method can be expanded to any discrete optimization problem as long as the gradients can be constructed from a reasonably continuous description of discrete space. We also expect that the local gradients can be combined with a GA [i.e., a gradient-directed genetic algorithm (GDGA)] to guide the construction of the offspring generations from the parents in a GDGA.
ACKNOWLEDGMENTS
Support from the University of Pittsburgh Center for Chemical Methodologies & Library Development is gratefully acknowledged (Grant No. 2P50GM067082). D.N.B. thanks the Keck and NIH Foundations for support of computational infrastructure. W.Y. acknowledges partial support from the National Science Foundation. We thank B. Christopher Rinderspacher, Hao Hu, and Bruce Donald for helpful discussions.
References
- Horst R., Pardalos P., and Thoai N., Introduction to Global Optimization, 2nd ed. (Kluwer, Dordrecht, 2000). [Google Scholar]
- Floudas C. A., Deterministic Global Optimization (Kluwer, Dordrecht, 2000). [Google Scholar]
- Spall J. C., Introduction to Stochastic Search and Optimization (Wiley-Interscience, New York, NY, 2003). [Google Scholar]
- Kirkpatrick S., Gelatt C., and Vecchi M., Science 220, 671 (1983). 10.1126/science.220.4598.671 [DOI] [PubMed] [Google Scholar]
- Goldberg D., Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, Reading, MA, 1989). [Google Scholar]
- Oda A., Nagao H., Kitagawa Y., Shigeta Y., Shoji M., Nitta H., Okumura M., and Yamaguchi K., Int. J. Quantum Chem. 105, 645 (2005). 10.1002/qua.20665 [DOI] [Google Scholar]
- Koh S. K. and Ananthasuresh G. K., Int. J. Robot. Res. 24, 109 (2005). 10.1177/0278364905050354 [DOI] [Google Scholar]
- von Lilienfeld O. A., Lins R. D., and Rothlisberger U., Phys. Rev. Lett. 95, 153002 (2005). 10.1103/PhysRevLett.95.153002 [DOI] [PubMed] [Google Scholar]
- Wang M., Hu X., Beratan D. N., and Yang W., J. Am. Chem. Soc. 128, 3228 (2006). 10.1021/ja0572046 [DOI] [PubMed] [Google Scholar]
- Koh S. K., Ananthasuresh G. K., and Croke C., J. Mech. Des. 127, 728 (2005). 10.1115/1.1901705 [DOI] [Google Scholar]
- Xiao D., Yang W., and Beratan D. N., J. Chem. Phys. 129, 044106 (2008). 10.1063/1.2955756 [DOI] [PubMed] [Google Scholar]
- Hu X., Beratan D. N., and Yang W., J. Chem. Phys. 129, 064102 (2008). 10.1063/1.2958255 [DOI] [PubMed] [Google Scholar]
- Keinan S., Hu X., Beratan D. N., and Yang W., J. Phys. Chem. A 111, 176 (2007). 10.1021/jp0646168 [DOI] [PubMed] [Google Scholar]
- Snyman J. A., Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms (Springer, New York, 2005). [Google Scholar]
- Alberts B., Johnson A., Lewis J., Raff M., Roberts K., and Walters P., Molecular Biology of the Cell (Garland Science, New York and London, 2002). [Google Scholar]
- Anfinsen C., Biochem. J. 128, 737 (1972). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lau K. and Dill K., Macromolecules 22, 3986 (1989). 10.1021/ma00200a030 [DOI] [Google Scholar]
- Chan H. and Dill K., J. Chem. Phys. 95, 3775 (1991). 10.1063/1.460828 [DOI] [Google Scholar]
- Dill K., Bromberg S., Yue K., Fiebig K., Yee D., Thomas P., and Chan H., Protein Sci. 4, 561 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan H. S., Proteins 40, 543 (2000). [DOI] [PubMed] [Google Scholar]
- Chan H. S., Shimizu S., and Kaya H., Methods Enzymol. 380, 350 (2004). 10.1016/S0076-6879(04)80016-8 [DOI] [PubMed] [Google Scholar]
- Kaya H. and Chan H. S., Phys. Rev. Lett. 85, 4823 (2000). 10.1103/PhysRevLett.85.4823 [DOI] [PubMed] [Google Scholar]
- Berger B. and Leighton T., J. Comput. Biol. 5, 27 (1998). 10.1089/cmb.1998.5.27 [DOI] [PubMed] [Google Scholar]
- Crescenzi P., Goldman D., Papadimitriou C., Piccolboni A., and Yannakakis M., J. Comput. Biol. 5, 423 (1998). 10.1089/cmb.1998.5.423 [DOI] [PubMed] [Google Scholar]
- Yue K. and Dill K. A., Proc. Natl. Acad. Sci. U.S.A. 89, 4163 (1992). 10.1073/pnas.89.9.4163 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Butterfoss G. L. and Kuhlman B., Annu. Rev. Biophys. Biomol. Struct. 35, 49 (2006). 10.1146/annurev.biophys.35.040405.102046 [DOI] [PubMed] [Google Scholar]
- Toma L. and Toma S., Protein Sci. 5, 147 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domb C., Adv. Chem. Phys. 15, 229 (1969). 10.1002/9780470143605.ch13 [DOI] [Google Scholar]
- Lesh N., Mitzenmacher M., and Whitesides S., Annual Conference on Research in Computational Molecular Biology, 2003. (unpublished), pp. 188–195.
- Hsu H. P., Mehra V., Nadler W., and Grassberger P., J. Chem. Phys. 118, 444 (2003). 10.1063/1.1522710 [DOI] [Google Scholar]
- Unger R. and Moult J., J. Mol. Biol. 231, 75 (1993). 10.1006/jmbi.1993.1258 [DOI] [PubMed] [Google Scholar]
- Beutler T. C. and Dill K. A., Protein Sci. 5, 2037 (1996). 10.1002/pro.5560051010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chikenji G., Kikuchi M., and Iba Y., Phys. Rev. Lett. 83, 1886 (1999). 10.1103/PhysRevLett.83.1886 [DOI] [Google Scholar]
- Liang F. and Wong W. H., J. Chem. Phys. 115, 3374 (2001). 10.1063/1.1387478 [DOI] [Google Scholar]
- Huang W., Lu Z., and Shi H., Phys. Rev. E 72, 016704 (2005). 10.1103/PhysRevE.72.016704 [DOI] [PubMed] [Google Scholar]
- Kou S. C., Oh J., and Wong W. H., J. Chem. Phys. 124, 244903 (2006). 10.1063/1.2208607 [DOI] [PubMed] [Google Scholar]



