Coupling the folding of homologous proteins

Chen Keasar; Dror Tobi; Ron Elber; Jeff Skolnick

doi:10.1073/pnas.95.11.5880

. 1998 May 26;95(11):5880–5883. doi: 10.1073/pnas.95.11.5880

Coupling the folding of homologous proteins

Chen Keasar ^*,†, Dror Tobi ^*, Ron Elber ^*,^‡, Jeff Skolnick ^§

PMCID: PMC34490 PMID: 9600887

Abstract

The empirical observation that homologous proteins fold to similar structures is used to enhance the capabilities of an ab initio algorithm to predict protein conformations. A penalty function that forces homologous proteins to look alike is added to the potential and is employed in the coupled energy optimization of several homologous proteins. Significant improvement in the quality of the computed structures (as compared with the computational folding of a single protein) is demonstrated and discussed.

It is convenient to classify methods of predicting protein conformations into one of two main categories: (a) methods that optimize energy functions and (b) methods that search through databases of protein structures. In the present manuscript we call a, “energy minimization methods”, and b “homology.” The division is not sharp. For example, many of the energy functions that are used in the prediction of three-dimensional conformations of proteins include information extracted from databases on protein structures.

The conformation of the global energy minimum, even if we succeed to find it, may differ from the native fold because of two possible reasons: (1) the empirical energy is inaccurate, and (2) the native fold does not correspond to a global free energy minimum. To address point 1, an adjustment of the energy function may follow, whereas to address point 2 the folding pathways (and not only the most stable state) are required. We propose below a combination of the homology and the energy approaches. The combination improves the prediction of structures of homologous proteins even if their conformations do not correspond to a global minimum of the individual molecules.

In the homology approach, an empirical observation on databases of protein structures is employed: Proteins with comparable sequences adopt a similar structure in the native configuration. This information is used to build models of unknown structures. The required degree of similarity between sequences is uncertain but a bet with significant safety margins is of 40% sequence identity. A model of a protein with an unknown structure can be built by using an experimental structure of a protein with a comparable sequence.

The homology protocol is the most accurate approach we have today to model protein structures on the computer. Its disadvantage is the necessity of having a similar sequence with a known structure.

In this manuscript, we describe a connection between the two approaches that improves the performance of energy optimization techniques while maintaining its generality. In the next section, we describe the algorithm and an example for a “real” protein follows. Finally, we explain why the suggested coupling optimizes better than straightforward annealing. We suggest two reasons for the improvement. The first source of improvement is smoothing of the potential energy surface, making it more accessible for global optimization. The second reason for improvement in the optimization is due to averaging in sequence space (over the homologous proteins) that enhances weak signals.

The Algorithm.

We consider N homologous proteins with sufficient sequence identity that suggests structural similarity. The structure of the family is unknown, making the “energy minimization” approach the right choice to predict the structure (it is the only choice). The N sequences are aligned, using established sequence alignment techniques (1). Here, we assume that the alignment is adequate. The example discussed below did not include deletions or insertion of amino acids into the sequence. However, an extension that includes deletions and insertions is straightforward.

The energy of a homologous set of proteins.

An energy function, E_total_, is defined, which includes the sum of all the individual energies of the homologous proteins and a coupling term that penalizes structural diversity.

X_i is the vector of coordinates for the i-th homologous protein, and ɛ_i(Χ_i) is its unique energy function. Δ_ij(Χ_i, Χ_j) is a function that measures and penalizes structural diversity between proteins i and j. The larger is the difference between the two structures; the higher is the value of the penalty function. In Eq. 1, we sum over the diversities of all pairs. Optimization of E_total provides a prediction for the structure of the family of the homologous proteins.

Exploring conformations.

We used the Lattice Monte Carlo Program (LMCP) of Skolnick and Kolinski (2). The Monte Carlo procedure uses different moves on the lattice to modify a starting conformation of the protein chain. Each of the proteins is modified independently. A displacement δΧ_i is chosen according to the LMCP protocol, so that 〈δΧ_iδΧ_j〉_i≠j is zero (〈…〉 denotes an ensemble average). New protein energies—ɛ_i(Χ_i, + ΔΧ_i) (i = 1, … , N)—are computed and supplemented by the i-th measure of structural diversity Δ_i(Χ_i + δΧ_i) = ∑_j=1^N Δ_ij. The displacement in Χ_i (δΧ_i) is now accepted or rejected according to the usual Monte Carlo criterion with an energy, ɛ_i(Χ_i) + Δ_i(Χ_i). The Monte Carlo test is repeated for all the homologous structures {i = 1, … , N}.

The generation of δΧ_i (but not the decision on its acceptance) depends only on Χ_i and does not take into account the penalty function on structural diversity. This protocol may lead to a large number of step rejections. However, the above choice is the simplest to use in a parallel computing environment, and it further leaves some room for future publications.

The parallel environment is important because the computations are pursued typically on a cluster of workstations or on a parallel computer with multiple CPUs. Each of the homologous proteins is assigned to one processor. The processor calculates the displacement δΧ_i and the energy ɛ_i(Χ_i) + Δ_i(Χ_i) and decides whether to accept or to reject the move. The correlation with other structures is built using the function Δ_i(Χ_i) that depends on all the other coordinates. The conformations of the set are therefore sampled from the canonical ensemble with an “energy”, E_total.

To compute the penalty function Δ_i(Χ_i), it is necessary to have the coordinates of all the proteins on each of the processors. The update of the coordinates can be done with every Monte Carlo step. However, to reduce communication overhead, we usually update the coordinates only after a few Monte Carlo steps.

A related algorithm can be easily formulated for molecular dynamics, solving the Newton’s equations of motion:

where M is the mass matrix for protein i. Nevertheless, the lattice approach suggests a number of unique advantages for the protein folding problem, which are discussed elsewhere (3).

We now return to the functional form of the penalty on structural variations. We experiment with two measures:

(a) Root mean square difference (RMSD) of the shared C_α coordinates after optimal overlap (4):

(b) L is the fraction of dissimilar contacts in the maps of the two structures. L = (number of dissimilar contacts)/(total number of contacts in the two maps). (Two residues are considered at a contact if their C_α distance is ≤6.5 Å.)

The RMSD is a common and useful measure of global similarity. However, it is doing poorly in detecting similar folds of structural segments. For example, if the secondary structure elements are predicted correctly but their packing is incorrect, the RMSD is typically high. In contrast, L, which is not as widely spread as the RMSD, detects local similarities and shows more uniform decrease in value as the structure quality decreases.

Both functions are useful in comparing the final structure to the native fold; however, the task of forcing the different chains to look alike is best done with the RMSD. The application of L is problematic because maps with no contacts at all have (of course) no dissimilar contacts. As a result, restraining the structures to similar L values pushes the system to unfolded swollen states. We therefore used the RMSD. The specific functional form of Δ_ij(Χ_i, Χ_j) is listed below:

Simulation Protocol and Results.

We provide a numerical example for a family of pancreatic hormones (5). In Table 1, we list the seven sequences that were used in the runs with coupled optimizations (6).

Table 1.

The seven coupled sequences that were used in the present work

PDB	Swiss prot entry	Sequence
1ppt	paho_chick	`GPSQPTYPGDDAPVEDLIRFYDNLQQYLNVVTRHRY`
	paho_rante	`APSEPHHPGDQATQDQLAQYYSDLYQYITFVTRPRF`
	pyy_myosc	`YPPQPESPGGNASPEDWAKYHAAVRHYVNLITRQRY`
	neuy_carau	`YPTKPDNPGEGAPAEELAKYYSALRHYINLITRQRY`
	paho_rat	`YPTKPDNPGEGAPAEELAKYYSALRHYINLITRQRY`
	paho_erieu	`VPLEPVYPGDNATPEQMAHYAAELRRYINMLTRPRY`
	pyy_pig	`YPAKPEAPGEDASPEELSRYYASLRHYLNLVTRQRY`

Open in a new tab

PDB, protein data bank.

We performed 100 Monte Carlo simulated annealing runs of the protein 1ppt and 142 coupled runs of structures of seven sequences, which were optimized in parallel. Only one experimental structure (of 1ppt) is available, and we compared with it the results of the computations. In Fig. 1, we show the energies of the 100 Monte Carlo runs of 1ppt as a function of the RMSD from the native structure. Also shown are the energies of the 142 coupled runs.

Comparison between coupled and uncoupled Monte Carlo runs. X, uncoupled runs; black circles, uncoupled runs. Each point is the final configuration of a simulated annealing run. Note that the coupled runs end more frequently at lower energies and lower RMSD values.

It is clearly seen that runs, which employed seven coupled proteins, cluster near lower RMSD values and therefore provide better prediction. The lowest energy structure of the coupled and the uncoupled runs (our best guess of the native conformation) are shown in Fig. 2. Again, the coupled runs provide a better answer. The improvement does not require an increase in computational effort. Each of the uncoupled 1ppt runs was seven times longer than the run of the seven sequences.

Comparing the native structure (a) with the lowest energy structures of the coupled (b) and the uncoupled (c) runs.

Another example for a protein family (homeodomain) can be found in ref. 6. Yet, another study employed coupling in a two-dimensional lattice (7) and showed even more profound improvement.

DISCUSSION

Here we discuss the question of “why.” Why does the proposed algorithm improve structure prediction? We have seen one example, and other examples are available in the literature (6, 7). From a global optimization perspective, it may be surprising that optimizing a system, N times larger (N homologous sequences) is easier than optimizing one sequence at a time. At the limit in which the optimizations are completely independent, they should take approximately N times longer.

Obviously, the coupling plays an important role in increasing the computation efficiency and accuracy. To understand the effect, it is useful to consider a simpler system first in which the “homologous” proteins are all of the same molecule. Hence, only structural diversity remains. E_total is now E_total = ∑_i=1^N [ɛ(Χ_i) + Δ_i]. A single energy function [ɛ(Χ_i)] is used for the different conformations of the proteins {i = 1, … , N}.

In Fig. 3, we compare annealing results with coupled and uncoupled energy function for the protein 1fsd. The distribution clearly shows that better energies are obtained when the coupling (of identical proteins) is employed. Hence, a better optimization protocol is obtained without sequence diversity. However, it is important to emphasize that the quality of the structures (as opposed to the energies) is not necessarily better because it depends also on the quality of the energy function.

Comparing the optimized energies from multiple simulated annealing runs for coupled and uncoupled simulations. Dark bars, coupled runs; light bars, uncoupled runs.

The Monte Carlo procedure produces conformations that are sampled from the canonical ensemble. The weight of a coupled state Χ₁, … , Χ_N at a temperature T is given by e^{−∑^N}^_i=1^[ɛ(Χ_i^{)
+
Δ_i}^]/k_B^T ≡ e^−E_total^/k_B^T. Note that Δ_i depends on the coordinates of the rest of the copies and that, without the coupling, we are getting the classical Boltzmann factor for a set of N noninteracting copies ^{−∑^N}^_i=1^ɛ(Χ_i^)/k_B^T.

Consider now, the discrete formula for quantum path integral of a system with a potential energy ɛ(Χ) e^{−∑^N}^_i=1^[ɛ(Χ_i^)/k_B^{T
+ (Χ_i}^{−
Χ_i+1}^{)
mNk_B}^T/2ℏ²^(Χ_i^{− Χ_i+1}^)] where m is the mass matrix and ℏ is the Planck constant divided by 2π (8). For convenience, we define

and we also set N + 1 ≡ 1. The new “coupled” energy E_coupled resembles

The quantum expression couples only pairs of nearest neighbor structures because the coupling corresponds to a physical entity—the kinetic energy. Nevertheless, if λ is sufficiently large (the protein is closed to be “classical”), the different structures will remain similar at each sampling point, exactly what we wanted to achieve in the folding of homologous proteins. The similarity between E_total and E_coupled is therefore self-evident and hints to the origin of the enhanced optimization as discussed below.

Expressions that are related to the above quantum expression were investigated in the global optimization field (9). The key idea in a number of pioneering approaches was to define a new energy function. The new energy function is a local spatial average with different choices of densities: E_average(Χ) = ∫E[Χ]₀)ρ(Χ, Χ₀)dΧ₀. ρ(Χ, Χ₀) is the density. The result E_average(Χ) is a smoother potential that is easier to minimize (Fig. 4) as was shown by numerous examples (9–13). Examples for smoothing densities are (but not limited to) Gaussians (9–11), square boxes (12), and a discrete sample of points (13). For example, the above quantum expression is an average of ɛ(Χ)over a discrete number of sampling points {Χ₁, … , Χ_N}. The discrete averaging has advantages and disadvantages. An advantage is its simplicity. We do not have to perform complex integrals to obtain the average. The smoothing is done by a direct sum over the sampling points. However, because the number of points is small, the averaging is less effective compared with other methods that use analytical densities and integrals. Nevertheless, discrete smoothing is very suggestive for lattice calculations of the type that we present here.

A schematic drawing of potential smoothing using discrete, local averaging: 〈V(Χ)〉 = 1/N ∑_i=1^N V(Χ_i)·δ_{Χ, Χ_i}.

To conclude, one of the reasons that the proposed protocol improves structure prediction is because of spatial potential averaging. This improvement is in the spirit of a number of recent global optimization procedures (9).

Another important feature of the present protocol is of averaging in sequence space. We return to the N different sequences. Each of the proteins has a unique energy surface. By virtue of experimental observations, we know that all the homologous proteins share similar structures at their native fold. At approximately the same coordinate Χ_native, we expect all the proteins to be in an energy minimum. Therefore, all of the energies are correlated and the result of the sum ∑_i=1^N ɛ_i(Χ_native) ≈ N〈ɛ〉 is a quantity, which increases linearly with N.

On the other hand, for unfolded states the energies of the different homologous structures are not necessarily correlated. Consider for example a correlated mutation at the hydrophobic core of the protein. To maintain the compactness of the hydrophobic core of the native state, valine and tryptophan may replace a pair of phenylalanines. At unfolded conformations, it is not necessary to assume that the contacts and the energies of the above residues are still correlated. It is more likely that the energies are not correlated. We therefore estimate ∑_i=1^N ɛ_i(Χ_unfolded) ≈ N^1/2 〈ɛ′〉. This estimate is in the spirit of the Random Energy Model as applied to proteins (14).

The new energy surface E_total is therefore distorted in a favorable way when comparing it to the original ɛ_i. The shared minimum (which databases of protein structures support its existence) is deeper compared with other portions of the energy surface. The enhancement of the well depth of the shared minimum may make it the global energy minimum of the new average energy even if originally it was not. This enhancement suggests the new protocol as possibly effective for kinetically stable proteins.

It is the combination of the spatial and sequence averaging, that provides significant improvement in structure prediction of ab initio algorithms as discussed above.

Acknowledgments

This research was supported by a Binational Science Foundation grant (to R.E. and J.S.) and by National Institutes of Health Grant GM37408 (to J.S.).

ABBREVIATIONS

LMCP: Lattice Monte Carlo program
RMSD: root mean square difference

References

1.Sander C, Schneider R. Proteins. 1991;9:56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]
2.Kolinski A, Skolnick J. Proteins. 1994;18:338–352. doi: 10.1002/prot.340180405. [DOI] [PubMed] [Google Scholar]
3.Kolinski A, Skolnick J. Lattice Models of Protein Folding, Dynamics and Thermodynamics. London: Chapman & Hall; 1996. [Google Scholar]
4.Kabsch W. Acta Crystallogr A. 1976;32:922–923. [Google Scholar]
5.Glover I, Blundell T. Biopolymers. 1983;22:293–304. doi: 10.1002/bip.360220138. [DOI] [PubMed] [Google Scholar]
6.Keasar C, Elber R, Skolnick J. Fold Des. 1997;2:247–259. doi: 10.1016/S1359-0278(97)00033-3. [DOI] [PubMed] [Google Scholar]
7.Keasar C, Elber R. J Phys Chem. 1995;99:11550–11556. [Google Scholar]
8.Feynman R P. Statistical Mechanics: A Set of Lectures. Reading, MA: Benjamin Cummins; 1982. [Google Scholar]
9.Straub J E. In: Optimization Techniques with Applications to Proteins in Recent Developments in Theoretical Studies of Proteins. Elber R, editor. Singapore: World Scientific; 1996. pp. 137–197. [Google Scholar]
10.Piela L, Kostrowicki J, Scheraga H A. J Phys Chem. 1989;93:4024–4035. [Google Scholar]
11.Shalloway D. Global Optimizat. 1992;2:281. (1992). [Google Scholar]
12.Andricioaei I, Straub J E. Comput Phys. 1996;10:449–454. [Google Scholar]
13.Roitberg A, Elber R. J Chem Phys. 1991;95:9277–9287. [Google Scholar]
14.Bryngelson J D, Wolynes P G. J Phys Chem. 1989;93:6902–6915. [Google Scholar]

[B1] 1.Sander C, Schneider R. Proteins. 1991;9:56–68. doi: 10.1002/prot.340090107. [DOI] [PubMed] [Google Scholar]

[B2] 2.Kolinski A, Skolnick J. Proteins. 1994;18:338–352. doi: 10.1002/prot.340180405. [DOI] [PubMed] [Google Scholar]

[B3] 3.Kolinski A, Skolnick J. Lattice Models of Protein Folding, Dynamics and Thermodynamics. London: Chapman & Hall; 1996. [Google Scholar]

[B4] 4.Kabsch W. Acta Crystallogr A. 1976;32:922–923. [Google Scholar]

[B5] 5.Glover I, Blundell T. Biopolymers. 1983;22:293–304. doi: 10.1002/bip.360220138. [DOI] [PubMed] [Google Scholar]

[B6] 6.Keasar C, Elber R, Skolnick J. Fold Des. 1997;2:247–259. doi: 10.1016/S1359-0278(97)00033-3. [DOI] [PubMed] [Google Scholar]

[B7] 7.Keasar C, Elber R. J Phys Chem. 1995;99:11550–11556. [Google Scholar]

[B8] 8.Feynman R P. Statistical Mechanics: A Set of Lectures. Reading, MA: Benjamin Cummins; 1982. [Google Scholar]

[B9] 9.Straub J E. In: Optimization Techniques with Applications to Proteins in Recent Developments in Theoretical Studies of Proteins. Elber R, editor. Singapore: World Scientific; 1996. pp. 137–197. [Google Scholar]

[B10] 10.Piela L, Kostrowicki J, Scheraga H A. J Phys Chem. 1989;93:4024–4035. [Google Scholar]

[B11] 11.Shalloway D. Global Optimizat. 1992;2:281. (1992). [Google Scholar]

[B12] 12.Andricioaei I, Straub J E. Comput Phys. 1996;10:449–454. [Google Scholar]

[B13] 13.Roitberg A, Elber R. J Chem Phys. 1991;95:9277–9287. [Google Scholar]

[B14] 14.Bryngelson J D, Wolynes P G. J Phys Chem. 1989;93:6902–6915. [Google Scholar]

PERMALINK

Coupling the folding of homologous proteins

Chen Keasar

Dror Tobi

Ron Elber

Jeff Skolnick

Series information