Skip to main content
The Journal of Chemical Physics logoLink to The Journal of Chemical Physics
. 2008 Sep 2;129(9):094101. doi: 10.1063/1.2968605

Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints

Ming Lin 1, Hsiao-Mei Lu 2, Rong Chen 3, Jie Liang 2,a)
PMCID: PMC2640457  NIHMSID: NIHMS89558  PMID: 19044859

Abstract

Inferring three-dimensional structural information of biomacromolecules such as proteins from limited experimental data is an important and challenging task. Nuclear Overhauser effect measurements based on nucleic magnetic resonance, disulfide linking, and electron paramagnetic resonance labeling studies can all provide useful partial distance constraint characteristic of the conformations of proteins. In this study, we describe a general approach for reconstructing conformations of biomolecules that are consistent with given distance constraints. Such constraints can be in the form of upper bounds and lower bounds of distances between residue pairs, contact maps based on specific contact distance cutoff values, or indirect distance constraints such as experimental ϕ-value measurement. Our approach is based on the framework of sequential Monte Carlo method, a chain growth-based method. We have developed a novel growth potential function to guide the generation of conformations that satisfy given distance constraints. This potential function incorporates not only the distance information of current residue during growth but also the distance information of future residues by introducing global distance upper bounds between residue pairs and the placement of reference points. To obtain protein conformations from indirect distance constraints in the form of experimental ϕ-values, we first generate properly weighted contact maps satisfying ϕ-value constraints, we then generate conformations from these contact maps. We show that our approach can faithfully generate conformations that satisfy the given constraints, which approach the native structures when distance constraints for all residue pairs are given.

INTRODUCTION

Three-dimensional structures of biomacromolecules (such as protein, DNA, and RNA molecules) are essential for understanding their biological functions. The primary sources of structural information of biomacromolecules are x-ray diffractions1 and nuclear magnetic resonance (NMR) experiments.2, 3 In NMR studies, the assessment of chemical shifts of atomic nuclei with spins can provide information about distances between specific pairs of atoms. In addition, a number of biochemical techniques such as disulfide linking4, 5 and electron paramagnetic resonance labeling6, 7 can also provide useful partial information about distances between residues. When accurate values of distances between residue pairs are available, they can be represented by a distance matrix, where an entry (i,j) of the matrix denotes the distance between the corresponding two residues i and j. In other cases, the distance information is inexact, but can be represented by a contact map, which denotes whether the distance of each residue pair is below or above a specified distance threshold.

With distance information of various levels of details, a challenging problem is to integrate all the distance constraints into global information about the properties of ensemble structures of the biomacromolecule.2, 8, 9 When the distance constraints are complete and accurate, the positions of all residues can be obtained by carrying out a singular value decomposition calculation.10, 11 When the distance constraints are incomplete or inaccurate, one needs to solve an optimization problem and find the structures that best reproduces these distance constraints.9, 12, 13 If the size of the molecule is large, this is a very challenging problem.

Several previous studies address the problem of generating protein conformations from contact maps.14, 15 These approaches can be expanded to generate conformations when only indirect or implicit distance constraints are available. In Ref. 16, Vendruscolo et al. generated the conformations of the transition state ensemble (TSE) important in protein folding studies. In this case, no explicit distance constraints between residue pairs are given. Rather, indirect information in the form of ϕ-value constraints is known for a subset of residues. Here the ϕ-value at a residue measured experimentally is interpreted as the ratio of the average number of native contacts formed by the residue in the transition state conformations to the number of contacts formed in the native structure of ground state.

In this paper, we focus on protein structures and develop a general method to obtain ensemble of conformations that satisfy distance constraints given either in the form of an incomplete set of distance bounds, a set of binary conditions whether the distance is below or above a specific threshold, or in the indirect form of experimentally measured ϕ-values. Our approach is based on the framework of sequential Monte Carlo (SMC) method, a growth-based method in which residues are added to an existing partial chain one by one until a conformation of full length is obtained.17, 18, 19 In addition to generating structures, we can also estimate important physical properties of molecular ensembles. As the probabilities of growing viable conformation samples become exceedingly small because of strong distance constraints and the self-avoiding requirement, an efficient sampling strategy becomes critical in order to obtain full chain conformations consistent with all distance constraints.

This paper is organized as follows. In Sec. 2, we introduce a cubic lattice model and the incomplete and indirect distance constraints we work with. We then discuss the general approach of SMC method and a new growth potential function. Results are presented in Sec. 3.

MODEL AND METHOD

Protein model and distance constraints

Three-dimensional model for proteins

We use a three-dimensional cubic lattice model to represent a protein conformation. The sides of a cubic cell have unit length γ=1.3 Å. A length-n protein conformation is represented by a connected chain xn=(x1,x2,…,xn), where the ith Cα atom of the conformation is located at site xi=(xi1,xi2,xi3) on the cubic lattice.18, 20, 21

For proteins molecules, the locations of Cα atoms satisfy certain constraints. We assume the Cα atoms in our model can only be placed on the lattice sites with the following constraints. First, the Euclidean distance between neighboring residues xi−1 and xi is between 3.5 and 4.1 Å. Second, the direction of the vector xi-xi−1 must be within 30° of four canonical directional vectors, which are specifically determined by the residue type of residue i and the locations xi−1, xi−2, and xi−3. These canonical vectors are derived from a discrete four-state off-lattice model of proteins, which gives four possible locations of xi for monomer i given the locations of xi−1, xi−2, and xi−3, and the type of residue i−1. Details of obtaining optimal canonical directional vectors are given in Ref. 22. Third, we enforce the self-avoiding constraint. Specifically, non-neighboring residues are not permitted to be closer than 3.5 Å, which is the smallest distance of non-neighboring residues observed in 646 representative proteins from the Protein Data Bank (PDB).

On average, there are about 23 candidate positions for placing xi in our model, although the exact number depends on the different relative positions of xi−3, xi−2, and xi−1, as well as the type of the (i−1)th residue. Figure 1 provides an illustration of this lattice model.

Figure 1.

Figure 1

Illustration of the cubic lattice model. We have set the cell unit length to 1.8 Å instead of 1.3 Å here for clarity. Given the locations of xi−3, xi−2, and xi−1, there are 21 positions (marked by “◻” and “◯”) satisfying the first distance condition. Only nine positions (marked by “◯”) among them also satisfy the second direction condition. These positions all satisfy the third self-avoidance condition.

Direct distance constraints

Distance constraints for protein chain xn=(x1,…,xn)∊R3 are written in the following general form as

lijxixjuijforall(i,j)D,

where ‖xixj‖ is the Euclidean distance between residues i and j; lij and uij are the lower and upper bounds of the distance between residues i and j. Here lij can be 0 and uij can be +∞ if only upper bound or lower bound is available, respectively. D is a given set of (i,j) residue pairs, in which such constraints are known. D is often a much smaller subset of the complete set of all residue pairs. The problem of determining the conformation xn=(x1,…,xn) according to such distance constraints has been studied before.8, 23 In this paper, we focus on constraints of distance between Cα atoms, and we only consider the structure of Cα chain. The general principle can be applied to other types of distance constraints, and side chain repacking methods can be used to generate more detailed protein structures.24, 25

Indirect distance constraints and experimentally measured ϕ-values

An important class of studies on protein folding is to characterize the properties of the TSE. TSE represents the structures around the saddle point of the potential energy surface, and these structures are often followed by a large structural change in protein unfolding process.26, 27, 28, 29

It is challenging to characterize TSE due to the complexity of the folding and unfolding processes. Experimental research on this problem focuses on the measurement of ϕ-value at individual residue position, defined as the ratio of change in stability to the transition state upon mutation versus the change to the native folded state.27, 28, 29, 30, 31

ϕ-value provides information about the native likeness of TSE.32, 33 For example, if ϕi, the ϕ-value at residue i, is close to 1, the transition state is thought to have almost the same structure at residue i as the native state. If ϕi is close to 0, the transition state is likely to be in the denatured state in this region. An important question therefore is to translate ϕ-value measurements into explicit conformational information of protein structures in the TSE.16, 31, 34

Let ϕiexp be the experimentally measured ϕ-value at residue i. Based on experimental observations, it is reasonable to assume that changes in protein stability are proportional to the change in the number of contacts in a protein structure.35 Based on this assumption, the calculated ϕ-value ϕicalc, which relates to the protein structure, is defined as ϕicalc=CiTSENiN, where CiTSE is the average number of contacts formed by residue i in the TSE and NiN is the number of contacts formed by residue i in the native structure. In studies based on molecular dynamics simulations, Li and Daggett31 showed that ϕicalc is in good agreement with ϕiexp. Vendruscolo et al.16 further introduced a different definition of ϕicalc,

ϕicalc=NiTSENiN, (1)

where NiTSE is taken as the average number of native contacts instead of all contacts of residue i in the TSE. In this case, the TSE is defined as the set of conformations with ϕicalc very close to the corresponding experimental measured ϕiexp at all positions. An important question is therefore how to obtain explicit conformations of TSE that satisfy these indirect distance constraints of ϕ-values. A number of studies have shown promising results.16, 34

Generating conformations with various distance constraints using SMC

In general, one can aim to obtain conformations that are at the global minimum of an error function measuring deviation in distance from the lower bounds and upper bounds of the distance constraints,

E(xn)=i,j[max2{lijxixj,0}+max2{xixjuij,0}]forall(i,j)D, (2)

in which the distance constraints are incomplete and inaccurate.9, 12 Our goal is to generate a set of conformations satisfying all distance constraints and following certain target distribution π(xn), for example, the uniform distribution of all feasible conformations satisfying the distance constraints, or the Boltzmann distribution associated with an energy function. If the true energy function was known, it could be used to estimate the thermodynamics properties of the ensemble of protein conformations following the Boltzmann distribution. In reality, one can approximate the unknown true energy function with various empirically derived energy functions, such as the Miyazaw–Jernighan potential function,36 the geometric poetntial,37, 38 and many other potential functions as reviewed in Ref. 39.

Since our goal here is to minimize the error function E(xn), instead of approximating the true energy function, we can set the energy function to be proportional to the error function [Eq. 2]. More details of this target distributions we use are described in Sec. 3. Let xt=(x1,…,xt) be the vector for the positions of residues from 1 up to t. We recursively place residue t at position xt following a trial distribution gt(xtxt−1). The trial distribution proposes possible positions with different probabilities for residue t to be placed under the condition that positions x1,…,xt−1 for residues 1 tot−1 are given. The joint trial distribution for a chain with t residues at positions x1,…,xt is given by

gt(xt)=g1(x1)g2(x2x1)gt(xtxt1).

Following the principle of importance sampling,40, 41, 42 the design of the trial distribution can accommodate different types of bias, which allows great flexibility for improving sampling efficiency. However, each final sample of full length chain xn needs to be weighted to remove the bias so the original target distribution π(xn) can be recovered. Specifically, we assign a weight

w(xn(j))=π(xn(j))gn(xn(j))

to each conformation sample xn(j), j=1,…,m, where gn(xn) is the trial distribution of the full chain. Then the expected mean value of physical property represented by a function h(xn) of conformation xn following the target distribution π(xn) can be estimated by

Eπ(h(xn))j=1mw(xn(j))h(xn(j))j=1mw(xn(j)).

We adopt the framework developed in Ref. 43 to generate sample conformations which minimizes the loss introduced in the resampling step when choosing a number of distinct samples from a larger sample set. It helps to maintain the diversity of the samples. Let mt be the number of samples we retain in the tth iteration, mmax be the maximum value of mt, the algorithm for generating samples are described in Algorithm 1.

Algorithm 1 Generating conformation

   
 Set m1=1, w1(1)=1.0 and place the first residue at fixed x1(1).  
 for t=2 to n do  
  Lt=0;  
  {Lt: number of length t chains that can be obtained from samples obtained at step t−1.}  
  for sample j=1:mt−1 do  
   Find all of the valid sites xt(i,j)i=1,,lt(j) for placing xt next to partial chain xt1(j).  
   {lt(j)=number of available sites to place xt next to partial chain xt1(j)}.  
   Generate lt(j) number of t-length chain x˜t(Lt+i)=(xt1(j),xt(i,j))  
   w˜t(Lt+i)=wt1(j). {Temporary weights for uniform distribution.}  
   Lt=Lt+lt(j).  
  end for  
  if Ltmmax then  
   Let mt=Lt and {(xt(j),wt(j))}j=1mt={(x˜t(l),w˜t(l))}l=1Lt.  
  else  
   Let mt=mmax.  
   for l=1 to Lt do  
    Assign a priority score βt(l) for chain x˜t(l) according to the constraints.  
   end for  
   Find constant c such that l=1Ltmin{cβt(l),1}=mmax. {e.g. by binary search.}  
   Draw r from uniform distribution U[0,1).  
   for j=1:mmax do  
    Let rj=jr.  
    Find integer Jj such that l=1Jj1min{cβt(l),1}<rjl=1Jjmin{cβt(l),1}.  
    Select sample xt(j)=x˜t(Jj).  
    Set weight wt(j)=w˜t(Jj)min{cβt(Jj),1}.  
   end for  
  end if  
 end for  
 for j=1:mn do  
  Calculate importance weight w(xn(j))wn(j)π(xn(j)).  
 end for  

The key step in this algorithm is to construct high quality priority scores βs(l), which works as the trial distribution gt(xtxt−1) to guide the growth of the partial chains toward more profitable regions.

In this algorithm, it is not necessary to require the growth starts from the first residue x1. In fact, growth can start from any place, as long as the newly placed monomer is connected to the existing partial chain. For example, growth can start in one direction from a position in the middle of the primary sequence of the chain. After it reaches the end of the chain, the growth process can go back to the starting residue and continue to grow in the other direction of the primary sequence. That is, the order of placing residues can be (xk,xk−1,…,x1,xk+1,…,xn) or (xk,xk+1,…,xn,xk−1,…,x1) for any residue k located in the middle of the chain. The steps of adding a new residue to existed partial chain are the same as above. In this study, we choose the order of placing residues so that the fragment of the first 20 residues to be placed has the largest number of distance constraints.

Priority score

The choice of a good priority score βt used in Algorithm 1 is very important. A carefully designed βt can successfully guide the growth of the conformation so that the full chain will eventually obey all the distance constraints, hence increasing the sampling acceptance rate. A difficulty in the growth-based method is that when adding current residue, the distance information of future residues cannot be directly used. To solve this problem, the priority score we develop consists of three components: growth potential from upper bounds of the distance constraints, growth potential from reference points, and growth potential from lower bounds of the distance constraints. The first two components of the priority score incorporate the distance information of future residues.

Growth potential from upper bounds of the distance constraints

Given the upper bounds for the distances between residue pairs in a subset D of all residue pairs, we first develop distance upper bounds λij between all residue pairs (i,j), i,j=1,…,n.

Let q(k) be an upper bound of distances between two residues that are k residues away in the protein primary sequence. For constructing the upper bounds q(k) for small sequence separation k, we enumerate self-avoiding chains on the discrete lattice model using the protein sequence of interest. We have

q(k)=maximaxx(i,k)(xi+kxi),

where x(i,k) is a self-avoiding chain of length k starting at residue i. In this study, we enumerate fragments of chains for k=1,…,5 at different starting positions i, and take the largest as q(k). When sequence separation k is large, enumeration is infeasible. We approximate q(k) by k1q(5)+q(k2) if k=5k1+k2, where k1Z, k2=0,…,4. This is an upper bound as it assumes the chain is attached at some residues without angle constraint.

Consider a complete graph G with n vertices, each vertex represents a residue. The length of edge between any two vertices i and j is set to

eij={min{uij,q(ij)},if(i,j)Dq(ij),otherwise.}

We can use the Floyd algorithm44 to identify the shortest path pij between any two vertices i and j in this complete graph G. The distance upper bound λij between residues i and j is then set to the total length of the shortest path pij.

After obtaining the distance upper bound λij and the corresponding path pij, we construct the potential function thatcontributes to the priority score as

f1(xt)=i<j,(i,j)Ph1(xixj,λij), (3)

where P is a set of (i,j) pairs such that on the shortest pathpij between i and j, the two ends xi and xj are in the partial chain xt, but none of the residues between i and j are in xt. This is to avoid double counting of the distance constraints. The function h1 is a loss function to measure the violation of constraint ‖xixj‖≤λij. Usually, h1(‖xixj‖,λij) is set to zero when ‖xixj‖≤λij, and monotonically nondecreasing as ‖xixj‖−λij increases. Different types of h1(⋅) can be chosen for different considerations, which we will discuss in detail in later sections.

Growth potential from reference points

Given a partial chain xt, if the position of a future residue j (xjxt) is strongly constrained, e.g., there are more than four residues in the existing chain xt having distance constraints related to residue j, then residue j can only be placed in a small spatial region. We generate candidate position for xj on lattice sites within this small space. More specifically, if a future residue j has distance constraints,

likjxikxjuikj,k=1,,K,

where xik, k=1,…,K, are in the existing chain xt, and K≥5, we use Newton’s climbing method45 to find a position z in R3 such that

z=argminxF(x)=argminxk=1K(xikxuikj)2,

in which z is obtained by iteratively performing z=z−(F(z))−1F(z). We then search the sites on the cubic lattice around position z and choose the site x that minimizes

k=1K[max2{likjxikx,0}+max2{xikxuikj,0}]

as the candidate position for residue j. Denote the candidate position as xj*, we use it as a reference point to guide the growth of the chain. The following potential function is used to encode this:

f2(xt)=(i,j)Ph2(xixj*,λij), (4)

where P is a set of (i,j) pairs such that on the shortest pathpij between i and j, xi is in the partial chain xt constructed so far, xj* is the reference point, and none of the residues between i and j are in xt. As before, h2 is the loss function to measure the violation of constraint xixj*λij.

Growth potential from lower bounds of the distance constraints

This potential function penalizes the violation of lower bound constraint,

f3(xt)=(i,j)SDh3(xixj,lij), (5)

where S is the set of residue pair (i,j) in which xi and xj exist in the partial chain xt. Here h3 is the loss function to measure the violation of constraint ‖xixj‖≥lij. Hence, h3(‖xixj‖,lij)=0, when ‖xixj‖≥lij, and is monotonically nondecreasing as ‖xixj‖−lij decreases.

Combined priority score

The combined priority score βt(l) for chain x˜t(l) is set as

βt(l)=exp[ρ1f1(x˜t(l))+ρ2f2(x˜t(l))+ρ3f3(x˜t(l))τt], (6)

where ρ12, and ρ3 are coefficients of the three growth potential functions, τt is a temperaturelike variable. The choice of loss functions h1, h2, and h3 in f1, f2, and f3, and coefficients ρ123t will be described in later sections.

Generating conformations from incomplete residue distance constraints

In this section, we discuss how to use Algorithm 1 to generate protein conformations with given constraints in the form of small intervals of distances between a subset of residue pairs. The distance constraints are represented as12, 13

dijϵijxixjdij+ϵijforall(i,j)D,

where dij is the distance of residues i and j in the native structure. The set D is assigned as follows: each non-neighboring residue pair within short range distance (SRD) in the native structure is selected in D with a certain probability, e.g., (20%, 40%,…, 100%) independently. The SRD is selected as 10 Å for residue level structure followingRef. 13. All residue pairs with distance dij>10 Å are excluded from D. Variations ϵij, ϵij of the bounds are randomly selected from uniform distribution U[0,1) independently, so that the distance variation is under 1 Å, about 10% of the true distance dij as in Ref. 13.

In this problem, the priority score in Algorithm 1 is set for Eq. 6 with

h1(z,λ)=h2(z,λ)={(zλ)2,ifz>λ0,ifzλ,}

and

h3(z,l)={(zl)2,ifz<l0,ifzl,}

for f1, f2, and f3, and parameters are set as ρ1=1, ρ2=1, ρ3=1, and τt=0.5. Here z is the value of the corresponding distance.

The loss functions are chosen so that the distance between any two residues i and j in the conformational sample does not deviate too much from the given constrained interval [lij,uij], in case that not all constraints can be perfectly satisfied simultaneously. The loss functions h1, h2, and h3 are concave downward functions of the distance ‖xixj‖, which increases rapidly as ‖xixj‖ departs the constrained interval [lij,uij].

Generating conformations from contact map of distance cutoff

We now describe how to generate conformations based on a given incomplete contact map, where distances between some residue pairs are known to be either above or below a cutoff value in our calculation. We use 8.5 Å as the cutoff value. This value has been used by Vendruscolo et al. in Ref. 16.

The contact map of a length n polymer chain is a n×n symmetric matrix C={cij}n×n, where cij=1 if residues i and j are in contact, and cij=0 otherwise. A given contact map is equivalent to a set of distance constraints,

xixj8.5Åforallcij=1,
xixj>8.5Åforallcij=0.

For this problem, we use

h1(z,λ)=h2(z,λ)=I(zλ>0)

and

h3(z,l)=I(zl<0),

in Eqs. 3, 4, 5, respectively, to construct the priority score in Eq. 6. Here I(⋅) is the indicator function: I(⋅)=1 if the statement represented by (⋅) is true, 0 otherwise. Parameters in Eq. 6 are taken as ρ1=1, ρ2=1, ρ3=0.8, and τt=0.2.

The loss functions are chosen in order to keep the contact map of the generate conformational samples as close to the given target contact map as possible if not completely satisfied. In particular, if the distance ‖xixj‖ violates the distance constraint, the corresponding loss function increases from 0 to 1 instantly.

Generating contact maps and conformations from indirect distance constraints by ϕ-values

In this section, we describe how to obtain contact maps based on indirect distance constraints in the form of experimentally measured ϕ-values.

Generating contact maps from indirect distance constraints

ϕ-values of TSE and contact maps. For generating conformations of the TSE, our target distribution π(xn) is the uniform distribution of all conformations xn satisfying the ϕ-value constraints,

ϕicalc(xn)=NiTSENiNϕiexp,iI={I1,,IT},

where I represents the set of residues whose ϕ-values have been measured experimentally, and I1,…,IT are the indexes of these residues. By definition, ϕicalc(xn) can be computed when the conformation xn of the full chain is known. When only information of partial chains xt−1 is available during chain growth, it is difficult to construct an effective conditional trial distribution gt(xtxt−1).

Our approach is to translate the ϕ-value constraints into contact maps of equivalent distance constraints. These contact maps provide more direct information on distance constraints for generating conformations. We then sample conformations following the contact map constraints, which willautomatically satisfy all ϕ-value constraints. We describe briefly how to generate conformations from these ϕ-value derived contact maps in Sec. 2F2.

From ϕ-values to contact maps. Because of the intrinsic symmetry of the contact map C={cij}n×n, we consider cij and cji as the same entry in C. Let N be the set of residue pairs (i,j) forming native contacts. By definition, the calculated ϕ-value ϕicalc for residue i of a conformation only depends on the values of cij in its contact map that are native contacts formed by residue i. Let

Ci={cij(i,j)N}.

The size of this set, ∣Ci∣, is the number of contacts formed by residue i in native structure. Note that if (i,j)∊N, both Ci and Cj contain cij.

To generate contact map C satisfying the ϕ-value constraints, we only need to decide which subset of native contacts to preserve for residue i whose experimental ϕ-value is available. To satisfy the ϕ-value constraint, there needs to be Ciϕiexp number of native contacts preserved for residue i in the contact map. That is, we need to assign either 0 or 1 to elements in Ci,iI, such that there are exactly Ciϕiexp number of “1” s in Ci. That is, for each generated contact map we should have cijCicij=Ciϕiexp, iI. For simplicity, we denote Ψi=Ciϕiexp in the subsequent discussion.

Now we generate contact map samples properly weighted with respect to the uniform distribution of all contact maps that physically satisfy the ϕ-value constraints. Each contact map sample C generated by importance sampling via the use of a trial distribution g(C) is weighted by v=1∕g(C). Here g(C) is the probability to generate contact map C.

A similar problem has been studied in Ref. 46 for generating 0–1 tables with fixed marginal sums. Although in our problem, the contact map has to be symmetric and only part of it needs to be filled, some techniques in Ref. 46 can be used to improve the sampling efficiency.

Specifically, we proceed by assigning the proper numbers of “1’s” and “0’s” in the rows for residues with experimental ϕ-value measurement. That is, we fill 0’s and 1’s in Ci, iI, and repeat this position after position, until the rows corresponding to all residues with experimental ϕ-value measurement are assigned. Let m* denote the number of contact map samples we will generate, CI1:Itk denote the partially filled kth contact map we have obtained thus far after finishing positions I1 to It, and vt(k) be the weight of the kth contact map that has been partially filled up to position It. The algorithm for generating contact maps from ϕ-value measurement is listed as Algorithm 2.

Algorithm 2 Generating contact map

   
 for k=1 to m* do  
  CI1:I0(k)=0/,v0(k)=1  
 end for  
 for position index t=1 to T do  
  for sample k=1 to m* do  
   for s=t to T do  
    Divide CIs into disjoint sets S0,Is(k), S1,Is(k), and Su,Is(k) based on partial contact map CI1:It1(k), where S1,Is(k)={CIs,jalreadyfilledwith1}, S0,Is(k)={CIs,jalreadyfilledwith0}, and Su,Is(k)={CIs,junspecified}.  
  end for  
  repeat  
   for s=t to T do  
    if S1,Is(k)>ΨIs then  
     Remove this sample. {Already too many “1”s.}  
    else if S1,Is(k)=ΨIs then  
     Fill all elements in Su,Is(k) with 0.  
     Update S0,Ij(k), S1,Ij(k), Su,Ij(k), j∊{t,⋯,T}.  
    end if  
    if S1,Is(k)+Su,Is(k)<ΨIs. then  
     Remove this sample. {Already too many “0”s.}  
    else if S1,Is(k)+Su,Is(k)=ΨIs then  
     Fill all elements in Su,Is(k) with 1.  
     Update S0,Ij(k), S1,Ij(k), Su,Ij(k), j∊{t,⋯,T}.  
    end if  
   end for  
  until Su,It(k)=0/, or none of S0,Is(k), S1,Is(k), Su,Is(k), s∊{t,⋯,T} changes.  
  {This step must converge because the number of unspecified positions decreases monotonically as the iteration proceeds.}  
  if Su,It(k)=0/ then  
   CIt(k) is completed and let weight vt(k)=vt1(k).  
  else  
     Fill Su,It(k) with ΨItS1,It(k) “1”s following CP-distribution.  
     {When there are unspecified entries in this row.}  
     Update weight vt(k) by Eq. 8.  
   end if  
 end for  
  Optionally resample38{(CI1:It(k),vt(k))}k=1m* if many samples were removed.  
end for  

Constrained Poisson (CP) distribution. The details of CP distribution can be found in Ref. 47. Briefly, we sample 0’s and 1’s to fill each entry s1,,sSu,It(k) of Su,It(k) described in Algorithm 2 with probability proportional to

g(s1,,sSu,It(k))j=1Su,It(k)pjsj(1pj)1sj,

and the total number of assigned 1’s is j=1Su,It(k)sj=ΨItS1,It(k). Here pj∊[0,1 stretchy=’true’] are the chosen parameters to improve the sample survival probability of this distribution.

Parameters for conditional Poisson (CP) distribution. For each entry sjSu,It(k) to be filled (whose corresponding entry in the contact map is cIt,Jj), we assign the parameter pj for the CP distribution as

pj={ΨJjS1,Jj(k)Su,Jj(k),ifcIt,JjSu,It(k)Amax{ΨItS1,It(k)cIt,iSu,It(k)ApiSu,It(k)\A,0.1},ifcIt,JjSu,It(k)\A,}

where A={cIt,It+1,cIt,It+2,…,cIt,IT} are entries recording existence of contacts between residue It and other future residues with experimental ϕ-values.

If Jj is a position with ϕ-measurement but currently unspecified, we assign pj as the ratio of the number of 1’s to be assigned ΨJjS1,Jj(k) and the number of unspecified positions Su,Jj(k) for residue Jj.

If Jj is a position currently unspecified but not a position with known ϕ-measurement, we assign pj as the ratio of the number of 1 to be assigned ΨItS1,It(k), minus an expected number cIt,iSu,It(k)Api of 1’s that will be assigned for future positions with ϕ values, and the number Su,It(k)\A of unspecified positions without known ϕ-values, or the value of 0.1, which ever is larger. This choice of pj is expected to fill ΨJjS1,Jj(k) number of 1’s in Su,j(k) for j∊{It,It+1,…,IT}. Note that in this assignment, pi is guaranteed to have a value between 0 and 1.

Realization of CP distribution. The overall idea for sampling from the CP distribution is to take out ΨItS1,Is(k) number of elements from the set Su,It(k) one by one without following specific probability replacement. These elements will be assigned as 1’s, while the remaining ones will be 0’s.46

Specifically, let aj=pj∕(1−pj). Suppose S¯u,It(k)(i) are the remaining elements after taking out i elements (i=0,1,,ΨItS1,Is(k)1). Each sjS¯u,It(k)(i) will be selected as next element to be taken out and assigned the value of 1 with probability

P(sj,S¯u,It(k)(i))=ajR(ΨItS1,Is(k)i1,S¯u,It(k)(i)\{sj})(ΨItS1,Is(k)i)R(ΨItS1,Is(k)i,S¯u,It(k)(i)),

where R(i,S) is

R(i,S)=BS,B=i(jBaj). (7)

It is the summation of ∏jBaj of all size i subsets B in S.

For an integer i and a subset SSu,It(k), R(i,S) can be calculated using the recursive formula

R(i,S)=R(i,S\{sj})+aiR(i1,S\{sj})

for any sjS. The initial conditions for the recursion are R(0,S)=1 for any SSu,It(k) and R(i,S)=0 for any ∣S∣<i.

Updating sample weight. The weight associated with a sample of contact map is updated as

vt(k)=vt1(k)R(ΨItS1,Is(k),Su,It(k))j=1Su,It(k)ajsj(k), (8)

where s1(k),,sSu,It(k)(k) is a realization of s1,,sSu,It(k) for the kth contact map and R(i,S) is defined in Eq. 7.

Generating conformations from contact map samples derived from ϕ-values

With a set of properly weighted samples of contact map {(C(k),vT(k)),k=1,,m*}, we draw a subset of it. The probability for each sample to be drawn is proportional to vT(k). For each selected contact map, we use it as the target contact map to generate conformations following Algorithm 1, using the priority score described in Sec. 2E. The set of the generated conformations form the TSE.

RESULTS

Result of generating conformations from incomplete residue distance constraints

This section shows the result of generating protein conformations with given constraints in the form of small intervals of distances for a subset of residue pairs as described in Sec. 2D.

Consider the Boltzmann distribution π(xn)∝exp{−E(xn)∕τ∣D∣}, where E(xn)∕∣D∣ is the error function defined in Eq. 2 normalized by the number of constraints, τ is a temperaturelike parameter in the Boltzmann function. It reflects deviation from the lower and upper bounds of the distance constraints. Here we set τ=0.5. We use Algorithm 1 to estimate the expected root mean square distance (rmsd) to the native structure of conformations following this Boltzmann distribution. The algorithm is applied to 189 proteins chosen from PDB, whose lengths are between 80 and 120. The distance constraints are constructed for non-neighboring residue pairs whose distances are less than 10 Å (SRD). The percentage of SRD pairs included in the given constraint set D varies from 20% to 100%.

The growth priority score used in Algorithm 1 is described in Sec. 2D. We repeat the algorithm 20 times independently with at most mmax=1000 samples being kept during each computation. The corresponding estimated rmsd expectations of distance constraint set D that includes different percentages of SRDs are plotted in Fig. 2. The boxes in the figure have lines at the lower quartile, median, and upper quartile values of the estimated expectations of the 189 proteins. We can see the corresponding expectation of rmsd becomes smaller as the percentage of the constraints increases. This is expected, as the Boltzmann probabilities π(xn) of conformations close to the native structure tend to be larger as more distance constraints are available.

Figure 2.

Figure 2

Box plot of expected to native structures rmsd expectations measured in Å of conformations following Boltzmann distribution of the error function π(xn)∝exp{−E(xn)∕τ∣D∣} for 189 proteins with length between 80 and 120. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

We can choose the conformation with the smallest error function Eq. 2 from the generated conformation samples as the recovered structure. In Fig. 3a, we plot the values of normalized error function E(xn)∕∣D∣ of these recovered structures, compared to the values of normalized error function of the fittest native structures [Fig. 3b]. The fittest native structure is the conformation in our discrete model, whose rmsd to the native structure is the smallest. It is obtained by a greedy growth method (Ref. 19) with a local minimal rmsd to the native structure. Although the objective of our algorithm is to generate conformations following the Boltzmann distribution π(xn)∝exp{−E(xn)∕τ∣D∣}, we still can find conformations with smaller error function values in terms of violation of distance donstraints than the fittest native structures.

Figure 3.

Figure 3

Normalized error function of the recovered structure and the fittest structure. (a) Box plot of normalized error function E(xn)∕∣D∣ of recovered structures of 189 proteins with length of 80–120; (b) box plot of normalized error function E(xn)∕∣D∣ of the fittest native structures of 189 proteins with length of 80–120. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

The rmsd’s of the recovered structures to native structures are plotted in Fig. 4. When the distance informations of all SRD are provided, the recovered structures of 160 out of the 189 proteins have rmsd to the native structures less than 3 Å. In general, the recovered structures approach native structures as more distance constraints are incorporated. This shows that the priority score βt we use introduces larger probability to generate conformations close to the native structure when more distance constraints are available.

Figure 4.

Figure 4

Box plot of rmsd’s measured in Å of recovered structures of 189 proteins with length of 80–120 to native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

We compare the difficulties of recovering structures from distance constraints among different protein classes. The rmsd’s of the recovered structures to native structures of ten proteins of different classes are reported in Table 1. Compared to alpha helical proteins, the recovered structures from incomplete distance constraints for beta proteins and alpha∕beta proteins have larger rmsd’s to the native structures. Table 2 reports the normalized error function of the recovered structures and the fittest native structures (in parentheses). Although the recovered structure and the fittest structure are both fixed, depending on the choice of the constraints at different percentages, values of the error function normalized by the number of constraints will be different. We also report the number of violated distance constraints of the recovered structures and the fitted native structures in Table 3. The results show that although the recovered structures violate some of the distance constraints, values of the normalized error function can be much smaller than the fittest native structures. This is because the loss functions h1, h2, and h3 we use are concave downward functions, which focus on preventing the distance between residues being far away from the given distance constraints.

Table 1.

rmsd’s measured in Å of the recovered structures and the fittest native structures to native structures of ten proteins of different classes. Number of all SRD pairs: the number of all residue pairs with distance less than 10 Å. % of SRD: the percentage of SRDs included in the constraint set D.

RMSD to native structure measured in Å
PDBID Proteinclass Proteinlength # of allSRD pairs Fittest nativestructure Structure recovered from % of SRD
20% 40% 60% 80% 100%
2mhr All alpha 118 765 0.9 5.3 3.9 1.9 2.3 1.5
256b All alpha 106 752 1.0 9.6 3.7 1.5 1.2 1.4
1cmc All alpha 104 619 0.9 6.3 4.9 2.7 2.1 2.3
1btn All beta 106 749 1.5 6.1 6.5 4.0 4.1 2.3
1f7d All beta 118 796 1.4 7.9 8.4 8.0 5.1 4.5
1f86 All beta 115 816 1.2 5.5 5.3 3.4 2.6 2.3
2trx Alpha∕beta 108 728 1.1 5.2 3.5 2.1 2.1 1.8
1bkf Alpha∕beta 107 788 1.5 5.6 2.6 2.0 2.1 1.6
1lkk Alpha∕beta 105 719 1.0 6.2 4.3 1.8 2.3 1.6
1puc Alpha∕beta 101 455 0.9 10.6 8.9 7.1 7.5 4.2

Table 2.

Value of the normalized error function of the recovered structures and of the fittest native structures (in parentheses) of ten proteins of different classes. % of SRD: the percentage of SRDs included in the constraint set D.

Normalized error function
PDBID Structure recovered from % of SRD
20% 40% 60% 80% 100%
2mhr 0.050 (0.108) 0.117 (0.105) 0.085 (0.122) 0.092 (0.136) 0.093 (0.140)
256b 0.060 (0.237) 0.049 (0.199) 0.077 (0.203) 0.085 (0.201) 0.083 (0.202)
1cmc 0.018 (0.211) 0.071 (0.196) 0.071 (0.164) 0.068 (0.165) 0.097 (0.161)
1btn 0.072 (0.842) 0.465 (0.648) 0.478 (0.650) 0.530 (0.678) 0.370 (0.696)
1f7d 0.147 (0.669) 0.293 (0.726) 0.260 (0.648) 0.343 (0.716) 0.200 (0.688)
1f86 0.144 (0.527) 0.217 (0.469) 0.330 (0.430) 0.339 (0.415) 0.198 (0.443)
2trx 0.044 (0.564) 0.102 (0.466) 0.196 (0.471) 0.120 (0.427) 0.129 (0.418)
1bkf 0.159 (0.526) 0.117 (0.564) 0.130 (0.691) 0.131 (0.667) 0.164 (0.584)
1lkk 0.264 (0.214) 0.146 (0.216) 0.128 (0.239) 0.128 (0.246) 0.121 (0.228)
1puc 0.009 (0.181) 0.083 (0.157) 0.103 (0.134) 0.069 (0.137) 0.068 (0.132)

Table 3.

The numbers of violations of distance constraints of the recovered structures and the fittest native structures (in parentheses) of ten proteins of different classes. % of SRD: the percentage of SRDs included in the constraint set D.

Number of violated distance constraints
PDBID Structure recovered from % of SRD
20% 40% 60% 80% 100%
2mhr 67 (61) 158 (137) 224 (211) 288 (303) 383 (377)
256b 63 (74) 123 (143) 202 (231) 271 (313) 318 (372)
1cmc 36 (60) 92 (107) 150 (176) 211 (236) 281 (302)
1btn 62 (93) 188 (175) 267 (263) 386 (352) 427 (439)
1f7d 81 (83) 183 (184) 260 (294) 375 (383) 428 (463)
1f86 72 (96) 191 (210) 279 (302) 349 (384) 455 (504)
2trx 63 (79) 133 (156) 214 (233) 276 (313) 351 (398)
1bkf 84 (98) 160 (182) 237 (292) 311 (383) 414 (489)
1lkk 82 (79) 137 (158) 249 (242) 299 (327) 376 (402)
1puc 30 (47) 87 (88) 133 (134) 150 (180) 171 (216)

The relatively large number of constraint violation may be due to certain limitation of our discrete model. There may not exist any conformation on the lattice satisfying all of the distance constraints. To address this issue, we construct a different set of distance constraints using the fittest native structure among the conformations of the discrete model, which is obtained by a greedy method. The new set of distance constraints are

d˜ijϵijxixjd˜ij+ϵijforall(i,j)D,

where d˜ij is the distance of residues i and j in the fittest native structure. In this case, there exists at least one conformation, the fittest native structure, in the discrete model satisfying all the distance constraints. Under this setting, the normalized error function E(xn)∕∣D∣ of the recovered structures is plotted in Fig. 5, and the rmsd’s of the recovered structures to the fittest native structures are plotted in Fig. 6. Among 189 proteins, the recovered structures of 40 proteins can match the fittest native structures perfectly when all SRD pairs are in the constraint set D.

Figure 5.

Figure 5

Box plot of normalized error function E(xn)∕∣D∣ of recovered structures of 189 proteins with length of 80–120 when distance constraints are constructed based on the fittest native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

Figure 6.

Figure 6

Box plot of rmsd’s measured in Å of the recovered structures to the fittest native structures of 189 proteins with length of 80–120 when distance constraints are constructed based on the fittest native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

Result of generating conformations from contact map of distance cutoff

This section shows the result of generating conformations based on a given contact map, where the distances between residue pairs are known to be either above or below a cutoff value (8.5 Å).16

We choose 20 proteins with length of 50–200 from the Protein Data Bank and generate conformations from their complete native contact map using Algorithm 1. We repeat the computation ten times independently and at most mmax=1000 samples are kept during each computation. The conformation with the smallest numbers of missing contacts (residue pairs that form contact in the native structure but not in the generated conformation) and extraneous contacts (residue pairs that form contact in the generated conformation but not in the native structure) is chosen as the recovered structure. The number of missed contacts, extraneous contacts, and rmsd to native structures measured in angstroms of the recovered structures are reported in Table 4. Figure 7 shows rmsd of the recovered structures to native structures. Again, we found that the recovered structures of alpha helical proteins have smaller rmsd to the native structures.

Table 4.

List of proteins of different classes used to recover structures form complete native contact maps. The number of all native contacts, the number of missed contacts, the number of false positive contacts, rmsd to native structure in Å are also listed.

PDB Protein Protein Number of native Number of missed Number of extraneous rmsd
ID class length contacts contacts contacts (Å)
1ptq Small protein 50 164 8 9 1.7
1cse Small protein 63 172 10 11 2.6
1utg All alpha 70 206 6 3 2.5
1hyp All alpha 75 232 12 7 1.7
1lmb All alpha 87 280 8 5 2.0
1plc All beta 99 391 28 51 2.6
256b All alpha 106 363 17 7 1.9
2mcm All beta 112 414 36 36 2.3
2mhr All alpha 118 352 17 17 2.3
1dz3 Alpha∕beta 123 413 22 15 3.4
1mdc All beta 131 474 40 22 2.3
1stm All beta 141 521 68 77 4.1
1mba All alpha 146 530 38 53 2.7
1byr Alpha∕beta 152 641 52 42 1.6
4dfr Alpha∕beta 159 598 49 49 2.3
3dfr Alpha∕beta 162 578 45 65 2.6
1v37 Alpha∕beta 171 672 60 58 2.3
1dgw All beta 178 617 65 52 3.9
1fvk Alpha∕beta 188 664 58 36 2.0
1o7n All beta 193 622 77 85 3.7

Figure 7.

Figure 7

rmsd of structures recovered from complete native contact maps to native structures for 20 proteins of different classes with length of 50–200. X axis is the protein length, Y-axis is the rmsd value of generated conformation that best fit the contact map to the native structure measured in Å.

Result of generating contact maps and conformations from indirect distance constraints by ϕ-values

This section depicts the result of generating TSE from ϕ-value constraints.

We generate TSE of bovine acyl-coenzyme A-binding protein, a length 86 protein with experimental ϕ-values. The PDB entry of the protein is 1nvl. The experimental ϕ-values are plotted in Fig. 8. More details of the experimental ϕ-values can be found in Ref. 48. We follow Ref. 16 and define TSE as the conformations satisfying ϕicalcϕiexp<0.15 for all residue i with experimental measured ϕ-value. Hence, the target distribution is the uniform distribution of all conformations satisfying these constraints.

Figure 8.

Figure 8

Comparison of the experimental ϕ-values and calculated ϕ-values of the generated TSE of 1nvl. The filled circles represent the experimental ϕ-values, empty circles represent the calculated ϕ-values of the generated TSE.

We generate m*=10 000 contact map samples using Algorithm 2, among which 1000 contact maps are chosen with probability proportional to their corresponding weights. For each chosen contact map, Algorithm 1 is used to generate conformations. At most mmax=1000 conformations are generated for each contact map. Figure 8 reports ϕicalc of the generated TSE. It is seen that the generated TSE can faithfully reproduce the ϕ-values. The average rmsd between TSE and the native structure of 1nvl is 11.3 Å. The result shows that the conformations of TSE can be far away from the native structure.

DISCUSSION

Obtaining molecular structures from incomplete and inaccurate distance information provided by experiments is an important problem. Several global optimization methods has been applied to solve this problem,9, 12, 13, 14 in which the goal is to minimize some error function derived from the provided distance information. In this study, we use SMC method to recover protein structures.

Compared to global optimization methods, an important advantage of our approach is that it can generate a set of conformations that are properly weighted with respect to a specified target distribution. Hence, in addition to recovering structures, we can also provide estimate of important physical parameters of the molecular ensemble, including thermodynamics properties such as energy and entropies under a given energy function.18, 19 In this paper, the average rmsd to native structure for TSE conformations is a consistent estimate of how close the native structure and TSE satisfying the distance constraints indirectly provided by ϕ-values are.

A difficulty in growth-based method, such as SMC method, is that the distance information of future residues cannot be directly used for placing current residue. To circumvent this problem, we develop a new growth potential function that can incorporate the distance information of future residues. In this potential function, we convert upper bound constraints of distance for a subset of residue pairs to global distance upper bound constraints of all possible residue pairs. In addition, we introduce reference points of future residues to be placed.

We have used this algorithm to generate protein conformations from constraints in the form of small intervals of distances between a subset of residue pairs, from contact map, and from indirect distance constraints by ϕ-values. This algorithm can effectively recover native structures and can generate conformations satisfying any given set of distance constraints. The conformations generated by this method can also be used as the initial conformations for further refinement.9, 10, 11, 12

In this study, a discrete model for protein structures was used for simplicity, at the price of model accuracy.22 We expect further improvement by extending our model to continuous space, with additional steps of local move refinement, as demonstrated in Refs. 14, 49.

ACKNOWLEDGMENTS

This work was supported by NIH Grant Nos. GM079804-01A1 and GM081682 and by NFS Grant Nos. DBI-0646035 and DMS-0800257.

References

  1. Rhodes G., Crystallography Made Crystal Clear: A Guide for Users of Macromolecular Models (Academic, New York, 1999). [Google Scholar]
  2. Crippen G. M. and Havel T. F., Distance Geometry and Molecular Conformation (Wiley, New York, 1988). [Google Scholar]
  3. Rieping W., Habeck M., and Nilges M., Science 10.1126/science.1110428 309, 303 (2005). [DOI] [PubMed] [Google Scholar]
  4. Falcon C. M. and Matthews K. S., Biochemistry 10.1021/bi0114067 40, 15650 (2001). [DOI] [PubMed] [Google Scholar]
  5. Cai K., Langen R., Hubbell W. L., and Khorana H. G., Proc. Natl. Acad. Sci. U.S.A. 94, 14267 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Altenbach C., Marti T., Khorana H., and Hubbell W. L., Science 10.1126/science.2160734 248, 1088 (1990). [DOI] [PubMed] [Google Scholar]
  7. Altenbach C., Oh K. J., Trabanino R. J., Hideg K., and Hubbell W. L., Biochemistry 10.1021/bi011544w 40, 15471 (2001). [DOI] [PubMed] [Google Scholar]
  8. Berger B., Kleinberg J., and Leighton T., J. ACM 46, 212 (1999). [Google Scholar]
  9. Moré J. J. and Wu Z., in Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding, edited by Pardalos P. M., Shalloway D., and Xue G. (American Mathematical Society, Providence, 1996), pp. 151–168. [Google Scholar]
  10. Glunt W., Hayden T. L., and Raydan M., J. Comput. Chem. 10.1002/jcc.540140115 14, 114 (1993). [DOI] [Google Scholar]
  11. Moré J. J. and Wu Z., in Encyclopedia of Nuclear Magnetic Resonance, edited by Grant D. M. and Harris R. K. (Wiley, New York, 1995), pp. 1701–1710. [Google Scholar]
  12. Grosso A., Locatelli M., and Schoen F., “Solving molecular distance geometry problems by global optimization algorithms,” Optim. (to be published). [Google Scholar]
  13. Williams G. A., Dugan J. M., and Altman R. B., J. Comput. Biol. 10.1089/106652701753216521 8, 523 (2001). [DOI] [PubMed] [Google Scholar]
  14. Vendruscolo M., Kussell E., and Domany E., Folding Des. 2, 295 (1997). [DOI] [PubMed] [Google Scholar]
  15. Vendruscolo M. and Domany E., Folding Des. 10.1016/S1359-0278(98)00045-5 3, 329 (1998). [DOI] [PubMed] [Google Scholar]
  16. Vendruscolo M., Paci E., Dobson C., and Karplus M., Nature (London) 10.1038/35054591 409, 641 (2001). [DOI] [PubMed] [Google Scholar]
  17. Rosenbluth M. N. and Rosenbluth A. W., J. Chem. Phys. 10.1063/1.1741967 23, 356 (1955). [DOI] [Google Scholar]
  18. Liang J., Zhang J., and Chen R., J. Chem. Phys. 10.1063/1.1493772 117, 3511 (2002). [DOI] [Google Scholar]
  19. Zhang J., Lin M., Chen R., Liang J., and Liu J. S., Proteins 10.1002/prot.21203 66, 61 (2007). [DOI] [PubMed] [Google Scholar]
  20. Frenkel D. and Smit B., Understanding Molecular Simulation: From Algorithms to Applications (Academic, San Diego, 1996). [Google Scholar]
  21. Landau D. P. and Binder K., Monte Carlo Simulations in Statistical Physics (Cambridge University Press, Cambridge, 2000). [Google Scholar]
  22. Zhang J., Chen R., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]
  23. Moré J. J. and Wu Z., J. Global Optim. 10.1023/A:1008380219900 15, 219 (1999). [DOI] [Google Scholar]
  24. Hom G., Mayo S., and Pierce N., J. Comput. Chem. 10.1002/jcc.10121 24, 232 (2002). [DOI] [PubMed] [Google Scholar]
  25. Keating A. E., Malashkevich V. N., Tidor B., and Kim P. S., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.261563398 98, 14825 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Moore J. W. and Pearson R. G., Kinetics and Mechanism (Wiley, New York, 1981). [Google Scholar]
  27. Fersht A. R., Itzhaki L. S., Elmasry N., Matthews J. M., and Otzen D. E., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10426 91, 10426 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li L. and Shakhnovich E. I., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.241378398 98, 13014 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ozkan S., Bahar I., and Dill K. A., Nat. Struct. Biol. 10.1038/nsb0901-765 8, 765 (2001). [DOI] [PubMed] [Google Scholar]
  30. Lazaridis T. and Karplus M., Science 10.1126/science.278.5345.1928 278, 1928 (1997). [DOI] [PubMed] [Google Scholar]
  31. Li A. and Daggett V., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10430 91, 10430 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Fersht A. R., Leatherbarrow R. J., and Wells T. N., Biochemistry 10.1021/bi00393a013 26, 6030 (1987). [DOI] [PubMed] [Google Scholar]
  33. Winter G., Fersht A. R., Wilkinson A. J., Zoller M., and Smith M., Nature (London) 10.1038/299756a0 299, 756 (1982). [DOI] [PubMed] [Google Scholar]
  34. Paci E., Lindorff-Larsen K., Dobson C., Karplus M., and Vendruscolo M., J. Mol. Biol. 10.1016/j.jmb.2005.06.081 352, 495 (2005). [DOI] [PubMed] [Google Scholar]
  35. Jackson S. E., Moracci M., ElMasry N., Johnson C. M., and Fersht A. R., Biochemistry 10.1021/bi00093a001 32, 11259 (1993). [DOI] [PubMed] [Google Scholar]
  36. Miyazawa S. and Jernigan R., Macromolecules 10.1021/ma00145a039 18, 534 (1985). [DOI] [Google Scholar]
  37. Li X., Hu C., and Liang J., Proteins 10.1002/prot.10442 53, 792 (2003). [DOI] [PubMed] [Google Scholar]
  38. Zhang J., Chen R., Liu J., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]
  39. Li X. and Liang J., Computational Algorithms for Protein Structure Prediction (Springer, New York, 2006). [Google Scholar]
  40. Marshall A., in Symposium on Monte Carlo Methods, edited by Meyer M. (Wiley, New York, 1956), pp. 123–140. [Google Scholar]
  41. Liu J. and Chen R., J. Am. Stat. Assoc. 10.2307/2669847 93, 1032 (1998). [DOI] [Google Scholar]
  42. Liu J. S., Monte Carlo Strategies in Scientific Computing (Springer, New York, 2001). [Google Scholar]
  43. Fearnhead P. and Clifford P., J. R. Stat. Soc. Ser. B (Stat. Methodol.) 65, 887 (2003). [Google Scholar]
  44. Atallah M. J., Algorithms and Theory of Computation Handbook (CRC, Boca Raton, FL, 1998). [Google Scholar]
  45. Householder A. S., Principles of Numerical Analysis (McGraw-Hill, New York, 1953). [Google Scholar]
  46. Chen S. X. and Liu J. S., Stat. Sin. 7, 875 (1997). [Google Scholar]
  47. Chen Y., Diaconis P., Holmes S. P., and Liu J. S., J. Am. Stat. Assoc. 10.1198/016214504000001303 100, 109 (2005). [DOI] [Google Scholar]
  48. Kragelund B. B., Osmark P., Neergaard T. B., Schiødt J., Kristiansen K., Knudsen J., and Poulsen F. M., Nat. Struct. Biol. 10.1038/9384 6, 594 (1999). [DOI] [PubMed] [Google Scholar]
  49. Zhang J., Kou S. C., and Liu J. S., J. Chem. Phys. 10.1063/1.2736681 126, 225101 (2007). [DOI] [PubMed] [Google Scholar]

Articles from The Journal of Chemical Physics are provided here courtesy of American Institute of Physics

RESOURCES