Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jun 21.
Published in final edited form as: J Chem Phys. 2008 Feb 28;128(8):084903. doi: 10.1063/1.2831905

Statistical geometry of lattice chain polymers with voids of defined shapes: Sampling with strong constraints

Ming Lin 1, Rong Chen 1,2, Jie Liang 2,*
PMCID: PMC3689594  NIHMSID: NIHMS397401  PMID: 18315083

Abstract

Proteins contain many voids, which are unfilled spaces enclosed in the interior. A few of them have shapes compatible to ligands and substrates, and are important for protein functions. An important general question is how the need for maintaining functional voids is influenced by, and affects other aspects of proteins structures and properties (e.g., protein folding stability, kinetic accessibility, and evolution selection pressure). In this paper, we exam in detail the effects of maintaining voids of different shapes and sizes using two-dimensional lattice models. We study the propensity for conformations to form a void of specific shape, which is related to the entropic cost of void maintenance. We also study the location that voids of a specific shape and size tend to form, and the influence of compactness on the formation of such voids. As enumeration is infeasible for long chain polymer, a key development in this work is the design of a novel sequential Monte Carlo strategy for generating large number of sample conformations under very constraining restrictions. Our method is validated by comparing results obtained from sampling and from enumeration for short polymer chains. We succeeded in accurate estimation of entropic cost of void maintenance, with and without an increasing number of restrictive conditions, such as loops forming the wall of void with fixed length, with additionally fixed starting position in the sequence. Additionally, we have identified the key structural properties of voids that are important in determining the entropic cost of void formation. We have further developed a parametric models to predict quantitatively void entropy. Our model is highly effective, and these results indicate that voids representing functional sites can be used as an improved model for studying the evolution of protein functions, and how protein function relates to protein stability.

Keywords: void shape, constraint, sequential Monte Carlo, entropy, propensity

I. INTRODUCTION

Proteins are the working molecules of cell. Understanding how they maintain their stability and carry out their functions is a fundamental problem of molecular biology. Although it is well-known that the structures of proteins are well packed13, there exist numerous packing defects in the form of voids buried in the interior of proteins. The size distributions of these voids are broad4. Various scaling relationships indicate that their origin may be generic steric constraints of compact chain polymers4,5. It is also well-known that a few voids on a protein may play key roles in enabling protein functions69, for example, for substrate and ligand binding.

However, the shape space of voids of folded and unfolded proteins are not well-characterized, and the energetic consequences and the kinetic effects by maintaining voids of certain shape and size are largely unknown. In this paper, we exam in detail the effects of maintaining voids of different shapes in lattice models of chain polymers. Lattice models have been widely used for studying protein folding, where the conformational space of simplified polymers can be examined in detail1018. Despite its simplistic nature, lattice model has provided important insights about proteins, including collapse and folding transitions16,1923, influence of packing on secondary structure and void formation11,12,24,25, the evolution of protein function26,27, nascent chain folding18, and the effects of chirality and side chains25.

In this paper, we focus on conformations that enclose voids of specific shapes. Our main objective is to study the fraction of conformations with a specific void shape among all possible conformations. This is related to the entropic cost of maintaining such a void in a polymer structure. We also study the location that voids of a specific shape and size tend to form, and the influence of compactness on the formation of such voids. The methodology we use is the sequential Monte Carlo approach (SMC) designed for sampling conformations under strong constraints, i.e., the requirement of the existence of specific types of voids. SMC is a growth-based method, in which residues are added to the chain polymer one by one until the conformation of full length is obtained. This method is first used in reference28 to estimate the average extension of molecular chains. The basic goal is to obtain a set of conformational samples, along with the probabilities of generating these conformations. Compared with other sampling methods, such as Markov chain Monte Carlo2931, sequential Monte Carlo can generate diverse samples and can directly estimate the number of conformations containing voids of specific shapes accurately. In this study, we develop several new strategies to improve the effectiveness of sequential Monte Carlo in generating samples under strongly constrained conditions.

Our paper is organized as follows. In section 2, we describe briefly the lattice model and define void and shape of voids. We then discuss the constrained sequential Monte Carlo method used in our study. Results are presented in Section 3. The final section contains the summary and conclusion.

II. METHOD

A. Lattice Model

In lattice models, chain polymers are self-avoiding walks (SAWs) in the square lattice space ℤ2. A length n conformation is denoted by a connected chain Xn = (x1, x2, ⋯, xn), where the i-th monomer is located at the site xi = (ai, bi), where ai and bi are integers. The Manhattan distance between bonded monomers xi and xi+1 is 1. The chain is self-avoiding: xixj for all ij. We consider the beginning and the end of a polymer to be distinct. Only conformations that are not related by translation, rotation, and reflection are considered to be distinct. This is achieved by following the rule that a chain is always grown from the origin, the first step is always to the right, and the chain always goes up at the first time it deviates from the x-axis. We denote the set of all length-n SAW polymers satisfying these constraints as 𝒫n.

B. Voids and shape of voids

Given the conformation Xn ∈ 𝒫n of a chain polymer, the unoccupied sites on the square lattice are divided by the polymer into disconnected components:

2\Xn=uυ1υk,

where u is the outside component that connects to infinity, and υ1, ⋯, υk are the voids that are enclosed by Xn. Here two components are considered connected if they share any edges or vertices. By this definition, conformations (a) and (b) in Figure 1 both have a size-2 void, but conformation (c) does not contain any void, since the internal two unoccupied points are connected to the outside through a vertex. This definition of void is arbitrary, but is consistent with the definition of contact for monomers in a chain, that is, only if two sites in a void shares an edge, they are considered to be connected24.

Fig 1.

Fig 1

Conformations in square lattice model. (a) This conformation contains a size-2 void. (b) This conformation contains a size-2 void. (c) This conformation does not contain any void.

We are interested in the set of conformations with voids of a particular shape S. Figure 2 shows some of those shapes of sizes 2 to 6. Note that those shapes are regular shapes, in which sites are connected by edges. In this study we do not consider shapes with sites connected by a vertex only, such as the size-2 void formed by conformation (b) in Figure 1. Here the voids are labelled by their shapes, where the first digit represents the number of sites the void occupies, and the second digit is the identification number of different shapes.

Fig 2.

Fig 2

Regular shapes of voids of different sizes. Here the first digit represents the number of sites the void occupies, and the second digit is the identification number for different shapes. All possible shapes for voids up to size 4 are listed. Several samples for voids of size 5 and 6 are also listed.

C. Parameters of interest

To study the properties of conformations with specific shaped, we consider the following parameters.

a. Propensity of void formation f1(S, n)

Let Ωn(S) be the set of conformations with at least one void of a specific shape S, that is

Ωn(S)={Xn|Xn𝒫n,Xn has at least one void of shape S}.

The fraction of conformations with void of this particular shape S among all possible conformations is:

f1(S,n)=N(S,n)Nall(n)=XnΩn(S)1Xn𝒫n1. (1)

This parameter represents the propensity of void formation, i.e., the probability of forming a void of specified shape. This relates to the question whether there are preferred shapes for binding voids to occur.

b. Propensity of void formation with fixed loop length f2(l, S, n)

The loop length of a void is defined as l = I1I0 + 1, where I0 and I1 are the smallest and largest indices of the monomers forming the boundary of the void. Let Ωn(l, S) ⊂ Ωn(S) be the set of length n conformations with at least one void of shape S of loop length l. The fraction of conformations with void of a particular shape S and a particular void loop length l among all conformations with a void of the same shape but without the restriction of void loop length is defined as:

f2(l,S,n)=N(l,S,n)N(S,n)=XnΩn(l,S)ξ(Xn,l,S)/K(Xn,S)XnΩn(S)1, (2)

where K(Xn, S) is the number of shape-S voids in Xn, ξ(Xn, l, S) is the number of shape-S voids with loop length l in Xn. This special treatment on N(l, S, n) is to deal with the cases when Xn has multiple voids of shape-S. In such cases, Xn is counted once in N(S, n), and counted 1/K(Xn, S) in N(l, S, n) for each combination of shape-S void and loop length l. For example, if conformation Xn has two voids of shape S, then K(Xn, S) = 2. If both voids have loop length l = 14, then ξ(Xn, l = 14, S) = 2 and this conformation contributes 1 to N(l = 14, S, n); if one void has loop length l = 14 and the other void has loop length l = 16, then this conformation contributes 1/2 to N(l = 14, S, n) and 1/2 to N(l = 16, S, n). Clearly, with this definition, we have

lN(l,S,n)=N(S,n).

The parameter f2(l, s, n) represents the propensity of void formation with fixed loop length, i.e., the probability of forming a void of specified shape with fixed void loop length. In protein, a related interesting question is how easier it is to form certain types of voids in shape and size with more local compared to with more global sequence fragments.

c. Propensity of void formation with fixed loop length and starting position f3(I0, l, S, n)

Let Ωn(I0, l, S) ⊂ Ωn(l, S) be the set of length n conformations with at least one void of shape S, loop length l, and starting at residue position I0. The fraction of conformations with void of a particular shape S, a particular loop length l, and a particular starting residue I0 among the conformations with a void of the same shape and the same loop length12,32 is defined as

f3(I0,l,S,n)=N(I0,l,S,n)N(l,S,n)=XnΩn(I0,l,S)ξ(Xn,I0,l,S)/K(Xn,S)XnΩn(l,S)ξ(Xn,l,S)/K(Xn,S), (3)

where ξ(Xn, I0, l, S) is the number of shape-S voids with loop length l and starting residue I0 in Xn. Similarly, this definition ensures that

I0N(I0,l,S,n)=N(l,S,n).

The parameter f3(I0, I, S, n) represents the propensity of void formation with fixed loop length and starting position, i.e., the probability of forming a void of specified shape with fixed void loop length starting at a specific position. A related question in protein is what is the propensity of forming voids of certain shape with more local or more global sequence fragments starting at specific positions of the chain.

d. Propensity of void formation at specific compactness f4(ρ, S, n)

The compactness of a polymer ρ is defined as t/tmax(n)11, where t is the number of contacts in the conformation, and tmax(n) is maximum number of contacts possible for length n conformations. For square lattice space, we have11:

tmax(n)={n2b,if b2<nb(b+1),n2b1,if b(b+1)<n(b+1)2,

where b is a positive integer. Let Ωn(ρ) ⊂ 𝒫n be the set of length n conformations with compactness ρ and Ωn(ρ, S) ⊂ Ωn(ρ) be the set of length n conformations with at least one void of shape S and compactness ρ. The fraction of conformations of a particular compactness ρ with void of a particular shape S among all conformations with the same compactness ρ is defined as:

f4(ρ,S,n)=N(ρ,S,n)N(ρ,n)=XnΩn(ρ,S)1XnΩn(ρ)1. (4)

This parameter represents the propensity of void formation with certain compactness, i.e., the probability of forming a void of specified shape for length n chain polymers at a fixed compactness.

D. Estimating void parameters using sequential Monte Carlo

Exhaustive enumeration can be used to calculate the propensities defined above, but is only applicable to very short polymer chains. For longer chain, we use a modified version of the sequential Monte Carlo (SMC) method.

All the parameters described above are fractions, where the corresponding numerators and denominators N(S, n), N(l, S, n), N(I0, l, S, n) and N(ρ, S, n) can be written in the form of

XnΩn(S)h(Xn), (5)

where h(·) is a function of conformation Xn. Specifically, we have:

h(Xn)=1 for N(S,n);
h(Xn)=𝕀Ωn(l,S)(Xn)ξ(Xn,l,S)K(Xn,S) for N(l,S,n);
h(Xn)=𝕀Ωn(I0,l,S)(Xn)ξ(Xn,I0,l,S)K(Xn,S) for N(I0,l,S,n);
h(Xn)=𝕀Ωn(ρ,S)(Xn) for N(ρ,S,n),

where 𝕀Ωn(Xn) is the indicator function, 𝕀Ω(Xn) = 1 if Xn is in set Ωn, 𝕀Ω(Xn) = 0 otherwise.

Suppose we can generate random samples of conformations Xn(j), j = 1, ⋯, m, from a trial distribution g(Xn). Following the importance sampling principle33,34, formula (5) can be estimated as:

1mj=1mh(Xn(j))𝕀Ωn(S)(Xn(j))g(Xn(j))𝔼g[h(X)·𝕀Ωn(S)(Xn)g(X)]=XΩn(S)h(X)g(X)·g(X)=XnΩn(S)h(Xn), (6)

Note that to obtain an unbiased estimate, the trial distribution g(Xn) must have a support larger than h(Xn)𝕀Ωn(S)(Xn), that is, g(Xn) > 0 must hold for all Xn in Ωn(S) that satisfy h(Xn) > 0. Let wn(j)=1/g(Xn(j)) to be the weight of sample Xn(j), then Eqn (6) can be rewritten as

XnΩn(S)h(Xn)=1mj=1mwn(j)h(Xn(j))𝕀Ωn(S)(Xn(j)). (7)

The efficiency of the estimator of Eqn (6) depends on the choice of the trial distribution g(Xn) and the computational complexity for generating a sample. In general, if g(Xn) is approximately proportional to |h(Xn)𝕀Ωn(S)(Xn)|, with a support larger but close to Ωn(S), the estimate can be reasonably accurate34.

The original Rosenbluth and Rosenbluth growth method generates samples in the space of 𝒫n 28. Starting at x1 = (0, 0), monomers are added to the chain and the associated weights are updated recursively, until the chain reaches length n. Modifications of the algorithm can be found in24,3436. However, the space under our consideration is a highly constrained subspace of 𝒫n. For example, for void shape 4.1 and chain length 50, the size of the constrained space Ωn(S) is less than 2 × 10−3 of the size of 𝒫n. With additional constraints such as fixed loop length, the space becomes even smaller and sampling such conformations becomes more difficult. The simple growth method of28 is very inefficient in generating samples for such constrained space. Below we reformulate the sampling space and modify the growth method to overcome this difficulty.

1. An equivalent representation of Ωn(S)

In order to avoid location ambiguity, the construction of 𝒫n is restricted to the set of SAW conformations starting at x1 = (0, 0), x2 = (1, 0) and going up at the first time the chain deviates from the x-axis. Since our main interests are sampling conformations containing specific void, we adopt an equivalent representation that is more efficient for our purpose.

Specifically, let υ = υ(S) be a set of sites in ℤ2, whose union takes the shape S. Let A(υ) = (a1(υ), ⋯, a|A(υ)|(υ)) be the set of neighboring sites of υ, sharing either edges or vertices with υ. We call it the wall sites of υ. If a SAW completely occupies A(υ) and does not intersect with υ, then this SAW has at least one void of shape S. Denote the set of all such conformations as

Gn(υ)={Xn|Xn is a SAW,A(υ)Xn,υXn=}.

Recall that by definition a conformations in 𝒫n first grows to the right, and always goes up when it first deviates from the x-axis. Note that the conformations in Gn(υ) is not restricted to 𝒫n as they can start from any site on the lattice. In Gn(υ), we consider two SAWs as equivalent if one SAW can be transformed into the other through a combination of rotation, reflection and position translation. Then Gn(υ) consists of a number of disjoint equivalent classes.

It can be easily established that there is a one-to-one mapping between conformations in Ωn(S) and the equivalent classes of conformations in Gn (υ) through transformation consisting the primitives of rotation, reflection, and translation. Each of the transformations provides such a map that the starting site x1 of XnGn(υ) becomes the origin (0, 0), the second site x2 becomes (1, 0), and the first site that deviates from x-axis is up. Hence, if h(·) is a function of Xn that takes the same value for equivalent conformations, we have:

XnΩn(S)h(Xn)=XnGn(υ)h(Xn)E(Xn,S).

where E(Xn, S) is the number of equivalent conformations of Xn in Gn(υ).

The number E(Xn, S) depends on K(Xn, S), the number of shape-S voids contained in Xn (as in Eqn (2)), and the symmetricity of the shape-S. Let q(υ) be the number of combination of rotation and reflection that maps υ to itself. In two-dimensional lattice space, there are 4 possible rotations and 2 possible reflections around x and y axes. Hence, q(υ) can only take a value in {1, 2, 4, 8}. For example, q(υ) = 8 for shape 4.2 in Figure 2, q(υ) = 4 for shapes 2.1 and 3.1, q(υ) = 2 for shapes 4.4 and 4.5, and q(υ) = 1 for shape 4.3.

When Xn contains only one S-shaped void, the size of its equivalent class E(Xn, S) is q(υ). Figure 3 shows 4 polymers in a equivalent class for void 2.1. When Xn contains total K(Xn, S) S-shaped voids, then E(Xn, S) = q(υ)K(Xn, S) as each of the voids contributes q(υ) number of members in the equivalent class.

Fig 3.

Fig 3

The equivalent class of conformations. The union of the sites occupied by stars (*) is the fixed void υ. Here polymer a, b, c and d are different chains enclosing void υ, as indicated by the different sites occupied by the starting monomer x1 and the next monomer x2. However, the shapes of the polymers taken up by the union of the occupied sites for these chains are the same. As a consequence, these four polymers are equivalent.

Fig 4.

Fig 4

The general procedure for growing chains. The union of the sites occupied by stars (*) is the fixed void υ. (a) The k-th monomer y1 = xk is first placed to the position ai(υ) of the wall sites of the void υ. (b) We then grow backward until we reach the first monomer yk = x1 of the chain to form void υ. (c) We continue by growing forward until we reach Xn.

To simplify our analysis, we note that Gn(υ) consists of disjoint subsets:

Gn(υ)=i,kGn(υ,i,k),

where

Gn(υ,i,k)={Xn|XnGn(υ),xk=ai,A(υ){x1,,xk}}.

If XnG(υ, i, k), then the void υ is completely enclosed by the prefix (x1, ⋯, xk) of the chain, where xk is the last monomer in the prefix and occupies the i-th site ai(υ) of the wall sites. We have k ≥ |A(υ)| since some of the monomers in the prefix (x1, ⋯, xk) may not be on the wall of the void. The remaining chain, (xk+1, …, xn), does not intersect with the void space υ nor with the wall sites A(υ).

Using this partition, we have that for any function h(·) that is constant within the equivalent classes,

XnΩn(S)h(Xn)=XnGn(υ)h(Xn)q(υ)K(Xn,S)=1q(υ)i,kXnGn(υ,i,k)h(Xn)K(Xn,S). (8)

In the following we develop procedures to estimate the quantity

XnGn(υ,i,k)h(Xn)K(Xn,S),

for each subset Gn(υ, i, k), i = 1, ⋯, |A(υ)|, k = |A(υ)|, ⋯, n.

2. Algorithmic steps

The following procedure is used to generate Monte Carlo samples in Gn(υ, i, k) for all υ, i and k, which are then used to estimate the parameters listed in Section 2.3. First, we set xk = ai(υ) as defined by Gn(υ, i, k). We then grow backwards sequentially to place xk−1, xk−2, ⋯, until we reach the first monomer x1 of the chain. During this process, the wall sites A(υ) become fully occupied by monomers in {x1, …, xk}, and the void space υ remains unoccupied. Lastly, as now that (x1, ⋯, xk) are placed and the constraints for void formation are satisfied, we complete the remaining conformation by sequentially placing monomers xk+1, ⋯, Xn. The only constraint at this stage is that these monomers avoid the partial chain grown so far. An illustration of the procedure is shown in Figure 4.

For ease of presentation, we rearrange the monomer labels based on the above procedure. Define yn = (y1, …, yn) as (xk, …, x1, xk+1, …, xn). Formally, ys = xks+1 for sk, and ys = xs for s > k. In this notation, the chain prefix of length s becomes ys = (y1, …, ys).

We adopt the general framework of optimal sampling method37 to generate sample conformations. Let mi be the number of samples we retain in the i-th iteration, and mmax be the maximum value of {mi}. In the initial step, we set m1 = 1, y1(1)=ai(υ),and w1(1)=mmax. For s = 2, …, n, we perform the following procedure:

  1. At step s when adding the s-th monomer, assume there are ms−1 samples {ys1(j),j=1,,ms1} with weights ws1(j).

  2. We now add the s-th monomer to the partial chain ys−1. For each sample ys1(j), j = 1, ⋯, ms−1, generate ls(j) number of new samples s(j) by placing ys at each of the vacant sites neighboring ys1(j), where ls(j) is the number of vacant sites neighboring ys1(j) in sample ys1(j). Set weight s(l)=ws1(j). Assume this step results in a total of Ls=jls(j) samples (s(l),s(l)).

    Note that the step k + 1 is slightly different. At steps 1, …, k, we grow the chain backwards. But at step k + 1, we start to grow the chain forward. That is, we place yk+1 = xk+1, which is connected to y1 = xk. Hence, at the step k + 1, we consider the vacant neighbor(s) of y1, not yk.

  3. Assign a priority score βs(l) to each resulting partial chain s(j). The choice of the priority scores will be discussed in details in the next section.

  4. If Ls < mmax, we keep all of the samples with their weights, set ms = Ls and go to step s + 1. If Ls > mmax, we choose m distinct samples from {s(l),l=1,,Ls} according to the priority scores as follows:
    1. Find a constant c such that l=1Lsmin{cβs(l),1}=mmax.
    2. Choose distinct integers J1, J2, ⋯, Jmmax from l = 1, ⋯, Ls, with probability bs(l)min{cβs(l),1}. This is achieved by the following steps:
      1. Draw a sample r0 from the uniform distribution between 0 and 1. Let rj = jr0 for j = 1, ⋯, mmax;
      2. For each j = 1, ⋯, mmax, choose Jj as the integer such that l=1Jj1bs(l)<rjl=1Jjbs(l) holds.
    3. Let ys(j)=s(Jj) and update its weight to be ws(j)=s(Jj)/min{cβs(Jj),1}.

3. Priority scores

The priority score guides the growth of conformations, and its design is critically important for obtaining accurate estimates. Our priority scoring function has three components addressing three important issues, namely, the support of the target distribution, the weighting scheme of samples, and the look-ahead strategy.

The support of the target distribution

If a partial chain s(l) at step s is impossible to eventually grow into the constrained space Gn(υ, i, k) at step n, it should be removed from future steps of sampling immediately at step s, since it is destined to be rejected. Define the support 𝒮s of partial chains of length s as

𝒮s={ys|  s.t.ys+1:n=(ys+1,,yn) that  (ys,ys+1:n)Gn(υ,i,k)}

That is, 𝒮s contains all possible prefix chains of length s of desired polymers. However, it is difficult to evaluate if a partial chain is in the support. Here we use a sequence of the support ψs that contains 𝒮s but easy to work with. Specifically, let ψ1 = {ai(υ)} where the chain starts according to the definition of Gn(υ, i, k). The support ψs is updated sequentially as follows: For each partial chain ys−1 ∈ ψs−1, find all possible chains ys by adding a monomer to a vacant neighboring site which shares an edge with ys−1. The new support ψs is the union of all such chains satisfying the following conditions:

  1. ysys−1 = ∅ (the self-avoiding constraint) and ys ∉ υ, where υ is the void space.

  2. If sk and if A(υ) \ ys is not an empty set (i.e., the wall sites has not been filled by ys), then A(υ) \ ys must remain as a strongly connected set. Here we define that a strong connection exists between two sites if they share an edge.

  3. If sk and if A(υ) \ ys ≠ ∅, then the site ys must satisfy
    ksd(ys,A(υ)\ys)+|A(υ)\ys|1,
    where d(ys, A(υ) \ ys) is the minimum Manhattan distance between ys and the unoccupied wall sites, |A(υ) \ ys| is the number of unoccupied wall sites of A(υ) \ ys.

Condition (ii) reflects the property that both the filled and unfilled sites on the wall of the void must remain strongly connected at any time of the growth. Otherwise, the unfilled wall sites A(υ) has multiple not strongly connected components. In such cases, the self-avoiding property must be violated in order to fill all of them by ynGn(υ, i, k). This is the consequence of the Jordan curve theorem in plane38.

Condition (iii) is to ensure that the remaining chain length is sufficient to fill all wall sites, i.e., A(υ) \ ys must be filled by (ys + 1, …, yk), which is a length ks chain connected to ys. The priority scores without lookahead. In the optimal sampling method framework (37), the priority score serves both as the propagation trial distribution as well as the resampling priority score. Under the importance sampling principle34, the ideal trial distribution should be proportional to |h(x)π(x)|, where π(x) is the target distribution. In our case, it translates to t(l)h(t(l))𝕀ψt(t(l)). Here h(t(l)) is the value of function h(·) applied to partial chain t(l), which is always non-negative.

For s > k, we simply set equally priority scores βs(l)1.0 to all partial chain samples s(l), since this stage is relatively easy.

For the more difficult part of the growth sk where the major constraints lie, we need to guide samples to grow into the support region ψs in order to reduce the sample rejection rate. Following Zhang and Liu36, we use the priority score to achieve this. Taking condition (iii) when updating the support into consideration, we define

Us(s(l),υ)={ks+2d(s(l),A(υ)\s(l))|A(υ)\t(l)|,|A(υ)\t(l)|>0,0,|A(υ)\t(l)|=0.

It evaluates how much freedom and flexibility the remaining chain possess. When |A(υ)\t(l)|>0, there are still some vacant sites on the void wall needs to be occupied. In this case, if Us(s(l),υ)0,s(l) is not in the support ψs, as it violates condition (iii), we reject this sample. The larger Us(s(l),υ) is, the less constrained the remaining chain is.

Combining the value of the function h(t(l)) to be evaluated, and Us(s(l),υ) reflecting the desired flexibility of the remaining chain, we design our priority score for sk as:

βs(l)=s(l)h(s(l))𝕀ψs(s(l)) exp{Us12(s(l),υ)/Ts},

where Ts is a temperature-like variable. The choice of values for Ts is important. In general, the constraint of forming void is not of serious concerns at the beginning, so we can use high values of Ts to enhance diversity in sampling. As the chain grows, the concern of meeting the constraints become stronger, since there are less freedom for the remaining chains to grow. Hence, we gradually reduced the Ts, as in simulated annealing algorithms. In this study, we use Ts=ks+16 for s = 1, ⋯, k − 1.

Priority score with look-ahead

An often used strategy to improve performance of SMC is look-ahead36,39,40. Look-ahead enables us to use information from possible future steps to construct priority scores, resulting smaller rejection rate of the samples. In addition, it reduces the variance of samples for estimation and hence improves sample efficiency41.

For a δ-step look-ahead, the priority score at time t is determined by exploring all possible combinations of δ-step growth from the current sample ys. Specifically, the priority scores we use are:

βs(l)=s(l)ys+1,,ys+δh(s+δ(l))𝕀ψs+δ(s+δ(l)) exp{Us12(s+δ(l),υ)Ts}

where ỹs denotes (ỹs, ys+1, ⋯, ys).

Note that as look-ahead step δ increases, the effectiveness increases at the cost of exponentially growing computational complexity. Hence the choice δ is a tradeoff between estimate efficiency and complexity. In this study we use δ = 1.

4. Estimation

In our framework, it is possible to estimate the parameters described in Section IIC for polymer chains of different lengths up to n when generating conformation samples of length n.

Specifically, when generating conformation samples for Gn(i, k), at step s = k, k + 1, ⋯, n, the generated partial conformations are s(l)=(x1(l),,xk(l),,xs(l)), which are properly weighted chain polymers of length s. Hence, xn*Gn*(υ,i,k)h(xn*)K(xn*,S), n* = k, k + 1, ⋯, n, can be estimated by the following estimator

ĥ(xn*;i,k)=1mmaxl=1Lss(l)𝕀Gn*(υ,i,k)(s(l))h(s(l))K(s(l),S)

at step s = n*. Here the estimation is made after step (2) of the algorithmic steps in the previous subsection.

After generating samples for Gn(i, k) of all possible i, k, for any n* ≤ n, we can estimate ∑Xn* ∈Ωn*(S) h(Xn*) by

Xn*Ωn*(S)h(Xn*)1q(υ)i,kĥ(xn*;i,k)

according to Eqn (8).

III. RESULTS

In this section, we present the results of estimation of the parameters described in Section 2.3. We also develop parametric models relating to void and chain properties for interpreting the estimated results, and for prediction of propensity of forming void of specific shape.

A. Propensity of void formation

For propensity of void formation f1(S, n) defined in Eqn (1), we first examine size-4 voids. There are 5 different shapes for size-4 regular voids (Figure 2). To validate our procedure, the estimated propensity of void formation is compared with the true values obtained by exhaustive enumeration, for chains of length 14 to 24. Figure 5(a) shows the results for voids shapes 4.3 and 4.4. The estimated values are indistinguishable from the true values. These results suggest that our sampling method works well and can provide accurate estimations.

Fig 5.

Fig 5

Estimating void propensity values. (a) Estimated propensity values and true propensity values of forming size-4 voids of different specific shapes for conformations of length 14 − 24. They superimpose very well. (b) Estimated propensity values of forming size-4 voids of different specific shapes for conformations of length 15 − 50.

The results for longer chains of length 15 to 50 using the SMC procedure are presented in Figure 5(b), where propensity of void formation f1(S, n) for void shapes 4.1 to 4.5 are shown. It is clear that voids of different shapes have significant difference in propensity of formation. This raises the question whether voids and binding sites in proteins are similarly biased, and whether the distribution of voids of different shapes can be partly explained by these intrinsic propensities analogous what is observed here on lattice models.

a. Predictive models

To better understand our estimation results and to infer general principles, we develop a predictive model for f1(S, n) using the following parametric form:

f1^(S,n)=1q(υ)c1c2|A(υ)|(n|A(υ)|+1)c3[1c4(|e(υ)|4)], (9)

where q(υ) represents the degeneracy of the void shape as we discussed in Section IID1. We consider three factors other than q(υ) in our model: the the wall size |A(υ)|, the chain length n, and number of outer corners of void, |e(υ)|. Here the outer corners, e(υ), are defined as the sites on void wall that connect to the void through a single vertex only. The values of q(υ), |A(υ)|, and |e(υ)| for different void shapes are summarized in Table I.

TABLE I.

Geometric features of voids determining the fractions of chain polymers containing such voids. q(υ) is related to the symmetry of the void υ, |A(υ)| is wall size of the void υ, and |e(υ)| is the number of outer corners of void υ.

void type 2.1 3.1 3.2 4.1 4.2 4.3 4.4 4.5 5.1 6.1 6.2 6.3
q(υ) 4 4 2 4 8 1 2 2 4 4 1 2
|A(υ)| 10 12 12 14 12 14 14 14 16 18 18 18
|e(υ)| 4 4 5 4 4 5 6 6 4 4 5 6

In this model, c1, c2, c3, c4 are positive constants. As the wall size |A(υ)| increases, it is expected that the propensity of forming voids of the specific shape decreases exponentially. This is reflected by the term containing c2|A(υ)|. When the chain length n increases, it is expected that the propensity of forming voids of the specific shape increases by some power. This is captured by the term of (n − |A(υ)| + 1)c3. We also find that the number of outer corners, |e(υ)|, is an important determinant of propensity of void formation. For void shapes with more outer corners, chain polymers enclosing such voids have more concave turns on the wall. This makes it more difficult for a self-avoiding chain to enclose the void. The negative term of |e(υ)| in Eqn (9) models this effect.

We estimate the coefficients in model (9) using the estimated f1n) from SMC for voids of sizes 2 to 5 and chain length from 25 to 50. Taking log transformation and using nonlinear regression, we found that ĉ1 = 47.46, ĉ2 = 2.28, ĉ3 = 0.76 and ĉ4 = 0.21.

The propensity values estimated from SMC and the fitted results of f1^(S,n) using model (9) are plotted in Figure 6. It can be seen that the parametric model fits the data very well. Using the above estimated parameters obtained from the training data, we develop a predictive model for the propensity for void shapes 6.1, 6.2 and 6.3, which are not used in deriving the regression model. The predictions are again compared with those estimated by SMC (Figure 7). The models works well, although it consistently under-estimates by a small amount for void shape 6.3.

Fig 6.

Fig 6

Propensity values of forming voids (size=2 – 5, a–d) of different specific shapes for conformations of length 25–50. These are used to develop a regression model. Dashed line: results obtained by estimation using sequential Monte Carlo. Solid line: fitted results from the regression models.

Fig 7.

Fig 7

Estimated and predicted propensity values of forming size-6 voids of different specific shapes for conformations of length 25 – 50. Dashed line: SMC results. Solid line: predicted results using the regression model (9).

B. Propensity of void formation with fixed loop length

Now we consider the propensity of void formation with fixed loop length f2(S, n) defined in Eqn (2). We plot estimated f2(l, S, n = 50) for different specified loop length l and shape S in Figure 9. Although void with odd loop length do exist, we can see that it is much easier to form void with even loop length. This is because the number of wall sites, |A(υ)|, is always an even number on lattice. To form a void υ with odd loop length, the first monomer and the last monomer of the polymer on the void wall A(υ) cannot be adjacent, which results in a more complicated shape. A conformation enclosing a void of shape 4.1 with loop length 17 is given in Figure 8. On average, void shapes 4.1 and 4.3 have larger loop sizes than void shapes 4.4 and 4.5, because they have fewer corners. These results suggest that voids of different shapes have different propensity at specific loop lengths.

Fig 9.

Fig 9

Estimated propensity values of forming size-4 voids of different specific shapes with different specified loop length for conformations of length n = 50.

Fig 8.

Fig 8

Conformation with odd loop length. This conformation encloses an void of shape 4.1 with loop length 17.

C. End effect for void formation

For propensity of void formation with fixed loop length and specified starting position f3(I0, l, S, n) as defined in Eqn (3), we plot estimated f3(I0, l = 14, S, n = 50) for voids of shapes S with a loop length of 14 in chain polymers of length 50 with different starting positions I0 in Figure 10. We find that the propensities f3(I0, l, S, n) at I0 = 1 and I0 = 2 are very different, indicating strong end-effect for void formation. That is, void is much easier to form at the end of the conformation. This is likely due to the tail effect. Void at the end of the chain only need to have one tail, but has two tails if it is in the middle of the conformation. It is difficult to constrain the tails to satisfy the multiple restrictions for forming void of certain shapes.

Fig 10.

Fig 10

Estimated propensity values of forming size-4 voids of different specific shapes with fixed loop length l = 14 and different specified starting position for conformations of length n = 50.

D. Propensity of void formation at different compactness

Figure 11 shows estimated propensity values of void formation at different compactness f4(ρ, S, n) defined in Eqn (4) for chain length from n = 30 to 50. Conformations with size-4 voids are dominated by those at compactness around 0.3 − 0.7. If we normalize f4(ρ, S, n = 50), that is, we define

4(ρ,S,n)=f4(ρ,S,n)f4(ρ,S,n)dρ,

where 4(ρ, S, n) can be considered as a distribution of ρ. We plot the 0.25-quantile, median value, and 0.75-quantile of distribution 4(ρ, S, n) for different chain length n and fixed shape S in Figure 12. We can see these values slightly increase as n increases from 30 to 50. This indicates that the prefer compactness range of forming these size-4 voids shift slightly to more compact regions as chain length increases. We also compare the propensity values of forming voids of all size-2 regular shapes (2.1), voids of all size-3 regular shapes (3.1, 3.2), and voids of all size-4 regular shape (4.1, 4.2, 4.3, 4.4, 4.5) for chains of length 50 at different compactness (Figure 13). The results show that smaller voids are easier to form as compactness increases. Our results from lattice model suggests that there might be a preferred size for void formation in proteins, which are all within a specific narrow range of compactness3.

Fig 11.

Fig 11

Estimated propensity values of forming size-4 voids of different specific shapes with certain compactness for conformations of length from n = 30 to 50. (a–d) : void 4.1, void 4.3, void 4.4, and void 4.5.

Fig 12.

Fig 12

Estimated quantiles (0.25, 0.5, and 0.75) of distribution 4(ρ, S, n) for different chain length n and void shape S. (a–d) : void 4.1, void 4.3, void 4.4, and void 4.5.

Fig 13.

Fig 13

Estimated propensity values of forming voids of all size-2 regular shapes (2.1), voids of all size-3 regular shapes (3.1, 3.2), and voids of all size-4 regular shape (4.1, 4.2, 4.3, 4.4, 4.5) for chain length 50 and different compactness.

IV. SUMMARY AND CONCLUSION

Protein molecules contain many voids buried in the interior of proteins, with broad distribution4. Although most voids are likely to originate from generic steric constraints of compact chain polymers4,5, some voids are the functional regions for many proteins, such as enzymes, where substrates and ligands bind, and biochemical reactions occur6,7.

An important general question is how the need for maintaining functional voids, which have to be of specific shape, is influenced by, and affects other aspects of proteins structures and properties: e.g., protein folding stability, kinetic accessibility, and evolutionary selection pressure. These are broad and complex issues that require detailed studies.

In this work, we study the effects of maintaining voids of defined shape using lattice model. Because the conformational space of simplified polymers can be examined in detail, lattice models have been widely used in protein studies and have lead to important insight about protein folding. The focus of our study is to generate large number of sample conformations under very constraining restrictions to study general properties of voids and their shapes. We use sequential Monte Carlo method and have developed an efficient growth method to generate conformation samples in highly-constrained space.

We show that our approach is effective in estimating entropy of void maintenance, with and without an increasing number of restrictive conditions, such as loops forming the wall of void with fixed length, with additionally fixed starting position in the sequence. Our results also lead to a number of observations, including that polymers of certain compactness range favors the formation of voids of specific size, and that voids are far easier to form around the end of the polymer. A finding is that voids tend to form at the chain ends. This raises the interesting question whether voids and pockets tend to form at either the N-terminal or the C-terminal end in real proteins. A detailed analysis of voids and pockets in real proteins will be necessary for answering this question. In addition, we have developed a parametric model for explaining the propensity of forming voids of particular shapes, or equivalently, the entropic cost of maintaining such voids. Our model is highly effective in predicting the propensity of void formation for different shapes. Such lattice model of voids representing functional sites can be used as improved model for studying the evolution of protein functions26, and how it relates to protein stability27.

Although in this study we treat the occurrence of all conformations equally likely, our approach can be applied to models with more realistic energy functions in a straight-forward manner. The approach for sampling strongly constrained conformations we developed in this study will be generally applicable for studying real proteins in three-dimensional space.

References

  • 1.Richards FM. Ann. Rev. Biophys. Bioeng. 1977;6:151. doi: 10.1146/annurev.bb.06.060177.001055. [DOI] [PubMed] [Google Scholar]
  • 2.Chothia C. Nature. 1975;254:304. doi: 10.1038/254304a0. [DOI] [PubMed] [Google Scholar]
  • 3.Richards FM, Lim WA. Q. Rev. Biophys. 1994;26:423. doi: 10.1017/s0033583500002845. [DOI] [PubMed] [Google Scholar]
  • 4.Liang J, Dill KA. Biophys. J. 2001;81:751. doi: 10.1016/S0006-3495(01)75739-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhang J, Chen R, Tang C, Liang J. J. Chem. Phys. 2003;118:6102. [Google Scholar]
  • 6.Laskowski RA, Luscombe NM, Swindells MB, Thornton JM. Protein Science. 1996;5:2438. doi: 10.1002/pro.5560051206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Liang J, Edelsbrunner H, Woodward C. Protein Science. 1998;7:1884. doi: 10.1002/pro.5560070905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Binkowski TA, Adamian L, Liang J. J. Mol. Biol. 2003;332:505. doi: 10.1016/s0022-2836(03)00882-9. [DOI] [PubMed] [Google Scholar]
  • 9.Tseng Y, Liang J. Mol. Biol. Evol. 2006;23(2):421. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]
  • 10.Lau KF, Dill KA. Macromolecule. 1989;93:6737. [Google Scholar]
  • 11.Chan HS, Dill KA. Macromolecules. 1989;22:4559. [Google Scholar]
  • 12.Chan HS, Dill KA. J. Chem. Phys. 1990;92:3118. [Google Scholar]
  • 13.Shakhnovich E, Gutin A. J. Chem. Phys. 1990;93:5967. [Google Scholar]
  • 14.Camacho CJ, Thirumalai D. Proc. Natl. Acad. Sci. USA. 1993;90:6369. doi: 10.1073/pnas.90.13.6369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pande VS, Joerg C, Grosberg AY, Tanaka T. J. Phys. A. 1994;27:6231. [Google Scholar]
  • 16.Socci ND, Onuchic JN. J. Chem. Phys. 1994;101:1519. [Google Scholar]
  • 17.Dill K, Bromberg S, Yue K, Fiebig K, Yee D, Thomas P, Chan H. Protein Science. 1995;4:561. doi: 10.1002/pro.5560040401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lu H, Liang J. Proteins. (Accepted). [Google Scholar]
  • 19.Šali A, Shakhnovich EI, Karplus M. Nature. 1994;369:248. doi: 10.1038/369248a0. [DOI] [PubMed] [Google Scholar]
  • 20.Shrivastava I, Vishveshwara S, Cieplak M, Maritan A, Banavar JR. Proc. Natl. Acad. Sci. U.S.A. 1995;92:9206. doi: 10.1073/pnas.92.20.9206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Klimov DK, Thirumalai D. Phys. Rev. Lett. 1996;76:4070. doi: 10.1103/PhysRevLett.76.4070. [DOI] [PubMed] [Google Scholar]
  • 22.Mélin R, Li H, Wingreen N, Tang C. J. Chem. Phys. 1999;110:1252. [Google Scholar]
  • 23.Kachalo S, Lu H, Liang J. Phys Rev Lett. 2006;96(5) doi: 10.1103/PhysRevLett.96.058106. 058106. [DOI] [PubMed] [Google Scholar]
  • 24.Liang J, Zhang J, Chen R. J. Chem. Phys. 2002;117:3511. [Google Scholar]
  • 25.Zhang J, Chen Y, Chen R, Liang J. J. Chem. Phys. 2004:592–603. doi: 10.1063/1.1756573. [DOI] [PubMed] [Google Scholar]
  • 26.Williams PD, Pollock DD, Goldstein R. Journal of Molecular Graphics and Modelling. 2001;19:150. doi: 10.1016/s1093-3263(00)00125-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bloom JD, Wilke CO, Arnold FH, Adami C. Biophys. J. 2004;86:2758. doi: 10.1016/S0006-3495(04)74329-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rosenbluth MN, Rosenbluth AW. J. Chem. Phys. 1955;23:356. [Google Scholar]
  • 29.Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. Chapman & Hall; 1996. [Google Scholar]
  • 30.Hamelryck T, Kent J, Krogh A. PLoS Comput. Biol. 2006;2:1121. doi: 10.1371/journal.pcbi.0020131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Iba Y, Chikenji G, Kikuchi M. Journal of the Physical Society of Japan. 1998;67:3327. [Google Scholar]
  • 32.Chan HS, Dill KA. J. Chem. Phys. 1989;90:492. [Google Scholar]
  • 33.Marshall A. In: Meyer M, editor. Symposium on Monte Carlo Methods; Wiley; 1956. pp. 123–140. [Google Scholar]
  • 34.Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]
  • 35.Grassberger P. Phys. Rev. E. 1997;56:3682. [Google Scholar]
  • 36.Zhang JL, Liu JS. J. Chem. Phys. 2002;117:3492. [Google Scholar]
  • 37.Fearnhead P, Clifford P. J.R.Statist. Soc. B. 2003;65:887. [Google Scholar]
  • 38.Hatcher A. Algebraic topology. Cambridge, England: Cambridge University Press; 2002. [Google Scholar]
  • 39.Meirovitch H. J. Phys.A: Math. Gen. 1982;15:L735. [Google Scholar]
  • 40.Wang X, Chen R, Guo D. IEEE trans. Signal Processing. 2002;50:241. [Google Scholar]
  • 41.Kong A, Liu J, Wong W. J. Amer. Statist. Assoc. 1994;89:278. [Google Scholar]

RESOURCES