Abstract
Proteins contain many voids, which are unfilled spaces enclosed in the interior. A few of them have shapes compatible to ligands and substrates, and are important for protein functions. An important general question is how the need for maintaining functional voids is influenced by, and affects other aspects of proteins structures and properties (e.g., protein folding stability, kinetic accessibility, and evolution selection pressure). In this paper, we exam in detail the effects of maintaining voids of different shapes and sizes using two-dimensional lattice models. We study the propensity for conformations to form a void of specific shape, which is related to the entropic cost of void maintenance. We also study the location that voids of a specific shape and size tend to form, and the influence of compactness on the formation of such voids. As enumeration is infeasible for long chain polymer, a key development in this work is the design of a novel sequential Monte Carlo strategy for generating large number of sample conformations under very constraining restrictions. Our method is validated by comparing results obtained from sampling and from enumeration for short polymer chains. We succeeded in accurate estimation of entropic cost of void maintenance, with and without an increasing number of restrictive conditions, such as loops forming the wall of void with fixed length, with additionally fixed starting position in the sequence. Additionally, we have identified the key structural properties of voids that are important in determining the entropic cost of void formation. We have further developed a parametric models to predict quantitatively void entropy. Our model is highly effective, and these results indicate that voids representing functional sites can be used as an improved model for studying the evolution of protein functions, and how protein function relates to protein stability.
Keywords: void shape, constraint, sequential Monte Carlo, entropy, propensity
I. INTRODUCTION
Proteins are the working molecules of cell. Understanding how they maintain their stability and carry out their functions is a fundamental problem of molecular biology. Although it is well-known that the structures of proteins are well packed1–3, there exist numerous packing defects in the form of voids buried in the interior of proteins. The size distributions of these voids are broad4. Various scaling relationships indicate that their origin may be generic steric constraints of compact chain polymers4,5. It is also well-known that a few voids on a protein may play key roles in enabling protein functions6–9, for example, for substrate and ligand binding.
However, the shape space of voids of folded and unfolded proteins are not well-characterized, and the energetic consequences and the kinetic effects by maintaining voids of certain shape and size are largely unknown. In this paper, we exam in detail the effects of maintaining voids of different shapes in lattice models of chain polymers. Lattice models have been widely used for studying protein folding, where the conformational space of simplified polymers can be examined in detail10–18. Despite its simplistic nature, lattice model has provided important insights about proteins, including collapse and folding transitions16,19–23, influence of packing on secondary structure and void formation11,12,24,25, the evolution of protein function26,27, nascent chain folding18, and the effects of chirality and side chains25.
In this paper, we focus on conformations that enclose voids of specific shapes. Our main objective is to study the fraction of conformations with a specific void shape among all possible conformations. This is related to the entropic cost of maintaining such a void in a polymer structure. We also study the location that voids of a specific shape and size tend to form, and the influence of compactness on the formation of such voids. The methodology we use is the sequential Monte Carlo approach (SMC) designed for sampling conformations under strong constraints, i.e., the requirement of the existence of specific types of voids. SMC is a growth-based method, in which residues are added to the chain polymer one by one until the conformation of full length is obtained. This method is first used in reference28 to estimate the average extension of molecular chains. The basic goal is to obtain a set of conformational samples, along with the probabilities of generating these conformations. Compared with other sampling methods, such as Markov chain Monte Carlo29–31, sequential Monte Carlo can generate diverse samples and can directly estimate the number of conformations containing voids of specific shapes accurately. In this study, we develop several new strategies to improve the effectiveness of sequential Monte Carlo in generating samples under strongly constrained conditions.
Our paper is organized as follows. In section 2, we describe briefly the lattice model and define void and shape of voids. We then discuss the constrained sequential Monte Carlo method used in our study. Results are presented in Section 3. The final section contains the summary and conclusion.
II. METHOD
A. Lattice Model
In lattice models, chain polymers are self-avoiding walks (SAWs) in the square lattice space ℤ2. A length n conformation is denoted by a connected chain Xn = (x1, x2, ⋯, xn), where the i-th monomer is located at the site xi = (ai, bi), where ai and bi are integers. The Manhattan distance between bonded monomers xi and xi+1 is 1. The chain is self-avoiding: xi ≠ xj for all i ≠ j. We consider the beginning and the end of a polymer to be distinct. Only conformations that are not related by translation, rotation, and reflection are considered to be distinct. This is achieved by following the rule that a chain is always grown from the origin, the first step is always to the right, and the chain always goes up at the first time it deviates from the x-axis. We denote the set of all length-n SAW polymers satisfying these constraints as 𝒫n.
B. Voids and shape of voids
Given the conformation Xn ∈ 𝒫n of a chain polymer, the unoccupied sites on the square lattice are divided by the polymer into disconnected components:
where u is the outside component that connects to infinity, and υ1, ⋯, υk are the voids that are enclosed by Xn. Here two components are considered connected if they share any edges or vertices. By this definition, conformations (a) and (b) in Figure 1 both have a size-2 void, but conformation (c) does not contain any void, since the internal two unoccupied points are connected to the outside through a vertex. This definition of void is arbitrary, but is consistent with the definition of contact for monomers in a chain, that is, only if two sites in a void shares an edge, they are considered to be connected24.
We are interested in the set of conformations with voids of a particular shape S. Figure 2 shows some of those shapes of sizes 2 to 6. Note that those shapes are regular shapes, in which sites are connected by edges. In this study we do not consider shapes with sites connected by a vertex only, such as the size-2 void formed by conformation (b) in Figure 1. Here the voids are labelled by their shapes, where the first digit represents the number of sites the void occupies, and the second digit is the identification number of different shapes.
C. Parameters of interest
To study the properties of conformations with specific shaped, we consider the following parameters.
a. Propensity of void formation f1(S, n)
Let Ωn(S) be the set of conformations with at least one void of a specific shape S, that is
The fraction of conformations with void of this particular shape S among all possible conformations is:
(1) |
This parameter represents the propensity of void formation, i.e., the probability of forming a void of specified shape. This relates to the question whether there are preferred shapes for binding voids to occur.
b. Propensity of void formation with fixed loop length f2(l, S, n)
The loop length of a void is defined as l = I1 − I0 + 1, where I0 and I1 are the smallest and largest indices of the monomers forming the boundary of the void. Let Ωn(l, S) ⊂ Ωn(S) be the set of length n conformations with at least one void of shape S of loop length l. The fraction of conformations with void of a particular shape S and a particular void loop length l among all conformations with a void of the same shape but without the restriction of void loop length is defined as:
(2) |
where K(Xn, S) is the number of shape-S voids in Xn, ξ(Xn, l, S) is the number of shape-S voids with loop length l in Xn. This special treatment on N(l, S, n) is to deal with the cases when Xn has multiple voids of shape-S. In such cases, Xn is counted once in N(S, n), and counted 1/K(Xn, S) in N(l, S, n) for each combination of shape-S void and loop length l. For example, if conformation Xn has two voids of shape S, then K(Xn, S) = 2. If both voids have loop length l = 14, then ξ(Xn, l = 14, S) = 2 and this conformation contributes 1 to N(l = 14, S, n); if one void has loop length l = 14 and the other void has loop length l = 16, then this conformation contributes 1/2 to N(l = 14, S, n) and 1/2 to N(l = 16, S, n). Clearly, with this definition, we have
The parameter f2(l, s, n) represents the propensity of void formation with fixed loop length, i.e., the probability of forming a void of specified shape with fixed void loop length. In protein, a related interesting question is how easier it is to form certain types of voids in shape and size with more local compared to with more global sequence fragments.
c. Propensity of void formation with fixed loop length and starting position f3(I0, l, S, n)
Let Ωn(I0, l, S) ⊂ Ωn(l, S) be the set of length n conformations with at least one void of shape S, loop length l, and starting at residue position I0. The fraction of conformations with void of a particular shape S, a particular loop length l, and a particular starting residue I0 among the conformations with a void of the same shape and the same loop length12,32 is defined as
(3) |
where ξ(Xn, I0, l, S) is the number of shape-S voids with loop length l and starting residue I0 in Xn. Similarly, this definition ensures that
The parameter f3(I0, I, S, n) represents the propensity of void formation with fixed loop length and starting position, i.e., the probability of forming a void of specified shape with fixed void loop length starting at a specific position. A related question in protein is what is the propensity of forming voids of certain shape with more local or more global sequence fragments starting at specific positions of the chain.
d. Propensity of void formation at specific compactness f4(ρ, S, n)
The compactness of a polymer ρ is defined as t/tmax(n)11, where t is the number of contacts in the conformation, and tmax(n) is maximum number of contacts possible for length n conformations. For square lattice space, we have11:
where b is a positive integer. Let Ωn(ρ) ⊂ 𝒫n be the set of length n conformations with compactness ρ and Ωn(ρ, S) ⊂ Ωn(ρ) be the set of length n conformations with at least one void of shape S and compactness ρ. The fraction of conformations of a particular compactness ρ with void of a particular shape S among all conformations with the same compactness ρ is defined as:
(4) |
This parameter represents the propensity of void formation with certain compactness, i.e., the probability of forming a void of specified shape for length n chain polymers at a fixed compactness.
D. Estimating void parameters using sequential Monte Carlo
Exhaustive enumeration can be used to calculate the propensities defined above, but is only applicable to very short polymer chains. For longer chain, we use a modified version of the sequential Monte Carlo (SMC) method.
All the parameters described above are fractions, where the corresponding numerators and denominators N(S, n), N(l, S, n), N(I0, l, S, n) and N(ρ, S, n) can be written in the form of
(5) |
where h(·) is a function of conformation Xn. Specifically, we have:
where 𝕀Ωn(Xn) is the indicator function, 𝕀Ω(Xn) = 1 if Xn is in set Ωn, 𝕀Ω(Xn) = 0 otherwise.
Suppose we can generate random samples of conformations , j = 1, ⋯, m, from a trial distribution g(Xn). Following the importance sampling principle33,34, formula (5) can be estimated as:
(6) |
Note that to obtain an unbiased estimate, the trial distribution g(Xn) must have a support larger than h(Xn)𝕀Ωn(S)(Xn), that is, g(Xn) > 0 must hold for all Xn in Ωn(S) that satisfy h(Xn) > 0. Let to be the weight of sample , then Eqn (6) can be rewritten as
(7) |
The efficiency of the estimator of Eqn (6) depends on the choice of the trial distribution g(Xn) and the computational complexity for generating a sample. In general, if g(Xn) is approximately proportional to |h(Xn)𝕀Ωn(S)(Xn)|, with a support larger but close to Ωn(S), the estimate can be reasonably accurate34.
The original Rosenbluth and Rosenbluth growth method generates samples in the space of 𝒫n 28. Starting at x1 = (0, 0), monomers are added to the chain and the associated weights are updated recursively, until the chain reaches length n. Modifications of the algorithm can be found in24,34–36. However, the space under our consideration is a highly constrained subspace of 𝒫n. For example, for void shape 4.1 and chain length 50, the size of the constrained space Ωn(S) is less than 2 × 10−3 of the size of 𝒫n. With additional constraints such as fixed loop length, the space becomes even smaller and sampling such conformations becomes more difficult. The simple growth method of28 is very inefficient in generating samples for such constrained space. Below we reformulate the sampling space and modify the growth method to overcome this difficulty.
1. An equivalent representation of Ωn(S)
In order to avoid location ambiguity, the construction of 𝒫n is restricted to the set of SAW conformations starting at x1 = (0, 0), x2 = (1, 0) and going up at the first time the chain deviates from the x-axis. Since our main interests are sampling conformations containing specific void, we adopt an equivalent representation that is more efficient for our purpose.
Specifically, let υ = υ(S) be a set of sites in ℤ2, whose union takes the shape S. Let A(υ) = (a1(υ), ⋯, a|A(υ)|(υ)) be the set of neighboring sites of υ, sharing either edges or vertices with υ. We call it the wall sites of υ. If a SAW completely occupies A(υ) and does not intersect with υ, then this SAW has at least one void of shape S. Denote the set of all such conformations as
Recall that by definition a conformations in 𝒫n first grows to the right, and always goes up when it first deviates from the x-axis. Note that the conformations in Gn(υ) is not restricted to 𝒫n as they can start from any site on the lattice. In Gn(υ), we consider two SAWs as equivalent if one SAW can be transformed into the other through a combination of rotation, reflection and position translation. Then Gn(υ) consists of a number of disjoint equivalent classes.
It can be easily established that there is a one-to-one mapping between conformations in Ωn(S) and the equivalent classes of conformations in Gn (υ) through transformation consisting the primitives of rotation, reflection, and translation. Each of the transformations provides such a map that the starting site x1 of Xn ∈ Gn(υ) becomes the origin (0, 0), the second site x2 becomes (1, 0), and the first site that deviates from x-axis is up. Hence, if h(·) is a function of Xn that takes the same value for equivalent conformations, we have:
where E(Xn, S) is the number of equivalent conformations of Xn in Gn(υ).
The number E(Xn, S) depends on K(Xn, S), the number of shape-S voids contained in Xn (as in Eqn (2)), and the symmetricity of the shape-S. Let q(υ) be the number of combination of rotation and reflection that maps υ to itself. In two-dimensional lattice space, there are 4 possible rotations and 2 possible reflections around x and y axes. Hence, q(υ) can only take a value in {1, 2, 4, 8}. For example, q(υ) = 8 for shape 4.2 in Figure 2, q(υ) = 4 for shapes 2.1 and 3.1, q(υ) = 2 for shapes 4.4 and 4.5, and q(υ) = 1 for shape 4.3.
When Xn contains only one S-shaped void, the size of its equivalent class E(Xn, S) is q(υ). Figure 3 shows 4 polymers in a equivalent class for void 2.1. When Xn contains total K(Xn, S) S-shaped voids, then E(Xn, S) = q(υ)K(Xn, S) as each of the voids contributes q(υ) number of members in the equivalent class.
To simplify our analysis, we note that Gn(υ) consists of disjoint subsets:
where
If Xn ∈ G(υ, i, k), then the void υ is completely enclosed by the prefix (x1, ⋯, xk) of the chain, where xk is the last monomer in the prefix and occupies the i-th site ai(υ) of the wall sites. We have k ≥ |A(υ)| since some of the monomers in the prefix (x1, ⋯, xk) may not be on the wall of the void. The remaining chain, (xk+1, …, xn), does not intersect with the void space υ nor with the wall sites A(υ).
Using this partition, we have that for any function h(·) that is constant within the equivalent classes,
(8) |
In the following we develop procedures to estimate the quantity
for each subset Gn(υ, i, k), i = 1, ⋯, |A(υ)|, k = |A(υ)|, ⋯, n.
2. Algorithmic steps
The following procedure is used to generate Monte Carlo samples in Gn(υ, i, k) for all υ, i and k, which are then used to estimate the parameters listed in Section 2.3. First, we set xk = ai(υ) as defined by Gn(υ, i, k). We then grow backwards sequentially to place xk−1, xk−2, ⋯, until we reach the first monomer x1 of the chain. During this process, the wall sites A(υ) become fully occupied by monomers in {x1, …, xk}, and the void space υ remains unoccupied. Lastly, as now that (x1, ⋯, xk) are placed and the constraints for void formation are satisfied, we complete the remaining conformation by sequentially placing monomers xk+1, ⋯, Xn. The only constraint at this stage is that these monomers avoid the partial chain grown so far. An illustration of the procedure is shown in Figure 4.
For ease of presentation, we rearrange the monomer labels based on the above procedure. Define yn = (y1, …, yn) as (xk, …, x1, xk+1, …, xn). Formally, ys = xk−s+1 for s ≤ k, and ys = xs for s > k. In this notation, the chain prefix of length s becomes ys = (y1, …, ys).
We adopt the general framework of optimal sampling method37 to generate sample conformations. Let mi be the number of samples we retain in the i-th iteration, and mmax be the maximum value of {mi}. In the initial step, we set m1 = 1, . For s = 2, …, n, we perform the following procedure:
At step s when adding the s-th monomer, assume there are ms−1 samples with weights .
-
We now add the s-th monomer to the partial chain ys−1. For each sample , j = 1, ⋯, ms−1, generate number of new samples by placing ys at each of the vacant sites neighboring , where is the number of vacant sites neighboring in sample . Set weight . Assume this step results in a total of .
Note that the step k + 1 is slightly different. At steps 1, …, k, we grow the chain backwards. But at step k + 1, we start to grow the chain forward. That is, we place yk+1 = xk+1, which is connected to y1 = xk. Hence, at the step k + 1, we consider the vacant neighbor(s) of y1, not yk.
Assign a priority score to each resulting partial chain . The choice of the priority scores will be discussed in details in the next section.
- If Ls < mmax, we keep all of the samples with their weights, set ms = Ls and go to step s + 1. If Ls > mmax, we choose m distinct samples from according to the priority scores as follows:
- Find a constant c such that .
- Choose distinct integers J1, J2, ⋯, Jmmax from l = 1, ⋯, Ls, with probability . This is achieved by the following steps:
- Draw a sample r0 from the uniform distribution between 0 and 1. Let rj = j − r0 for j = 1, ⋯, mmax;
- For each j = 1, ⋯, mmax, choose Jj as the integer such that holds.
- Let and update its weight to be .
3. Priority scores
The priority score guides the growth of conformations, and its design is critically important for obtaining accurate estimates. Our priority scoring function has three components addressing three important issues, namely, the support of the target distribution, the weighting scheme of samples, and the look-ahead strategy.
The support of the target distribution
If a partial chain at step s is impossible to eventually grow into the constrained space Gn(υ, i, k) at step n, it should be removed from future steps of sampling immediately at step s, since it is destined to be rejected. Define the support 𝒮s of partial chains of length s as
That is, 𝒮s contains all possible prefix chains of length s of desired polymers. However, it is difficult to evaluate if a partial chain is in the support. Here we use a sequence of the support ψs that contains 𝒮s but easy to work with. Specifically, let ψ1 = {ai(υ)} where the chain starts according to the definition of Gn(υ, i, k). The support ψs is updated sequentially as follows: For each partial chain ys−1 ∈ ψs−1, find all possible chains ys by adding a monomer to a vacant neighboring site which shares an edge with ys−1. The new support ψs is the union of all such chains satisfying the following conditions:
ys ∩ ys−1 = ∅ (the self-avoiding constraint) and ys ∉ υ, where υ is the void space.
If s ≤ k and if A(υ) \ ys is not an empty set (i.e., the wall sites has not been filled by ys), then A(υ) \ ys must remain as a strongly connected set. Here we define that a strong connection exists between two sites if they share an edge.
- If s ≤ k and if A(υ) \ ys ≠ ∅, then the site ys must satisfy
where d(ys, A(υ) \ ys) is the minimum Manhattan distance between ys and the unoccupied wall sites, |A(υ) \ ys| is the number of unoccupied wall sites of A(υ) \ ys.
Condition (ii) reflects the property that both the filled and unfilled sites on the wall of the void must remain strongly connected at any time of the growth. Otherwise, the unfilled wall sites A(υ) has multiple not strongly connected components. In such cases, the self-avoiding property must be violated in order to fill all of them by yn ∈ Gn(υ, i, k). This is the consequence of the Jordan curve theorem in plane38.
Condition (iii) is to ensure that the remaining chain length is sufficient to fill all wall sites, i.e., A(υ) \ ys must be filled by (ys + 1, …, yk), which is a length k − s chain connected to ys. The priority scores without lookahead. In the optimal sampling method framework (37), the priority score serves both as the propagation trial distribution as well as the resampling priority score. Under the importance sampling principle34, the ideal trial distribution should be proportional to |h(x)π(x)|, where π(x) is the target distribution. In our case, it translates to . Here is the value of function h(·) applied to partial chain , which is always non-negative.
For s > k, we simply set equally priority scores to all partial chain samples , since this stage is relatively easy.
For the more difficult part of the growth s ≤ k where the major constraints lie, we need to guide samples to grow into the support region ψs in order to reduce the sample rejection rate. Following Zhang and Liu36, we use the priority score to achieve this. Taking condition (iii) when updating the support into consideration, we define
It evaluates how much freedom and flexibility the remaining chain possess. When , there are still some vacant sites on the void wall needs to be occupied. In this case, if is not in the support ψs, as it violates condition (iii), we reject this sample. The larger is, the less constrained the remaining chain is.
Combining the value of the function to be evaluated, and reflecting the desired flexibility of the remaining chain, we design our priority score for s ≤ k as:
where Ts is a temperature-like variable. The choice of values for Ts is important. In general, the constraint of forming void is not of serious concerns at the beginning, so we can use high values of Ts to enhance diversity in sampling. As the chain grows, the concern of meeting the constraints become stronger, since there are less freedom for the remaining chains to grow. Hence, we gradually reduced the Ts, as in simulated annealing algorithms. In this study, we use for s = 1, ⋯, k − 1.
Priority score with look-ahead
An often used strategy to improve performance of SMC is look-ahead36,39,40. Look-ahead enables us to use information from possible future steps to construct priority scores, resulting smaller rejection rate of the samples. In addition, it reduces the variance of samples for estimation and hence improves sample efficiency41.
For a δ-step look-ahead, the priority score at time t is determined by exploring all possible combinations of δ-step growth from the current sample ys. Specifically, the priority scores we use are:
where ỹs+δ denotes (ỹs, ys+1, ⋯, ys+δ).
Note that as look-ahead step δ increases, the effectiveness increases at the cost of exponentially growing computational complexity. Hence the choice δ is a tradeoff between estimate efficiency and complexity. In this study we use δ = 1.
4. Estimation
In our framework, it is possible to estimate the parameters described in Section IIC for polymer chains of different lengths up to n when generating conformation samples of length n.
Specifically, when generating conformation samples for Gn(i, k), at step s = k, k + 1, ⋯, n, the generated partial conformations are , which are properly weighted chain polymers of length s. Hence, , n* = k, k + 1, ⋯, n, can be estimated by the following estimator
at step s = n*. Here the estimation is made after step (2) of the algorithmic steps in the previous subsection.
After generating samples for Gn(i, k) of all possible i, k, for any n* ≤ n, we can estimate ∑Xn* ∈Ωn*(S) h(Xn*) by
according to Eqn (8).
III. RESULTS
In this section, we present the results of estimation of the parameters described in Section 2.3. We also develop parametric models relating to void and chain properties for interpreting the estimated results, and for prediction of propensity of forming void of specific shape.
A. Propensity of void formation
For propensity of void formation f1(S, n) defined in Eqn (1), we first examine size-4 voids. There are 5 different shapes for size-4 regular voids (Figure 2). To validate our procedure, the estimated propensity of void formation is compared with the true values obtained by exhaustive enumeration, for chains of length 14 to 24. Figure 5(a) shows the results for voids shapes 4.3 and 4.4. The estimated values are indistinguishable from the true values. These results suggest that our sampling method works well and can provide accurate estimations.
The results for longer chains of length 15 to 50 using the SMC procedure are presented in Figure 5(b), where propensity of void formation f1(S, n) for void shapes 4.1 to 4.5 are shown. It is clear that voids of different shapes have significant difference in propensity of formation. This raises the question whether voids and binding sites in proteins are similarly biased, and whether the distribution of voids of different shapes can be partly explained by these intrinsic propensities analogous what is observed here on lattice models.
a. Predictive models
To better understand our estimation results and to infer general principles, we develop a predictive model for f1(S, n) using the following parametric form:
(9) |
where q(υ) represents the degeneracy of the void shape as we discussed in Section IID1. We consider three factors other than q(υ) in our model: the the wall size |A(υ)|, the chain length n, and number of outer corners of void, |e(υ)|. Here the outer corners, e(υ), are defined as the sites on void wall that connect to the void through a single vertex only. The values of q(υ), |A(υ)|, and |e(υ)| for different void shapes are summarized in Table I.
TABLE I.
void type | 2.1 | 3.1 | 3.2 | 4.1 | 4.2 | 4.3 | 4.4 | 4.5 | 5.1 | 6.1 | 6.2 | 6.3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
q(υ) | 4 | 4 | 2 | 4 | 8 | 1 | 2 | 2 | 4 | 4 | 1 | 2 |
|A(υ)| | 10 | 12 | 12 | 14 | 12 | 14 | 14 | 14 | 16 | 18 | 18 | 18 |
|e(υ)| | 4 | 4 | 5 | 4 | 4 | 5 | 6 | 6 | 4 | 4 | 5 | 6 |
In this model, c1, c2, c3, c4 are positive constants. As the wall size |A(υ)| increases, it is expected that the propensity of forming voids of the specific shape decreases exponentially. This is reflected by the term containing . When the chain length n increases, it is expected that the propensity of forming voids of the specific shape increases by some power. This is captured by the term of (n − |A(υ)| + 1)c3. We also find that the number of outer corners, |e(υ)|, is an important determinant of propensity of void formation. For void shapes with more outer corners, chain polymers enclosing such voids have more concave turns on the wall. This makes it more difficult for a self-avoiding chain to enclose the void. The negative term of |e(υ)| in Eqn (9) models this effect.
We estimate the coefficients in model (9) using the estimated f1(υn) from SMC for voids of sizes 2 to 5 and chain length from 25 to 50. Taking log transformation and using nonlinear regression, we found that ĉ1 = 47.46, ĉ2 = 2.28, ĉ3 = 0.76 and ĉ4 = 0.21.
The propensity values estimated from SMC and the fitted results of using model (9) are plotted in Figure 6. It can be seen that the parametric model fits the data very well. Using the above estimated parameters obtained from the training data, we develop a predictive model for the propensity for void shapes 6.1, 6.2 and 6.3, which are not used in deriving the regression model. The predictions are again compared with those estimated by SMC (Figure 7). The models works well, although it consistently under-estimates by a small amount for void shape 6.3.
B. Propensity of void formation with fixed loop length
Now we consider the propensity of void formation with fixed loop length f2(S, n) defined in Eqn (2). We plot estimated f2(l, S, n = 50) for different specified loop length l and shape S in Figure 9. Although void with odd loop length do exist, we can see that it is much easier to form void with even loop length. This is because the number of wall sites, |A(υ)|, is always an even number on lattice. To form a void υ with odd loop length, the first monomer and the last monomer of the polymer on the void wall A(υ) cannot be adjacent, which results in a more complicated shape. A conformation enclosing a void of shape 4.1 with loop length 17 is given in Figure 8. On average, void shapes 4.1 and 4.3 have larger loop sizes than void shapes 4.4 and 4.5, because they have fewer corners. These results suggest that voids of different shapes have different propensity at specific loop lengths.
C. End effect for void formation
For propensity of void formation with fixed loop length and specified starting position f3(I0, l, S, n) as defined in Eqn (3), we plot estimated f3(I0, l = 14, S, n = 50) for voids of shapes S with a loop length of 14 in chain polymers of length 50 with different starting positions I0 in Figure 10. We find that the propensities f3(I0, l, S, n) at I0 = 1 and I0 = 2 are very different, indicating strong end-effect for void formation. That is, void is much easier to form at the end of the conformation. This is likely due to the tail effect. Void at the end of the chain only need to have one tail, but has two tails if it is in the middle of the conformation. It is difficult to constrain the tails to satisfy the multiple restrictions for forming void of certain shapes.
D. Propensity of void formation at different compactness
Figure 11 shows estimated propensity values of void formation at different compactness f4(ρ, S, n) defined in Eqn (4) for chain length from n = 30 to 50. Conformations with size-4 voids are dominated by those at compactness around 0.3 − 0.7. If we normalize f4(ρ, S, n = 50), that is, we define
where f̅4(ρ, S, n) can be considered as a distribution of ρ. We plot the 0.25-quantile, median value, and 0.75-quantile of distribution f̅4(ρ, S, n) for different chain length n and fixed shape S in Figure 12. We can see these values slightly increase as n increases from 30 to 50. This indicates that the prefer compactness range of forming these size-4 voids shift slightly to more compact regions as chain length increases. We also compare the propensity values of forming voids of all size-2 regular shapes (2.1), voids of all size-3 regular shapes (3.1, 3.2), and voids of all size-4 regular shape (4.1, 4.2, 4.3, 4.4, 4.5) for chains of length 50 at different compactness (Figure 13). The results show that smaller voids are easier to form as compactness increases. Our results from lattice model suggests that there might be a preferred size for void formation in proteins, which are all within a specific narrow range of compactness3.
IV. SUMMARY AND CONCLUSION
Protein molecules contain many voids buried in the interior of proteins, with broad distribution4. Although most voids are likely to originate from generic steric constraints of compact chain polymers4,5, some voids are the functional regions for many proteins, such as enzymes, where substrates and ligands bind, and biochemical reactions occur6,7.
An important general question is how the need for maintaining functional voids, which have to be of specific shape, is influenced by, and affects other aspects of proteins structures and properties: e.g., protein folding stability, kinetic accessibility, and evolutionary selection pressure. These are broad and complex issues that require detailed studies.
In this work, we study the effects of maintaining voids of defined shape using lattice model. Because the conformational space of simplified polymers can be examined in detail, lattice models have been widely used in protein studies and have lead to important insight about protein folding. The focus of our study is to generate large number of sample conformations under very constraining restrictions to study general properties of voids and their shapes. We use sequential Monte Carlo method and have developed an efficient growth method to generate conformation samples in highly-constrained space.
We show that our approach is effective in estimating entropy of void maintenance, with and without an increasing number of restrictive conditions, such as loops forming the wall of void with fixed length, with additionally fixed starting position in the sequence. Our results also lead to a number of observations, including that polymers of certain compactness range favors the formation of voids of specific size, and that voids are far easier to form around the end of the polymer. A finding is that voids tend to form at the chain ends. This raises the interesting question whether voids and pockets tend to form at either the N-terminal or the C-terminal end in real proteins. A detailed analysis of voids and pockets in real proteins will be necessary for answering this question. In addition, we have developed a parametric model for explaining the propensity of forming voids of particular shapes, or equivalently, the entropic cost of maintaining such voids. Our model is highly effective in predicting the propensity of void formation for different shapes. Such lattice model of voids representing functional sites can be used as improved model for studying the evolution of protein functions26, and how it relates to protein stability27.
Although in this study we treat the occurrence of all conformations equally likely, our approach can be applied to models with more realistic energy functions in a straight-forward manner. The approach for sampling strongly constrained conformations we developed in this study will be generally applicable for studying real proteins in three-dimensional space.
References
- 1.Richards FM. Ann. Rev. Biophys. Bioeng. 1977;6:151. doi: 10.1146/annurev.bb.06.060177.001055. [DOI] [PubMed] [Google Scholar]
- 2.Chothia C. Nature. 1975;254:304. doi: 10.1038/254304a0. [DOI] [PubMed] [Google Scholar]
- 3.Richards FM, Lim WA. Q. Rev. Biophys. 1994;26:423. doi: 10.1017/s0033583500002845. [DOI] [PubMed] [Google Scholar]
- 4.Liang J, Dill KA. Biophys. J. 2001;81:751. doi: 10.1016/S0006-3495(01)75739-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang J, Chen R, Tang C, Liang J. J. Chem. Phys. 2003;118:6102. [Google Scholar]
- 6.Laskowski RA, Luscombe NM, Swindells MB, Thornton JM. Protein Science. 1996;5:2438. doi: 10.1002/pro.5560051206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liang J, Edelsbrunner H, Woodward C. Protein Science. 1998;7:1884. doi: 10.1002/pro.5560070905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Binkowski TA, Adamian L, Liang J. J. Mol. Biol. 2003;332:505. doi: 10.1016/s0022-2836(03)00882-9. [DOI] [PubMed] [Google Scholar]
- 9.Tseng Y, Liang J. Mol. Biol. Evol. 2006;23(2):421. doi: 10.1093/molbev/msj048. [DOI] [PubMed] [Google Scholar]
- 10.Lau KF, Dill KA. Macromolecule. 1989;93:6737. [Google Scholar]
- 11.Chan HS, Dill KA. Macromolecules. 1989;22:4559. [Google Scholar]
- 12.Chan HS, Dill KA. J. Chem. Phys. 1990;92:3118. [Google Scholar]
- 13.Shakhnovich E, Gutin A. J. Chem. Phys. 1990;93:5967. [Google Scholar]
- 14.Camacho CJ, Thirumalai D. Proc. Natl. Acad. Sci. USA. 1993;90:6369. doi: 10.1073/pnas.90.13.6369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Pande VS, Joerg C, Grosberg AY, Tanaka T. J. Phys. A. 1994;27:6231. [Google Scholar]
- 16.Socci ND, Onuchic JN. J. Chem. Phys. 1994;101:1519. [Google Scholar]
- 17.Dill K, Bromberg S, Yue K, Fiebig K, Yee D, Thomas P, Chan H. Protein Science. 1995;4:561. doi: 10.1002/pro.5560040401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lu H, Liang J. Proteins. (Accepted). [Google Scholar]
- 19.Šali A, Shakhnovich EI, Karplus M. Nature. 1994;369:248. doi: 10.1038/369248a0. [DOI] [PubMed] [Google Scholar]
- 20.Shrivastava I, Vishveshwara S, Cieplak M, Maritan A, Banavar JR. Proc. Natl. Acad. Sci. U.S.A. 1995;92:9206. doi: 10.1073/pnas.92.20.9206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Klimov DK, Thirumalai D. Phys. Rev. Lett. 1996;76:4070. doi: 10.1103/PhysRevLett.76.4070. [DOI] [PubMed] [Google Scholar]
- 22.Mélin R, Li H, Wingreen N, Tang C. J. Chem. Phys. 1999;110:1252. [Google Scholar]
- 23.Kachalo S, Lu H, Liang J. Phys Rev Lett. 2006;96(5) doi: 10.1103/PhysRevLett.96.058106. 058106. [DOI] [PubMed] [Google Scholar]
- 24.Liang J, Zhang J, Chen R. J. Chem. Phys. 2002;117:3511. [Google Scholar]
- 25.Zhang J, Chen Y, Chen R, Liang J. J. Chem. Phys. 2004:592–603. doi: 10.1063/1.1756573. [DOI] [PubMed] [Google Scholar]
- 26.Williams PD, Pollock DD, Goldstein R. Journal of Molecular Graphics and Modelling. 2001;19:150. doi: 10.1016/s1093-3263(00)00125-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Bloom JD, Wilke CO, Arnold FH, Adami C. Biophys. J. 2004;86:2758. doi: 10.1016/S0006-3495(04)74329-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rosenbluth MN, Rosenbluth AW. J. Chem. Phys. 1955;23:356. [Google Scholar]
- 29.Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice. Chapman & Hall; 1996. [Google Scholar]
- 30.Hamelryck T, Kent J, Krogh A. PLoS Comput. Biol. 2006;2:1121. doi: 10.1371/journal.pcbi.0020131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Iba Y, Chikenji G, Kikuchi M. Journal of the Physical Society of Japan. 1998;67:3327. [Google Scholar]
- 32.Chan HS, Dill KA. J. Chem. Phys. 1989;90:492. [Google Scholar]
- 33.Marshall A. In: Meyer M, editor. Symposium on Monte Carlo Methods; Wiley; 1956. pp. 123–140. [Google Scholar]
- 34.Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001. [Google Scholar]
- 35.Grassberger P. Phys. Rev. E. 1997;56:3682. [Google Scholar]
- 36.Zhang JL, Liu JS. J. Chem. Phys. 2002;117:3492. [Google Scholar]
- 37.Fearnhead P, Clifford P. J.R.Statist. Soc. B. 2003;65:887. [Google Scholar]
- 38.Hatcher A. Algebraic topology. Cambridge, England: Cambridge University Press; 2002. [Google Scholar]
- 39.Meirovitch H. J. Phys.A: Math. Gen. 1982;15:L735. [Google Scholar]
- 40.Wang X, Chen R, Guo D. IEEE trans. Signal Processing. 2002;50:241. [Google Scholar]
- 41.Kong A, Liu J, Wong W. J. Amer. Statist. Assoc. 1994;89:278. [Google Scholar]