Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints

Ming Lin; Hsiao-Mei Lu; Rong Chen; Jie Liang

doi:10.1063/1.2968605

. 2008 Sep 2;129(9):094101. doi: 10.1063/1.2968605

Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints

Ming Lin ¹, Hsiao-Mei Lu ², Rong Chen ³, Jie Liang ^2,^a)

PMCID: PMC2640457 NIHMSID: NIHMS89558 PMID: 19044859

Abstract

Inferring three-dimensional structural information of biomacromolecules such as proteins from limited experimental data is an important and challenging task. Nuclear Overhauser effect measurements based on nucleic magnetic resonance, disulfide linking, and electron paramagnetic resonance labeling studies can all provide useful partial distance constraint characteristic of the conformations of proteins. In this study, we describe a general approach for reconstructing conformations of biomolecules that are consistent with given distance constraints. Such constraints can be in the form of upper bounds and lower bounds of distances between residue pairs, contact maps based on specific contact distance cutoff values, or indirect distance constraints such as experimental ϕ-value measurement. Our approach is based on the framework of sequential Monte Carlo method, a chain growth-based method. We have developed a novel growth potential function to guide the generation of conformations that satisfy given distance constraints. This potential function incorporates not only the distance information of current residue during growth but also the distance information of future residues by introducing global distance upper bounds between residue pairs and the placement of reference points. To obtain protein conformations from indirect distance constraints in the form of experimental ϕ-values, we first generate properly weighted contact maps satisfying ϕ-value constraints, we then generate conformations from these contact maps. We show that our approach can faithfully generate conformations that satisfy the given constraints, which approach the native structures when distance constraints for all residue pairs are given.

INTRODUCTION

Three-dimensional structures of biomacromolecules (such as protein, DNA, and RNA molecules) are essential for understanding their biological functions. The primary sources of structural information of biomacromolecules are x-ray diffractions¹ and nuclear magnetic resonance (NMR) experiments.²^,³ In NMR studies, the assessment of chemical shifts of atomic nuclei with spins can provide information about distances between specific pairs of atoms. In addition, a number of biochemical techniques such as disulfide linking⁴^,⁵ and electron paramagnetic resonance labeling⁶^,⁷ can also provide useful partial information about distances between residues. When accurate values of distances between residue pairs are available, they can be represented by a distance matrix, where an entry (i,j) of the matrix denotes the distance between the corresponding two residues i and j. In other cases, the distance information is inexact, but can be represented by a contact map, which denotes whether the distance of each residue pair is below or above a specified distance threshold.

With distance information of various levels of details, a challenging problem is to integrate all the distance constraints into global information about the properties of ensemble structures of the biomacromolecule.²^,⁸^,⁹ When the distance constraints are complete and accurate, the positions of all residues can be obtained by carrying out a singular value decomposition calculation.¹⁰^,¹¹ When the distance constraints are incomplete or inaccurate, one needs to solve an optimization problem and find the structures that best reproduces these distance constraints.⁹^,¹²^,¹³ If the size of the molecule is large, this is a very challenging problem.

Several previous studies address the problem of generating protein conformations from contact maps.¹⁴^,¹⁵ These approaches can be expanded to generate conformations when only indirect or implicit distance constraints are available. In Ref. 16, Vendruscolo et al. generated the conformations of the transition state ensemble (TSE) important in protein folding studies. In this case, no explicit distance constraints between residue pairs are given. Rather, indirect information in the form of ϕ-value constraints is known for a subset of residues. Here the ϕ-value at a residue measured experimentally is interpreted as the ratio of the average number of native contacts formed by the residue in the transition state conformations to the number of contacts formed in the native structure of ground state.

In this paper, we focus on protein structures and develop a general method to obtain ensemble of conformations that satisfy distance constraints given either in the form of an incomplete set of distance bounds, a set of binary conditions whether the distance is below or above a specific threshold, or in the indirect form of experimentally measured ϕ-values. Our approach is based on the framework of sequential Monte Carlo (SMC) method, a growth-based method in which residues are added to an existing partial chain one by one until a conformation of full length is obtained.¹⁷^,¹⁸^,¹⁹ In addition to generating structures, we can also estimate important physical properties of molecular ensembles. As the probabilities of growing viable conformation samples become exceedingly small because of strong distance constraints and the self-avoiding requirement, an efficient sampling strategy becomes critical in order to obtain full chain conformations consistent with all distance constraints.

This paper is organized as follows. In Sec. 2, we introduce a cubic lattice model and the incomplete and indirect distance constraints we work with. We then discuss the general approach of SMC method and a new growth potential function. Results are presented in Sec. 3.

MODEL AND METHOD

Protein model and distance constraints

Three-dimensional model for proteins

We use a three-dimensional cubic lattice model to represent a protein conformation. The sides of a cubic cell have unit length γ=1.3 Å. A length-n protein conformation is represented by a connected chain x_n=(x₁,x₂,…,x_n), where the ith C_α atom of the conformation is located at site x_i=(x_i1,x_i2,x_i3) on the cubic lattice.¹⁸^,²⁰^,²¹

For proteins molecules, the locations of C_α atoms satisfy certain constraints. We assume the C_α atoms in our model can only be placed on the lattice sites with the following constraints. First, the Euclidean distance between neighboring residues x_i−1 and x_i is between 3.5 and 4.1 Å. Second, the direction of the vector x_i-x_i−1 must be within 30° of four canonical directional vectors, which are specifically determined by the residue type of residue i and the locations x_i−1, x_i−2, and x_i−3. These canonical vectors are derived from a discrete four-state off-lattice model of proteins, which gives four possible locations of x_i for monomer i given the locations of x_i−1, x_i−2, and x_i−3, and the type of residue i−1. Details of obtaining optimal canonical directional vectors are given in Ref. 22. Third, we enforce the self-avoiding constraint. Specifically, non-neighboring residues are not permitted to be closer than 3.5 Å, which is the smallest distance of non-neighboring residues observed in 646 representative proteins from the Protein Data Bank (PDB).

On average, there are about 23 candidate positions for placing x_i in our model, although the exact number depends on the different relative positions of x_i−3, x_i−2, and x_i−1, as well as the type of the (i−1)th residue. Figure 1 provides an illustration of this lattice model.

Illustration of the cubic lattice model. We have set the cell unit length to 1.8 Å instead of 1.3 Å here for clarity. Given the locations of x_i−3, x_i−2, and x_i−1, there are 21 positions (marked by “◻” and “◯”) satisfying the first distance condition. Only nine positions (marked by “◯”) among them also satisfy the second direction condition. These positions all satisfy the third self-avoidance condition.

Direct distance constraints

Distance constraints for protein chain x_n=(x₁,…,x_n)∊R³ are written in the following general form as

l_{i j} \leq ‖ x_{i} - x_{j} ‖ \leq u_{i j} for all (i, j) ∊ D,

where ‖x_i−x_j‖ is the Euclidean distance between residues i and j; l_ij and u_ij are the lower and upper bounds of the distance between residues i and j. Here l_ij can be 0 and u_ij can be +∞ if only upper bound or lower bound is available, respectively. D is a given set of (i,j) residue pairs, in which such constraints are known. D is often a much smaller subset of the complete set of all residue pairs. The problem of determining the conformation x_n=(x₁,…,x_n) according to such distance constraints has been studied before.⁸^,²³ In this paper, we focus on constraints of distance between C_α atoms, and we only consider the structure of C_α chain. The general principle can be applied to other types of distance constraints, and side chain repacking methods can be used to generate more detailed protein structures.²⁴^,²⁵

Indirect distance constraints and experimentally measured ϕ-values

An important class of studies on protein folding is to characterize the properties of the TSE. TSE represents the structures around the saddle point of the potential energy surface, and these structures are often followed by a large structural change in protein unfolding process.²⁶^,²⁷^,²⁸^,²⁹

It is challenging to characterize TSE due to the complexity of the folding and unfolding processes. Experimental research on this problem focuses on the measurement of ϕ-value at individual residue position, defined as the ratio of change in stability to the transition state upon mutation versus the change to the native folded state.²⁷^,²⁸^,²⁹^,³⁰^,³¹

ϕ-value provides information about the native likeness of TSE.³²^,³³ For example, if _ϕi, the ϕ-value at residue i, is close to 1, the transition state is thought to have almost the same structure at residue i as the native state. If _ϕi is close to 0, the transition state is likely to be in the denatured state in this region. An important question therefore is to translate ϕ-value measurements into explicit conformational information of protein structures in the TSE.¹⁶^,³¹^,³⁴

Let $ϕ_{i}^{exp}$ be the experimentally measured ϕ-value at residue i. Based on experimental observations, it is reasonable to assume that changes in protein stability are proportional to the change in the number of contacts in a protein structure.³⁵ Based on this assumption, the calculated ϕ-value $ϕ_{i}^{calc}$ , which relates to the protein structure, is defined as $ϕ_{i}^{calc} = C_{i}^{TSE} ∕ N_{i}^{N}$ , where $C_{i}^{TSE}$ is the average number of contacts formed by residue i in the TSE and $N_{i}^{N}$ is the number of contacts formed by residue i in the native structure. In studies based on molecular dynamics simulations, Li and Daggett³¹ showed that $ϕ_{i}^{calc}$ is in good agreement with $ϕ_{i}^{exp}$ . Vendruscolo et al.¹⁶ further introduced a different definition of $ϕ_{i}^{calc}$ ,

ϕ_{i}^{calc} = \frac{N_{i}^{TSE}}{N_{i}^{N}},

(1)

where $N_{i}^{TSE}$ is taken as the average number of native contacts instead of all contacts of residue i in the TSE. In this case, the TSE is defined as the set of conformations with $ϕ_{i}^{calc}$ very close to the corresponding experimental measured $ϕ_{i}^{exp}$ at all positions. An important question is therefore how to obtain explicit conformations of TSE that satisfy these indirect distance constraints of ϕ-values. A number of studies have shown promising results.¹⁶^,³⁴

Generating conformations with various distance constraints using SMC

In general, one can aim to obtain conformations that are at the global minimum of an error function measuring deviation in distance from the lower bounds and upper bounds of the distance constraints,

E (x_{n}) = \sum_{i, j} [{max}^{2} {l_{i j} - ‖ x_{i} - x_{j} ‖, 0} + {max}^{2} {‖ x_{i} - x_{j} ‖ - u_{i j}, 0}] for all (i, j) ∊ D,

(2)

in which the distance constraints are incomplete and inaccurate.⁹^,¹² Our goal is to generate a set of conformations satisfying all distance constraints and following certain target distribution π(x_n), for example, the uniform distribution of all feasible conformations satisfying the distance constraints, or the Boltzmann distribution associated with an energy function. If the true energy function was known, it could be used to estimate the thermodynamics properties of the ensemble of protein conformations following the Boltzmann distribution. In reality, one can approximate the unknown true energy function with various empirically derived energy functions, such as the Miyazaw–Jernighan potential function,³⁶ the geometric poetntial,³⁷^,³⁸ and many other potential functions as reviewed in Ref. 39.

Since our goal here is to minimize the error function E(x_n), instead of approximating the true energy function, we can set the energy function to be proportional to the error function [Eq. 2]. More details of this target distributions we use are described in Sec. 3. Let x_t=(x₁,…,x_t) be the vector for the positions of residues from 1 up to t. We recursively place residue t at position x_t following a trial distribution g_t(x_t∣x_t−1). The trial distribution proposes possible positions with different probabilities for residue t to be placed under the condition that positions x₁,…,x_t−1 for residues 1 tot−1 are given. The joint trial distribution for a chain with t residues at positions x₁,…,x_t is given by

g_{t} (x_{t}) = g_{1} (x_{1}) g_{2} (x_{2} ∣ x_{1}) \dots g_{t} (x_{t} ∣ x_{t - 1}) .

Following the principle of importance sampling,⁴⁰^,⁴¹^,⁴² the design of the trial distribution can accommodate different types of bias, which allows great flexibility for improving sampling efficiency. However, each final sample of full length chain x_n needs to be weighted to remove the bias so the original target distribution π(x_n) can be recovered. Specifically, we assign a weight

w (x_{n}^{(j)}) = π (x_{n}^{(j)}) ∕ g_{n} (x_{n}^{(j)})

to each conformation sample $x_{n}^{(j)}$ , j=1,…,m, where g_n(x_n) is the trial distribution of the full chain. Then the expected mean value of physical property represented by a function h(x_n) of conformation x_n following the target distribution π(x_n) can be estimated by

E_{π} (h (x_{n})) ≃ \frac{\sum_{j = 1}^{m} w (x_{n}^{(j)}) h (x_{n}^{(j)})}{\sum_{j = 1}^{m} w (x_{n}^{(j)})} .

We adopt the framework developed in Ref. 43 to generate sample conformations which minimizes the loss introduced in the resampling step when choosing a number of distinct samples from a larger sample set. It helps to maintain the diversity of the samples. Let m_t be the number of samples we retain in the tth iteration, m_max be the maximum value of m_t, the algorithm for generating samples are described in Algorithm 1.

Algorithm 1 Generating conformation


Set m₁=1, $w_{1}^{(1)} = 1.0$ and place the first residue at fixed $x_{1}^{(1)}$ .
for t=2 to n do
L_t=0;
{L_t: number of length t chains that can be obtained from samples obtained at step t−1.}
for sample j=1:m_t−1 do
Find all of the valid sites $x_{t}^{(i, j)}$ $i = 1, \dots, l_{t}^{(j)}$ for placing x_t next to partial chain $x_{t - 1}^{(j)}$ .
{ $l_{t}^{(j)} = number$ of available sites to place x_t next to partial chain $x_{t - 1}^{(j)}$ }.
Generate $l_{t}^{(j)}$ number of t-length chain ${\tilde{x}}_{t}^{(L_{t} + i)} = (x_{t - 1}^{(j)}, x_{t}^{(i, j)})$
${\tilde{w}}_{t}^{(L_{t} + i)} = w_{t - 1}^{(j)}$ . {Temporary weights for uniform distribution.}
$L_{t} = L_{t} + l_{t}^{(j)}$ .
end for
if L_t≤m_max then
Let m_t=L_t and ${(x_{t}^{(j)}, w_{t}^{(j)})}_{j = 1}^{m_{t}} = {({\tilde{x}}_{t}^{(l)}, {\tilde{w}}_{t}^{(l)})}_{l = 1}^{L_{t}}$ .
else
Let m_t=m_max.
for l=1 to L_t do
Assign a priority score $β_{t}^{(l)}$ for chain ${\tilde{x}}_{t}^{(l)}$ according to the constraints.
end for
Find constant c such that $\sum_{l = 1}^{L_{t}} min {c β_{t}^{(l)}, 1} = m_{max}$ . {e.g. by binary search.}
Draw r from uniform distribution U[0,1).
for j=1:m_max do
Let r_j=j−r.
Find integer J_j such that $\sum_{l = 1}^{J_{j} - 1} min {c β_{t}^{(l)}, 1} < r_{j} \leq \sum_{l = 1}^{J_{j}} min {c β_{t}^{(l)}, 1}$ .
Select sample $x_{t}^{(j)} = {\tilde{x}}_{t}^{(J_{j})}$ .
Set weight $w_{t}^{(j)} = {\tilde{w}}_{t}^{(J_{j})} ∕ min {c β_{t}^{(J_{j})}, 1}$ .
end for
end if
end for
for j=1:m_n do
Calculate importance weight $w (x_{n}^{(j)}) \propto w_{n}^{(j)} π (x_{n}^{(j)})$ .
end for

Open in a new tab

The key step in this algorithm is to construct high quality priority scores $β_{s}^{(l)}$ , which works as the trial distribution g_t(x_t∣x_t−1) to guide the growth of the partial chains toward more profitable regions.

In this algorithm, it is not necessary to require the growth starts from the first residue x₁. In fact, growth can start from any place, as long as the newly placed monomer is connected to the existing partial chain. For example, growth can start in one direction from a position in the middle of the primary sequence of the chain. After it reaches the end of the chain, the growth process can go back to the starting residue and continue to grow in the other direction of the primary sequence. That is, the order of placing residues can be (x_k,x_k−1,…,x₁,x_k+1,…,x_n) or (x_k,x_k+1,…,x_n,x_k−1,…,x₁) for any residue k located in the middle of the chain. The steps of adding a new residue to existed partial chain are the same as above. In this study, we choose the order of placing residues so that the fragment of the first 20 residues to be placed has the largest number of distance constraints.

Priority score

The choice of a good priority score _βt used in Algorithm 1 is very important. A carefully designed _βt can successfully guide the growth of the conformation so that the full chain will eventually obey all the distance constraints, hence increasing the sampling acceptance rate. A difficulty in the growth-based method is that when adding current residue, the distance information of future residues cannot be directly used. To solve this problem, the priority score we develop consists of three components: growth potential from upper bounds of the distance constraints, growth potential from reference points, and growth potential from lower bounds of the distance constraints. The first two components of the priority score incorporate the distance information of future residues.

Growth potential from upper bounds of the distance constraints

Given the upper bounds for the distances between residue pairs in a subset D of all residue pairs, we first develop distance upper bounds λ_ij between all residue pairs (i,j), i,j=1,…,n.

Let q(k) be an upper bound of distances between two residues that are k residues away in the protein primary sequence. For constructing the upper bounds q(k) for small sequence separation k, we enumerate self-avoiding chains on the discrete lattice model using the protein sequence of interest. We have

q (k) = max_{i} max_{x (i, k)} (‖ x_{i + k} - x_{i} ‖),

where x(i,k) is a self-avoiding chain of length k starting at residue i. In this study, we enumerate fragments of chains for k=1,…,5 at different starting positions i, and take the largest as q(k). When sequence separation k is large, enumeration is infeasible. We approximate q(k) by k₁q(5)+q(k₂) if k=5k₁+k₂, where k₁∊Z, k₂=0,…,4. This is an upper bound as it assumes the chain is attached at some residues without angle constraint.

Consider a complete graph G with n vertices, each vertex represents a residue. The length of edge between any two vertices i and j is set to

e_{i j} = {\begin{array}{l} min {u_{i j}, q (∣ i - j ∣)}, & if (i, j) ∊ D \\ q (∣ i - j ∣), & otherwise . \end{array}

We can use the Floyd algorithm⁴⁴ to identify the shortest path p_ij between any two vertices i and j in this complete graph G. The distance upper bound λ_ij between residues i and j is then set to the total length of the shortest path p_ij.

After obtaining the distance upper bound λ_ij and the corresponding path p_ij, we construct the potential function thatcontributes to the priority score as

f_{1} (x_{t}) = \sum_{i < j, (i, j) ∊ P} h_{1} (‖ x_{i} - x_{j} ‖, λ_{i j}),

(3)

where P is a set of (i,j) pairs such that on the shortest pathp_ij between i and j, the two ends x_i and x_j are in the partial chain x_t, but none of the residues between i and j are in x_t. This is to avoid double counting of the distance constraints. The function h₁ is a loss function to measure the violation of constraint ‖x_i−x_j‖≤λ_ij. Usually, h₁(‖x_i−x_j‖,λ_ij) is set to zero when ‖x_i−x_j‖≤λ_ij, and monotonically nondecreasing as ‖x_i−x_j‖−λ_ij increases. Different types of h₁(⋅) can be chosen for different considerations, which we will discuss in detail in later sections.

Growth potential from reference points

Given a partial chain x_t, if the position of a future residue j (x_j∉x_t) is strongly constrained, e.g., there are more than four residues in the existing chain x_t having distance constraints related to residue j, then residue j can only be placed in a small spatial region. We generate candidate position for x_j on lattice sites within this small space. More specifically, if a future residue j has distance constraints,

l_{i_{k} j} \leq ‖ x_{i_{k}} - x_{j} ‖ \leq u_{i_{k} j}, k = 1, \dots, K,

where x_{i_k}, k=1,…,K, are in the existing chain x_t, and K≥5, we use Newton’s climbing method⁴⁵ to find a position z in R³ such that

z = arg min_{x} F (x) = arg min_{x} \sum_{k = 1}^{K} {(‖ x_{i_{k}} - x ‖ - u_{i_{k} j})}^{2},

in which z is obtained by iteratively performing z=z−(F^″(z))⁻¹F^′(z). We then search the sites on the cubic lattice around position z and choose the site x that minimizes

\sum_{k = 1}^{K} [{max}^{2} {l_{i_{k} j} - ‖ x_{i_{k}} - x ‖, 0} + {max}^{2} {‖ x_{i_{k}} - x ‖ - u_{i_{k} j}, 0}]

as the candidate position for residue j. Denote the candidate position as $x_{j}^{*}$ , we use it as a reference point to guide the growth of the chain. The following potential function is used to encode this:

f_{2} (x_{t}) = \sum_{(i, j) ∊ P^{'}} h_{2} (‖ x_{i} - x_{j}^{*} ‖, λ_{i j}),

(4)

where P^′ is a set of (i,j) pairs such that on the shortest pathp_ij between i and j, x_i is in the partial chain x_t constructed so far, $x_{j}^{*}$ is the reference point, and none of the residues between i and j are in x_t. As before, h₂ is the loss function to measure the violation of constraint $‖ x_{i} - x_{j}^{*} ‖ \leq λ_{i j}$ .

Growth potential from lower bounds of the distance constraints

This potential function penalizes the violation of lower bound constraint,

f_{3} (x_{t}) = \sum_{(i, j) ∊ S \cap D} h_{3} (‖ x_{i} - x_{j} ‖, l_{i j}),

(5)

where S is the set of residue pair (i,j) in which x_i and x_j exist in the partial chain x_t. Here h₃ is the loss function to measure the violation of constraint ‖x_i−x_j‖≥l_ij. Hence, h₃(‖x_i−x_j‖,l_ij)=0, when ‖x_i−x_j‖≥l_ij, and is monotonically nondecreasing as ‖x_i−x_j‖−l_ij decreases.

Combined priority score

The combined priority score $β_{t}^{(l)}$ for chain ${\tilde{x}}_{t}^{(l)}$ is set as

β_{t}^{(l)} = exp [- \frac{ρ_{1} f_{1} ({\tilde{x}}_{t}^{(l)}) + ρ_{2} f_{2} ({\tilde{x}}_{t}^{(l)}) + ρ_{3} f_{3} ({\tilde{x}}_{t}^{(l)})}{τ_{t}}],

(6)

where ρ₁,ρ₂, and ρ₃ are coefficients of the three growth potential functions, τ_t is a temperaturelike variable. The choice of loss functions h₁, h₂, and h₃ in f₁, f₂, and f₃, and coefficients ρ₁,ρ₂,ρ₃,τ_t will be described in later sections.

Generating conformations from incomplete residue distance constraints

In this section, we discuss how to use Algorithm 1 to generate protein conformations with given constraints in the form of small intervals of distances between a subset of residue pairs. The distance constraints are represented as¹²^,¹³

d_{i j} - ϵ_{i j} \leq ‖ x_{i} - x_{j} ‖ \leq d_{i j} + ϵ_{i j} for all (i, j) ∊ D,

where d_ij is the distance of residues i and j in the native structure. The set D is assigned as follows: each non-neighboring residue pair within short range distance (SRD) in the native structure is selected in D with a certain probability, e.g., (20%, 40%,…, 100%) independently. The SRD is selected as 10 Å for residue level structure followingRef. ¹³. All residue pairs with distance d_ij>10 Å are excluded from D. Variations ϵ_ij, ϵ_ij of the bounds are randomly selected from uniform distribution U[0,1) independently, so that the distance variation is under 1 Å, about 10% of the true distance d_ij as in Ref. 13.

In this problem, the priority score in Algorithm 1 is set for Eq. 6 with

h_{1} (z, λ) = h_{2} (z, λ) = {\begin{matrix} {(z - λ)}^{2}, & if z > λ \\ 0, & if z \leq λ, \end{matrix}

and

h_{3} (z, l) = {\begin{matrix} {(z - l)}^{2}, & if z < l \\ 0, & if z \geq l, \end{matrix}

for f₁, f₂, and f₃, and parameters are set as ρ₁=1, ρ₂=1, ρ₃=1, and τ_t=0.5. Here z is the value of the corresponding distance.

The loss functions are chosen so that the distance between any two residues i and j in the conformational sample does not deviate too much from the given constrained interval [l_ij,u_ij], in case that not all constraints can be perfectly satisfied simultaneously. The loss functions h₁, h₂, and h₃ are concave downward functions of the distance ‖x_i−x_j‖, which increases rapidly as ‖x_i−x_j‖ departs the constrained interval [l_ij,u_ij].

Generating conformations from contact map of distance cutoff

We now describe how to generate conformations based on a given incomplete contact map, where distances between some residue pairs are known to be either above or below a cutoff value in our calculation. We use 8.5 Å as the cutoff value. This value has been used by Vendruscolo et al. in Ref. 16.

The contact map of a length n polymer chain is a n×n symmetric matrix C={c_ij}_n×n, where c_ij=1 if residues i and j are in contact, and c_ij=0 otherwise. A given contact map is equivalent to a set of distance constraints,

‖ x_{i} - x_{j} ‖ \leq 8.5 Å for all c_{i j} = 1,

‖ x_{i} - x_{j} ‖ > 8.5 Å for all c_{i j} = 0.

For this problem, we use

h_{1} (z, λ) = h_{2} (z, λ) = I (z - λ > 0)

and

h_{3} (z, l) = I (z - l < 0),

in Eqs. 3, 4, 5, respectively, to construct the priority score in Eq. 6. Here I(⋅) is the indicator function: I(⋅)=1 if the statement represented by (⋅) is true, 0 otherwise. Parameters in Eq. 6 are taken as ρ₁=1, ρ₂=1, ρ₃=0.8, and τ_t=0.2.

The loss functions are chosen in order to keep the contact map of the generate conformational samples as close to the given target contact map as possible if not completely satisfied. In particular, if the distance ‖x_i−x_j‖ violates the distance constraint, the corresponding loss function increases from 0 to 1 instantly.

Generating contact maps and conformations from indirect distance constraints by ϕ-values

In this section, we describe how to obtain contact maps based on indirect distance constraints in the form of experimentally measured ϕ-values.

Generating contact maps from indirect distance constraints

ϕ-values of TSE and contact maps. For generating conformations of the TSE, our target distribution π(x_n) is the uniform distribution of all conformations x_n satisfying the ϕ-value constraints,

ϕ_{i}^{calc} (x_{n}) = \frac{N_{i}^{TSE}}{N_{i}^{N}} \approx ϕ_{i}^{exp}, i ∊ I = {I_{1}, \dots, I_{T}},

where I represents the set of residues whose ϕ-values have been measured experimentally, and I₁,…,I_T are the indexes of these residues. By definition, $ϕ_{i}^{calc} (x_{n})$ can be computed when the conformation x_n of the full chain is known. When only information of partial chains x_t−1 is available during chain growth, it is difficult to construct an effective conditional trial distribution g_t(x_t∣x_t−1).

Our approach is to translate the ϕ-value constraints into contact maps of equivalent distance constraints. These contact maps provide more direct information on distance constraints for generating conformations. We then sample conformations following the contact map constraints, which willautomatically satisfy all ϕ-value constraints. We describe briefly how to generate conformations from these ϕ-value derived contact maps in Sec. 2F2.

From ϕ-values to contact maps. Because of the intrinsic symmetry of the contact map C={c_ij}_n×n, we consider c_ij and c_ji as the same entry in C. Let N be the set of residue pairs (i,j) forming native contacts. By definition, the calculated ϕ-value $ϕ_{i}^{calc}$ for residue i of a conformation only depends on the values of c_ij in its contact map that are native contacts formed by residue i. Let

C_{i} = {c_{i j} ∣ (i, j) ∊ N} .

The size of this set, ∣C_i∣, is the number of contacts formed by residue i in native structure. Note that if (i,j)∊N, both C_i and C_j contain c_ij.

To generate contact map C satisfying the ϕ-value constraints, we only need to decide which subset of native contacts to preserve for residue i whose experimental ϕ-value is available. To satisfy the ϕ-value constraint, there needs to be $∣ C_{i} ∣ \cdot ϕ_{i}^{exp}$ number of native contacts preserved for residue i in the contact map. That is, we need to assign either 0 or 1 to elements in C_i,i∊I, such that there are exactly $∣ C_{i} ∣ \cdot ϕ_{i}^{exp}$ number of “1” s in C_i. That is, for each generated contact map we should have $\sum_{c_{i j} ∊ C_{i}} c_{i j} = ∣ C_{i} ∣ \cdot ϕ_{i}^{exp}$ , i∊I. For simplicity, we denote $Ψ_{i} = ∣ C_{i} ∣ \cdot ϕ_{i}^{exp}$ in the subsequent discussion.

Now we generate contact map samples properly weighted with respect to the uniform distribution of all contact maps that physically satisfy the ϕ-value constraints. Each contact map sample C generated by importance sampling via the use of a trial distribution g(C) is weighted by v=1∕g(C). Here g(C) is the probability to generate contact map C.

A similar problem has been studied in Ref. 46 for generating 0–1 tables with fixed marginal sums. Although in our problem, the contact map has to be symmetric and only part of it needs to be filled, some techniques in Ref. 46 can be used to improve the sampling efficiency.

Specifically, we proceed by assigning the proper numbers of “1’s” and “0’s” in the rows for residues with experimental ϕ-value measurement. That is, we fill 0’s and 1’s in C_i, i∊I, and repeat this position after position, until the rows corresponding to all residues with experimental ϕ-value measurement are assigned. Let m^* denote the number of contact map samples we will generate, $C_{I_{1} : I_{t}}^{k}$ denote the partially filled kth contact map we have obtained thus far after finishing positions I₁ to I_t, and $v_{t}^{(k)}$ be the weight of the kth contact map that has been partially filled up to position I_t. The algorithm for generating contact maps from ϕ-value measurement is listed as Algorithm 2.

Algorithm 2 Generating contact map


for k=1 to m^* do
$C_{I_{1} : I_{0}}^{(k)} = \overset{/}{0}, v_{0}^{(k)} = 1$
end for
for position index t=1 to T do
for sample k=1 to m^* do
for s=t to T do
Divide C_{I_s} into disjoint sets $S_{0, I_{s}}^{(k)}$ , $S_{1, I_{s}}^{(k)}$ , and $S_{u, I_{s}}^{(k)}$ based on partial contact map $C_{I_{1} : I_{t - 1}}^{(k)}$ , where $S_{1, I_{s}}^{(k)} = {C_{I_{s}, j} ∣ already filled with 1}$ , $S_{0, I_{s}}^{(k)} = {C_{I_{s}, j} ∣ already filled with 0}$ , and $S_{u, I_{s}}^{(k)} = {C_{I_{s}, j} ∣ unspecified}$ .
end for
repeat
for s=t to T do
if $∣ S_{1, I_{s}}^{(k)} ∣ > Ψ_{I_{s}}$ then
Remove this sample. {Already too many “1”s.}
else if $∣ S_{1, I_{s}}^{(k)} ∣ = Ψ_{I_{s}}$ then
Fill all elements in $S_{u, I_{s}}^{(k)}$ with 0.
Update $S_{0, I_{j}}^{(k)}$ , $S_{1, I_{j}}^{(k)}$ , $S_{u, I_{j}}^{(k)}$ , j∊{t,⋯,T}.
end if
if $∣ S_{1, I_{s}}^{(k)} ∣ + ∣ S_{u, I_{s}}^{(k)} ∣ < Ψ_{I_{s}}$ . then
Remove this sample. {Already too many “0”s.}
else if $∣ S_{1, I_{s}}^{(k)} ∣ + ∣ S_{u, I_{s}}^{(k)} ∣ = Ψ_{I_{s}}$ then
Fill all elements in $S_{u, I_{s}}^{(k)}$ with 1.
Update $S_{0, I_{j}}^{(k)}$ , $S_{1, I_{j}}^{(k)}$ , $S_{u, I_{j}}^{(k)}$ , j∊{t,⋯,T}.
end if
end for
until $S_{u, I_{t}}^{(k)} = \overset{/}{0}$ , or none of $S_{0, I_{s}}^{(k)}$ , $S_{1, I_{s}}^{(k)}$ , $S_{u, I_{s}}^{(k)}$ , s∊{t,⋯,T} changes.
{This step must converge because the number of unspecified positions decreases monotonically as the iteration proceeds.}
if $S_{u, I_{t}}^{(k)} = \overset{/}{0}$ then
$C_{I_{t}}^{(k)}$ is completed and let weight $v_{t}^{(k)} = v_{t - 1}^{(k)}$ .
else
Fill $S_{u, I_{t}}^{(k)}$ with $Ψ_{I_{t}} - ∣ S_{1, I_{t}}^{(k)} ∣$ “1”s following CP-distribution.
{When there are unspecified entries in this row.}
Update weight $v_{t}^{(k)}$ by Eq. 8.
end if
end for
Optionally resample³⁸ ${(C_{I_{1} : I_{t}}^{(k)}, v_{t}^{(k)})}_{k = 1}^{m^{*}}$ if many samples were removed.
end for

Open in a new tab

Constrained Poisson (CP) distribution. The details of CP distribution can be found in Ref. 47. Briefly, we sample 0’s and 1’s to fill each entry $s_{1}, \dots, s_{∣ S_{u, I_{t}}^{(k)} ∣}$ of $S_{u, I_{t}}^{(k)}$ described in Algorithm 2 with probability proportional to

g (s_{1}, \dots, s_{∣ S_{u, I_{t}}^{(k)} ∣}) \propto \prod_{j = 1}^{∣ S_{u, I_{t}}^{(k)} ∣} p_{j}^{s_{j}} {(1 - p_{j})}^{1 - s_{j}},

and the total number of assigned 1’s is $\sum_{j = 1}^{∣ S_{u, I_{t}}^{(k)} ∣} s_{j} = Ψ_{I_{t}} - ∣ S_{1, I_{t}}^{(k)} ∣$ . Here p_j∊[0,1 stretchy=’true’] are the chosen parameters to improve the sample survival probability of this distribution.

Parameters for conditional Poisson (CP) distribution. For each entry $s_{j} ∊ S_{u, I_{t}}^{(k)}$ to be filled (whose corresponding entry in the contact map is c_{I_t,J_j}), we assign the parameter p_j for the CP distribution as

p_{j} = {\begin{array}{l} \frac{Ψ_{J_{j}} - ∣ S_{1, J_{j}}^{(k)} ∣}{∣ S_{u, J_{j}}^{(k)} ∣}, & if c_{I_{t}, J_{j}} ∊ S_{u, I_{t}}^{(k)} \cap A \\ max {\frac{Ψ_{I_{t}} - ∣ S_{1, I_{t}}^{(k)} ∣ - \sum_{c_{I_{t}, i} ∊ S_{u, I_{t}}^{(k)} \cap A} p_{i}}{∣ S_{u, I_{t}}^{(k)} \ A ∣}, 0.1}, & if c_{I_{t}, J_{j}} ∊ S_{u, I_{t}}^{(k)} \ A, \end{array}

where A={c_{I_t,I_t+1},c_{I_t,I_t+2},…,c_{I_t,I_T}} are entries recording existence of contacts between residue I_t and other future residues with experimental ϕ-values.

If J_j is a position with ϕ-measurement but currently unspecified, we assign p_j as the ratio of the number of 1’s to be assigned $Ψ_{J_{j}} - ∣ S_{1, J_{j}}^{(k)} ∣$ and the number of unspecified positions $∣ S_{u, J_{j}}^{(k)} ∣$ for residue J_j.

If J_j is a position currently unspecified but not a position with known ϕ-measurement, we assign p_j as the ratio of the number of 1 to be assigned $Ψ_{I_{t}} - ∣ S_{1, I_{t}}^{(k)} ∣$ , minus an expected number $\sum_{c_{I_{t}, i} ∊ S_{u, I_{t}}^{(k)} \cap A} p_{i}$ of 1’s that will be assigned for future positions with ϕ values, and the number $∣ S_{u, I_{t}}^{(k)} \ A ∣$ of unspecified positions without known ϕ-values, or the value of 0.1, which ever is larger. This choice of p_j is expected to fill $Ψ_{J_{j}} - ∣ S_{1, J_{j}}^{(k)} ∣$ number of 1’s in $∣ S_{u, j}^{(k)} ∣$ for j∊{I_t,I_t+1,…,I_T}. Note that in this assignment, p_i is guaranteed to have a value between 0 and 1.

Realization of CP distribution. The overall idea for sampling from the CP distribution is to take out $Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣$ number of elements from the set $S_{u, I_{t}}^{(k)}$ one by one without following specific probability replacement. These elements will be assigned as 1’s, while the remaining ones will be 0’s.⁴⁶

Specifically, let a_j=p_j∕(1−p_j). Suppose ${\bar{S}}_{u, I_{t}}^{(k)} (i)$ are the remaining elements after taking out i elements $(i = 0, 1, \dots, Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣ - 1)$ . Each $s_{j} ∊ {\bar{S}}_{u, I_{t}}^{(k)} (i)$ will be selected as next element to be taken out and assigned the value of 1 with probability

P (s_{j}, {\bar{S}}_{u, I_{t}}^{(k)} (i)) = \frac{a_{j} \cdot R (Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣ - i - 1, {\bar{S}}_{u, I_{t}}^{(k)} (i) \ {s_{j}})}{(Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣ - i) \cdot R (Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣ - i, {\bar{S}}_{u, I_{t}}^{(k)} (i))},

where R(i,S) is

R (i, S) = \sum_{B \subset S, ∣ B ∣ = i} (\prod_{j ∊ B} a_{j}) .

(7)

It is the summation of ∏_j∊Ba_j of all size i subsets B in S.

For an integer i and a subset $S \subset S_{u, I_{t}}^{(k)}$ , R(i,S) can be calculated using the recursive formula

R (i, S) = R (i, S \ {s_{j}}) + a_{i} R (i - 1, S \ {s_{j}})

for any s_j∊S. The initial conditions for the recursion are R(0,S)=1 for any $S \subset S_{u, I_{t}}^{(k)}$ and R(i,S)=0 for any ∣S∣<i.

Updating sample weight. The weight associated with a sample of contact map is updated as

v_{t}^{(k)} = v_{t - 1}^{(k)} \cdot \frac{R (Ψ_{I_{t}} - ∣ S_{1, I_{s}}^{(k)} ∣, S_{u, I_{t}}^{(k)})}{\prod_{j = 1}^{∣ S_{u, I_{t}}^{(k)} ∣} a_{j}^{s_{j}^{(k)}}},

(8)

where $s_{1}^{(k)}, \dots, s_{∣ S_{u, I_{t}}^{(k)} ∣}^{(k)}$ is a realization of $s_{1}, \dots, s_{∣ S_{u, I_{t}}^{(k)} ∣}$ for the kth contact map and R(i,S) is defined in Eq. 7.

Generating conformations from contact map samples derived from ϕ-values

With a set of properly weighted samples of contact map ${(C^{(k)}, v_{T}^{(k)}), k = 1, \dots, m^{*}}$ , we draw a subset of it. The probability for each sample to be drawn is proportional to $v_{T}^{(k)}$ . For each selected contact map, we use it as the target contact map to generate conformations following Algorithm 1, using the priority score described in Sec. 2E. The set of the generated conformations form the TSE.

RESULTS

Result of generating conformations from incomplete residue distance constraints

This section shows the result of generating protein conformations with given constraints in the form of small intervals of distances for a subset of residue pairs as described in Sec. 2D.

Consider the Boltzmann distribution π(x_n)∝exp{−E(x_n)∕τ∣D∣}, where E(x_n)∕∣D∣ is the error function defined in Eq. 2 normalized by the number of constraints, τ is a temperaturelike parameter in the Boltzmann function. It reflects deviation from the lower and upper bounds of the distance constraints. Here we set τ=0.5. We use Algorithm 1 to estimate the expected root mean square distance (rmsd) to the native structure of conformations following this Boltzmann distribution. The algorithm is applied to 189 proteins chosen from PDB, whose lengths are between 80 and 120. The distance constraints are constructed for non-neighboring residue pairs whose distances are less than 10 Å (SRD). The percentage of SRD pairs included in the given constraint set D varies from 20% to 100%.

The growth priority score used in Algorithm 1 is described in Sec. 2D. We repeat the algorithm 20 times independently with at most m_max=1000 samples being kept during each computation. The corresponding estimated rmsd expectations of distance constraint set D that includes different percentages of SRDs are plotted in Fig. 2. The boxes in the figure have lines at the lower quartile, median, and upper quartile values of the estimated expectations of the 189 proteins. We can see the corresponding expectation of rmsd becomes smaller as the percentage of the constraints increases. This is expected, as the Boltzmann probabilities π(x_n) of conformations close to the native structure tend to be larger as more distance constraints are available.

Box plot of expected to native structures rmsd expectations measured in Å of conformations following Boltzmann distribution of the error function π(x_n)∝exp{−E(x_n)∕τ∣D∣} for 189 proteins with length between 80 and 120. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

We can choose the conformation with the smallest error function Eq. 2 from the generated conformation samples as the recovered structure. In Fig. 3a, we plot the values of normalized error function E(x_n)∕∣D∣ of these recovered structures, compared to the values of normalized error function of the fittest native structures [Fig. 3b]. The fittest native structure is the conformation in our discrete model, whose rmsd to the native structure is the smallest. It is obtained by a greedy growth method (Ref. ¹⁹) with a local minimal rmsd to the native structure. Although the objective of our algorithm is to generate conformations following the Boltzmann distribution π(x_n)∝exp{−E(x_n)∕τ∣D∣}, we still can find conformations with smaller error function values in terms of violation of distance donstraints than the fittest native structures.

Normalized error function of the recovered structure and the fittest structure. (a) Box plot of normalized error function E(x_n)∕∣D∣ of recovered structures of 189 proteins with length of 80–120; (b) box plot of normalized error function E(x_n)∕∣D∣ of the fittest native structures of 189 proteins with length of 80–120. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

The rmsd’s of the recovered structures to native structures are plotted in Fig. 4. When the distance informations of all SRD are provided, the recovered structures of 160 out of the 189 proteins have rmsd to the native structures less than 3 Å. In general, the recovered structures approach native structures as more distance constraints are incorporated. This shows that the priority score β_t we use introduces larger probability to generate conformations close to the native structure when more distance constraints are available.

Box plot of rmsd’s measured in Å of recovered structures of 189 proteins with length of 80–120 to native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

We compare the difficulties of recovering structures from distance constraints among different protein classes. The rmsd’s of the recovered structures to native structures of ten proteins of different classes are reported in Table 1. Compared to alpha helical proteins, the recovered structures from incomplete distance constraints for beta proteins and alpha∕beta proteins have larger rmsd’s to the native structures. Table 2 reports the normalized error function of the recovered structures and the fittest native structures (in parentheses). Although the recovered structure and the fittest structure are both fixed, depending on the choice of the constraints at different percentages, values of the error function normalized by the number of constraints will be different. We also report the number of violated distance constraints of the recovered structures and the fitted native structures in Table 3. The results show that although the recovered structures violate some of the distance constraints, values of the normalized error function can be much smaller than the fittest native structures. This is because the loss functions h₁, h₂, and h₃ we use are concave downward functions, which focus on preventing the distance between residues being far away from the given distance constraints.

Table 1.

rmsd’s measured in Å of the recovered structures and the fittest native structures to native structures of ten proteins of different classes. Number of all SRD pairs: the number of all residue pairs with distance less than 10 Å. % of SRD: the percentage of SRDs included in the constraint set D.

RMSD to native structure measured in Å
PDBID	Proteinclass	Proteinlength	# of allSRD pairs	Fittest nativestructure	Structure recovered from % of SRD
PDBID	Proteinclass	Proteinlength	# of allSRD pairs	Fittest nativestructure	20%	40%	60%	80%	100%
2mhr	All alpha	118	765	0.9	5.3	3.9	1.9	2.3	1.5
256b	All alpha	106	752	1.0	9.6	3.7	1.5	1.2	1.4
1cmc	All alpha	104	619	0.9	6.3	4.9	2.7	2.1	2.3
1btn	All beta	106	749	1.5	6.1	6.5	4.0	4.1	2.3
1f7d	All beta	118	796	1.4	7.9	8.4	8.0	5.1	4.5
1f86	All beta	115	816	1.2	5.5	5.3	3.4	2.6	2.3
2trx	Alpha∕beta	108	728	1.1	5.2	3.5	2.1	2.1	1.8
1bkf	Alpha∕beta	107	788	1.5	5.6	2.6	2.0	2.1	1.6
1lkk	Alpha∕beta	105	719	1.0	6.2	4.3	1.8	2.3	1.6
1puc	Alpha∕beta	101	455	0.9	10.6	8.9	7.1	7.5	4.2

Open in a new tab

Table 2.

Value of the normalized error function of the recovered structures and of the fittest native structures (in parentheses) of ten proteins of different classes. % of SRD: the percentage of SRDs included in the constraint set D.

Normalized error function
PDBID	Structure recovered from % of SRD
PDBID	20%	40%	60%	80%	100%
2mhr	0.050 (0.108)	0.117 (0.105)	0.085 (0.122)	0.092 (0.136)	0.093 (0.140)
256b	0.060 (0.237)	0.049 (0.199)	0.077 (0.203)	0.085 (0.201)	0.083 (0.202)
1cmc	0.018 (0.211)	0.071 (0.196)	0.071 (0.164)	0.068 (0.165)	0.097 (0.161)
1btn	0.072 (0.842)	0.465 (0.648)	0.478 (0.650)	0.530 (0.678)	0.370 (0.696)
1f7d	0.147 (0.669)	0.293 (0.726)	0.260 (0.648)	0.343 (0.716)	0.200 (0.688)
1f86	0.144 (0.527)	0.217 (0.469)	0.330 (0.430)	0.339 (0.415)	0.198 (0.443)
2trx	0.044 (0.564)	0.102 (0.466)	0.196 (0.471)	0.120 (0.427)	0.129 (0.418)
1bkf	0.159 (0.526)	0.117 (0.564)	0.130 (0.691)	0.131 (0.667)	0.164 (0.584)
1lkk	0.264 (0.214)	0.146 (0.216)	0.128 (0.239)	0.128 (0.246)	0.121 (0.228)
1puc	0.009 (0.181)	0.083 (0.157)	0.103 (0.134)	0.069 (0.137)	0.068 (0.132)

Open in a new tab

Table 3.

The numbers of violations of distance constraints of the recovered structures and the fittest native structures (in parentheses) of ten proteins of different classes. % of SRD: the percentage of SRDs included in the constraint set D.

Number of violated distance constraints
PDBID	Structure recovered from % of SRD
PDBID	20%	40%	60%	80%	100%
2mhr	67 (61)	158 (137)	224 (211)	288 (303)	383 (377)
256b	63 (74)	123 (143)	202 (231)	271 (313)	318 (372)
1cmc	36 (60)	92 (107)	150 (176)	211 (236)	281 (302)
1btn	62 (93)	188 (175)	267 (263)	386 (352)	427 (439)
1f7d	81 (83)	183 (184)	260 (294)	375 (383)	428 (463)
1f86	72 (96)	191 (210)	279 (302)	349 (384)	455 (504)
2trx	63 (79)	133 (156)	214 (233)	276 (313)	351 (398)
1bkf	84 (98)	160 (182)	237 (292)	311 (383)	414 (489)
1lkk	82 (79)	137 (158)	249 (242)	299 (327)	376 (402)
1puc	30 (47)	87 (88)	133 (134)	150 (180)	171 (216)

Open in a new tab

The relatively large number of constraint violation may be due to certain limitation of our discrete model. There may not exist any conformation on the lattice satisfying all of the distance constraints. To address this issue, we construct a different set of distance constraints using the fittest native structure among the conformations of the discrete model, which is obtained by a greedy method. The new set of distance constraints are

{\tilde{d}}_{i j} - ϵ_{i j} \leq ‖ x_{i} - x_{j} ‖ \leq {\tilde{d}}_{i j} + ϵ_{i j} for all (i, j) ∊ D,

where ${\tilde{d}}_{i j}$ is the distance of residues i and j in the fittest native structure. In this case, there exists at least one conformation, the fittest native structure, in the discrete model satisfying all the distance constraints. Under this setting, the normalized error function E(x_n)∕∣D∣ of the recovered structures is plotted in Fig. 5, and the rmsd’s of the recovered structures to the fittest native structures are plotted in Fig. 6. Among 189 proteins, the recovered structures of 40 proteins can match the fittest native structures perfectly when all SRD pairs are in the constraint set D.

Box plot of normalized error function E(x_n)∕∣D∣ of recovered structures of 189 proteins with length of 80–120 when distance constraints are constructed based on the fittest native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

Box plot of rmsd’s measured in Å of the recovered structures to the fittest native structures of 189 proteins with length of 80–120 when distance constraints are constructed based on the fittest native structures. The boxes have lines at the lower quartile, median, and upper quartile values. The lines extending from each end of the boxes are to show the rest of the data. X axis is the percentage of native SRD pairs included in the constraint set D.

Result of generating conformations from contact map of distance cutoff

This section shows the result of generating conformations based on a given contact map, where the distances between residue pairs are known to be either above or below a cutoff value (8.5 Å).¹⁶

We choose 20 proteins with length of 50–200 from the Protein Data Bank and generate conformations from their complete native contact map using Algorithm 1. We repeat the computation ten times independently and at most m_max=1000 samples are kept during each computation. The conformation with the smallest numbers of missing contacts (residue pairs that form contact in the native structure but not in the generated conformation) and extraneous contacts (residue pairs that form contact in the generated conformation but not in the native structure) is chosen as the recovered structure. The number of missed contacts, extraneous contacts, and rmsd to native structures measured in angstroms of the recovered structures are reported in Table 4. Figure 7 shows rmsd of the recovered structures to native structures. Again, we found that the recovered structures of alpha helical proteins have smaller rmsd to the native structures.

Table 4.

List of proteins of different classes used to recover structures form complete native contact maps. The number of all native contacts, the number of missed contacts, the number of false positive contacts, rmsd to native structure in Å are also listed.

PDB	Protein	Protein	Number of native	Number of missed	Number of extraneous	rmsd
ID	class	length	contacts	contacts	contacts	(Å)
1ptq	Small protein	50	164	8	9	1.7
1cse	Small protein	63	172	10	11	2.6
1utg	All alpha	70	206	6	3	2.5
1hyp	All alpha	75	232	12	7	1.7
1lmb	All alpha	87	280	8	5	2.0
1plc	All beta	99	391	28	51	2.6
256b	All alpha	106	363	17	7	1.9
2mcm	All beta	112	414	36	36	2.3
2mhr	All alpha	118	352	17	17	2.3
1dz3	Alpha∕beta	123	413	22	15	3.4
1mdc	All beta	131	474	40	22	2.3
1stm	All beta	141	521	68	77	4.1
1mba	All alpha	146	530	38	53	2.7
1byr	Alpha∕beta	152	641	52	42	1.6
4dfr	Alpha∕beta	159	598	49	49	2.3
3dfr	Alpha∕beta	162	578	45	65	2.6
1v37	Alpha∕beta	171	672	60	58	2.3
1dgw	All beta	178	617	65	52	3.9
1fvk	Alpha∕beta	188	664	58	36	2.0
1o7n	All beta	193	622	77	85	3.7

Open in a new tab

rmsd of structures recovered from complete native contact maps to native structures for 20 proteins of different classes with length of 50–200. X axis is the protein length, Y-axis is the rmsd value of generated conformation that best fit the contact map to the native structure measured in Å.

Result of generating contact maps and conformations from indirect distance constraints by ϕ-values

This section depicts the result of generating TSE from ϕ-value constraints.

We generate TSE of bovine acyl-coenzyme A-binding protein, a length 86 protein with experimental ϕ-values. The PDB entry of the protein is 1nvl. The experimental ϕ-values are plotted in Fig. 8. More details of the experimental ϕ-values can be found in Ref. 48. We follow Ref. 16 and define TSE as the conformations satisfying $∣ ϕ_{i}^{calc} - ϕ_{i}^{exp} ∣ < 0.15$ for all residue i with experimental measured ϕ-value. Hence, the target distribution is the uniform distribution of all conformations satisfying these constraints.

Comparison of the experimental ϕ-values and calculated ϕ-values of the generated TSE of 1nvl. The filled circles represent the experimental ϕ-values, empty circles represent the calculated ϕ-values of the generated TSE.

We generate m^*=10 000 contact map samples using Algorithm 2, among which 1000 contact maps are chosen with probability proportional to their corresponding weights. For each chosen contact map, Algorithm 1 is used to generate conformations. At most m_max=1000 conformations are generated for each contact map. Figure 8 reports $ϕ_{i}^{calc}$ of the generated TSE. It is seen that the generated TSE can faithfully reproduce the ϕ-values. The average rmsd between TSE and the native structure of 1nvl is 11.3 Å. The result shows that the conformations of TSE can be far away from the native structure.

DISCUSSION

Obtaining molecular structures from incomplete and inaccurate distance information provided by experiments is an important problem. Several global optimization methods has been applied to solve this problem,⁹^,¹²^,¹³^,¹⁴ in which the goal is to minimize some error function derived from the provided distance information. In this study, we use SMC method to recover protein structures.

Compared to global optimization methods, an important advantage of our approach is that it can generate a set of conformations that are properly weighted with respect to a specified target distribution. Hence, in addition to recovering structures, we can also provide estimate of important physical parameters of the molecular ensemble, including thermodynamics properties such as energy and entropies under a given energy function.¹⁸^,¹⁹ In this paper, the average rmsd to native structure for TSE conformations is a consistent estimate of how close the native structure and TSE satisfying the distance constraints indirectly provided by ϕ-values are.

A difficulty in growth-based method, such as SMC method, is that the distance information of future residues cannot be directly used for placing current residue. To circumvent this problem, we develop a new growth potential function that can incorporate the distance information of future residues. In this potential function, we convert upper bound constraints of distance for a subset of residue pairs to global distance upper bound constraints of all possible residue pairs. In addition, we introduce reference points of future residues to be placed.

We have used this algorithm to generate protein conformations from constraints in the form of small intervals of distances between a subset of residue pairs, from contact map, and from indirect distance constraints by ϕ-values. This algorithm can effectively recover native structures and can generate conformations satisfying any given set of distance constraints. The conformations generated by this method can also be used as the initial conformations for further refinement.⁹^,¹⁰^,¹¹^,¹²

In this study, a discrete model for protein structures was used for simplicity, at the price of model accuracy.²² We expect further improvement by extending our model to continuous space, with additional steps of local move refinement, as demonstrated in Refs. ¹⁴^,⁴⁹.

ACKNOWLEDGMENTS

This work was supported by NIH Grant Nos. GM079804-01A1 and GM081682 and by NFS Grant Nos. DBI-0646035 and DMS-0800257.

References

Rhodes G., Crystallography Made Crystal Clear: A Guide for Users of Macromolecular Models (Academic, New York, 1999). [Google Scholar]
Crippen G. M. and Havel T. F., Distance Geometry and Molecular Conformation (Wiley, New York, 1988). [Google Scholar]
Rieping W., Habeck M., and Nilges M., Science 10.1126/science.1110428 309, 303 (2005). [DOI] [PubMed] [Google Scholar]
Falcon C. M. and Matthews K. S., Biochemistry 10.1021/bi0114067 40, 15650 (2001). [DOI] [PubMed] [Google Scholar]
Cai K., Langen R., Hubbell W. L., and Khorana H. G., Proc. Natl. Acad. Sci. U.S.A. 94, 14267 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
Altenbach C., Marti T., Khorana H., and Hubbell W. L., Science 10.1126/science.2160734 248, 1088 (1990). [DOI] [PubMed] [Google Scholar]
Altenbach C., Oh K. J., Trabanino R. J., Hideg K., and Hubbell W. L., Biochemistry 10.1021/bi011544w 40, 15471 (2001). [DOI] [PubMed] [Google Scholar]
Berger B., Kleinberg J., and Leighton T., J. ACM 46, 212 (1999). [Google Scholar]
Moré J. J. and Wu Z., in Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding, edited by Pardalos P. M., Shalloway D., and Xue G. (American Mathematical Society, Providence, 1996), pp. 151–168. [Google Scholar]
Glunt W., Hayden T. L., and Raydan M., J. Comput. Chem. 10.1002/jcc.540140115 14, 114 (1993). [DOI] [Google Scholar]
Moré J. J. and Wu Z., in Encyclopedia of Nuclear Magnetic Resonance, edited by Grant D. M. and Harris R. K. (Wiley, New York, 1995), pp. 1701–1710. [Google Scholar]
Grosso A., Locatelli M., and Schoen F., “Solving molecular distance geometry problems by global optimization algorithms,” Optim. (to be published). [Google Scholar]
Williams G. A., Dugan J. M., and Altman R. B., J. Comput. Biol. 10.1089/106652701753216521 8, 523 (2001). [DOI] [PubMed] [Google Scholar]
Vendruscolo M., Kussell E., and Domany E., Folding Des. 2, 295 (1997). [DOI] [PubMed] [Google Scholar]
Vendruscolo M. and Domany E., Folding Des. 10.1016/S1359-0278(98)00045-5 3, 329 (1998). [DOI] [PubMed] [Google Scholar]
Vendruscolo M., Paci E., Dobson C., and Karplus M., Nature (London) 10.1038/35054591 409, 641 (2001). [DOI] [PubMed] [Google Scholar]
Rosenbluth M. N. and Rosenbluth A. W., J. Chem. Phys. 10.1063/1.1741967 23, 356 (1955). [DOI] [Google Scholar]
Liang J., Zhang J., and Chen R., J. Chem. Phys. 10.1063/1.1493772 117, 3511 (2002). [DOI] [Google Scholar]
Zhang J., Lin M., Chen R., Liang J., and Liu J. S., Proteins 10.1002/prot.21203 66, 61 (2007). [DOI] [PubMed] [Google Scholar]
Frenkel D. and Smit B., Understanding Molecular Simulation: From Algorithms to Applications (Academic, San Diego, 1996). [Google Scholar]
Landau D. P. and Binder K., Monte Carlo Simulations in Statistical Physics (Cambridge University Press, Cambridge, 2000). [Google Scholar]
Zhang J., Chen R., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]
Moré J. J. and Wu Z., J. Global Optim. 10.1023/A:1008380219900 15, 219 (1999). [DOI] [Google Scholar]
Hom G., Mayo S., and Pierce N., J. Comput. Chem. 10.1002/jcc.10121 24, 232 (2002). [DOI] [PubMed] [Google Scholar]
Keating A. E., Malashkevich V. N., Tidor B., and Kim P. S., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.261563398 98, 14825 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
Moore J. W. and Pearson R. G., Kinetics and Mechanism (Wiley, New York, 1981). [Google Scholar]
Fersht A. R., Itzhaki L. S., Elmasry N., Matthews J. M., and Otzen D. E., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10426 91, 10426 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L. and Shakhnovich E. I., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.241378398 98, 13014 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
Ozkan S., Bahar I., and Dill K. A., Nat. Struct. Biol. 10.1038/nsb0901-765 8, 765 (2001). [DOI] [PubMed] [Google Scholar]
Lazaridis T. and Karplus M., Science 10.1126/science.278.5345.1928 278, 1928 (1997). [DOI] [PubMed] [Google Scholar]
Li A. and Daggett V., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10430 91, 10430 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
Fersht A. R., Leatherbarrow R. J., and Wells T. N., Biochemistry 10.1021/bi00393a013 26, 6030 (1987). [DOI] [PubMed] [Google Scholar]
Winter G., Fersht A. R., Wilkinson A. J., Zoller M., and Smith M., Nature (London) 10.1038/299756a0 299, 756 (1982). [DOI] [PubMed] [Google Scholar]
Paci E., Lindorff-Larsen K., Dobson C., Karplus M., and Vendruscolo M., J. Mol. Biol. 10.1016/j.jmb.2005.06.081 352, 495 (2005). [DOI] [PubMed] [Google Scholar]
Jackson S. E., Moracci M., ElMasry N., Johnson C. M., and Fersht A. R., Biochemistry 10.1021/bi00093a001 32, 11259 (1993). [DOI] [PubMed] [Google Scholar]
Miyazawa S. and Jernigan R., Macromolecules 10.1021/ma00145a039 18, 534 (1985). [DOI] [Google Scholar]
Li X., Hu C., and Liang J., Proteins 10.1002/prot.10442 53, 792 (2003). [DOI] [PubMed] [Google Scholar]
Zhang J., Chen R., Liu J., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]
Li X. and Liang J., Computational Algorithms for Protein Structure Prediction (Springer, New York, 2006). [Google Scholar]
Marshall A., in Symposium on Monte Carlo Methods, edited by Meyer M. (Wiley, New York, 1956), pp. 123–140. [Google Scholar]
Liu J. and Chen R., J. Am. Stat. Assoc. 10.2307/2669847 93, 1032 (1998). [DOI] [Google Scholar]
Liu J. S., Monte Carlo Strategies in Scientific Computing (Springer, New York, 2001). [Google Scholar]
Fearnhead P. and Clifford P., J. R. Stat. Soc. Ser. B (Stat. Methodol.) 65, 887 (2003). [Google Scholar]
Atallah M. J., Algorithms and Theory of Computation Handbook (CRC, Boca Raton, FL, 1998). [Google Scholar]
Householder A. S., Principles of Numerical Analysis (McGraw-Hill, New York, 1953). [Google Scholar]
Chen S. X. and Liu J. S., Stat. Sin. 7, 875 (1997). [Google Scholar]
Chen Y., Diaconis P., Holmes S. P., and Liu J. S., J. Am. Stat. Assoc. 10.1198/016214504000001303 100, 109 (2005). [DOI] [Google Scholar]
Kragelund B. B., Osmark P., Neergaard T. B., Schiødt J., Kristiansen K., Knudsen J., and Poulsen F. M., Nat. Struct. Biol. 10.1038/9384 6, 594 (1999). [DOI] [PubMed] [Google Scholar]
Zhang J., Kou S. C., and Liu J. S., J. Chem. Phys. 10.1063/1.2736681 126, 225101 (2007). [DOI] [PubMed] [Google Scholar]

[c1] Rhodes G., Crystallography Made Crystal Clear: A Guide for Users of Macromolecular Models (Academic, New York, 1999). [Google Scholar]

[c2] Crippen G. M. and Havel T. F., Distance Geometry and Molecular Conformation (Wiley, New York, 1988). [Google Scholar]

[c3] Rieping W., Habeck M., and Nilges M., Science 10.1126/science.1110428 309, 303 (2005). [DOI] [PubMed] [Google Scholar]

[c4] Falcon C. M. and Matthews K. S., Biochemistry 10.1021/bi0114067 40, 15650 (2001). [DOI] [PubMed] [Google Scholar]

[c5] Cai K., Langen R., Hubbell W. L., and Khorana H. G., Proc. Natl. Acad. Sci. U.S.A. 94, 14267 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c6] Altenbach C., Marti T., Khorana H., and Hubbell W. L., Science 10.1126/science.2160734 248, 1088 (1990). [DOI] [PubMed] [Google Scholar]

[c7] Altenbach C., Oh K. J., Trabanino R. J., Hideg K., and Hubbell W. L., Biochemistry 10.1021/bi011544w 40, 15471 (2001). [DOI] [PubMed] [Google Scholar]

[c8] Berger B., Kleinberg J., and Leighton T., J. ACM 46, 212 (1999). [Google Scholar]

[c9] Moré J. J. and Wu Z., in Global Minimization of Nonconvex Energy Functions: Molecular Conformation and Protein Folding, edited by Pardalos P. M., Shalloway D., and Xue G. (American Mathematical Society, Providence, 1996), pp. 151–168. [Google Scholar]

[c10] Glunt W., Hayden T. L., and Raydan M., J. Comput. Chem. 10.1002/jcc.540140115 14, 114 (1993). [DOI] [Google Scholar]

[c11] Moré J. J. and Wu Z., in Encyclopedia of Nuclear Magnetic Resonance, edited by Grant D. M. and Harris R. K. (Wiley, New York, 1995), pp. 1701–1710. [Google Scholar]

[c12] Grosso A., Locatelli M., and Schoen F., “Solving molecular distance geometry problems by global optimization algorithms,” Optim. (to be published). [Google Scholar]

[c13] Williams G. A., Dugan J. M., and Altman R. B., J. Comput. Biol. 10.1089/106652701753216521 8, 523 (2001). [DOI] [PubMed] [Google Scholar]

[c14] Vendruscolo M., Kussell E., and Domany E., Folding Des. 2, 295 (1997). [DOI] [PubMed] [Google Scholar]

[c15] Vendruscolo M. and Domany E., Folding Des. 10.1016/S1359-0278(98)00045-5 3, 329 (1998). [DOI] [PubMed] [Google Scholar]

[c16] Vendruscolo M., Paci E., Dobson C., and Karplus M., Nature (London) 10.1038/35054591 409, 641 (2001). [DOI] [PubMed] [Google Scholar]

[c17] Rosenbluth M. N. and Rosenbluth A. W., J. Chem. Phys. 10.1063/1.1741967 23, 356 (1955). [DOI] [Google Scholar]

[c18] Liang J., Zhang J., and Chen R., J. Chem. Phys. 10.1063/1.1493772 117, 3511 (2002). [DOI] [Google Scholar]

[c19] Zhang J., Lin M., Chen R., Liang J., and Liu J. S., Proteins 10.1002/prot.21203 66, 61 (2007). [DOI] [PubMed] [Google Scholar]

[c20] Frenkel D. and Smit B., Understanding Molecular Simulation: From Algorithms to Applications (Academic, San Diego, 1996). [Google Scholar]

[c21] Landau D. P. and Binder K., Monte Carlo Simulations in Statistical Physics (Cambridge University Press, Cambridge, 2000). [Google Scholar]

[c22] Zhang J., Chen R., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]

[c23] Moré J. J. and Wu Z., J. Global Optim. 10.1023/A:1008380219900 15, 219 (1999). [DOI] [Google Scholar]

[c24] Hom G., Mayo S., and Pierce N., J. Comput. Chem. 10.1002/jcc.10121 24, 232 (2002). [DOI] [PubMed] [Google Scholar]

[c25] Keating A. E., Malashkevich V. N., Tidor B., and Kim P. S., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.261563398 98, 14825 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c26] Moore J. W. and Pearson R. G., Kinetics and Mechanism (Wiley, New York, 1981). [Google Scholar]

[c27] Fersht A. R., Itzhaki L. S., Elmasry N., Matthews J. M., and Otzen D. E., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10426 91, 10426 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c28] Li L. and Shakhnovich E. I., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.241378398 98, 13014 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c29] Ozkan S., Bahar I., and Dill K. A., Nat. Struct. Biol. 10.1038/nsb0901-765 8, 765 (2001). [DOI] [PubMed] [Google Scholar]

[c30] Lazaridis T. and Karplus M., Science 10.1126/science.278.5345.1928 278, 1928 (1997). [DOI] [PubMed] [Google Scholar]

[c31] Li A. and Daggett V., Proc. Natl. Acad. Sci. U.S.A. 10.1073/pnas.91.22.10430 91, 10430 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]

[c32] Fersht A. R., Leatherbarrow R. J., and Wells T. N., Biochemistry 10.1021/bi00393a013 26, 6030 (1987). [DOI] [PubMed] [Google Scholar]

[c33] Winter G., Fersht A. R., Wilkinson A. J., Zoller M., and Smith M., Nature (London) 10.1038/299756a0 299, 756 (1982). [DOI] [PubMed] [Google Scholar]

[c34] Paci E., Lindorff-Larsen K., Dobson C., Karplus M., and Vendruscolo M., J. Mol. Biol. 10.1016/j.jmb.2005.06.081 352, 495 (2005). [DOI] [PubMed] [Google Scholar]

[c35] Jackson S. E., Moracci M., ElMasry N., Johnson C. M., and Fersht A. R., Biochemistry 10.1021/bi00093a001 32, 11259 (1993). [DOI] [PubMed] [Google Scholar]

[c36] Miyazawa S. and Jernigan R., Macromolecules 10.1021/ma00145a039 18, 534 (1985). [DOI] [Google Scholar]

[c37] Li X., Hu C., and Liang J., Proteins 10.1002/prot.10442 53, 792 (2003). [DOI] [PubMed] [Google Scholar]

[c38] Zhang J., Chen R., Liu J., and Liang J., Proteins 10.1002/prot.20809 63, 949 (2006). [DOI] [PubMed] [Google Scholar]

[c39] Li X. and Liang J., Computational Algorithms for Protein Structure Prediction (Springer, New York, 2006). [Google Scholar]

[c40] Marshall A., in Symposium on Monte Carlo Methods, edited by Meyer M. (Wiley, New York, 1956), pp. 123–140. [Google Scholar]

[c41] Liu J. and Chen R., J. Am. Stat. Assoc. 10.2307/2669847 93, 1032 (1998). [DOI] [Google Scholar]

[c42] Liu J. S., Monte Carlo Strategies in Scientific Computing (Springer, New York, 2001). [Google Scholar]

[c43] Fearnhead P. and Clifford P., J. R. Stat. Soc. Ser. B (Stat. Methodol.) 65, 887 (2003). [Google Scholar]

[c44] Atallah M. J., Algorithms and Theory of Computation Handbook (CRC, Boca Raton, FL, 1998). [Google Scholar]

[c45] Householder A. S., Principles of Numerical Analysis (McGraw-Hill, New York, 1953). [Google Scholar]

[c46] Chen S. X. and Liu J. S., Stat. Sin. 7, 875 (1997). [Google Scholar]

[c47] Chen Y., Diaconis P., Holmes S. P., and Liu J. S., J. Am. Stat. Assoc. 10.1198/016214504000001303 100, 109 (2005). [DOI] [Google Scholar]

[c48] Kragelund B. B., Osmark P., Neergaard T. B., Schiødt J., Kristiansen K., Knudsen J., and Poulsen F. M., Nat. Struct. Biol. 10.1038/9384 6, 594 (1999). [DOI] [PubMed] [Google Scholar]

[c49] Zhang J., Kou S. C., and Liu J. S., J. Chem. Phys. 10.1063/1.2736681 126, 225101 (2007). [DOI] [PubMed] [Google Scholar]

PERMALINK

Generating properly weighted ensemble of conformations of proteins from sparse or indirect distance constraints

Ming Lin

Hsiao-Mei Lu

Rong Chen

Jie Liang

Abstract

INTRODUCTION

MODEL AND METHOD

Protein model and distance constraints

Three-dimensional model for proteins

Figure 1.

Direct distance constraints

Indirect distance constraints and experimentally measured ϕ-values

Generating conformations with various distance constraints using SMC

Priority score

Growth potential from upper bounds of the distance constraints

Growth potential from reference points

Growth potential from lower bounds of the distance constraints

Combined priority score

Generating conformations from incomplete residue distance constraints

Generating conformations from contact map of distance cutoff

Generating contact maps and conformations from indirect distance constraints by ϕ-values

Generating contact maps from indirect distance constraints

Generating conformations from contact map samples derived from ϕ-values

RESULTS

Result of generating conformations from incomplete residue distance constraints

Figure 2.

Figure 3.

Figure 4.

Table 1.

Table 2.

Table 3.

Figure 5.

Figure 6.

Result of generating conformations from contact map of distance cutoff

Table 4.

Figure 7.

Result of generating contact maps and conformations from indirect distance constraints by ϕ-values

Figure 8.

DISCUSSION

ACKNOWLEDGMENTS

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases