Abstract
Root mean-square deviation (RMSD) after roto-translational least-squares fitting is a measure of global structural similarity of macromolecules used commonly. On the other hand, experimental x-ray B-factors are used frequently to study local structural heterogeneity and dynamics in macromolecules by providing direct information about root mean-square fluctuations (RMSF) that can also be calculated from molecular dynamics simulations. We provide a mathematical derivation showing that, given a set of conservative assumptions, a root mean-square ensemble-average of an all-against-all distribution of pairwise RMSD for a single molecular species, <RMSD2>1/2, is directly related to average B-factors (<B>) and <RMSF2>1/2. We show this relationship and explore its limits of validity on a heterogeneous ensemble of structures taken from molecular dynamics simulations of villin headpiece generated using distributed-computing techniques and the Folding@Home cluster. Our results provide a basis for quantifying global structural diversity of macromolecules in crystals directly from x-ray experiments, and we show this on a large set of structures taken from the Protein Data Bank. In particular, we show that the ensemble-average pairwise backbone RMSD for a microscopic ensemble underlying a typical protein x-ray structure is ∼1.1 Å, under the assumption that the principal contribution to experimental B-factors is conformational variability.
Introduction
The most frequently used measure for structure comparison in structural biology is, arguably, the atom-positional root mean-square deviation (RMSD) obtained after roto-translational least-squares fitting (1–5). Its applications are diverse and include monitoring structural changes in simulations of protein folding and dynamics (6–12), evaluating the quality of structure prediction schemes (13–16), comparing the diversity of model structures derived from experiments (17,18), assessing the properties of modeling approaches at different levels of resolution (19,20), and defining high-resolution shapes of polymers (21). Furthermore, structural diversity of an ensemble of biomolecular structures obtained through computer simulations is analyzed frequently by calculating an all-against-all distribution of RMSD values (pairwise RMSD) (22,23). Such calculation is also carried out commonly in NMR spectroscopy to assess the mutual similarity of the lowest energy structures in an ensemble produced by the refinement process (24–26). The resulting distribution of pairwise RMSD values captures the degree of structural heterogeneity of a given ensemble that can be due to either the intrinsic flexibility of a given structure or the uncertainties of the refinement procedure. The properties of this distribution, calculated typically for backbone atoms, are often summarized by reporting its arithmetic mean. Even though the calculations of pairwise RMSD values can be computationally demanding for large ensembles, they are also frequently used as an appropriate measure for clustering of structures (7,27–29).
A distribution of pairwise RMSD values provides information on the mutual similarity of members of a given ensemble when it comes to their global structure. However, to obtain information on local structural flexibility, thermal stability, and heterogeneity of macromolecules, root mean-square fluctuations (RMSF) are often studied (30–32). Most importantly, RMSF can be obtained through Debye-Waller or temperature factors (B-factors) in x-ray experiments using Eq. 1, where B-factors are usually defined as a measure of spatial fluctuations of atoms around their average position and where their motion is described as an isotropic Gaussian distribution of displacements about the average position (33). The inverse of this equation has often been used in the literature to calculate B-factors from various models (most often molecular dynamics simulations or Gaussian network models) and to compare them to experimental values (34–42):
(1) |
B-factors have also been used in a variety of studies to predict protein flexibility (43,44), assess their thermal stability (45–47), test for errors in protein structures (48), analyze active sites and binding pockets (49–51), correlate side-chain mobility with conformation (52,53), investigate crystal packing contacts (54), analyze and predict protein disordered regions (55–58), and study protein dynamics (37,40,59). However useful B-factors may be, one should always keep in mind that they include not only the positional variance of macromolecules that is due to local thermal motion, but also the effects of noise due to refinement errors, lattice defects, crystal contacts, and rigid-body motions (36,41,60). Furthermore, they also contain components coming from both static and dynamic disorder (61,62) whose separation is nontrivial (36). Finally, RMSF can also be predicted from NMR chemical shifts via a measure called random coil index (63–65).
Because pairwise RMSD, B-factors (or RMSF) are all frequently used to give information on different aspects of biomolecular ensembles, we study their relationship. We present a derivation showing that, given a set of conservative assumptions, <RMSD2>1/2 is directly proportional to average experimental B-factors (<B>), i.e., <RMSF2>1/2 for a single molecular species. Our finding is illustrated and its limits of validity probed by calculations made on structures taken from molecular dynamic (MD) simulations of the native and unfolded state of the villin headpiece domain (10,66) generated using worldwide-distributed computing techniques. In particular, we use simulated ensembles to study the effects of the exact method of structure alignment on the derived relationship, and show that the influence is typically only marginal. Finally, the newly derived relation is used to calculate quadratic means of pairwise RMSD distributions for a set of x-ray structures, given the B-factors reported in the Protein Data Bank (PDB), to assess their heterogeneity in the crystal environment.
To foreshadow the derivation presented in this study, we would like to introduce a useful analogy between <RMSD2>1/2 and <RMSF2>1/2 on the one hand and the radius of gyration (Rg) on the other. The radius of gyration, a measure often used to describe the dimensions of biopolymers such as proteins (67–69), can be analytically calculated in two ways: one using the pairwise distances between monomers (Eq. 2, Fig. 1 A), and the other using the distances between each monomer and their center of mass, i.e., the average position of all monomers if they have the same mass (Eq. 3, Fig. 1 B).
(2) |
(3) |
Indices i and j refer to different monomers, whereas Nm is a total number of monomers in a chain. Vector represents spatial coordinates of a monomer, whereas is the average position of Nm monomers. Equations 2 and 3 are shown to be identical by modifications of the Lagrange's theorem (70).
Our derivation of the relationship between <B>, <RMSF2>1/2, and <RMSD2>1/2, which is the main result of this study, mirrors the relationship between these two definitions of the radius of gyration. Namely, the two definitions given for Rg can be applied easily to ensembles of biomolecular structures where monomers are replaced by structures and RMSD is used as a measure of distance between them (Eq. 2, Fig. 1, C and D). Following this analogy, there is equivalence between root mean-square average pairwise RMSD, <RMSD2>1/2 (recalling the first definition of Rg, see Eq. 2) and the root mean-square average deviation between each structure and the average structure of the ensemble, <RMSF2>1/2 (recalling the second definition of Rg, see Eq. 3). The exact relationship between <RMSD2>1/2, <RMSF2>1/2, and <B> is explored below, together with an analysis of a novel measure of structural diversity in ensembles, the structural radius (Rstruct), which can be thought of as a structural analog of Rg.
Materials and Methods
Molecular dynamics simulations
Thousands of tens of nanoseconds long, independent trajectories for the villin headpiece domain were generated using a heterogeneous computer cluster as a part of the ongoing Folding@Home distributed computing project (10,66). The folding simulations were initiated from fully extended conformations (ϕ = −135°, ψ = 135°) with N-acetyl and C-amino caps. The equilibrium simulations were started from the experimental NMR structure of the molecule (PDB code 1VII, average structure) (66). The simulations, run using Tinker biomolecular simulation package, involved Langevin dynamics in implicit GB/SA solvent (71) (velocity damping parameter of γ = 91 ps−1) with a 2-fs integration step, at 300 K. Bond lengths were constrained using RATTLE (72). No cutoffs were used for electrostatics. The protein was modeled using the OPLSua force field (73). The molecule in the equilibrium simulations was stable with respect to both secondary and tertiary structure (10,74).
The structures were divided into two data sets for calculations: one that included native-like structures (1543 structures taken from the same number of independent equilibrium simulations at t = 20 ns), whereas the other one contained unfolded structures 5213 structures taken from the same number of independent folding trajectories at t = 27 ns).
RMSD calculations—pairwise alignment
To illustrate the relationship between average RMSD and RMSF for ensembles spanning a large range of average RMSD values, we used a clustering procedure on the two villin data sets. The main purpose of this procedure was to derive a set of mutually different distributions of pairwise RMSD to help us illustrate and assess the properties of the derivation provided in this study. Backbone atoms for each pair of structures from both simulated data sets were optimally aligned (pairwise alignment (PA)) before RMSD calculations. Nonweighted pairwise RMSDs were then calculated for the aligned backbone atoms that included C, N, and Cα of every residue (108 atoms in total). A distribution of the calculated pairwise RMSD values was plotted and divided into 20 equal segments between the smallest and the largest RMSD value. The structure that appeared in the highest number of pairs in a given segment was chosen as the center of a cluster, and the structures paired with it were assigned to that particular cluster as well. Twenty clusters were obtained through such a procedure for each data set (number of structures in each cluster is listed in Table S1 in the Supporting Material). Nonweighted pairwise RMSD was calculated for each cluster in the same way as described above using backbone-based PA. Quadratic mean of pairwise RMSD for each cluster was calculated as well.
We have noticed that the choice of the reference structure for the alignment of all the structures before calculating RMSF does affect its quadratic value in the very heterogeneous data set as shown in the Results. Therefore, RMSF for each cluster was calculated by using every single structure from the cluster for the alignment before the calculations and then quadratically averaging the obtained values to get the RMSF value for each cluster. All the alignments and calculations were done by using GROMACS-3.3 and its routines (75).
RMSD calculations—reference structure alignment
Backbones of all the structures were aligned to the backbone of the native structure of the villin headpiece domain taken from the PDB (average NMR structure, PDB code 1VII) (reference structure alignment (RSA)) to rule out the alignment effect from the calculations. Fitted structures were then subjected to the same procedure described in the previous section to obtain clusters (number of structures in each cluster is also listed in Table S1) and quadratic averages of RMSD and RMSF values. Structures were aligned to the reference structure using the McLachlan algorithm (76) as implemented in the program ProFit v3.1 (Martin, A.C.R., http://www.bioinf.org.uk/software/profit/).
Results
Demonstration of a direct proportionality between <RMSD2>1/2 and <RMSF2>1/2
RMSD is defined as the root mean-square-average distance between atoms of two optimally superimposed macromolecules (Si and Sj) and is calculated as a minimum over all rotations and translations of one of the structures being compared (Eq. 4).
(4) |
where Na is the number of atoms in a structure and should not be confused with the Avogadro constant. Indices i and j refer to different structures, whereas the index k refers to the atom position in a given structure. Vector represents spatial coordinates of a given atom.
To capture the properties of a distribution of pairwise RMSD, in this study we have used its quadratic mean calculated using Eq. 5 as it lends itself to better analytic manipulation compared to the arithmetic mean that is usually reported in NMR studies.
(5) |
where Ns is the number of structures in an ensemble. Here and in the rest of the derivation we will assume that all the structures are aligned to the same reference structure (therefore, the notation from Eq. 4 was simplified). Note that for the derivation it is not relevant what the exact nature of the reference structure is, as long as the same structure is used for aligning the whole ensemble. This is to be contrasted with typical calculation of pairwise RMSD, where each pair of structures is mutually superimposed.
RMSF for a specific number of structures is defined as a root mean-square-average distance between an atom and its average position in a given set of structures (Eq. 6)
(6) |
where is the average position of the atom k over Ns structures (Eq. 7)
(7) |
In the following, we have used the quadratic mean of RMSF calculated using Eq. 8
(8) |
If Eq. 6 is inserted into Eq. 8,
(9) |
On the other hand, if Eq. 4 is inserted into Eq. 5,
(10) |
The sums in Eq. 10 can be rearranged
(11) |
The sums over Na can now be written
(12) |
For simplicity, let us define a new variable
(13) |
The first term on the right-hand side of Eq. 13 can be separated by applying Eq. 7 and the second term can be represented as a double summation over the number of structures
(14) |
Terms in Eq. 14 can be added
(15) |
By combining Eqs. 13 and 15, we now see that
(16) |
Using the former definition of in Eq. 16, we can express <RMSF2>1/2 (Eq. 9) as
(17) |
Furthermore, we can express <RMSD2>1/2 (Eq. 12) as
(18) |
Finally, combining Eqs. 1, 17, and 18, a formula is derived that proves that <RMSD2>1/2 is directly proportional to <RMSF2>1/2 and, subsequently, B-factors.
(19) |
Finally, for Ns ≫ 1,
(20) |
Exclusion of the ensemble size effect and the derivation of identity
Typical calculations of the average pairwise RMSD for a given ensemble exclude the RMSDs between the same structures (equaling zero). One can show (see below) that this causes the relationship between the average RMSD and RMSF (and subsequently B-factors) to depend on the number of structures in an ensemble as in Eq. 19. This effect, as seen in Eq. 20, vanishes only for a large number of structures in the ensemble. One can define a new measure, Rstruct (that we term structural radius), using the following equation instead of Eq. 5 to eliminate the aforementioned effect:
(21) |
Combining Eqs. 4 and 21, it follows
(22) |
The sums in Eq. 22 can be rearranged
(23) |
Equation 16 can be inserted into Eq. 23
(24) |
The calculation of the average RMSD and RMSF remains the same as in the previous section, so by combining Eqs. 1, 17, 20, and 24, it follows
(25) |
With Eq. 25 we have shown that Rstruct and <RMSF2>1/2 are identical and can be linked directly with both <RMSD2>1/2 and experimental B-factors. In this sense, Rstruct serves as a measure of structural diversity of an ensemble of structures that is in an intuitively clear fashion directly related to B-factors, RMSF, and RMSD.
Illustrations of the relationship between the average RMSD and RMSF
To demonstrate the derived relationship between the average values of pairwise RMSD and RMSF, structures taken from distributed-computing MD simulations of villin headpiece domain were used to calculate the two measures. Each data set (native and unfolded) was divided into two sets of 20 clusters based on the distributions of pairwise RMSD values for two types of alignment (PA or reference structure alignment (RSA)). For RSA, all structures were first roto-translationally aligned to a common reference structure before their pairwise RMSD values were calculated. For PA, each individual pair of structures was first optimally aligned before calculating their RMSD. Fig. 2 shows these distributions, their arithmetic means and standard deviations. For the native data set, we can see that distributions of pairwise RMSD are very much alike regardless of the type of the alignment. Their arithmetic means and standard deviations are also very similar: 4.02 ± 1.95 Å for the PA curve and 4.07 ± 2.01 Å for the RSA curve. On the other hand, distributions associated with the unfolded data set and generated with different types of alignment are more different that can be seen from their arithmetic means and standard deviations: 7.38 ± 1.23 Å for the PA curve and 7.86 ± 1.39 Å for the RSA curve. From the given values, it can be seen that RMSD values calculated after aligning structures to a reference structure are higher than the ones calculated after the pairwise alignment. This is, of course, expected as for each individual pair of structures, PA gives by definition the lowest values over all possible roto-translational fittings. To further demonstrate this point, in the inset of Fig. 2 we compare RMSD values calculated using both types of alignment for several hundred randomly chosen pairs of structures from both data sets. It is clear from the inset of Fig. 2 that none of the RMSD values calculated after pairwise alignment of structures is higher than the corresponding values calculated after aligning structures to a reference structure.
For every pair of structures belonging to a particular cluster, pairwise RMSD (Eq. 4) was calculated using either PA or RSA, and the quadratic mean of all the RMSD values (Eq. 5) in the given cluster was determined. RSA-calculations were used to study the effect of the optimal alignment on the derived relationship. RMSF for backbone atoms in the cluster was computed as well and its quadratic mean was calculated (Eq. 8). The average values of pairwise RMSD and RMSF for clusters of both data sets and types of alignment are presented in Fig. 3. As can be seen, in the case of RSA, the real data completely agrees with the analytical derivation above (Fig. 3 A). The slope of the trendline shown in Fig. 3 A is in complete agreement with the expected value of 21/2 (seen from the derivation) and its squared correlation coefficient (R2) equals 1. However, in the case of PA (Fig. 3 B), the slope of the trendline (1.2968) deviates from the expected value due to the pairwise alignment of structures before calculations, but the R2 still has a high value of 0.9956. Deviations caused by the pairwise alignment are smaller for the native data set and the trendline applied to it would have a slope of 1.3862 with an R2 of 0.9998 (not shown).
The relationship shown between RMSD and RMSF explored in Fig. 3 still depends on the number of structures contained in a cluster. That effect can be eliminated by using the structural radius (Eq. 21) as a measure of structural diversity. The calculated values of the structural radius and <RMSF2>1/2 for clusters of both data sets and types of alignment are presented in Fig. 4. In the case of RSA (Fig. 4 A), both slope of the applied trendline and its squared correlation coefficient equal 1 showing that the two measures are identical and confirming the derivation (Eq. 25). The agreement of the two measures is slightly worse in Fig. 4 Bdue to the pairwise alignment of structures before calculations and the slope of the trendline (0.9099) differs from the expected value of 1, but still has a very high R2 of 0.9960. Once more, deviations caused by the pairwise alignment are smaller for the native data set and the trendline applied to it would have a slope of 0.9804 with R2 of 0.9999 (not shown). Altogether, one can claim that the effects of the alignment are negligible (<2%) for ensembles with <RMSD2>1/2 under 4 Å.
We also examined how the choice of the reference structure for alignment affects the <RMSF2>1/2 values, by using every single structure in a given cluster for the alignment before calculations and comparing the obtained distributions of <RMSF2>1/2 values for every cluster through their arithmetic means and standard deviations. We have also analyzed the maximal and the minimal quadratic means of every cluster to show how extreme the effects of the choice of the reference structure for the alignment can be (Fig. 5). As can be seen from the figure, arithmetic means of both data sets are very close to the minimal values of quadratic means. Their standard deviations are also quite small and they do not exceed the value of 0.26 Å, but they are higher for the unfolded data set. However, for some of the clusters, the difference between the maximal and the minimal value of <RMSF2>1/2 is more than twofold (e.g., clusters 15 and 18 in the native ensemble) that implies that the choice of the structure for the alignment can make a significant difference for a very diverse data set such as this.
Structural heterogeneity of proteins in crystals
The above relationship between experimental B-factors, <RMSD2>1/2 and the structural radius provided us with an opportunity to calculate the latter (<RMSD2>1/2 and Rstruct) for an ensemble of structures in a crystal using Eq. 25 and the measured B-factors, thus assessing the heterogeneity of a given crystal. The calculations were made under the assumption that the crystal contains a very large number of structures (Ns ≫ 1) that eliminated cluster size effect. Distributions of <RMSD2>1/2 and Rstruct values for backbone and all atoms for a representative set of x-ray structures from the PDB with ∼4800 structures are presented in Fig. 6 (see the Supporting Material for selection criteria). <RMSD2>1/2 values for the backbone and all atoms are quite similar: 1.07 ± 0.23 Å for the backbone and 1.13 ± 0.23 Å for all atoms with the maximum values that are <2.5 Å, but still with >6% of structures with <RMSD2>1/2 >1.5 Å for all atoms. The structural radius values are lower than <RMSD2>1/2 values, but are still similar: 0.76 ± 0.17 Å for the backbone and 0.80 ± 0.16 Å for all atoms with the maximum values that do not exceed the value of 1.8 Å.
Discussion
To the best of our knowledge, the heterogeneity of biomolecular ensembles in crystals used in x-ray experiments has never been evaluated previously on the level of pairwise RMSD. Here, we have derived and illustrated a nontrivial relationship between ensemble-average RMSD and RMSF and, subsequently, isotropic B-factors that gave us the opportunity to evaluate the heterogeneity of the microscopic ensembles underlying the typical crystal structures deposited in the PDB. When these values (Fig. 6) are compared to the values for villin headpiece obtained through simulation (Figs. 3 and 4), we can easily see that even the highest values for proteins in a crystal are rather small and coincide with the values for the native data set in the villin graphs, for which the relationship between <RMSD2>1/2 and <RMSF2>1/2 is close to exact, regardless of the alignment. The crystal lattice aligns the protein structures to a significant degree such that the RSA would likely be a valid approximation, but it is reassuring to see that the typical values for <RMSD2>1/2 in crystals occupy the regime in which the choice of alignment makes very little, if any, difference. It is our estimate that if one calculates <RMSD2>1/2 from B-factors as in Eq. 20, the error committed is <2% on average compared to the ideal-case pairwise alignment. Namely, 2% is the average deviation between the <RMSD2>1/2 values obtained using RSA and PA for all villin structures below <RMSD2>1/2 of 4 Å (Fig. 3). It has been proposed recently that a single crystallographic structure deposited in the PDB is not enough to assess the heterogeneity of a crystal and that an ensemble of models would be a more suitable representation (77). Nevertheless, we feel that the identity we have shown in this study is a good starting point for the assessment of crystal heterogeneity that could be generalized to an ensemble of models in a straightforward manner. Finally, the derivation presented here could in future research potentially be generalized to include anisotropic B-factors as well. Due to the additional information present, the associated <RMSD2>1/2 values would likely be more informative in that case.
Given the analogy between the structural radius and the radius of gyration and our derivation, it becomes obvious that the structural radius can be used for an ensemble of macromolecular structures in the same sense as the radius of gyration is used for a single macromolecular structure. Although the radius of gyration provides information on a macromolecule's size, the structural radius tells how diverse structures in a given ensemble are on a global scale. The structural radius can be calculated either by using pairwise RMSD as a measure of distance between structures (Eq. 21), or by using RMSD values between each structure and the average structure of the ensemble. The latter corresponds to <RMSF2>1/2 as shown in Eq. 9, but with a difference of first summing over the number of atoms (Na), and then over the number of structures (Ns), which is actually identical to Eq. 9 because the sums are interchangeable. Note that if the structural radius is calculated in this way, its usage should be restricted to molecular ensembles whose <RMSD2>1/2 is ≲6 Å (for which the alignment effects are ≲3%). For more heterogeneous ensembles, we would advise to either use the PA approach or simply exercise caution when interpreting results because of the potential deviations at high RMSD values. Here, it should be mentioned that it has been shown previously that the sum of squared distances for all atomic pairs equals the sum of squared distances to the average structure (78). Even though the authors suggested that this connection could be used for speeding up RMSD calculations (as the number of RMSD evaluations is reduced from Ns(Ns − 1)/2 to Ns) and improving algorithms in multiple structure alignment, they made no explicit link between RMSD, RMSF, and B-factors.
An important challenge in quantifying the relationship between RMSF and RMSD is the influence of optimal alignment of two structures on their mutual RMSD value. All the RMSD values calculated after the optimal pairwise alignment of two structures (PA) are lower than the ones calculated after the initial alignment of all the structures to a reference (RSA). Optimal alignment means that roto-translational fitting of the structures is carried out to minimize the RMSD value between them. Now, if one optimally aligns two structures to a reference structure, those two structures will not be optimally aligned with each other and their mutual RMSD will not be minimal, unless all three structures are mutually highly similar. For example, the effect of the optimal alignment is noticeably lesser in the native ensemble of villin than in the unfolded one, because 1), the unfolded ensemble is much more heterogeneous than the native one, and 2), the structure used as a reference is the native form of the villin headpiece domain taken from the PDB. However, as shown here, the effects of the alignment are typically only marginal. Parenthetically, one way of avoiding roto-translational alignment altogether would be to use internal coordinates to represent biomolecules and assess their structural heterogeneity. Nevertheless, as biomolecular structures (including the associated B-factors) are refined in Cartesian coordinates, we believe that the most natural measure for evaluating their global structural diversity directly from experiment should also involve atom-positional Cartesian representation, such as in the case of RMSD. In fact, the mathematical simplicity of the connection between B-factors, RMSF, and RMSD described in this study actually serves as indirect evidence supporting this claim.
In addition, RMSF values are also affected by the choice of the reference structure for the alignment. In Fig. 5, we show the arithmetic mean and standard deviation of distributions of <RMSF2>1/2 values calculated for every cluster, but differing in the choice of the structure used for the alignment before calculations. Even though the standard deviations throughout the clusters are quite small and they never exceed the value of 0.26 Å, the differences between the maximal and minimal values of quadratic means in some clusters are twofold (Fig. 5), which suggests that the choice of a reference structure in certain rare cases could indeed influence the outcome significantly. Contrary to this finding, Yang et al. (39) found no major effect of the choice of the alignment structure in their studies where they calculated residual RMSD of the ensembles of NMR models. We explain this discrepancy with the greater diversity of structures in our data set compared to most ensembles of NMR models, and we would like to stress the necessity of evaluating the diversity of an ensemble before ruling out the possible effect of the choice of the reference structure.
Here, we would like to emphasize that all of our conclusions involving B-factors and other nonprimary data depend on several critical assumptions. The danger of using derived data, such as B-factors, lies in the inaccuracies of the refinement processes linking the primary data from x-ray crystallography and NMR with model structures. All the structures submitted to the PDB are based on time and ensemble-average signals that undoubtedly affect the nature of the models derived from them and potentially cause different artifacts to appear (79,80). Furthermore, it is hard to tell whether the refinement has been conducted using the state-of-the-art software at the time of deposition and whether the software has been used in an optimal manner (81). For that reason, there have been several re-refinement attempts that yielded improved structural models (81–83). Until all the artifacts are fully resolved and understood, making assumptions based on the comparison of simulations and secondary or derived data can result in overinterpreted or misinterpreted conclusions (84). Nonetheless, we find the interpretation of our results to be useful for determining the <RMSD2>1/2 of an ensemble of structures in a crystal from B-factors as long as the Eq. 1 holds, i.e., as long as the major contribution to the measured B-factors is indeed the structural heterogeneity of molecules.
Supporting Material
One table is available at http://www.biophysj.org/biophysj/supplemental/S0006-3495(09)01738-X.
Supporting Material
Acknowledgments
We thank Ivo F. Sbalzarini, Christian L. Müller, and the members of the Laboratory of Computational Biophysics at MedILS for useful comments on the manuscript. Contribution of Folding@Home members is gratefully acknowledged.
This work was supported in part by the National Foundation for Science, Higher Education and Technological Development of Croatia (EMBO Installation grant to B.Z.), the Unity Through Knowledge Fund (UKF 1A to B.Z.), and a National Institutes of Health R01-GM062868 grant (Folding@Home).
References
- 1.McLachlan A.D. Mathematical procedure for superimposing atomic coordinates of proteins. Acta Crystallogr. A. 1972;28:656–657. [Google Scholar]
- 2.Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A. 1976;32:922–923. [Google Scholar]
- 3.Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A. 1978;34:827–828. [Google Scholar]
- 4.Kneller G.R. Superposition of molecular structures using quaternions. Mol. Simul. 1991;7:113–119. [Google Scholar]
- 5.Kneller G.R. Comment on “Using quaternions to calculate RMSD” [J. Comp. Chem. 25, 1849 (2004)] J. Comput. Chem. 2005;26:1660–1662. doi: 10.1002/jcc.20296. [DOI] [PubMed] [Google Scholar]
- 6.Duan Y., Kollman P.A. Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science. 1998;282:740–744. doi: 10.1126/science.282.5389.740. [DOI] [PubMed] [Google Scholar]
- 7.Daura X., Jaun B., Mark A.E. Reversible peptide folding in solution by molecular dynamics simulation. J. Mol. Biol. 1998;280:925–932. doi: 10.1006/jmbi.1998.1885. [DOI] [PubMed] [Google Scholar]
- 8.Daura X., van Gunsteren W.F., Mark A.E. Folding-unfolding thermodynamics of a beta-heptapeptide from equilibrium simulations. Proteins. 1999;34:269–280. doi: 10.1002/(sici)1097-0134(19990215)34:3<269::aid-prot1>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 9.Zagrovic B., Sorin E.J., Pande V. Beta-hairpin folding simulations in atomistic detail using an implicit solvent model. J. Mol. Biol. 2001;313:151–169. doi: 10.1006/jmbi.2001.5033. [DOI] [PubMed] [Google Scholar]
- 10.Zagrovic B., Snow C.D., Pande V.S. Simulation of folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. J. Mol. Biol. 2002;323:927–937. doi: 10.1016/s0022-2836(02)00997-x. [DOI] [PubMed] [Google Scholar]
- 11.Yang J.S., Chen W.W., Shakhnovich E.I. All-atom ab initio folding of a diverse set of proteins. Structure. 2007;15:53–63. doi: 10.1016/j.str.2006.11.010. [DOI] [PubMed] [Google Scholar]
- 12.Verma A., Wenzel W. A free-energy approach for all-atom protein simulation. Biophys. J. 2009;96:3483–3494. doi: 10.1016/j.bpj.2008.12.3921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Schueler-Furman O., Wang C., Baker D. Progress in modeling of protein structures and interactions. Science. 2005;310:638–642. doi: 10.1126/science.1112160. [DOI] [PubMed] [Google Scholar]
- 14.Rangwala H., Karypis G. fRMSDPred: predicting local RMSD between structural fragments using sequence information. Proteins. 2008;72:1005–1018. doi: 10.1002/prot.21998. [DOI] [PubMed] [Google Scholar]
- 15.Zhang Y. Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 2008;18:342–348. doi: 10.1016/j.sbi.2008.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bowman G.R., Pande V.S. The roles of entropy and kinetics in structure prediction. PLoS One. 2009;4:e5840. doi: 10.1371/journal.pone.0005840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Andrec M., Snyder D.A., Levy R.M. A large data set comparison of protein structures determined by crystallography and NMR: statistical test for structural differences and the effect of crystal packing. Proteins. 2007;69:449–465. doi: 10.1002/prot.21507. [DOI] [PubMed] [Google Scholar]
- 18.Saccenti E., Rosato A. The war of tools: how can NMR spectroscopists detect errors in their structures? J. Biomol. NMR. 2008;40:251–261. doi: 10.1007/s10858-008-9228-4. [DOI] [PubMed] [Google Scholar]
- 19.Sullivan D.C., Kuntz I.D. Conformation spaces of proteins. Proteins. 2001;42:495–511. [PubMed] [Google Scholar]
- 20.Sullivan D.C., Kuntz I.D. Distributions in protein conformation space: implications for structure prediction and entropy. Biophys. J. 2004;87:113–120. doi: 10.1529/biophysj.104.041723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Müller C.L., Sbalzarini I.F., Hünenberger P.H. In the eye of the beholder: inhomogeneous distribution of high-resolution shapes within the random-walk ensemble. J. Chem. Phys. 2009;130:214904–214925. doi: 10.1063/1.3140090. [DOI] [PubMed] [Google Scholar]
- 22.Brüschweiler R. Efficient RMSD measures for the comparison of two molecular ensembles. Proteins. 2003;50:26–34. doi: 10.1002/prot.10250. [DOI] [PubMed] [Google Scholar]
- 23.Zagrovic B., van Gunsteren W.F. Computational analysis of the mechanism and thermodynamics of inhibition of phosphodiesterase 5A by synthetic ligands. J. Chem. Theory Comput. 2007;3:301–311. doi: 10.1021/ct600322d. [DOI] [PubMed] [Google Scholar]
- 24.Laurents D., Pérez-Cañadillas J.M., Bruix M. Solution structure and dynamics of ribonuclease Sa. Proteins. 2001;44:200–211. doi: 10.1002/prot.1085. [DOI] [PubMed] [Google Scholar]
- 25.Kövér K.E., Bruix M., Rico M. The solution structure and dynamics of human pancreatic ribonuclease determined by NMR spectroscopy provide insight into its remarkable biological activities and inhibition. J. Mol. Biol. 2008;379:953–965. doi: 10.1016/j.jmb.2008.04.042. [DOI] [PubMed] [Google Scholar]
- 26.Zhou Z., Feng H.Q., Bai Y. The high-resolution NMR structure of the early folding intermediate of the Thermus thermophilus ribonuclease H. J. Mol. Biol. 2008;384:531–539. doi: 10.1016/j.jmb.2008.09.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shortle D., Simons K.T., Baker D. Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA. 1998;95:11158–11162. doi: 10.1073/pnas.95.19.11158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Betancourt M.R., Skolnick J. Finding the needle in a haystack: educing native folds from ambiguous ab initio protein structure predictions. J. Comput. Chem. 2001;22:339–353. [Google Scholar]
- 29.Zhang Y., Skolnick J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem. 2004;25:865–871. doi: 10.1002/jcc.20011. [DOI] [PubMed] [Google Scholar]
- 30.Król M., Roterman I., Spólnik P. Analysis of correlated domain motions in IgG light chain reveals possible mechanisms of immunological signal transduction. Proteins. 2005;59:545–554. doi: 10.1002/prot.20434. [DOI] [PubMed] [Google Scholar]
- 31.Yin J., Bowen D., Southerland W.M. Barnase thermal titration via molecular dynamics simulations: detection of early denaturation sites. J. Mol. Graph. Model. 2006;24:233–243. doi: 10.1016/j.jmgm.2005.08.011. [DOI] [PubMed] [Google Scholar]
- 32.Sousa S.F., Fernandes P.A., Ramos M.J. Molecular dynamics simulations on the critical states of the farnesyltransferase enzyme. Bioorg. Med. Chem. 2009;17:3369–3378. doi: 10.1016/j.bmc.2009.03.055. [DOI] [PubMed] [Google Scholar]
- 33.Willis B.T.M., Pryor A.W. Cambridge University Press; London; New York: 1975. Thermal Vibrations in Crystallography. [Google Scholar]
- 34.Phillips G.N. Comparison of the dynamics of myoglobin in different crystal forms. Biophys. J. 1990;57:381–383. doi: 10.1016/S0006-3495(90)82540-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Halle B. Flexibility and packing in proteins. Proc. Natl. Acad. Sci. USA. 2002;99:1274–1279. doi: 10.1073/pnas.032522499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Meinhold L., Smith J.C. Fluctuations and correlations in crystalline protein dynamics: a simulation analysis of staphylococcal nuclease. Biophys. J. 2005;88:2554–2563. doi: 10.1529/biophysj.104.056101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lu W.C., Wang C.Z., Ho K.M. Dynamics of the trimeric AcrB transporter protein inferred from a B-factor analysis of the crystal structure. Proteins. 2006;62:152–158. doi: 10.1002/prot.20743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Glykos N.M. On the application of molecular-dynamics simulations to validate thermal parameters and to optimize TLS-group selection for macromolecular refinement. Acta Crystallogr. D Biol. Crystallogr. 2007;63:705–713. doi: 10.1107/S0907444907014928. [DOI] [PubMed] [Google Scholar]
- 39.Yang L.W., Eyal E., Bahar I. Insights into equilibrium dynamics of proteins from comparison of NMR and x-ray data with computational predictions. Structure. 2007;15:741–749. doi: 10.1016/j.str.2007.04.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lu C.H., Huang S.W., Hwang J.K. On the relationship between the protein structure and protein dynamics. Proteins. 2008;72:625–634. doi: 10.1002/prot.21954. [DOI] [PubMed] [Google Scholar]
- 41.Li D.W., Brüschweiler R. All-atom contact model for understanding protein dynamics from crystallographic B-factors. Biophys. J. 2009;96:3074–3081. doi: 10.1016/j.bpj.2009.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hu Z., Jiang J. Assessment of biomolecular force fields for molecular dynamics simulations in a protein crystal. J. Comp. Chem. 2010;31:371–380. doi: 10.1002/jcc.21330. [DOI] [PubMed] [Google Scholar]
- 43.Karplus P.A., Schulz G.E. Prediction of chain flexibility in proteins—a tool for the selection of peptide antigens. Naturwissenschaften. 1985;72:212–213. [Google Scholar]
- 44.Vihinen M., Torkkila E., Riikonen P. Accuracy of protein flexibility predictions. Proteins. 1994;19:141–149. doi: 10.1002/prot.340190207. [DOI] [PubMed] [Google Scholar]
- 45.Vihinen M. Relationship of protein flexibility to thermostability. Protein Eng. 1987;1:477–480. doi: 10.1093/protein/1.6.477. [DOI] [PubMed] [Google Scholar]
- 46.Parthasarathy S., Murthy M.R.N. Protein thermal stability: insights from atomic displacement parameters (B values) Protein Eng. 2000;13:9–13. doi: 10.1093/protein/13.1.9. [DOI] [PubMed] [Google Scholar]
- 47.Reetz M.T., Soni P., Fernández L. Knowledge-guided laboratory evolution of protein thermolability. Biotechnol. Bioeng. 2009;102:1712–1717. doi: 10.1002/bit.22202. [DOI] [PubMed] [Google Scholar]
- 48.Stroud R.M., Fauman E.B. Significance of structural changes in proteins: expected errors in refined protein structures. Protein Sci. 1995;4:2392–2404. doi: 10.1002/pro.5560041118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Carugo O., Argos P. Accessibility to internal cavities and ligand binding sites monitored by protein crystallographic thermal factors. Proteins. 1998;31:201–213. [PubMed] [Google Scholar]
- 50.Yuan Z., Zhao J., Wang Z.X. Flexibility analysis of enzyme active sites by crystallographic temperature factors. Protein Eng. 2003;16:109–114. doi: 10.1093/proeng/gzg014. [DOI] [PubMed] [Google Scholar]
- 51.Mohan S., Sinha N., Smith-Gill S.J. Modeling the binding sites of anti-hen egg white lysozyme antibodies HyHEL-8 and HyHEL-26: an insight into the molecular basis of antibody cross-reactivity and specificity. Biophys. J. 2003;85:3221–3236. doi: 10.1016/S0006-3495(03)74740-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Carugo O., Argos P. Correlation between side chain mobility and conformation in protein structures. Protein Eng. 1997;10:777–787. doi: 10.1093/protein/10.7.777. [DOI] [PubMed] [Google Scholar]
- 53.Eyal E., Najmanovich R., Sobolev V. Protein side-chain rearrangement in regions of point mutations. Proteins. 2003;50:272–282. doi: 10.1002/prot.10276. [DOI] [PubMed] [Google Scholar]
- 54.Carugo O., Argos P. Protein-protein crystal-packing contacts. Protein Sci. 1997;6:2261–2263. doi: 10.1002/pro.5560061021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Altman R., Hughes C., Jardetsky O. Compositional characteristics of relatively disordered regions in proteins. Protein Pept. Lett. 1994;1:120–127. [Google Scholar]
- 56.Romero P., Obradovic Z., Dunker A.K. Identifying disordered regions in proteins from amino acid sequence. The 1997 IEEE International Conference on Neural Networks Proc, Houston, TX. 1997;1:90–95. [Google Scholar]
- 57.Romero P., Obradovic Z., Dunker A.K. Thousands of proteins likely to have long disordered regions. Pac. Symp. Biocomput. 1998;3:437–448. [PubMed] [Google Scholar]
- 58.Radivojac P., Obradovic Z., Dunker A.K. Protein flexibility and intrinsic disorder. Protein Sci. 2004;13:71–80. doi: 10.1110/ps.03128904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Navizet I., Lavery R., Jernigan R.L. Myosin flexibility: structural domains and collective vibrations. Proteins. 2004;54:384–393. doi: 10.1002/prot.10476. [DOI] [PubMed] [Google Scholar]
- 60.Kuriyan J., Weis W.I. Rigid protein motion as a model for crystallographic temperature factors. Proc. Natl. Acad. Sci. USA. 1991;88:2773–2777. doi: 10.1073/pnas.88.7.2773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Frauenfelder H., Petsko G.A., Tsernoglou D. Temperature-dependent x-ray diffraction as a probe of protein structural dynamics. Nature. 1979;280:558–563. doi: 10.1038/280558a0. [DOI] [PubMed] [Google Scholar]
- 62.Chong S.H., Joti Y., Parak F. Dynamical transition of myoglobin in a crystal: comparative studies of x-ray crystallography and Mössbauer spectroscopy. Eur. Biophys. J. 2001;30:319–329. doi: 10.1007/s002490100152. [DOI] [PubMed] [Google Scholar]
- 63.Berjanskii M., Wishart D.S. NMR: prediction of protein flexibility. Nat. Protoc. 2006;1:683–688. doi: 10.1038/nprot.2006.108. [DOI] [PubMed] [Google Scholar]
- 64.Berjanskii M.V., Wishart D.S. The RCI server: rapid and accurate calculation of protein flexibility using chemical shifts. Nucleic Acids Res. 2007;35(Web Server issue):W531–W537. doi: 10.1093/nar/gkm328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Berjanskii M.V., Wishart D.S. Application of the random coil index to studying protein flexibility. J. Biomol. NMR. 2008;40:31–48. doi: 10.1007/s10858-007-9208-0. [DOI] [PubMed] [Google Scholar]
- 66.McKnight C.J., Matsudaira P.T., Kim P.S. NMR structure of the 35-residue villin headpiece subdomain. Nat. Struct. Biol. 1997;4:180–184. doi: 10.1038/nsb0397-180. [DOI] [PubMed] [Google Scholar]
- 67.Eliezer D., Jennings P.A., Tsuruta H. The radius of gyration of an apomyoglobin folding intermediate. Science. 1995;270:487–488. doi: 10.1126/science.270.5235.487. [DOI] [PubMed] [Google Scholar]
- 68.Bright J.N., Woolf T.B., Hoh J.H. Predicting properties of intrinsically unstructured proteins. Prog. Biophys. Mol. Biol. 2001;76:131–173. doi: 10.1016/s0079-6107(01)00012-8. [DOI] [PubMed] [Google Scholar]
- 69.Knott M., Chan H.S. Criteria for downhill protein folding: calorimetry, chevron plot, kinetic relaxation, and single-molecule radius of gyration in chain models with subdued degrees of cooperativity. Proteins. 2006;65:373–391. doi: 10.1002/prot.21066. [DOI] [PubMed] [Google Scholar]
- 70.Flory P.J. Hanser Publishers; New York: 1989. Statistical Mechanics of Chain Molecules. [Google Scholar]
- 71.Qiu D., Shenkin P.S., Still W.C. The GB/SA continuum model for solvation. A fast analytical method for the calculation of approximate Born radii. J. Phys. Chem. 1997;101:3005–3014. [Google Scholar]
- 72.Andersen H.C. Rattle—a velocity version of the shake algorithm for molecular-dynamics calculations. J. Comput. Phys. 1983;52:24–34. [Google Scholar]
- 73.Jorgensen W.L., Tiradorives J. The Opls potential functions for proteins - energy minimizations for crystals of cyclic-peptides and crambin. J. Am. Chem. Soc. 1988;110:1657–1666. doi: 10.1021/ja00214a001. [DOI] [PubMed] [Google Scholar]
- 74.Zagrovic B., Snow C.D., Pande V.S. Native-like mean structure in the unfolded ensemble of small proteins. J. Mol. Biol. 2002;323:153–164. doi: 10.1016/s0022-2836(02)00888-4. [DOI] [PubMed] [Google Scholar]
- 75.Lindahl E., Hess B., van der Spoel D. GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Model. 2001;7:306–317. [Google Scholar]
- 76.McLachlan A.D. Rapid comparison of protein structures. Acta Crystallogr. A. 1982;38:871–873. [Google Scholar]
- 77.Furnham N., Blundell T.L., Terwilliger T.C. Is one solution good enough? Nat. Struct. Mol. Biol. 2006;13:184–185. doi: 10.1038/nsmb0306-184. discussion 185. [DOI] [PubMed] [Google Scholar]
- 78.Wang, X., and J. Snoeyink. 2006. Multiple structure alignment by optimal RMSD implies that the average structure is a consensus. LSS Computational Systems Bioinformatics Conference, Stanford, CA. 79–87. [PubMed]
- 79.Bürgi R., Pitera J., van Gunsteren W.F. Assessing the effect of conformational averaging on the measured values of observables. J. Biomol. NMR. 2001;19:305–320. doi: 10.1023/a:1011295422203. [DOI] [PubMed] [Google Scholar]
- 80.Zagrovic B., van Gunsteren W.F. Comparing atomistic simulation data with the NMR experiment: how much can NOEs actually tell us? Proteins. 2006;63:210–218. doi: 10.1002/prot.20872. [DOI] [PubMed] [Google Scholar]
- 81.Joosten R.P., Womack T., Bricogne G. Re-refinement from deposited x-ray data can deliver improved models for most PDB entries. Acta Crystallogr. D Biol. Crystallogr. 2009;65:176–185. doi: 10.1107/S0907444908037591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Nabuurs S.B., Nederveen A.J., Spronk C.A. DRESS: a database of REfined solution NMR structures. Proteins. 2004;55:483–486. doi: 10.1002/prot.20118. [DOI] [PubMed] [Google Scholar]
- 83.Joosten R.P., Vriend G. PDB improvement starts with data deposition. Science. 2007;317:195–196. doi: 10.1126/science.317.5835.195. [DOI] [PubMed] [Google Scholar]
- 84.van Gunsteren W.F., Dolenc J., Mark A.E. Molecular simulation as an aid to experimentalists. Curr. Opin. Struct. Biol. 2008;18:149–153. doi: 10.1016/j.sbi.2007.12.007. [DOI] [PubMed] [Google Scholar]
- 85.Humphrey W., Dalke A., Schulten K. VMD: visual molecular dynamics. J. Mol. Graph. Model. 1996;14:33–38. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.