This paper provides a rigorous mathematical derivation of Wilson’s prediction that the power spectrum of many molecules of biological interest is approximately flat at high frequencies. The analysis elucidates the precise cutoff frequency above which the flat approximation holds and extends the result to other types of statistics with applications to electron cryomicroscopy (cryo-EM).
Keywords: power spectrum, cryo-EM, Wilson statistics, Fourier analysis, Guinier plot
Abstract
The power spectrum of proteins at high frequencies is remarkably well described by the flat Wilson statistics. Wilson statistics therefore plays a significant role in X-ray crystallography and more recently in electron cryomicroscopy (cryo-EM). Specifically, modern computational methods for three-dimensional map sharpening and atomic modelling of macromolecules by single-particle cryo-EM are based on Wilson statistics. Here the first rigorous mathematical derivation of Wilson statistics is provided. The derivation pinpoints the regime of validity of Wilson statistics in terms of the size of the macromolecule. Moreover, the analysis naturally leads to generalizations of the statistics to covariance and higher-order spectra. These in turn provide a theoretical foundation for assumptions underlying the widespread Bayesian inference framework for three-dimensional refinement and for explaining the limitations of autocorrelation-based methods in cryo-EM.
1. Introduction
The power spectrum of proteins is often modelled by the Guinier law at low frequencies and the Wilson statistics at high frequencies. At low frequencies, there is a quadratic decay of the power spectrum characterized by the moment of inertia of the molecule (e.g. its radius of gyration). At high frequencies, the power spectrum is approximately flat. In structural biology, it is customary to plot the logarithm of the spherically averaged power spectrum of a three-dimensional structure as a function of the squared spatial frequency. This Guinier plot typically depicts the two different frequency regimes. It is not surprising that these laws are of critical importance in structural biology, with applications in X-ray crystallography (Drenth, 2007 ▸) and electron cryomicroscopy (cryo-EM) (Rosenthal & Henderson, 2003 ▸). However, while the Guinier law has a very simple mathematical derivation based on a Taylor expansion, in the literature we could only find heuristic arguments in support of Wilson statistics, such as the original argument provided by Wilson in his seminal one-page Nature paper (Wilson, 1942 ▸). Here we provide a rigorous mathematical derivation of Wilson statistics in the form of Theorem 3 and derive other forms of statistics with potential application to cryo-EM. The main ingredients in our analysis are a scaling argument, basic probability theory, and modern results in Fourier analysis that have found various applications within mathematics (such as the distribution of lattice points in domains), but their application to structural biology appears to be new.
1.1. Random bag of atoms
The model underlying Wilson statistics is a random ‘bag of atoms’, where the random ‘protein’ consists of N atoms whose locations are independent and identically distributed (i.i.d.). For example, each could be uniformly distributed inside a container such as a cube or a ball, though other shapes and non-uniform distributions are also possible. The electron scattering potential of the protein is modelled as
where f is a bump function such as a Gaussian, or a delta function in the limit of an ideal point mass. For simplicity of exposition, we assume that the atoms are identical. Otherwise, one can use different f’s to describe the scattering from each atom type. The Fourier transform of (1) is given by
1.2. Wilson statistics
Wilson’s original argument (Wilson, 1942 ▸) uses (2) to evaluate the power spectrum as follows:
Wilson argued that the sum of the complex exponentials in (3) is negligible compared with N, as those terms wildly oscillate and cancel each other, especially for high frequency ξ. We shall make this hand wavy argument more rigorous and the term ‘high frequency’ mathematically precise. Note that for an ideal point mass and (4) implies that the power spectrum is flat, i.e. .
The challenge is to show that there is so much cancellation that adding oscillating terms of size in (3) is negligible compared with N. For a random walk, the sum of i.i.d. zero-mean random variables of variance is (the square root of the number of terms). In order to show that the sum is negligible compared with N, additional cancellation must be happening. The role that ξ plays also needs to be carefully analysed, as for , clearly . What is the mechanism by which decays from to N as ξ increases?
2. Derivation of Wilson statistics
2.1. N1/3 scaling
Since are i.i.d., one might be tempted to apply the central limit theorem (CLT) to (2) and conclude that is approximately a Gaussian, for which the mean and variance can be readily calculated as done by Wilson (1949 ▸). However, one should proceed with caution, because if the container Ω is fixed, then in the limit , the density of the atoms also grows indefinitely, whereas the density of atoms in a protein is clearly bounded. If the density of the atoms is to be kept fixed, the container Ω has to grow with N. To make this dependency explicit, we denote the container by . The volume of the container must be proportional to N. The length scale is therefore proportional to , that is, or with in the uniform case, and i.i.d. in general. The Fourier transform (2) is rewritten as
but now the CLT can no longer be applied in a straightforward manner, because the summands in (5) are random variables that depend on N.
2.2. Shape of container and decay rate of the Fourier transform
The representation (5) facilitates the calculation of any moment of . The expectation (first moment) of is given by
where is the probability density function of and is its Fourier transform. The dependency on and N being a large parameter together suggest that the decay rate of at high frequencies is critical for analysing Wilson statistics.
Different container shapes and choices of g can lead to different behaviour of its Fourier transform . Before stating known theoretical results, it is instructive to consider a couple of examples.
(i) A uniform distribution in a ball. Here is a ball of radius 1, denoted B, and the uniform density is , where is the characteristic function of the ball. It is a radial function, a property that can readily be used to calculate its Fourier transform as
In particular, (7) implies that for some constant C.
(ii) A uniform distribution in a cube. Here is the unit cube, and is a product of three rectangular window functions whose Fourier transform is the sinc function. As a result,
Taking ξ along one of the axes, e.g. gives . In this case, for some . Notice that the decay of in directions not normal to its faces is faster. For example, for we have
We are now ready to state existing theoretical results about the decay rate of the Fourier transform for containers of general shape.
Theorem 1
(1) (See Stein & Shakarchi, 2011 ▸, p. 336.) Suppose is a bounded region whose boundary has non-vanishing Gauss curvature at each point, then
(2) If M has m non-vanishing principal curvatures at each point, then
The decay rates previously observed for the three-dimensional ball ( or ) and the cube () are particular cases of Theorem 1.
Although the decay rate in different directions could be different (as the example of the cube illustrates), for a large family of containers (convex sets and open sets with sufficiently smooth boundary surface), the following theorem asserts that the spherical average of the power spectrum has the same decay rate as that of the ball.
Theorem 2
(See Brandolini et al., 2003 ▸.) Suppose is a convex body or an open bounded set whose boundary is . Then,
Here is the radial frequency and is the unit sphere in .
2.3. Validity regime of Wilson statistics
We are now in position to state and prove our main result that fully characterizes the regime of validity of Wilson statistics.
Theorem 3
(1) For the random bag of atoms model, the expected power spectrum is given by
(2) If the container is a convex body or an open set with a boundary surface, and the atom locations are uniformly distributed in the container, then the expected spherically averaged power spectrum satisfies
for .
(3) If the Fourier transform of the density g satisfies , then
where .
Proof —
Starting with Wilson’s original approach, from (5) it follows that the power spectrum of ϕ is given by
Since the ’s are i.i.d., the expected power spectrum satisfies
establishing (12). Assuming f (hence also ) are radial functions, the expectation of the spherically averaged power spectrum satisfies
Theorem 2 with implies
This term is negligible compared with N in (16) for , proving (13). Finally, if , then = , which is for .
□
Note that (16) and (17) suggest that the spherically averaged power spectrum decays to its high-frequency limit as . This decay rate at high frequencies is reminiscent of Porod’s law in SAXS (small-angle X-ray scattering) (Porod, 1951 ▸, 1982 ▸). At first, the 1/12 exponent of the cutoff frequency might seem mysterious. In hindsight, it is simply the product of the dimension that resulted in the scaling of and the decay rate exponent of .
2.4. Spherical averaging and statistical fluctuation
Note that in our derivation of Wilson statistics, we first took expectation with respect to the atom positions followed by spherically averaging the power spectrum. On the other hand, spherically averaging (3) first gives
as in Debye’s scattering equation (Debye, 1915 ▸), due to the identity
Although the decay of the sinc function in (18) sheds some light on the mechanism by which the sum over atom pairs decreases with k, it does not seem to provide a good starting point for a rigorous derivation of Wilson statistics, nor does it provide a clear path for the generalizations considered later in this paper.
While Theorem 3 characterizes the expected power spectrum, one may wonder whether the statistical fluctuations of the power spectrum could overwhelm its mean. This turns out not to be the case. Similar to the derivation of Wilson statistics, one can show that if then
Since = for , it follows that for
In other words, the standard deviation of the power spectrum is , so the fluctuation is smaller than the mean value.
3. Theoretical Guinier plots and cutoff frequencies
A realistic estimate of the density of atoms in proteins gives rise to theoretical Guinier plots and prediction of the cutoff frequency above which Wilson statistics holds. The protein density is approximately ρ ≃ 0.8 Da Å−3 (Henderson, 1995 ▸). The number of carbon atom equivalents, using 9.1 carbon equivalents per amino acid of molecular weight 110 is , where is the molecular weight. For a spherically shaped protein of radius R, the molecular weight and number of carbon atom equivalents are given by and , respectively. In particular, and = 3.3 MDa for R = 100 Å, while and = 52 kDa for R = 25 Å [see Table 2 of Henderson (1995 ▸)].
Theoretical Guinier plots of the logarithm of the expected power spectra [using (12) and (7)] as a function of the squared spatial frequency for these representative cases are shown in Fig. 1 ▸. The effect of the atomic structure factor is not included in Fig. 1 ▸ for which . Also not included is the modification due to solvent contrast. The low-frequency signal is modified by the partial contrast-matching of solvent. In the work of Rosenthal & Henderson (2003 ▸) the remaining contrast is estimated to be 0.42, so the low-frequency spectral density should be modified by this.
The theoretical Guinier plots qualitatively resemble experimental Guinier plots, such as Fig. 8 of Rosenthal & Henderson (2003 ▸). For the larger molecule with R = 100 Å the power spectrum is approximately flat above k 2 = 0.01 Å−2 corresponding to 10 Å resolution, whereas for the smaller molecule with R = 25 Å the transition occurs closer to k 2 = 0.015 Å−2, or 8.2 Å resolution.
The notable oscillations in the Guinier plots are due to the oscillations of given by (7). Fig. 2 ▸ shows and (the latter is multiplied by 10 in order to make the two plots comparable in scale). We see that [i.e. the constant C in can be taken as ]. It is important to keep in mind that proteins are not perfectly spherically symmetric. Although oscillations in the Guinier plot are still expected (and are indeed observed), their magnitude and periodicity are shape dependent.
Theorem 3 implies that the transition to Wilson statistics in the Guinier plot occurs at , and for higher radial frequencies the spherically averaged power spectrum is approximately flat. The cutoff frequency can be determined by balancing the two terms in (12). Specifically, we require the second term of (12) to be at most . This criterion, together with the bound with imply
or . The radius of the unit cell (that occupies a single atom on average) satisfies = . Therefore, , and the dimensional cutoff frequency (in Å−1) is given by
in terms of the radius, or equivalently
in terms of the molecular weight. The cutoff frequency decreases with the size of the molecule, but the decrease is quite gradual due to the small exponent 1/12 in (23). For example, the cutoff frequency increases by just 47% when the molecular weight decreases by a factor of 100. For a large macromolecule with = 3.3 MDa and R = 100 Å the cutoff frequency is k c = 0.088 Å−1 corresponding to 11.3 Å resolution. For a smaller macromolecule with = 52 kDa and R = 25 Å the cutoff frequency is k c = 0.125 Å−1 corresponding to 8.0 Å resolution. These predictions are in agreement with our previous estimates for the cutoff frequencies that were obtained by observing Fig. 1 ▸. Fig. 3 ▸ illustrates the cutoff frequency as a function of the molecular size with radius extremes of 20 to 150 Å. The cutoff frequency is relatively stable and varies only a little across a wide range of molecular sizes (from 7.5 to 12.5 Å resolution). This behaviour and resolutions are in agreement with empirical evidence about the validity regime of Wilson statistics (Rosenthal & Henderson, 2003 ▸).
4. Generalizations and applications to cryo-EM
4.1. Existing applications to cryo-EM
A common practice in single-particle cryo-EM is to apply a filter to the reconstructed map. The filter boosts medium and high frequencies such that the power spectrum of the sharpened map is approximately flat and consistent with Wilson statistics (Rosenthal & Henderson, 2003 ▸; Fernandez et al., 2008 ▸). The filter is an exponentially growing filter whose parameter is estimated using the Guinier plot. The boost of medium- and high-frequency components increases the contrast of many structural features of the map and helps to model the atomic structure. This is the so-called B-factor correction, B-factor flattening or B-factor sharpening. It is a tremendously effective method to increase the interpretability of the reconstructed map. In fact, most map depositions in the Electron Microscopy Data Bank (EMDB) only contain sharpened maps (Vilas et al., 2020 ▸). Map sharpening is still an active area of research and method development (see e.g. Jakobi et al., 2017 ▸; Kaur et al., 2021 ▸ and references therein). Wilson statistics is also used to reason about and extrapolate the number of particles required to high resolution (Rosenthal & Henderson, 2003 ▸).
4.2. Generalization of Wilson statistics to covariance with application to three-dimensional iterative refinement
We now highlight a certain generalization of Wilson statistics with potential application to three-dimensional iterative refinement, arguably the main component of the computational pipeline for single-particle analysis (Singer & Sigworth, 2020 ▸). Specifically, the Bayesian inference framework underlying the popular software toolbox RELION (Scheres, 2012b ▸) requires the covariance matrix of and approximates it with a diagonal matrix (Scheres, 2012a ▸). For tractable computation, the variance (the diagonal of the covariance matrix) is further assumed to be a radial function.
The random bag of atoms model underlying Wilson statistics provides the covariance matrix
in closed form as
Before proving this result, note that it implies a vast reduction in the number of parameters needed to describe the covariance matrix. In general, for a three-dimensional map represented as an array of voxels, the covariance matrix is of size which requires entries, which is prohibitively large. However, (25) suggests that the covariance depends on only parameters. Furthermore, approximating by a radial function implies that the covariance depends on just parameters, the same number of parameters in the existing Bayesian inference method for three-dimensional iterative refinement. Moreover, comparing the two terms in (25), the decay of implies that whenever . Therefore, for
Since is largest for and decays with increasing distance , it follows from (26) that the covariance matrix restricted to frequencies above is approximately a band matrix with bandwidth , such that the diagonal is dominant and matrix entries decay when moving away from the diagonal. Note that is a very low frequency corresponding to resolution of the size of the protein (as implied by the scaling). Therefore, the covariance is well approximated by a band matrix with a very small number of diagonals. This serves as a theoretical justification for the diagonal approximation in the Bayesian inference framework (Scheres, 2012a ▸), as correlations of Fourier coefficients with are negligible. On the flip side, correlations for which should not be ignored and correctly accounting for them could potentially lead to further improvement of the Bayesian inference framework (Scheres, 2012a ▸).
To prove (25), we evaluate the two terms in the right-hand side of (24) separately. The second term is directly obtained from (6) as
To evaluate the first term, we substitute and by (5), separate the summation into diagonal terms () and off-diagonal terms () as in Wilson’s original argument, and use that ’s are i.i.d., resulting in
Subtracting (27) from (28) proves (25). This is a generalization of Wilson statistics, as setting reduces (28) to (12).
Note that the diagonal of the covariance matrix satisfies
The variance vanishes for because regardless of the atom positions. The small variance at very low frequencies shares the same origins as Guinier law.
In existing Bayesian inference approaches (Scheres, 2012a ▸), the mean of each frequency voxel is assumed to be zero. However, comparing (6) and (29) for the mean and the variance , we see that the variance dominates the squared mean only for , which is the validity regime of Wilson statistics. It follows that it is justified to assume a zero-mean signal only for high frequencies, but not at low frequencies. Including an explicit (approximately radial) non-zero mean in the Bayesian inference framework may therefore bring further improvement.
4.3. Generalization of Wilson statistics to higher-order spectra with application to autocorrelation analysis
Autocorrelation analysis, originally proposed by Kam (1977 ▸, 1980 ▸), has recently found revived interest for experiments using X-ray free-electron lasers (XFEL) (von Ardenne et al., 2018 ▸; Kurta et al., 2017 ▸; Liu et al., 2013 ▸) and cryo-EM (Sharon et al., 2020 ▸; Bendory et al., 2018 ▸, 2019 ▸). In autocorrelation analysis, the three-dimensional molecular structure is determined from the correlation statistics of the noisy images. Typically, the second- or third-order correlation functions are sufficient in principle to uniquely determine the structure (Bandeira et al., 2017 ▸; Sharon et al., 2020 ▸). It is therefore of interest to derive a third-order statistics analogue of (12). Specifically, is given by
This result is obtained by separating the sum over all triplets into five groups: , , , and .
Similar to the power spectrum which is the Fourier transform of the autocorrelation function, the bispectrum is the Fourier transform of the triple-correlation function. The bispectrum, like the power spectrum, is also shift-invariant. As such, it plays an important role in various autocorrelation analysis techniques. The expected bispectrum under the random bag of atoms model is obtained by setting in (31)
for .
The bispectrum drops from for to N at high frequencies. This drop is even more pronounced than that of the power spectrum that decreases from to N. This may lead to numerical difficulties in inverting the bispectrum as it has a large dynamic range, e.g. it spans eight orders of magnitude for .
The terms in the first two lines of (31) have similar behaviour to the power spectrum (12). The last term depends on the decay rate of . If as for the ball, then
for , which can be regarded as a generalization of Wilson statistics [e.g. (13)] to higher-order spectra. However, for higher-order spectra such as the bispectrum the behaviour at high frequencies is more involved. For example, taking and to be high frequencies does not imply is necessarily a high frequency, as can be readily seen by taking for which . For this particular choice of the expected bispectrum is always greater than .
5. Discussion
This paper provided the first formal mathematical derivation of Wilson statistics, offered generalizations to other statistics, and highlighted potential applications in structural biology.
The assumption underlying Wilson statistics of independent atom locations is too simplistic as it ignores correlations between atom positions in the protein. It is well known that the power spectrum deviates from Wilson statistics at frequencies that correspond to interatomic distances associated with secondary structure such as α-helices which produce a peak at 10 Å and beta-sheets which produce a peak at 4.5 Å. A more refined model that includes such correlations is beyond the scope of this paper.
From the computational perspective, we note that numerical evaluation of Fourier transforms and power spectra associated with Wilson statistics involves computing sums of complex exponentials of the form (1). These can be efficiently computed as a type-1 three-dimensional non-uniform fast Fourier transform (NUFFT) (Dutt & Rokhlin, 1993 ▸). The computational complexity of a naïve procedure is , where M is the number of target frequencies, whereas the asymptotic complexity of NUFFT is (up to logarithmic factors). These considerations will be taken into account in future computational work for numerical validation of the theoretical predictions including comparison with the power spectra and bispectra of density maps created from atomic models (Sorzano et al., 2015 ▸).
Wilson statistics is an instance of a universality phenomenon: all proteins regardless of their shape and specific atomic positions exhibit a similar spherically averaged power spectrum at high frequencies. From the computational standpoint in cryo-EM, this universality is a blessing and a curse at the same time. On the one hand, it enables one to correct the magnitudes of the Fourier coefficients of the reconstructed map so they agree with the theoretical prediction. On the other hand, it implies that the high-frequency part of the spherically averaged power spectrum is not particularly useful for structure determination, as it does not discriminate between molecules. The generalization of Wilson statistics to the higher-order spectra shows that the bispectrum also becomes flat at high frequencies. These observations may help explain difficulties of the autocorrelation approach as a high-resolution reconstruction method (Bendory et al., 2018 ▸).
Acknowledgments
The author is indebted to Nicholas Marshall, Fred Sigworth, Ti-Yen Lan, Tamir Bendory and Joe Kileel for valuable discussions and comments.
Funding Statement
This work was funded by Air Force Office of Scientific Research grants FA9550-17-1-0291 and FA9550-20-1-0266; Simons Foundation grant Math+X Investigator; Gordon and Betty Moore Foundation grant Data-Driven Discovery Investigator; National Science Foundation, Directorate for Mathematical and Physical Sciences grant DMS-2009753; National Science Foundation, Directorate for Computer and Information Science and Engineering grant IIS-1837992; National Institute of General Medical Sciences grant 1R01GM136780-01.
References
- Ardenne, B. von, Mechelke, M. & Grubmüller, H. (2018). Nat. Commun. 9, 2375. [DOI] [PMC free article] [PubMed]
- Bandeira, A. S., Blum-Smith, B., Kileel, J., Perry, A., Weed, J. & Wein, A. S. (2017). arXiv:1712.10163.
- Bendory, T., Boumal, N., Leeb, W., Levin, E. & Singer, A. (2018). arXiv:1810.00226.
- Bendory, T., Boumal, N., Leeb, W., Levin, E. & Singer, A. (2019). Inverse Probl. 35, 104003.
- Brandolini, L., Hofmann, S. & Iosevich, A. (2003). Geom. Funct. Anal. GAFA, 13, 671–680.
- Debye, P. (1915). Ann. Phys. 351, 809–823.
- Drenth, J. (2007). Principles of Protein X-ray Crystallography. New York, NY: Springer Science & Business Media.
- Dutt, A. & Rokhlin, V. (1993). SIAM J. Sci. Comput. 14, 1368–1393.
- Fernández, J., Luque, D., Castón, J. & Carrascosa, J. (2008). J. Struct. Biol. 164, 170–175. [DOI] [PubMed]
- Henderson, R. (1995). Q. Rev. Biophys. 28, 171–193. [DOI] [PubMed]
- Jakobi, A. J., Wilmanns, M. & Sachse, C. (2017). Elife, 6, e27131. [DOI] [PMC free article] [PubMed]
- Kam, Z. (1977). Macromolecules, 10, 927–934.
- Kam, Z. (1980). J. Theor. Biol. 82, 15–39. [DOI] [PubMed]
- Kaur, S., Gomez-Blanco, J., Khalifa, A. A., Adinarayanan, S., Sanchez-Garcia, R., Wrapp, D., McLellan, J. S., Bui, K. H. & Vargas, J. (2021). Nat. Commun. 12, 1240. [DOI] [PMC free article] [PubMed]
- Kurta, R. P., Donatelli, J. J., Yoon, C. H., Berntsen, P., Bielecki, J., Daurer, B. J., DeMirci, H., Fromme, P., Hantke, M. F., Maia, F. R., Munke, A., Nettelblad, C., Pande, K., Reddy, H. K. N., Sellberg, J. A., Sierra, R. G., Svenda, M., van der Schot, G., Vartanyants, I. A., Williams, G. J., Xavier, P. L., Aquila, A., Zwart, P. H. & Mancuso, A. P. (2017). Phys. Rev. Lett. 119, 158102. [DOI] [PMC free article] [PubMed]
- Liu, H., Poon, B. K., Saldin, D. K., Spence, J. C. H. & Zwart, P. H. (2013). Acta Cryst. A69, 365–373. [DOI] [PubMed]
- Porod, G. (1951). Kolloid-Zeitschrift, 124, 83–114.
- Porod, G. (1982). Small Angle X-ray Scattering, pp. 17–51. London, UK: Academic Press.
- Rosenthal, P. B. & Henderson, R. (2003). J. Mol. Biol. 333, 721–745. [DOI] [PubMed]
- Scheres, S. H. (2012a). J. Mol. Biol. 415, 406–418. [DOI] [PMC free article] [PubMed]
- Scheres, S. H. (2012b). J. Struct. Biol. 180, 519–530. [DOI] [PMC free article] [PubMed]
- Sharon, N., Kileel, J., Khoo, Y., Landa, B. & Singer, A. (2020). Inverse Probl. 36, 044003.
- Singer, A. & Sigworth, F. J. (2020). Annu. Rev. Biomed. Data Sci. 3, 163–190. [DOI] [PMC free article] [PubMed]
- Sorzano Carlos, O. S., Vargas, J., Otón, J., Abrishami, V., de la Rosa-Trevín, J. M., del Riego, S., Fernández-Alderete, A., Martínez-Rey, C., Marabini, R. & Carazo, J.-M. (2015). AIMS Biophysics, 2, 8–20.
- Stein, E. M. & Shakarchi, R. (2011). Functional Analysis: Introduction to Further Topics in Analysis, Vol. 4. Princeton University Press.
- Vilas, J., Vargas, J., Martínez, M., Ramírez-Aportela, E., Melero, R., Jimenez-Moreno, A., Garduño, E., Conesa, P., Marabini, R., Maluenda, D., Carazo, J. M. & Sorzano, C. O. S. (2020). J. Struct. Biol. 209, 107447. [DOI] [PubMed]
- Wilson, A. (1942). Nature, 150, 152.
- Wilson, A. J. C. (1949). Acta Cryst. 2, 318–321.