Abstract
In cryo-electron microscopy (cryo-EM), a microscope generates a top view of a sample of randomly oriented copies of a molecule. The problem of single particle reconstruction (SPR) from cryo-EM is to use the resulting set of noisy two-dimensional projection images taken at unknown directions to reconstruct the three-dimensional (3D) structure of the molecule. In some situations, the molecule under examination exhibits structural variability, which poses a fundamental challenge in SPR. The heterogeneity problem is the task of mapping the space of conformational states of a molecule. It has been previously suggested that the leading eigenvectors of the covariance matrix of the 3D molecules can be used to solve the heterogeneity problem. Estimating the covariance matrix is challenging, since only projections of the molecules are observed, but not the molecules themselves. In this paper, we formulate a general problem of covariance estimation from noisy projections of samples. This problem has intimate connections with matrix completion problems and high-dimensional principal component analysis. We propose an estimator and prove its consistency. When there are finitely many heterogeneity classes, the spectrum of the estimated covariance matrix reveals the number of classes. The estimator can be found as the solution to a certain linear system. In the cryo-EM case, the linear operator to be inverted, which we term the projection covariance transform, is an important object in covariance estimation for tomographic problems involving structural variation. Inverting it involves applying a filter akin to the ramp filter in tomography. We design a basis in which this linear operator is sparse and thus can be tractably inverted despite its large size. We demonstrate via numerical experiments on synthetic datasets the robustness of our algorithm to high levels of noise.
Keywords: cryo-electron microscopy, X-ray transform, inverse problems, structural variability, classification, heterogeneity, covariance matrix estimation, principal component analysis, high-dimensional statistics, Fourier projection slice theorem, spherical harmonics
1. Introduction
1.1. Covariance matrix estimation from projected data
Covariance matrix estimation is a fundamental task in statistics. Statisticians have long grappled with the problem of estimating this statistic when the samples are only partially observed. In this paper, we consider this problem in the general setting where “partial observations” are arbitrary linear projections of the samples onto a lower-dimensional space.
Problem 1.1
Let X be a random vector on Cp, with E[X] = μ0 and Var(X) = Σ0 (Var[X] denotes the covariance matrix of X). Suppose also that P is a random q × p matrix with complex entries, and E is a random vector in Cq with E[E] = 0 and Var[E] = σ2Iq. Finally, let I denote the random vector in Cq given by
(1.1) |
Assume now that X, P , and E are independent. Estimate μ0 and Σ0 given observations I1,… , In and P1,… , Pn of I and P , respectively.
Here, and throughout this paper, we write random quantities in boldface to distinguish them from deterministic quantities. We use regular font (e.g., X) for vectors and matrices, calligraphic font (e.g., X ) for functions, and script font for function spaces (e.g., B). We denote true parameter values with a subscript of zero (e.g., μ0), estimated parameter values with a subscript of n (e.g., μn), and generic variables with no subscript (e.g., μ).
Problem 1.1 is quite general, and has many practical applications as special cases. The main application this paper addresses is the heterogeneity problem in single particle reconstruction (SPR) from cryo-electron microscopy (cryo-EM). SPR from cryo-EM is an inverse problem where the goal is to reconstruct a three-dimensional (3D) molecular structure from a set of its two-dimensional (2D) projections from random directions [12]. The heterogeneity problem deals with the situation in which the molecule to be reconstructed can exist in several structural classes. In the language of Problem 1.1, X represents a discretization of the molecule (random due to heterogeneity), Ps the 3D-to-2D projection matrices, and Is the noisy projection images. The goal of this paper is to estimate the covariance matrix associated with the variability of the molecule. If there is a small, finite number (C) of classes, then Σ0 has low rank (C − 1). This ties the heterogeneity problem to principal component analysis (PCA) [40]. If Σ0 has eigenvectors V1,… , Vp (called principal components) corresponding to eigenvalues λ1 ≥ ··· ≥ λp, then PCA states that Vi accounts for a variance of λi in the data. In modern applications, the dimensionality p is often large, while X typically has much fewer intrinsic degrees of freedom [11]. The heterogeneity problem is an example of such a scenario; for this problem, we demonstrate later that the top principal components can be used in conjunction with the images to reconstruct each of the C classes.
Another class of applications closely related to Problem 1.1 is missing data problems in statistics. In these problems, X1,… , Xn are samples of a random vector X. The statistics of this random vector must be estimated in a situation where certain entries of the samples Xs are missing [31]. This amounts to choosing Ps to be coordinate-selection operators, operators which output a certain subset of the entries of a vector. An important problem in this category is PCA with missing data, which is the task of finding the top principal components when some data are missing. Closely related to this is the noisy low rank matrix completion problem [9]. In this problem, only a subset of the entries of a low rank matrix A are known (possibly with some error), and the task is to fill in the missing entries. If we let Xs be the columns of A, then the observed variables in each column are PsXs + Es, where Ps acts on Xs by selecting a subset of its coordinates, and Es is noise. Note that the matrix completion problem involves filling in the missing entries of Xs, while Problem 1.1 requires us only to find the covariance matrix of these columns. However, the two problems are closely related. For example, if the columns are distributed normally, then the missing entries can be found as their expectations conditioned on the known variables [51]. Alternatively, we can find the missing entries by choosing the linear combinations of the principal components that best fit the known matrix entries. A well-known application of matrix completion is in the field of recommender systems (also known as collaborative filtering). In this application, users rate the products they have consumed, and the task is to determine what new products they would rate highly. We obtain this problem by interpreting Ai,j as the jth user’s rating of product i. In recommender systems, it is assumed that only a few underlying factors determine users’ preferences. Hence, the data matrix A should have low rank. A high profile example of recommender systems is the Netflix prize problem [6].
In both of these classes of problems, Σ0 is large but should have low rank. Despite this, note that Problem 1.1 does not have a low rank assumption. Nevertheless, as our numerical results demonstrate, the spectrum of our (unregularized) covariance matrix estimator reveals low rank structure when it is present in the data. Additionally, the framework we develop in this paper naturally allows for regularization.
Having introduced Problem 1.1 and its applications, let us delve more deeply into one particular application: SPR from cryo-EM.
1.2. Cryo-electron microscopy
Electron microscopy is an important tool for structural biologists, as it allows them to determine complex 3D macromolecular structures. A general technique in electron microscopy is called SPR. In the basic setup of SPR, the data collected are 2D projection images of ideally assumed identical, but randomly oriented, copies of a macromolecule. In particular, one specimen preparation technique used in SPR is called cryo-EM, in which the sample of molecules is rapidly frozen in a thin ice layer [12, 63]. The electron microscope provides a top view of the molecules in the form of a large image called a micrograph. The projections of the individual particles can be picked out from the micrograph, resulting in a set of projection images. Mathematically, we can describe the imaging process as follows. Let X : R3 → R represent the Coulomb potential induced by the unknown molecule. We scale the problem to be dimension-free in such a way that most of the “mass” of X lies within the unit ball B ⊂ R3 (since we later model X to be bandlimited, we cannot quite assume it is supported in B). To each copy of this molecule corresponds a rotation R ∈ SO(3), which describes its orientation in the ice layer. The idealized forward projection operator P = P(R) : L1(R3) → L1(R2) applied by the microscope is the X-ray transform
(1.2) |
where r = (x, y, z)T . Hence, P first rotates X by R, and then integrates along vertical lines to obtain the projection image. The microscope yields the image PX , discretized onto an N × N Cartesian grid, where each pixel is also corrupted by additive noise. Let there be q ≈ π N 2 pixels contained in the inscribed disc of an N × N grid (the remaining pixels contain little or no signal because X is concentrated in B). If S : L1(R2) → Rq is a discretization operator, then the microscope produces images I given by
(1.3) |
with E ~ N (0, σ2Iq ), where for the purposes of this paper we assume additive white Gaussian noise. The microscope has an additional blurring effect on the images, a phenomenon we will discuss shortly, but will leave out of our model. Given a set of images I1,… , In, the cryo-EM problem is to estimate the orientations R1,… , Rn of the underlying volumes and reconstruct X . Note that throughout this paper, we will use “cryo-EM” and “cryo-EM problem” as shorthand for the SPR problem from cryo-EM images; we also use “volume” as a synonym for “3D structure.”
The cryo-EM problem is challenging for several reasons. Unlike most other imaging modalities of computerized tomography, the rotations Rs are unknown, so we must estimate them before reconstructing X . This challenge is one of the major hurdles to reconstruction in cryo-EM. Since the images are not perfectly centered, they also contain in-plane translations, which must be estimated as well. The main challenge in rotation estimation is that the projection images are corrupted by extreme levels of noise. This problem arises because only low electron doses can scan the molecule without destroying it. To an extent, this problem is mitigated by the fact that cryo-EM datasets often have tens or even hundreds of thousands of images, which makes the reconstruction process more robust. Another issue with transmission electron microscopy in general is that technically, the detector only registers the magnitude of the electron wave exiting the specimen. Zernike realized in the 1940s that the phase information could also be recovered if the images were taken out of focus [60]. While enabling measurement of the full output of the microscope, this out-of-focus imaging technique produces images representing the convolution of the true image with a point spread function (PSF). The Fourier transform of the PSF is called the contrast transfer function (CTF). Thus the true images are multiplied by the CTF in the Fourier domain to produce the output images. Hence, the Ps operators in practice also include the blurring effect of a CTF. This results in a loss of information at the zero crossings of the (Fourier-domain) CTF and at high frequencies [12]. In order to compensate for the former effect, images are taken with several different defocus values, whose corresponding CTFs have different zero crossings.
The field of cryo-EM has recently seen a drastic improvement in detector technology. New direct electron detector cameras have been developed, which, according to a recent article in Science, have “unprecedented speed and sensitivity” [24]. This technology has enabled SPR from cryo-EM to succeed on smaller molecules (up to size ~150 kDa) and achieve higher resolutions (up to 3Å) than before. Such high resolution allows tracing of the polypetide chain and identification of residues in protein molecules [28, 3, 15, 34, 68]. Recently, single particle methods have provided high resolution structures of the TRPV1 ion channel [30] and of the large subunit of the yeast mitochondrial ribosome [1]. While X-ray crystallography is still the imaging method of choice for small molecules, cryo-EM now holds the promise of reconstructing larger, biomedically relevant molecules not amenable to crystallization.
The most common method for solving the basic cryo-EM problem is guessing an initial structure and then performing an iterative refinement procedure, where iterations alternate between (1) estimating the rotations of the experimental images by matching them with projections of the current 3D model and (2) tomographic inversion producing a new 3D model based on the experimental images and their estimated rotations [12, 61, 44]. There are no convergence guarantees for this iterative scheme, and the initial guess can incur bias in the reconstruction. An alternative is to estimate the rotations and reconstruct an accurate initial structure directly from the data. Such an ab initio structure is a much better initialization for the iterative refinement procedure. This strategy helps avoid bias and reduce the number of refinement iterations necessary to converge [70]. In the ab initio framework, rotations can be estimated by one of several techniques (see, e.g., [55, 64] and references therein).
1.3. Heterogeneity problem
As presented above, a key assumption in the cryo-EM problem is that the sample consists of (rotated versions of) identical molecules. However, in many datasets this assumption does not hold. Some molecules of interest exist in more than one conformational state. For example, a subunit of the molecule might be present or absent, have a few different arrangements, or be able to move continuously from one position to another. These structural variations are of great interest to biologists, as they provide insight into the functioning of the molecule. Unfortunately, standard cryo-EM methods do not account for heterogeneous samples. New techniques must be developed to map the space of molecules in the sample, rather than just reconstruct a single volume. This task is called the heterogeneity problem. A common case of heterogeneity is when the molecule has a finite number of dominant conformational classes. In this discrete case, the goal is to provide biologists with 3D reconstructions of all these structural states. While cases of continuous heterogeneity are possible, in this paper we mainly focus on the discrete heterogeneity scenario.
While we do not investigate the 3D rotation estimation problem in the heterogeneous case, we conjecture that this problem can be solved without developing sophisticated new tools. Consider, for example, the case when the heterogeneity is small, i.e., the volumes X1,… , Xn can be rotationally aligned so they are all close to their mean (in some norm). For example, this property holds when the heterogeneity is localized (e.g., as in Figure 1). In this case, one might expect that by first assuming homogeneity, existing rotation estimation methods would yield accurate results. Even if the heterogeneity is large, an iterative scheme can be devised to alternately estimate the rotations and conformations until convergence (though this convergence is local, at best). Thus, in this publication, we assume that the 3D rotations Rs (and in-plane translations) have already been estimated.
With the discrete heterogeneity and known rotations assumptions, we can formulate the heterogeneity problem as follows.
Problem 1.2 (heterogeneity problem)
Suppose a heterogeneous molecule can take on one of C different states: X 1,… , X C ∈ B, where B is a finite-dimensional space of bandlimited functions (see section 3.2). Let Ω = {1, 2,… ,C} be a sample space, and p1,… , pC probabilities (summing to one) so that the molecule assumes state c with probability pc. Represent the molecule as a random field X : Ω × R3 → R, with
(1.4) |
Let R be a random rotation with some distribution over SO(3), and define the corresponding random projection P = P(R) (see (1.2)). Finally, E ~ N (0, σ2Iq ). Assume that X , R, E are independent. A random image of a particle is obtained via
(1.5) |
where S : L1(R2) → Rq is a discretization operator. Given observations I1,… , In and R1,… , Rn of I and R, respectively, estimate the number of classes C, the structures X c, and the probabilities pc.
Note that SP|B is a (random) linear operator between finite-dimensional spaces, and so it has a matrix version P : Rp → Rq , where p = dim B. If we let X be the random vector on Rp obtained by expanding X in the basis for B, then we recover the equation I = PX + E from Problem 1.1. Thus, the main factors distinguishing Problem 1.2 from Problem 1.1 are that the former assumes a specific form for P and posits a discrete distribution on X. As we discuss in section 4, Problem 1.2 can be solved by first estimating the covariance matrix as in Problem 1.1, finding coordinates for each image with respect to the top eigenvectors of this matrix, and then applying a standard clustering procedure to these coordinates.
One of the main dificulties of the heterogeneity problem is that, compared to usual SPR, we must deal with an even lower effective signal-to-noise ratio (SNR). Indeed, the signal we seek to reconstruct is the variation of the molecules around their mean, as opposed to the mean volume itself. We propose a precise definition of SNR in the context of the heterogeneity problem in section 7.1. Another dificulty is the indirect nature of our problem. Although the heterogeneity problem is an instance of a clustering problem, it differs from usual such problems in that we do not have access to the objects we are trying to cluster—only projections of these objects onto a lower-dimensional space are available. This makes it challenging to apply any standard clustering technique directly.
The heterogeneity problem is considered one of the most important problems in cryo-EM. In his 2013 Solvay public lecture on cryo-EM, Dr. Joachim Frank emphasized the importance of “the ability to obtain an entire inventory of coexisting states of a macromolecule from a single sample” [13]. Speaking of approaches to the heterogeneity problem in a review article, Frank discussed “the potential these new technologies will have in exploring functionally relevant states of molecular machines” [14]. It is stressed there that much room for improvement remains; current methods cannot automatically identify the number of conformational states and have trouble distinguishing between similar conformations.
1.4. Previous work
Much work related to Problems 1.1 and 1.2 has already been done. There is a rich statistical literature on the covariance estimation problem in the presence of missing data, a special case of Problem 1.1. In addition, work on the low rank matrix sensing problem (a generalization of matrix completion) is also closely related to Problem 1.1. Regarding Problem 1.2, several approaches to the heterogeneity problem have been proposed in the cryo-EM literature.
1.4.1. Work related to Problem 1.1
Many approaches to covariance matrix estimation from missing data have been proposed in the statistics literature [31]. The simplest approach to dealing with missing data is to ignore the samples with any unobserved variables. Another simple approach is called available case analysis, in which the statistics are constructed using all the available values. For example, the (i, j) entry of the covariance matrix is constructed using all samples for which the ith and jth coordinates are simultaneously observed. These techniques work best under certain assumptions on the pattern of missing entries, and more sophisticated techniques are preferred [31]. One of the most established such approaches is maximum likelihood estimation (MLE). This involves positing a probability distribution on X (e.g., multivariate normal) and then maximizing the likelihood of the observed partial data with respect to the parameters of the model. Such an approach to fitting models from partial observations was known as early as the 1930s, when Wilks used it for the case of a bivariate normal distribution [66]. Wilks proposed to maximize the likelihood using a gradient-based optimization approach. In 1977, Dempster, Laird, and Rubin introduced the expectation-maximization (EM) algorithm [10] to solve maximum likelihood problems. The EM algorithm is one of the most popular methods for solving missing data problems in statistics. Also, there is a class of approaches to missing data problems called imputation, in which the missing values are filled either by averaging the available values or through more sophisticated regression-based techniques. Finally, see [32, 33] for other approaches to related problems.
Closely related to covariance estimation from missing data is the problem of PCA with missing data. In this problem, the task is to find the leading principal components, and not necessarily the entire covariance matrix. Not surprisingly, EM-type algorithms are popular for this problem as well. These algorithms often search directly for the low rank factors. See [18] for a survey of approaches to PCA with missing data. Closely related to PCA with missing data is the low rank matrix completion problem. Many of the statistical methods discussed above are also applicable to matrix completion. In particular, EM algorithms to solve this problem are popular, e.g., [51, 27].
Another more general problem setup related to Problem 1.1 is the low rank matrix sensing problem, which generalizes the low rank matrix completion problem. Let A ∈ Rp×n be an unknown rank-k matrix, and let M : Rp×n → Rd be a linear map, called the sensing matrix. We would like to find A, but we only have access to the (possibly noisy) data M(A). Hence, the low rank matrix sensing problem can be formulated as follows [19]:
(1.6) |
Note that when Σ0 is low rank, Problem 1.1 is a special case of the low rank matrix sensing problem. Indeed, consider putting the unknown vectors X1,… , Xn together as the columns of a matrix A. The rank of this matrix is the number of degrees of freedom in X (in the cryo-EM problem, this relates to the number of heterogeneity classes of the molecule). The linear projections P1,… , Pn can be combined into one sensing matrix M acting on A. In this way, our problem falls into the realm of matrix sensing.
One of the first algorithms for matrix sensing was inspired by the compressed sensing theory [46]. This approach uses a matrix version of l1 regularization called nuclear norm regularization. The nuclear norm is the sum of the singular values of a matrix, and is a convex proxy for its rank. Another approach to this problem is alternating minimization, which decomposes A into a product of the form UV T and iteratively alternates between optimizing with respect to U and V . The first proof of convergence for this approach was given in [19]. Both the nuclear norm and alternating minimization approaches to the low rank matrix sensing problem require a restricted isometry property on M for theoretical guarantees.
While the aforementioned algorithms are widely used, we believe they have limitations as well. EM algorithms require postulating a distribution over the data and are susceptible to getting trapped in local optima. Regarding the former point, Problem 1.1 avoids any assumptions on the distribution of X, so our estimator should have the same property. Matrix sensing algorithms (especially alternating minimization) often assume that the rank is known in advance. However, there is no satisfactory statistical theory for choosing the rank. By contrast, the estimator we propose for Problem 1.1 allows automatic rank estimation.
1.4.2. Work related to Problem 1.2
Several approaches to the heterogeneity problem have been proposed. Here we give a brief overview of some of these approaches.
One approach is based on the notion of common lines. By the Fourier projection slice theorem (see Theorem 3.1), the Fourier transforms of any two projection images of an object will coincide on a line through the origin, called a common line. The idea of Shatsky et al. [52] was to use common lines as a measure of how likely it is that two projection images correspond to the same conformational class. Specifically, given two projection images and their corresponding rotations, we can take their Fourier transforms and correlate them on their common line. From there, a weighted graph of the images is constructed, with edges weighted based on this common line measure. Then spectral clustering is applied to this graph to classify the images. An earlier common lines approach to the heterogeneity problem is described in [16].
Another approach is based on MLE. It involves positing a probability distribution over the space of underlying volumes, and then maximizing the likelihood of the images with respect to the parameters of the distribution. For example, Wang et al. [65] model the heterogeneous molecules as a mixture of Gaussians and employ the EM algorithm to find the parameters. A challenge with MLE approaches is that the resulting objective functions are nonconvex and have a complicated structure. For more discussion of the theory and practice of maximum likelihood methods, see [53] and [50], respectively. Also see [49] for a description of a software package which uses maximum likelihood to solve the heterogeneity problem.
A third approach to the heterogeneity problem is to use the covariance matrix of the set of original molecules. Penczek, Kimmel, and Spahn outline a bootstrapping approach in [43] (see also [41, 42, 67, 29]). In this approach, one repeatedly takes random subsets of the projection images and reconstructs 3D volumes from these samples. Then, one can perform PCA on this set of reconstructed volumes, which yields a few dominant “eigenvolumes.” Penczek, Kimmel, and Spahn propose to then produce mean-subtracted images by subtracting projections of the mean volume from the images. The next step is to project each of the dominant eigenvolumes in the directions of the images, and then obtain a set of coordinates for each image based on its similarity with each of the eigenvolume projections. Finally, using these coordinates, this resampling approach proceeds by applying a standard clustering algorithm such as K-means to classify the images into classes.
While existing methods for the heterogeneity problem have their success stories, each suffers from its own shortcomings: the common line approach does not exploit all the available information in the images, the maximum likelihood approach requires explicit a priori distributions and is susceptible to local optima, and the bootstrapping approach based on covariance matrix estimation is a heuristic sampling method that lacks in theoretical guarantees.
Note that the above overview of the literature on the heterogeneity problem is not comprehensive. For example, very recently, an approach to the heterogeneity problem based on normal mode analysis was proposed [20].
1.5. Our contribution
In this paper, we propose and analyze a covariance matrix estimator Σn to solve the general statistical problem (Problem 1.1), and then apply this estimator to the heterogeneity problem (Problem 1.2).
Our covariance matrix estimator has several desirable properties. First, we prove that the estimator is consistent as n → ∞ for fixed p, q. Second, our estimator does not require a prior distribution on the data, unlike MLE methods. Third, when the data have low intrinsic dimension, our method does not require knowing the rank of Σ0 in advance. The rank can be estimated from the spectrum of the estimated covariance matrix. This sets our method apart from alternating minimization algorithms that search for the low rank matrix factors themselves. Fourth, our estimator is given in closed form and its computation requires only a single linear inversion.
To implement our covariance matrix estimator in the cryo-EM case, we must invert a high-dimensional matrix Ln (see definition (2.8)). The size of this matrix is so large that typically it cannot even be stored on a computer; thus, inverting Ln is the greatest practical challenge we face. We consider two possibilities of addressing this challenge. In the primary approach we consider, we replace Ln by its limiting operator L, which does not depend on the rotations Rs and is a good approximation of Ln as long as these rotations are distributed uniformly enough. We then carefully construct new bases for images and volumes to make L a sparse, block diagonal matrix. While L has dimensions on the order of , this matrix has only total nonzero entries in the bases we construct, where Nres is the grid size corresponding to the target resolution. These innovations lead to a practical algorithm to estimate the covariance matrix in the heterogeneity problem. The second approach we consider is an iterative inversion of Ln, which has a low storage requirement and avoids the requirement of uniformly distributed rotations. We compare the complexities of these two methods, and find that each has its strengths and weaknesses.
The limiting operator L is a fundamental object in tomographic problems involving variability, and we call it the projection covariance transform. The projection covariance transform relates the covariance matrix of the imaged object to data that can be acquired from the projection images. Standard weighted back-projection tomographic reconstruction algorithms involve application of the ramp filter to the data [38], and we find that the inversion of L entails applying a similar filter, which we call the triangular area filter. The triangular area filter has many of the same properties as the ramp filter, but reflects the slightly more intricate geometry of the covariance estimation problem. The projection covariance transform is an interesting mathematical object in its own right, and we begin studying it in this paper.
Finally, we numerically validate the proposed algorithm (the first algorithm discussed above). We demonstrate this method’s robustness to noise on synthetic datasets by obtaining a meaningful reconstruction of the covariance matrix and molecular volumes even at low SNR levels. Excluding precomputations (which can be done once and for all), reconstructions for 10000 projection images of size 65 × 65 pixels takes fewer than five minutes on a standard laptop computer.
The paper is organized as follows. In section 2, we construct an estimator for Problem 1.1, state theoretical results about this estimator, and connect our problem to high-dimensional PCA. In section 3, we specialize the covariance estimator to the heterogeneity problem and investigate its geometry. In section 4, we discuss how to reconstruct the conformations once we have estimated the mean and covariance matrix. In section 5, we discuss computational aspects of the problem and construct a basis in which L is block diagonal and sparse. In section 6, we explore the complexity of the proposed approach. In section 7, we present numerical results for the heterogeneity problem. We conclude with a discussion of future research directions in section 8. Appendices A, B, and C contain calculations and proofs.
2. An estimator for Problem 1.1
2.1. Constructing an estimator
We define estimators μn and Σn through a general optimization framework based on the model (1.1). As a first step, let us calculate the first- and second-order statistics of I, conditioned on the observed matrix Ps for each s. Using the assumptions in Problem 1.1, we find that
(2.1) |
and
(2.2) |
Note that denotes the conjugate transpose of Ps.
Based on (2.1) and (2.2), we devise least-squares optimization problems for μn and Σn:
(2.3) |
(2.4) |
Here we use the Frobenius norm, which is defined by
Note that these optimization problems do not encode any prior knowledge about μ0 or Σ0. Since Σ0 is a covariance matrix, it must be positive semidefinite (PSD). As discussed above, in many applications Σ0 is also low rank. The estimator Σn need not satisfy either of these properties. Thus, regularization of (2.4) is an option worth exploring. Nevertheless, here we only consider the unregularized estimator Σn. Note that in most practical problems, we only are interested in the leading eigenvectors of Σn, and if these are estimated accurately, then it does not matter if Σn is not PSD or low rank. Our numerical experiments show that in practice, the top eigenvectors of Σn are indeed good estimates of the true principal components for high enough SNR.
Note that we first solve (2.3) for μn, and then use this result in (2.4). This makes these optimization problems quadratic in the elements of μ and Σ, and hence they can be solved by setting the derivatives with respect to μ and Σ to zero. This leads to the following equations for μn and Σn (see Appendix A for the derivative calculations):
(2.5) |
(2.6) |
When p = q and P = Ip, μn and Σn reduce to the sample mean and sample covariance matrix. When P is a coordinate-selection operator (recall the discussion following the statement of Problem 1.1), (2.5) estimates the mean by averaging all the available observations for each coordinate, and (2.6) estimates each entry of the covariance matrix by averaging over all samples for which both coordinates are observed. These are exactly the available-case estimators discussed in [31, section 3.4].
Observe that (2.5) requires inversion of the matrix
(2.7) |
and (2.6) requires inversion of the linear operator Ln : Cp×p → Cp×p defined by
(2.8) |
Since the Ps are drawn independently from P , the law of large numbers implies that (2.9) An → A and Ln → L almost surely,
(2.9) |
where the convergence is in the operator norm, and
(2.10) |
The invertibilities of A and L depend on the distribution of P . Intuitively, if P has a nonzero probability of “selecting” any coordinate of its argument, then A will be invertible. If P has a nonzero probability of “selecting” any pair of coordinates of its argument, then L will be invertible. In this paper, we assume that A and L are invertible. In particular, we will find that in the cryo-EM case, A and L are invertible if, for example, the rotations are sampled uniformly from SO(3). Under this assumption, we will prove that An and Ln are invertible with high probability for sufficiently large n. In the case when An or Ln are not invertible, we cannot define estimators from the above equations, so we simply set them to zero. Since the RHS quantities bn and Bn are noisy, it is also not desirable to invert An or Ln when these matrices are nearly singular. Hence, we propose the following estimators:
(2.11) |
The factors of 2 are somewhat arbitrary; any α> 1 would do.
Let us make a few observations about An and Ln. By inspection, An is symmetric and PSD. We claim that Ln satisfies the same properties, with respect to the Hilbert space Cp×p equipped with the inner product (A, B) = tr(BH A). Using the property tr(AB) = tr(BA), we find that for any Σ1, Σ2,
(2.12) |
Thus, Ln is self-adjoint. Next, we claim that Ln is PSD. Indeed,
(2.13) |
2.2. Consistency of µn and Σn
In this section, we state that under mild conditions on P , X, E, the estimators μn and Σn are consistent. Note that here, and throughout this paper, ∥·∥ will denote the Euclidean norm for vectors and the operator norm for matrices. Also, define
(2.14) |
where Y is a random vector.
Proposition 2.1
Suppose A (defined in (2.10)) is invertible, that lP l is bounded almost surely, and that |||X|||2, |||E|||2 < ∞. Then, for fixed p, q we have
(2.15) |
Hence, under these assumptions, μn is consistent.
Proposition 2.2
Suppose A and L (defined in (2.10)) are invertible, that lP l is bounded almost surely, and that there is a polynomial Q for which
(2.16) |
Then, for fixed p, q, we have
(2.17) |
Hence, under these assumptions, Σn is consistent.
Remark 2.3
The moment growth condition (2.16) on X and E is not very restrictive. For example, bounded, subgaussian, and subexponential random vectors all satisfy (2.16) with deg Q ≤ 1 (see [62, sections 5.2 and 5.3]).
See Appendix B for the proofs of Propositions (2.1) and (2.2). We mentioned that μn and Σn are generalizations of available-case estimators. Such estimators are known to be consistent when the data are missing completely at random (MCAR). This means that the pattern of missingness is independent of the (observed and unobserved) data. Accordingly, in Problem 1.1, we assume that P and X are independent, a generalization of the MCAR condition. The above propositions state that the consistency of μn and Σn also generalizes to Problem 1.1.
2.3. Connection to high-dimensional PCA
While the previous section focused on the “fixed p, large n” regime, in practice both p and n are large. Now, we consider the latter regime, which is common in modern high-dimensional statistics. In this regime, we consider the properties of the estimator Σn when Σ0 is low rank, and the task is to find its leading eigenvectors. What is the relationship between the spectra of Σn and Σ0? Can the rank of Σ0 be deduced from that of Σn? To what extent do the leading eigenvectors of Σn approximate those of Σ0? In the setting of (1.1) when P = Ip, the theory of high-dimensional PCA provides insight into such properties of the sample covariance matrix (and thus of Σn). In particular, an existing result gives the correlation between the top eigenvectors of Σn and Σ0 for given settings of SNR and p/n. It follows from this result that if the SNR is sufficiently high compared to √p/n, then the top eigenvector of Σn is a useful approximation of the top eigenvector of Σ0. If generalized to the case of nontrivial P , this result would be a useful guide for using the estimator Σn to solve practical problems, such as Problem 1.2. In this section, we first discuss the existing high-dimensional PCA literature, and then raise some open questions about how these results generalize to the case of nontrivial P .
Given independently and identically distributed (i.i.d.) samples I1,… , In ∈ Rp from a centered distribution I with covariance matrix (called the population covariance matrix), the sample covariance matrix is defined by
(2.18) |
We use the new tilde notation because in the context of Problem 1.1, is the signal-plus-noise covariance matrix, as opposed to the covariance of the signal itself. High-dimensional PCA is the study of the spectrum of for various distributions of I in the regime where n, p →∞ with p/n → γ.
The first case to consider is X = 0, i.e., I = E, where E ~ N (0, σ2Ip). In a landmark paper, Marc̆cenko and Pastur [35] proved that the spectrum of converges to the Marc̆cenko– Pastur (MP) distribution, which is parameterized by γ and σ2:
(2.19) |
The above formula assumes γ ≤ 1; a similar formula governs the case γ > 1. Note that there are much more general statements about classes of I for which this convergence holds; see, e.g., [54]. See Figure 2(a) for MP distributions with a few different parameter settings.
Johnstone [21] took this analysis a step further and considered the limiting distribution of the largest eigenvalue of . He showed that the distribution of this eigenvalue converges to the Tracy–Widom distribution centered on the right edge of the MP spectrum. In the same paper, Johnstone considered the spiked covariance model, in which
(2.20) |
where E is as before and , so that the population covariance matrix is . Here, X is the signal and E is the noise. In this view, the goal is to accurately recover the top r eigenvectors, as these will determine the subspace on which X is supported. The question then is the following: for what values of τ1,… , τr will the top r eigenvectors of the sample covariance matrix be good approximations to the top eigenvectors of the population covariance? Since we might not know the value of r a priori, it is important to first determine for what values of τ1,… , τr we can detect the presence of “spiked” population eigenvalues. In [5], the spectrum of the sample covariance matrix in the spiked model was investigated. It was found that the bulk of the distribution still obeys the MP law, whereas for each k such that
(2.21) |
the sample covariance matrix will have an eigenvalue tending to . The signal eigenvalues below this threshold tend to the right edge of the noise distribution. Thus, (2.21) defines a criterion for detection of signal. In Figure 2(b), we illustrate these results with a numerical example. We choose p = 800, n = 4000, and a spectrum corresponding to r = 3, with τ1, τ2 above, but τ3 below, the threshold corresponding to γ = p/n = 0.2. Figure 2(b) is a normalized histogram of the eigenvalues of the sample covariance matrix. The predicted MP distribution for the bulk is superimposed. We see that indeed we have two eigenvalues separated from this bulk. Moreover, the eigenvalue of corresponding to τ3 does not pop out of the noise distribution.
It is also important to compare the top eigenvectors of the sample and population covariance matrices. Considering the simpler case of a spiked model with r = 1, [4, 37] showed a “phase transition” effect: as long as τ1 is above the threshold in (2.21), the correlation of the top eigenvector (VPCA) with the true principal component (V ) tends to a limit between 0 and 1:
(2.22) |
Otherwise, the limiting correlation is zero. Thus, high-dimensional PCA is inconsistent. However, if is sufficiently high compared to , then the top eigenvector of the sample covariance matrix is still a useful approximation.
While all the statements made so far have concerned the limiting case n, p → ∞, similar (but slightly more complicated) statements hold for finite n, p as well (see, e.g., [37]). Thus, (2.21) has a practical interpretation. Again considering the case r = 1, note that the quantity is the SNR. When faced with a problem of the form (2.20) with a given p and SNR, one can determine how many samples one needs in order to detect the signal. If V represents a spatial object as in the cryo-EM case, then p can reflect the resolution to which we reconstruct V . Hence, if we have a dataset with a certain number of images n and a certain estimated SNR, then (2.21) determines the resolution to which V can be reconstructed from the data.
This information is important to practitioners (e.g., in cryo-EM), but as of now, the above theoretical results only apply to the case when P is trivial. Of course, moving to the case of more general P brings additional theoretical challenges. For example, with nontrivial P , the empirical covariance matrix of X is harder to disentangle from that of I, because the operator Ln becomes nontrivial (see (2.6) and (2.8)). How can our knowledge about the spiked model be generalized to the setting of Problem 1.1? We raise some open questions along these lines.
In what high-dimensional parameter regimes (in terms of n, p, q) is there hope to detect and recover any signal from Σn? With the addition of the parameter q, the traditional regime p ≈ n might no longer be appropriate. For example, in the random coordinate-selection case with the (extreme) parameter setting q = 2, it is expected that n = p2 log p samples are needed just for Ln to be invertible (by the coupon collector problem).
In the case when there is no signal (X = 0), we have I = E. In this case, what is the limiting eigenvalue distribution of Σn (in an appropriate parameter regime)? Is it still the MP law? How does the eigenvalue distribution depend on the distribution of P ? This is perhaps the first step towards studying the signal-plus-noise model.
In the no-signal case, what is the limiting distribution of the largest eigenvalue of Σn? Is it still Tracy–Widom? How does this depend on n, p, q, and P ? Knowing this distribution can provide p-values for signal detection, as is the case for the usual spiked model (see [21, p. 303]).
In the full model (1.1), if X takes values in a low-dimensional subspace of Rp, is the limiting eigenvalue distribution of Σn a bulk distribution with a few separated eigenvalues? If so, what is the generalization of the SNR condition (2.21) that would guarantee separation of the top eigenvalues? What would these top eigenvalues be, in terms of the population eigenvalues? Would there still be a phase-transition phenomenon in which the top eigenvectors of Σn are correlated with the principal components as long as the corresponding eigenvalues are above a threshold?
Answering these questions theoretically would require tools from random matrix theory such as the ones used by [21, 5, 37]. We do not attempt to address these issues in this paper, but remark that such results would be very useful theoretical guides for practical applications of our estimator Σn. Our numerical results show that the spectrum of the cryo-EM estimator Σn has qualitative behavior similar to that of the sample covariance matrix.
At this point, we have concluded the part of our paper focused on the general properties of the estimator Σn. Next, we move on to the cryo-EM heterogeneity problem.
3. Covariance estimation in cryo-EM heterogeneity problem
Now that we have examined the general covariance matrix estimation problem, let us specialize to the cryo-EM case. In this case, the matrices P have a specific form: they are finite-dimensional versions of P (defined in (1.2)). We begin by describing the Fourier-domain counterpart of P, which will be crucial in analyzing the cryo-EM covariance estimation problem. Our Fourier transform convention is
(3.1) |
The following classical theorem in tomography (see, e.g., [38] for a proof) shows that the operator P takes on a simpler form in the Fourier domain.
Theorem 3.1 (Fourier projection slice theorem)
Suppose Y ∈ L2(R3)∩L1(R3) and J : R2 → R. Then
(3.2) |
where P : C(R3) → C(R2) is defined by
(3.3) |
Here, Ri is the ith row of R.
Hence, p̂ rotates a function by R and then restricts it to the horizontal plane ẑ = 0. If we let ξ = (x̂, ŷ, ẑ), then another way of viewing p̂ is that it restricts a function to the plane ξ · R3 = 0.
3.1. Infinite-dimensional heterogeneity problem
To build intuition for the Fourier-domain geometry of the heterogeneity problem, consider the following idealized scenario, taking place in Fourier space. Suppose detector technology improves to the point that images can be measured continuously and noiselessly and that we have access to the full joint distribution of R and Î. We would like to estimate the mean m̂ 0 : R3 → C and covariance function Ĉ0: R3 × R3 → C of the random field X , defined by
(3.4) |
Heuristically, we can proceed as follows. By the Fourier projection slice theorem, every image I provides an observation of X (ξ) for ξ ∈ R3 belonging to a central plane perpendicular to the viewing direction corresponding to P. By abuse of notation, let ξ ∈ p̂ if p̂ carries the value of P(ξ), and let P(ξ) denote this value. Informally, we expect that we can recover m̂ 0 and Ĉ0
(3.5) |
Now, let us formalize this problem setup and intuitive formulas for m̂ 0 and Ĉ0 .
Problem 3.2
Let be a random field, where (Ω, F , ν) is a probability space. Here X (ω, ·) is a Fourier volume for each ω ∈ Ω. Let R : Ω → SO(3) be a random rotation, independent of P , having the uniform distribution over SO(3). Let P= P(R) be the (random) projection operator associated with R via (3.3). define the random field I : Ω × R2 → C by
(3.6) |
Given the joint distribution of I and R, find the mean mC 0 and covariance function X̂ of P , defined in (3.4). Let X̂ be regular enough that
(3.7) |
In this problem statement, we do not assume that X̂ has a discrete distribution. The calculations that follow hold for any Î satisfying (3.7).
We claim that m̂ 0 and Ĉ can be found by solving
(3.8) |
and
(3.9) |
equations whose interpretations we shall discuss in this section. Note that (3.8) and (3.9) can be seen as the limiting cases of (2.5) and (2.6) for σ2 = 0, p → ∞, and n → ∞.
In the equations above, we define is the space of continuous linear functionals . Thus, both sides of (3.8) are elements of . To verify this equation, we apply both sides to a test function Ŷ:
(3.10) |
Note that
(3.11) |
from which it follows that in the sense of distributions,
(3.12) |
Intuitively, this means that P * P inputs the volume m̂ and outputs a “truncated” volume that coincides with m̂ on a plane perpendicular to the viewing angle and is zero elsewhere. This reflects the fact that the image Î = PX only gives us information about X̂ on a single central plane. When we aggregate this information over all possible R, we obtain the operator Â:
(3.13) |
We used the fact that R3 is uniformly distributed over S2 if R is uniformly distributed over SO(3). Here, dθ is the surface measure on S2 (hence the normalization by 4π). The last step holds because the integral over S2 is equal to the circumference of a great circle on S2, so it is 2π.
By comparing (3.8) and (2.7), it is clear that P is the analogue of APn for infinite n and p. Also, (3.8) echoes the heuristic formula (3.5). The backprojection operator Ĉ simply “inserts” a 2D image into 3D space by situating it in the plane perpendicular to the viewing direction of the image, and so the RHS of (3.8) at a point ξ is the accumulation of values Ĉ(ξ). Moreover, the operator P is diagonal, and for each ξ, P reflects the measure of the set ξ ∈ Ĉ; i.e., the density of central planes passing through ξ under the uniform distribution of rotations. Thus, (3.8) encodes the intuition from the first equation in (3.5). Inverting P involves multiplying by the radial factor 2|ξ|. In tomography, this factor is called the ramp filter [38]. Traditional tomographic algorithms proceed by applying the ramp filter to the projection data and then backprojecting. Note that solving implies performing these operations in the reverse order; however, backprojection and application of the ramp filter commute.
Now we move on to (3.9). Both sides of this equation are continuous linear functionals on . Indeed, for , the LHS of (3.9) operates on through the definition
(3.14) |
where we view as operating on pairs (η1, η2) of elements in via
(3.15) |
Using these definitions, we verify (3.9):
(3.16) |
Substituting (3.12) into the last two lines of the preceding calculation, we find
(3.17) |
This reflects the fact that an image Î gives us information about P (ξ ,ξ ) for ξ ,ξ ∈ Ĉ.
Taking the expectation over R, we find that
(3.18) |
Like Â, the operator P is diagonal. P is a fundamental operator in tomographic inverse problems involving variability; we term it the projection covariance transform. In the same way that (3.8) reflected the first equation of (3.5), we see that (3.9) resembles the second equation of (3.5). In particular, the kernel value K(ξ1, ξ2) reflects the density of central planes passing through ξ1, ξ2.
To understand this kernel, let us compute it explicitly. We have
(3.19) |
For fixed ξ1, note that δ(ξ1 · θ) is supported on the great circle of S2 perpendicular to ξ1. Similarly, δ(ξ2 · θ) corresponds to a great circle perpendicular to ξ2. Choose ξ1, ξ2 ∈ R3 so that |ξ1 × ξ2| /= 0. Then, note that these two great circles intersect in two antipodal points θ = ±(ξ1 × ξ2)/|ξ1 × ξ2|, and the RHS of (3.19) corresponds to the total measure of δ(ξ1 · θ)δ(ξ2 · θ) at those two points.
To calculate this measure explicitly, let us define the approximation to the identity . Fix E1, E2 > 0. Note that δ1 (ξ1 · θ) is supported on a strip of width 2E1/|ξ1| centered at the great circle perpendicular to ξ1. δ2 (ξ2 · θ) is supported on a strip of width 2E2/|ξ2| intersecting the first strip transversely. For small E1, E2, the intersection of the two strips consists of two approximately parallelogram-shaped regions, S1 and S2 (see Figure 3).
The sine of the angle between the diagonals of each of these regions is |ξ1 × ξ2|/|ξ1||ξ2|, and a simple calculation shows that the area of one of these regions is 2E12E2/|ξ1 × ξ2|. It follows that
(3.20) |
This analytic form of K sheds light on the geometry of Ĉ. Recall that K(ξ1, ξ2) is a measure of the density of central planes passing through ξ1 and ξ2. Note that this density is nonzero everywhere, which reflects the fact that there is a central plane passing through each pair of points in R3. The denominator in K is proportional to the magnitudes |ξ1| and |ξ2|, which indicates that there is a greater density of planes passing through pairs of points nearer the origin. Finally, note that K varies inversely with the sine of the angle between ξ1 and ξ2; indeed, a greater density of central planes pass through a pair of points nearly collinear with the origin. In fact, there is a singularity in K when ξ1, ξ2 are linearly dependent, reflecting the fact that infinitely many central planes pass through collinear points. As a way to sum up the geometry encoded in K, note that except for the factor of 1/4π, 1/K is the area of the triangle spanned by the vectors ξ1 and ξ2. For this reason, we call 1/K the triangular area filter.
Note that the triangular area filter is analogous to the ramp filter: it grows linearly with the frequencies |ξ1| and |ξ2| to compensate for the loss of high frequency information incurred by the geometry of the problem. So, this filter is a generalization of the ramp filter appearing in the estimation of the mean to the covariance estimation problem. The latter has a somewhat more intricate geometry, which is reflected in K.
The properties of K translate into the robustness of inverting P (supposing we added noise to our model). In particular, the robustness of recovering P (ξ ,ξ ) grows with K(ξ ,ξ ). For example, recovering higher frequencies in Ĉ is more dificult. However, the fact that K is everywhere positive means that P is at least invertible. This statement is important in proving theoretical results about our estimators, as we saw in section 2.2. Note that an analogous problem of estimating the covariance matrix of 2D objects from their one-dimensional line projections would not satisfy this condition, because for most pairs of points in R2, there is not a line passing through both points as well as the origin.
3.2. The discrete covariance estimation problem
The calculation in the preceding section shows that if we could sample images continuously and if we had access to projection images from all viewing angles, then P would become a diagonal operator. In this section, we explore the modifications necessary for the realistic case where we must work with finite-dimensional representations of volumes and images.
Our idea is to follow what we did in the fully continuous case treated above and estimate the covariance matrix in the Fourier domain. One possibility is to choose a Cartesian basis in the Fourier domain. With this basis, a tempting way to define PPs would be to restrict the Fourier 3D grid to the pixels of a 2D central slice by nearest-neighbor interpolation. This would make PPs a coordinate-selection operator, making LPn diagonal. However, this computational simplicity comes at a great cost in accuracy; numerical experiments show that the errors induced by such a coarse interpolation scheme are unacceptably large. Such an interpolation error should not come as a surprise, considering similar interpolation errors in computerized tomography [38]. Hence, we must choose other bases for the Fourier volumes and images.
The finite sampling rate of the images limits the 3D frequencies we can hope to reconstruct. Indeed, since the images are sampled on an N × N grid confining a disc of radius 1, the corresponding Nyquist bandlimit is ωNyq = Nπ/2. Hence, the images carry no information past this 2D bandlimit. By the Fourier slice theorem, this means that we also have no information about X past the 3D bandlimit ωNyq. In practice, the exponentially decaying envelope of the CTF function renders even fewer frequencies possible to reconstruct. Moreover, we saw in section 3.1 and will see in section 6.2 that reconstruction of Σ0 becomes more ill-conditioned as the frequency increases. Hence, it often makes sense to take a cuto? ωmax < ωNyq. We can choose ωmax to correspond to an effective grid size of Nres pixels, where Nres ≤ N . In this case, we would choose ωmax = Nresπ/2. Thus, it is natural to search for X in a space of functions bandlimited in Bωmax (the ball of radius ωmax) and with most of their energy contained in the unit ball. The optimal space B with respect to these constraints is spanned by a finite set of 3D Slepian functions [56]. For a given bandlimit ωmax, we have
(3.21) |
This dimension is called the Shannon number, and is the trace of the kernel in [56, eq. 6].
For the purposes of this section, let us work abstractly with the finite-dimensional spaces VP ⊂ C0(Bωmax ) and IP ⊂ C0(Dωmax ), which represent Fourier volumes and Fourier images, respectively (Dωmax ⊂ R is the disc of radius ωmax). For example, VP could be spanned by the Fourier transforms of the 3D Slepian functions. Let
(3.22) |
with dim(VĈ) = pP and dim(IĈ) = qP. Assume that for all R, Ĉ(VĈ) ⊂ IP (i.e., we do not need to worry about interpolation). Denote by PP the matrix expression of Ĉ . Thus, PP ∈ CqP×pP. Let XP1,… , XPn be the representations of P ,… , Ĉ in the basis for VĈ.
Since we are given the images Is in the pixel basis Rq , let us consider how to map these images into IĈ. Let Q1 : Rq → IP be the mapping which fits (in the least-squares sense) an element of IP to the pixel values defined by a vector in Rq. It is easiest to express Q1 in terms of the reverse mapping Q2 : IP → Rq . The ith column of Q2 consists of the evaluations of gi at the real-domain grid points inside the unit disc. It is easy to see that the least-squares method of defining
Now, note that
(3.23) |
The last approximate equality is due to the Fourier slice theorem. The inaccuracy comes from the discretization operator S. Note that . We would like the latter matrix to be a multiple of the identity matrix so that the noise in the images remains white. Let us calculate the entries of in terms of the basis functions gi. Given the fact that we are working with volumes hi which have most of their energy concentrated in the unit ball, it follows that gi have most of their energy concentrated in the unit disc. If x1,… , xq are the real-domain image grid points, it follows that
(3.24) |
It follows that in order for to be (approximately) a multiple of the identity matrix, we should require {gPi} to be an orthonormal set in L2(R2). If we let cq = 4π3/q, then we find that
(3.25) |
It follows that, if we make the approximations in (3.23) and (3.25), we can formulate the heterogeneity problem entirely in the Fourier domain as follows:
(3.26) |
where Var[EĈ] = σ2cq IqP. Thus, we have an instance of Problem (1.1) with σ2 replaced by σ2cq , q replaced by qP, and p replaced by pP. We seek μP0 = E[XP] and ΣP 0 = Var[XP]. Equations (2.5) and (2.6) become
(3.27) |
and
(3.28) |
3.3. Exploring AP and LĈ
In this section, we seek to find expressions for AP and LĈ like those in (3.13) and (3.18). The reason for finding these limiting operators is twofold. First of all, recall that the theoretical results in section 2.2 depend on the invertibility of these limiting operators. Hence, knowing AP and LP in the cryo-EM case will allow us to verify the assumptions of Propositions 2.1 and 2.2. Second, the law of large numbers guarantees that for large n, we have APn ≈ AP and LPn ≈ LĈ. We shall see in section 5 that approximating APn and LPn by their limiting counterparts makes possible the tractable implementation of our algorithm.
In section 3.1, we worked with functions m̂ : R3 → C and P : R3 × R3 → C. Now, we are in a finite-dimensional setup, and we have formulated (3.27) and (3.28) in terms of vectors and matrices. Nevertheless, in the finite-dimensional case we can still work with functions as we did in section 3.1 via the identifications
(3.29) |
where we define
(3.30) |
and VP ⊗ VP = span{hPi ⊗ hPj }. Thus, we identify CpP and CpP×pP with spaces of bandlimited functions. For these identifications to be isometries, we must endow VP with an inner product for which the hPi are orthonormal. We consider a family of inner products, weighted by radial functions w(|ξ|):
(3.31) |
The inner product on VP ⊗ VP is inherited from that of VĈ.
Note that APn and LPn both involve the projection-backprojection operator PPH PPs. Let us see how to express PPH PPs as an operator on VĈ. The ith column of PPs is the representation of in the orthonormal basis for I . Hence, using the isomorphism CqP ↔ I and reasoning along the lines of (3.11), we find that
(3.32) |
Note that here and throughout this section, we perform manipulations (like those in section 3.1) that involve treating elements of VP as test functions for distributions. We will ultimately construct VP so that its elements are continuous, but not in C∞(R3), as assumed in section 3.1. Nevertheless, since we are only dealing with distributions of order zero, continuity of the elements of VP is sufficient.
From (3.32), it follows that if μP ∈ CpP ↔ m̂
(3.33) |
where
(3.34) |
is a projection onto the finite-dimensional subspace VĈ.
In analogy with (3.8), we have
(3.35) |
Note AP resembles the operator P obtained in (3.8), with the addition of the “low-pass filter” πVP. As a particular choice of weight, one might consider w(|ξ|) = 1/|ξ| in order to cancel the ramp filter. For this weight, note that
(3.36) |
where is the orthogonal projection onto VP with respect to the weight w. Thus, for this weight we find that
A calculation analagous to (3.33) shows that for ΣP ∈ CpP×pP ↔ Ĉ
(3.37) |
Then, taking the expectation over R3, we find that
(3.38) |
This shows that between LĈ is linked to P via the low-pass-filter π P analogously to (3.34).
3.4. Properties of AP and LĈ
In this section, we will prove several results about AP and LĈ, defined in (3.35) and (3.38). We start by proving a useful lemma.
Lemma 3.3
For and Ŷ Ĉ, we have
(3.39) |
Likewise, if , we have
(3.40) |
Proof
Indeed, we have
(3.41) |
The proof of the second claim is similar.
Note that AP and LP are self-adjoint and PSD because each APn and LPn satisfies this property. In the next proposition, we bound the minimum eigenvalues of these two operators from below.
Proposition 3.4
Let Mw(ωmax) = max|ξ|≤ωmax |ξ|w(|ξ|). Then,
(3.42) |
Proof
Let μP ∈ CpP ↔ m̂ find
(3.43) |
The bound on the minimum eigenvalue of LP follows from a similar argument, using (3.38) and the following bound:
(3.44) |
By inspecting Mw (ωmax), we see that choosing w = 1/|ξ| leads to better conditioning of both AP and LĈ, as compared to w = 1. This is because the former weight compensates for the loss of information at higher frequencies. We see from (3.36) that for w = 1/|ξ|, AP is perfectly conditioned. This weight also cancels the linear growth of the triangular area filter with radial frequency. However, it does not cancel K altogether, since the dependency on sin γ in the denominators in (3.44) remains, where γ is the angle between ξ1 and ξ2.
The maximum eigenvalue of LP cannot be bounded as easily, since the quotient in (3.44) is not bounded from above. A bound on λmax(LĈ) might be obtained by using the fact that a bandlimited P can only be concentrated to a limited extent around the singular set {ξ1, ξ2 : |ξ1 × ξ2| = 0}.
Finally, we prove another property of AP and LĈ: they commute with rotations. Let us define the group action of SO(3) on functions R3 → C as follows: for R ∈ SO(3) and P : R3 → C, let R. Ĉ(ξ) = Ĉ(RT ξ). Likewise, define the group action of SO(3) on functions P : R3 × R3 → C via R. Ĉ(ξ1, ξ2) = Ĉ(RT ξ1, RT ξ2).
Proposition 3.5
Suppose that the subspace VP is closed under rotations. Then, for any Y ∈ V , C ∈ V ⊗ V , and R ∈ SO(3), we have
(3.45) |
where APX and LPX are understood via the identifications (3.29).
Proof
We begin by proving the first half of (3.45). First of all, extend the group action of SO(3) to the space , via
(3.46) |
We claim that for any , we have R.(π Pη) = π P(R.η). Since VP is closed under rotations, both sides of this equation are elements of VĈ. We can verify their equality by taking an inner product with an arbitrary element Ĉ VĈ. Using Lemma 3.3 and the fact that VP is closed under rotations, we obtain
(3.47) |
Next, we claim that for any Ĉ VĈ, we have R.( P P) = Ĉ(R. Ĉ). To check whether these two elements of are the same, we apply them to a test function :
(3.48) |
Putting together what we have, we find that
(3.49) |
which proves the first half of (3.45). The second half is proved analogously.
This property of AP and LP is to be expected, given the rotationally symmetric nature of these operators. This suggests that LP can be studied further using the representation theory of SO(3).
Finally, let us check that the assumptions of Propositions 2.1 and 2.2 hold in the cryo-EM case. It follows from Proposition 3.4 that as long as Mw (ωmax) < ∞, the limiting operators AP and LP are invertible. Of course, it is always possible to choose such a weight w. In particular the weights already considered, w = 1, 1/|ξ| satisfy this property. Moreover, by rotational symmetry, lPĈ(R)l is independent of R, and so of course this quantity is uniformly bounded. Thus, we have checked all the necessary assumptions to arrive at the following conclusion.
Proposition 3.6
If we neglect the errors incurred in moving to the Fourier domain and assume that the rotations are drawn uniformly from SO(3), then the estimators μPn and ΣP n obtained from (3.27) and (3.28) are consistent.
4. Using to determine the conformations
To solve Problem 1.2, we must do more than just estimate μP0 and ΣP 0. We must also estimate C, XP c, and pc, where XP c is the coefficient vector of Pc in the basis for VĈ. Once we solve (3.27) and (3.28) for μPn and ΣP n, we perform the following steps.
From the discussion on high-dimensional PCA in section 2.3, we expect to determine the number of structural states by inspecting the spectrum of ΣP n. We expect the spectrum of ΣP n to consist of a bulk distribution along with C − 1 separate eigenvalues (assuming the SNR is sufficiently high), a fact confirmed by our numerical results. Hence, given ΣP n, we can estimate C.
Next, we discuss how to reconstruct XP 1,… , XP C and p1,… , pC . Our approach is similar to Penczek, Kimmel, and Spahn’s [43]. By the principle of PCA, the leading eigenvectors of span the space of mean subtracted volumes are the leading eigenvectors of , we can write
(4.1) |
Note that there is only approximate equality because we have replaced the mean μP0 by the estimated mean μPn, and the eigenvectors of ΣP 0 by those of ΣP n. We would like to recover the coefficients αs = (αs,1,… , αs,C−1), but the XPs are unknown. Nevertheless, if we project the above equation by PPs, then we get
(4.2) |
For each s, we can now solve this equation for the coefficient vector αs in the least-squares sense. This gives us n vectors in CC−1. These should be clustered around C points for c = 1,… ,C, corresponding to the C underlying volumes. At this point, Penczek, Kimmel, and Spahn propose to perform K-means clustering on αs in order to deduce which image corresponds to which class. However, if the images are too noisy, then it would be impossible to separate the classes via clustering. Note that in order to reconstruct the original volumes, all we need are the means of the C clusters of coordinates. If the mean volume and top eigenvectors are approximately correct, then the main source of noise in the coordinates is the Gaussian noise in the images. It follows that the distribution of the coordinates in CC−1 is a mixture of Gaussians. Hence, we can find the means αc of each cluster using either an EM algorithm (of which the K-means algorithm used by Penczek is a limiting case [8]) or the method of moments, e.g., [23]. In the current implementation, we use an EM algorithm. Once we have the C mean vectors, we can reconstruct the original volumes using (4.1). Putting these steps together, we arrive at a high-level algorithm to solve the heterogeneity problem (see Algorithm 1).
5. Implementing Algorithm 1
In this section, we confront the practical challenges of implementing Algorithm 1. We consider different approaches to addressing these challenges and choose one approach to explore further.
5.1. Computational challenges and approaches
The main computational challenge in Algorithm 1 is solving for ΣP n in
(5.1) |
given the immense size of this problem. Two possibilities for inverting LPn immediately come to mind. The first is to treat (5.1) as a large system of linear equations, viewing ΣP n as a vector in CpP2 and LĈ as a matrix in CpP2×pP2 . In this scheme, the matrix LĈcould be computed once and stored. However, this approach has an unreasonably large storage requirement. Since , it follows that LPn has size . Even for a small Nres value such as 17, each dimension of LPn is 1.8 × 106. Storing such a large LPn requires over 23 terabytes. Moreover, inverting this matrix naively is completely intractable.
The second possibility is to abandon the idea of forming LPn as a matrix, and instead to use an iterative algorithm, such as the conjugate gradient (CG) algorithm, based on repeatedly applying LPn to an input matrix. From (3.28), we see that applying LPn to a matrix is dominated by n multiplications of a qP × pP matrix by a pP × pP matrix, which costs . If κn is the condition number of LPn , then CG will converge in O(√κn ) iterations (see, e.g., [58]). Hence, while the storage requirement of this alternative algorithm is only , the computational complexity is O(nN 8 √κn). Thus, the price to pay for reducing the storage requirement is that n matrix multiplications must be performed at each iteration. While this computational complexity might render the algorithm impractical for a regular computer, one can take advantage of the fact that the n matrix multiplications can be performed in parallel.
We propose a third numerical scheme, one which requires substantially less storage than the first scheme above and does not require O(n) operations at each iteration. We assume that the Rs are drawn from the uniform distribution over SO(3), and so for large n, the operator LPn does not differ much from its limiting counterpart LP (defined in (3.38)). Hence, if we replace LPn by LP in (5.1), we would not be making too large an error. Of course, LP is a matrix of the same size as LPn, so it is also impossible to store on a computer. However, we leverage the analytic form of LĈ in order to invert it more efficiently. At this point, we have not yet chosen the spaces VP and IĈ, and by constructing these carefully we give LP a special structure. This approach also entails a tradeo?: in practice the approximation LPn ≈ LĈ is accurate to the extent that R3,… , R3 are uniformly distributed on S2. Hence, we must extract a subset of the given rotations whose viewing angles are approximately uniformly distributed on the sphere. Thus, the sacrifice we make in this approach is a reduction in the sample size. Moreover, since the subselected viewing directions are no longer statistically independent, the theoretical consistency result stated in Proposition 3.6 does not necessarily extend to this numerical scheme.
Nevertheless, the latter approach is promising because the complexity of inverting LP is independent of the number of images, and this computation might be tractable for reasonable values of Nres if LP has enough structure. It remains to construct VP and IP to induce a special structure in LĈ, which we turn to next.
5.2. Choosing VP to make LP sparse and block diagonal
In this section, we write down an expression for an individual element of LĈ, and discover that for judiciously chosen basis functions Ĉhi, the matrix LP becomes sparse and block diagonal.
First, let us fix a functional form for the basis elements hPi: let
(5.2) |
where fi : R+ → R are radial functions and ai : S2 → C are spherical harmonics. Note, for example, that the 3D Slepian functions have this form [56, eq. 110]. If the hPi are orthogonal with respect to the weight w, then
(5.3) |
where we use as a shorthand for . The 3D Slepian functions satisfy the above condition with w = 1, because they are orthogonal in L2(R3).
Next, we write down the formula for an element LPi1 ,i2,j1,j2 (here, j1, j2 are the indices of the input matrix, and i1, i2 are the indices of the output matrix). From (3.38) and Lemma 3.3,
we find
(5.4) |
Thus, to make many of the radial inner products in LP correct weight is vanish, we see from (5.3) that the
(5.5) |
Recall that this is the weight needed to cancel the ramp filter in AP (see (3.36)). We obtain a cancellation in LP as well because the kernel of this operator also grows linearly with radial frequency. From this point on, w will represent the weight above, and we will work in the corresponding weighted L2 space.
What are sets of functions of the form (5.2) that are orthonormal in L2 (R3)? If we chose 3D Slepian functions, we would get the functional form
(5.6) |
However, these functions are orthonormal with weight w = 1 instead of w = 1/r. Consider modifying this construction by replacing the fk,R(r) by the radial functions arising in the 2D Slepian functions. These satisfy the property
(5.7) |
With this property (5.6) becomes orthonormal in L2 (R3). This gives LP a certain degree of sparsity. However, note that the construction (5.6) has different families of L2-orthogonal radial functions corresponding to each angular function. Thus, we only have orthogonality of the radial functions fk1,R1 and fk2,R2 when l1 = .e2. Thus, many of the terms fj , fi)L2 in (5.4) are still nonzero.
A drastic improvement on (5.6) would be to devise an orthogonal basis in L2 that used one set of r-weighted orthogonal functions fk for all the angular functions, rather than a separate set for each angular function. Namely, suppose we chose
(5.8) |
where J is some indexing set. Note that fk and J need to be carefully constructed so that span{hk,R,m}≈ B (see section 5.3 for this construction). We have
(5.9) |
Here, we assume that each fk is either even or odd at the origin, and we extend fk(r) to r ∈ R according to this parity. The above calculation implies that fk should have the same parity as .e. Let us suppose that fk has the same parity as k. Then, it follows that (k, .e, m) ∈ J only if k = .e mod 2. Thus, hk,R,m will be orthonormal in L2 if
(5.10) |
If we let ki be the radial index corresponding to i, then we claim that the above construction implies
(5.11) |
This statement does not follow immediately from (5.10), because we still need to check the case when ki1 /= kj1 mod 2. Note that in this case, the dependence on α in the integral over S2 × S2 is odd, and so indeed LPi ,i ,j ,j = 0 in that case as well. If VĈ is the space spanned by fk(r)Y m(α) for all .e, m, then the above implies that LP operates separately on each VPk ⊗ VPk2 . In the language of matrices, this means that if we divide ΣP n into blocks ΣP k1,k2 based on radial indices, LP operates on these blocks separately. We denote each of the corresponding “blocks” of LP by LPk1,k2 . Let us reindex the angular functions so that ak denotes the ith angular basis function paired with fk. From (5.11), we have
(5.12) |
This block diagonal structure of LP makes it much easier to invert. Nevertheless, each block LPk1,k2 is a square matrix with dimension . Hence, inverting the larger blocks of LĈ can be dificult. Remarkably, it turns out that each block of LP is sparse. In Appendix C, we simplify the above integral over S2 × S2. Then, (5.12) becomes
(5.13) |
where the constants c(.e) are defined in (C.8) and CR,m(ψĈ) is the .e, m coefficient in the spherical harmonic expansion of ψP : S2 → C. It turns out that the above expression is zero for most sets of indices. To see why, recall that the functions ak are spherical harmonics. It is known that the product Y mY m* can be expressed as a linear combination of harmonics Y M , where M = m + m1 and |.e − .e1|≤ L ≤ .e + .e1. Thus, Cm (aiaj ) are sparse vectors, which shows that each block LPk1,k2 is sparse. For example, LP15,15 has each dimension approximately 2 × 104. However, only about 107 elements of this block are nonzero, which is only about 3% of its total number of entries. This is about the same number of elements as a 3000 × 3000 full matrix.
Thus, we have found a way to tractably solve the covariance matrix estimation problem: reconstruct ΣP n (approximately) by solving the sparse linear systems
(5.14) |
where we recall that BPn is the RHS of (3.28). Also, using the fact that , we can estimate μPn from
(5.15) |
In the next two sections, we discuss how to choose the radial components fk(r) and define IP and VP more precisely.
5.3. Constructing fk(r) and the space VP
We have discussed so far that
(5.16) |
with (k, .e, m) ∈ J only if k = .e mod 2. Moreover, we have required the orthonormality condition (5.10). However, recall that we initially assumed that the real-domain functions Xs belonged to the space of 3D Slepian functions B. Thus, we must choose VP to approximate the image of B under the Fourier transform. Hence, the basis functions fk(r)Y m(θ, ϕ) should be supported in the ball of radius ωmax and have their inverse Fourier transforms concentrated in the unit ball. Moreover, we must have dim(VĈ) ≈ dim(B). Finally, the basis functions hPi should be analytic at the origin (they are the truncated Fourier transforms of compactly supported molecules). We begin by examining this condition.
Expanding hPi in a Taylor series near the origin up to a certain degree, we can approximate it locally as a finite sum of homogeneous polynomials. By [57, Theorem 2.1], a homogeneous polynomial of degree d can be expressed as
(5.17) |
where each YR represents a linear combination of spherical harmonics of degree .e. Hence, if (k, .e, m) ∈ J , then we require that fk(r) = αRrR + αR+2rR+2 + ··· , where some coefficients can be zero. We satisfy this requirement by constructing f0, f1,… so that
(5.18) |
for small r with αk,k /= 0, and combine fk with Y m if k = .e mod 2 and .e ≤ k. This leads to the following set of 3D basis functions:
(5.19) |
Written another way, we define
(5.20) |
Following the reasoning preceding (5.17), it can be seen that near the origin, this basis spans the set of polynomial functions up to degree K.
Now, consider the real- and Fourier-domain content of hPi. The bandlimitedness requirement on Xs is satisfied if and only if the functions fk are supported in the interval [0, ωmax]. To deal with the real domain requirement, we need the inverse Fourier transform of fk(r)Y m(θ, ϕ). With the Fourier convention (3.1), it follows from [2] that
(5.21) |
Here, jR is the spherical Bessel function of order .e, and SR is the spherical Hankel transform. Also note that (r, θ, ϕ) are Fourier-domain spherical coordinates, while (rx, θx, ϕx) are their real-domain counterparts. Thus, satisfying the real-domain concentration requirement amounts to maximizing the percentage of the energy of SRfk that is contained in [0, 1] for 0 ≤ k ≤ K, 0 ≤ .e ≤ k, .e = k mod 2.
Finally, we have arrived at the criteria we would like fk(r) to satisfy:
supp fk ⊂ [0, ωmax];
{fk : k even} and {fk : k odd} orthonormal in L2(R+, r);
fk(r) = αk,krk + αk,k+2rk+2 + ··· near r = 0;
under the above conditions, maximize the percentage of the energy of SRfk in [0, 1], for 0 ≤ k ≤ K, 0 ≤ .e ≤ k, .e = k mod 2.
While it might be possible to find an optimal set of such functions {fk } by solving an optimization problem, we can directly construct a set of functions that satisfactorily satisfies the above criteria.
Note that since .e ranges in [0, k], it follows that for larger k, we need to have higher-order spherical Hankel transforms SRfk remain concentrated in [0, 1]. Since higher-order spherical Hankel transforms tend to be less concentrated for oscillatory functions, it makes sense to choose fk to be less and less oscillatory as k increases. Note that the functions fk cannot all have only few oscillations because the even and odd functions must form orthonormal sets. Using this intuition, we construct fk as follows. Since the even and odd fk can be constructed independently, we will illustrate the idea by constructing the even fk. For simplicity, let us assume that K is odd, with K = 2K0 + 1. define the cuto? χ = χ([0, ωmax]). First, consider the sequence
(5.22) |
where zk,m is the mth positive zero of Jk (the kth-order Bessel function). Note that the functions in this list satisfy criteria 1 (by construction) and 3 (due to the asymptotics of the Bessel function at the origin). Also note that we have chosen the scaling of the arguments of the Bessel functions so that the number of zero crossings decreases as the list goes on. Thus, the functions become less and less oscillatory, which is the pattern that might lead to satisfying criterion 4. However, since these functions might not be orthogonal with respect to the weight r, we need to orthonormalize them with respect to this weight (via Gram–Schmidt). We need to be careful to orthonormalize them in such a way as to preserve the properties that they already satisfy. This can be achieved by running the (r-weighted) Gram–Schmidt algorithm from higher k towards lower k. This preserves the supports of the functions, their asymptotics at the origin, and the oscillation pattern. Moreover, the orthogonality property now holds as well. See Figure 4 for the first several even radial basis functions. Constructing the odd radial functions requires following an analogous procedure. Also, changing the parity of K requires the obvious modifications.
It remains to choose K. We do this based on how well criterion 4 is satisfied. For example, we can calculate how much energy of SRfk is contained in the unit interval for all 0 ≤ k ≤ K, 0 ≤ .e ≤ k, .e = k mod 2. Numerical experiments show that K = Nres − 2 is a reasonable value. For each value of Nres that we tested, this choice led to SRfk having at least 80% of its energy concentrated in the unit interval for each relevant (.e, k), and at least 95% on average over all such pairs (.e, k). Thus our experiments show that for our choice of fk, choosing roughly K ≈ Nres leads to acceptable satisfaction of criterion 4. A short calculation yields
(5.23) |
(5.24) |
Hence, we have pP/p = 6/π2 ≈ 0.6. Hence, the dimension of the space VP we have constructed is within a constant factor of the dimension of B. This factor is the price we pay for the computational simplicity VP provides.
Note that a different construction of fk might have even better results. Choosing better radial functions can be the topic of further research. In any case, the specific choice of fk does not affect the structure of our algorithm at all because LP is independent of these functions, as can be seen from (5.12). Thus, the selection of the radial basis functions can be viewed as an independent module in our algorithm. The radial functions we choose here work well in numerical experiments; see section 7.
5.4. Constructing IP
Finally, the remaining piece in our construction is the finite dimensional space of Fourier images, IĈ. To motivate our construction, consider applying Ps to a basis element of V . The first observation to make is that the radial components fk(r) factor through Ĉ completely:
Recall from (3.21) that
(5.25) |
Note that the Ĉon the LHS should be intepreted as C(R3) → C(R2), whereas the one on the RHS is the restricted map C(S2) → C(S1), which we also call P . The correct interpretation should be clear in each case. Viewed in this new way, Ĉ : C(S2) → C(S1) rotates a function on the sphere by Rs ∈ SO(3), and then restricts the result to the equator.
By the rotational properties of spherical harmonics, a short calculation shows that
(5.26) |
where the constants cR,m,m* depend on the Wigner D matrices DR [36]. Hence, P (VĈ) ⊂ IP if
(5.27) |
Thus, we construct IP by pairing fk with if k = m mod 2 and m ≤ k. This leads to the 2D basis functions
(5.28) |
Written another way, we construct
(5.29) |
If IPk is the subspace of IP spanned by the basis functions with radial component fk, (5.24) shows that P (VP ) ⊂ IĈ for each k. Thus, PĈ has a block diagonal structure, as depicted in Figure 5.
Let us now compare the dimension of IP to that of the corresponding space of 2D Slepian functions, as we did the previous section. We have
(6.1) |
The Shannon number in 2D corresponding to the bandlimit ωmax is ω2 4. Thus, we are short of this dimension by a constant factor of 8/π2 ≈ 0.8. Another comparison to make is that the number of grid points in the disc inscribed in the Nres × Nres grid is π N 2 = ω2/π. Thus, dim(IĈ) is short of this number by a factor of 2 . Note that this is the same factor that was obtained in a similar situation in [69], so IP is comparable in terms of approximation to the Fourier–Bessel space constructed there.
Thus, by this point we have fully specified our algorithm for the heterogeneity problem. After finding ΣP n numerically via (5.14), we can proceed as in steps 6–9 of Algorithm 1 to solve Problem 1.2.
6. Algorithm complexity
In this section, we explore the consequences of the constructions of VP and IP for the complexity of the proposed algorithm. We also compare this complexity with that of the straightforward CG approach discussed in section 5.1.
To calculate the computational complexity of inverting the sparse matrix LPk1,k2 via the CG algorithm, we must bound the number of nonzero elements of this matrix and its condition number.
6.1. Sparsity of LP and storage complexity
Preliminary numerical experiments confirm the following conjecture.
Conjecture 6.1
(6.2) |
where nnz(A) is the number of nonzero elements in a matrix A, and the term involving the square is the total number of elements in LPk1,k2 .
Hence, the percentage of nonzero elements in each block of LĈ decays linearly with the frequencies associated with that block. This conjecture remains to be verified theoretically.
We pause here to note the storage complexity of the proposed algorithm, which is dominated by the cost of storing LĈ. In fact, since we process all the blocks separately, only storing one LPk1,k2 at a time will suffice. Hence, the storage complexity is the memory required to store the largest block of LĈ, which is nnz(LPK,K ) = O(K7) = O(N 7 required storage for a full matrix of the size of LĈ, which is . Compare this to the required storage for a full matrix of the size of L, which is .
6.2. Condition number of LĈ
Here we find the condition number of each LPk1 ,k2 . We already proved in Proposition 3.4 that λmin(LĈ) ≥ 1/2π. For any k1, k2, this implies that λmin(LPk1 ,k2 ) ≥ 1/2π. This is confirmed by a numerical experiment: in Figure 6(a) are plotted the minimum eigenvalues of LPk,k for 0 ≤ k ≤ 15. Note that the eigenvalues actually approach the value 1/2π (marked with a horizontal line) as k increases. We remarked in section 3.4 that an upper bound on the maximum eigenvalue is harder to find. Nevertheless, numerical experiments have led us to the following conjecture.
Conjecture 6.2
The maximal eigenvalue of LPk1,k2 grows linearly with min(k1, k2).
Moreover, a plot of the maximal eigenvalue of LPk,k shows a clear linear dependence on k. See Figure 6(b). The line of best fit is approximately
(6.3) |
Taken together, Proposition 3.4 and Conjecture 6.2 imply the following conjecture about the condition number of LPk1,k2 , which we denote by κ(LPk1 ,k2 ).
Conjecture 6.3
(6.4) |
In particular, this implies that
(6.5) |
6.3. Algorithm complexity
Using the above results, we estimate the computational complexity of Algorithm 1. We proceed step by step through the algorithm and estimate the complexity at each stage. Before we do so, note that due to the block diagonal structure of PPs (depicted in Figure 5), it can be easily shown that an application of PPs or PPH costs O(K4).
Sending the images from the pixel domain into IP requires n applications of the matrix Q1 ∈ CqP×q , which costs O(nqqP) = O(nN 2N 2). Note that this complexity can be improved using an algorithm of the type [39], but in this paper we do not delve into the details of this alternative.
Finding μPn from (5.15) requires n applications of the matrix , and so has complexity .
Next, we must compute the matrix BPn. Note that the second term in BPn can be replaced by a multiple of the identity matrix by (3.36), so only the first term of BPn must be computed.
Note that BPn is a sum of n matrices, and each matrix can be found as the outer product of Ps (IPs − PPsμPn) ∈ CpP with itself. Calculating this vector has complexity O(K4), from which it follows that calculating BPn costs O(nK4) = O(nN 4 ).
Next, we must invert LĈ. As mentioned in section 5.1, the inversion of a matrix A via CG takes √κ(A) iterations. If A is sparse, than applying it to a vector has complexity nnz(A). Hence, the total complexity for inverting a sparse matrix is √κ(A)nnz(A). Conjectures 6.1 and 6.3 imply that
(6.6) |
Since LP has size of the order K6 × K6, note that the complexity of inverting a full matrix of this size would be K18. Thus, our efforts to make LP sparse have saved us a K8.5 complexity factor. Moreover, the fact that LP is block diagonal makes its inversion parallelizable.
Assuming that C = O(1), solving each of the n least-squares problems (4.2) is dominated by a constant number of applications of PPs to a vector. Thus, finding αs for s = 1,… ,n costs
Next, we must fit a mixture of Gaussians to αs to find αc. An EM approach to this problem requires O(n) operations per iteration. Assuming that the number of iterations is constant, finding αc has complexity O(n).
Finally, reconstructing XP c via (4.1) has complexity O(N 3 ).
Hence, neglecting lower-order terms, we find that the total complexity of our algorithm is
(6.7) |
6.4. Comparison to straightforward CG approach
We mentioned in section 5.1 that a CG approach is possible in which at each iteration, we apply LPn to ΣĈ using the definition (3.28). This approach has the advantage of not requiring uniformly spaced viewing directions. While the condition number of LPn depends on the rotations R1,… , Rn, let us assume here that κ(LPn) ≈ κ(LĈ). We estimated the computational complexity of this approach in section 5.1, but at that point we assumed that each PPs was a full matrix. If we use the bases VP and IĈ, we reap the benefit of the block diagonal structure of PPs. Hence, for each s, evaluating PPH PPsΣP PPH PPs is dominated by the multiplication PPsΣP , which has complexity N 7. Hence, applying LPn to ΣĈ has complexity nN 7. By (6.4), we assume that κ(LPn) = O(Nres). Hence, the full complexity of inverting LP using the conjugate gradient approach is (6.7) O(nN 7.5).
(7.1) |
Compare this to a complexity of O(N 9.5) for inverting LĈ. Given that n is usually on the order of 105 or 106, for moderate values of Nres we have N 9.5 ≤ nN 7.5. Nevertheless, both algorithms have possibilities for parallelization, which might change their relative complexities. As for memory requirements, note that the straightforward CG algorithm only requires O(N 6 ) storage, whereas we saw in section 6.1 that the proposed algorithm requires O(N 7 ) storage.
In summary, these two algorithms each have their strengths and weaknesses, and it would be interesting to write parallel implementations for both and compare their performances. In the present paper, we have implemented and tested only the algorithm based on inverting LĈ.
7. Numerical results
Here, we provide numerical results illustrating Algorithm 1, with the bases IP and VP chosen so as to make LĈ sparse, as discussed in section 5. The results presented below are intended for proof-of-concept purposes, and they demonstrate the qualitative behavior of the algorithm. They are not, however, biologically significant results. We have considered an idealized setup in which there is no CTF effect, and have assumed that the rotations Rs (and translations) have been estimated perfectly. In this way, we do not perform a “full-cycle” experiment, starting from only the noisy images. Therefore, we cannot gauge the overall effect of noise on our algorithm because we do not account for its contribution to the misspecification of rotations; we investigate the effect of noise on the algorithm only after the rotation estimation step. Moreover, we use simulated data instead of experimental data. The application of our algorithm to experimental datasets is left for a separate publication.
7.1. An appropriate definition of SNR
Generally, the definition of SNR is
(7.2) |
where P denotes power. In our setup, we will find appropriate definitions for both P (signal) and P (noise). Let us consider first the noise power. The standard definition is P (noise) = σ2. However, note that in our case, the noise has a power of σ2 in each pixel of an N × N grid, but we reconstruct the volumes to a bandlimit ωmax, corresponding to Nres. Hence, if we downsampled the N × N images to size Nres × Nres, then we would still obey the Nyquist criterion (assuming the volumes actually are bandlimited by ωmax). This would have the effect of reducing the noise power by a factor of N 2 /N 2. Hence, in the context of our problem, we define
(7.3) |
Now, consider P (signal). In standard SPR, a working definition of signal power is
(7.4) |
However, in the case of the heterogeneity problem, the object we are trying to reconstruct is not the volume itself, but rather the deviation from the average volume, due to heterogeneity. Thus, the relevant signal to us is not the images themselves, but the parts of the images that correspond to projections of the deviations of Xs from μ0. Hence, a natural definition of signal power in our case is
(7.5) |
Using the above definitions, let us define SNRhet in our problem by
(7.6) |
Even with the correction factor values are lower than the SNR values usually encountered in structural biology. Hence, we also define
(7.7) |
We will present our numerical results primarily using SNRhet, but we will also provide the corresponding SNR values in parentheses.
To get a sense of the difference between this definition of SNR and the conventional one, compare the signal strength in a projection image to that in a mean-subtracted projection image in Figure 7.
7.2. Experimental procedure
We performed three numerical experiments: one with two heterogeneity classes, one with three heterogeneity classes, and one with continuous variation along the perimeter of a triangle defined by three volumes. The first two demonstrate our algorithm in the setup of Problem 1.2, and the third shows that we can estimate the covariance matrix and discover a low-dimensional structure in more general setups than the discrete heterogeneity case.
As a first step in each of the experiments, we created a number of phantoms analytically. We chose the phantoms to be linear combinations of Gaussian densities:
(A.1) |
For the discrete heterogeneity cases, we chose probabilities p1,… , pC and generated X1,… , Xn by sampling from X 1,… , X C accordingly. For the continuous heterogeneity case, we generated each Xs by choosing a point uniformly at random from the perimeter of the triangle defined by X 1, X 2, X 3.
For all of our experiments, we chose n = 10000, N = 65, Nres = 17, K = 15, and selected the set of rotations Rs to be approximately uniformly distributed on SO(3). For each Rs, we calculated the clean continuous projection image PsXs analytically, and then sampled the result on an N × N grid. Then, for each SNR level, we used (7.5) to find the noise power σ2 to add to the images.
After simulating the data, we ran Algorithm 1 on the images Is and rotations Rs on an Intel i7-3615QM CPU with 8 cores, and 8 GB of RAM. The runtime for the entire algorithm with the above parameter values (excluding precomputations) is 257 seconds. For the continuous heterogeneity case, we stopped the algorithm after computing the coordinates αs (we did not attempt to reconstruct individual volumes in this case). To quantify the resolution of our reconstructions, we use the Fourier shell correlation (FSC), defined as the correlation of the reconstruction with the ground truth on each spherical shell in Fourier space [48]. For the discrete cases, we calculated FSC curves for the mean, the top eigenvectors, and the mean-subtracted reconstructed volumes. We also plotted the correlations of the mean, top eigenvectors, and mean-subtracted volumes with the corresponding ground truths for a range of SNR values. Finally, we plotted the coordinates αs. For the continuous heterogeneity case, we tested the algorithm on only a few different SNR values. By plotting αs in this case, we recover the triangle used in constructing Xs.
7.3. Experiment: Two classes
In this experiment, we constructed two phantoms X 1 and X 2 of the form (7.7), with M1 = 1, M2 = 2. Cross sections of X 1 and X 2 are depicted in the top row panels (c) and (d) in Figure 8. We chose the two heterogeneity classes to be equiprobable: p1 = p2 = 1/2. Note that the theoretical covariance matrix in the two-class heterogeneity problem has rank 1, with dominant eigenvector proportional to the difference between the two volumes.
Figure 8 shows the reconstructions of the mean, top eigenvector, and two volumes for SNRhet = 0.013, 0.003, 0.0013 (0.25, 0.056, 0.025). In Figure 9, we display eigenvalue histograms of the reconstructed covariance matrix for the above SNR values. Figure 10 shows the FSC curves for these reconstructions. Figure 11 shows the correlations of the computed means, top eigenvectors, and (mean-subtracted) volumes with their true values for a broader range of SNR values. In Figure 12, we plot a histogram of the coordinates αs from step 7 of Algorithm 1.
Our algorithm was able to meaningfully reconstruct the two volumes for SNRhet as low as about 0.003 (0.06). Note that the means were always reconstructed with at least a 94% correlation to their true values. On the other hand, the eigenvector reconstruction shows a phase-transition behavior, with the transition occurring between SNRhet values of 0.001
Regarding the coefficients αs depicted in Figure 12, note that in the noiseless case, there should be a distribution composed of two spikes. By adding noise to the images, the two spikes start blurring together. For SNR values up to a certain point, the distribution is still visibly bimodal. However, after a threshold the two spikes coalesce into one. The proportions pc are reliably estimated until this threshold.
7.4. Experiment: Three classes
In this experiment, we constructed three phantoms X 1, X2, X 3 of the form (7.7), with M1 = 2, M2 = 2, M3 = 1. The cross sections of X 1, X 2, X 3 are depicted in Figure 13 (top row, panels (d)–(f)). We chose the three classes to be equiprobable: p1 = p2 = p3 = 1/3. Note that the theoretical covariance matrix in the three-class heterogeneity problem has rank 2.
Figures 13, 14, 15, 16, 17 are the three-class analogues of Figures 8, 9, 10, 11, 12 in the two-class case.
Qualitatively, we observe behavior similar to that in the two-class case. The mean is reconstructed with at least 90% accuracy for all SNR values considered, while both top eigen-vectors experience a phase-transition phenomenon (Figure 16(a)). As with the two-class case, we see that the disappearance of the eigengap coincides with the phase-transition behavior in the reconstruction of the top eigenvectors. However, in the three-class case we have two eigenvectors, and we see that the accuracy of the second eigenvector decays more quickly than that of the first eigenvector. This reflects the fact that the top eigenvalue of the true covariance ΣP 0 is 2.1 × 105, while the second eigenvalue is 1.5 × 105. These two eigenvalues differ because X 13 has greater norm than X 2 −X , which means that the two directions of variation have different associated variances. Hence, recovering the second eigenvector is less robust to noise. In particular, there are SNR values for which the top eigenvector can be recovered, but the second eigenvector cannot. SNRhet = 0.0044 (0.03) is such an example. We see in Figure 14 that for this SNR value, only the top eigenvector pops out of the bulk distribution. In this case, we would incorrectly estimate the rank of the true covariance as 1, and conclude that C = 2.
The coefficients αs follow a similar trend to those in the two-class case. For high SNRs, there is a clearly defined clustering of the coordinates around three points, as in Figure 17(a). As the noise is increased, the three clusters become increasingly less defined. In Figure 17(b), we see that in this threshold case, the three clusters begin merging into one. As in the two-class case, this is the same threshold up to which the pc are accurately estimated. By the time SNR = 0.0044 (0.03), there is no visible cluster separation, just as we observed in the two-class case. Although the SNR threshold for finding pc from the αs coefficients comes earlier than the one for the eigengap, the quality of volume reconstruction roughly tracks the quality of the eigenvector reconstruction. This suggests that the estimation of cluster means is more robust than that of the probabilities pc.
7.5. Experiment: Continuous variation
In this experiment, we sampled Xs uniformly from the perimeter of the triangle determined by volumes X 1, X 2, X 3 (from the three-class discrete heterogeneity experiment). This setup is more suitable to model the case when the molecule can vary continuously between each pair X i and X j . Despite the fact this experiment does not fall under Problem 1.2, Figure 18 shows that we still recover the rank two structure. Indeed, it is clear that all the clean volumes still belong to a subspace of dimension 2. Moreover, we can see the triangular pattern of heterogeneity in the scatter plots of αs (Figure 19). However, note that once the images get moderately noisy, the triangular structure starts getting drowned out. Thus, in practice, without any prior assumptions, just looking at the scatter plots of αs will not necessarily reveal the heterogeneity structure in the dataset. To detect continuous variation, a new algorithmic step must be designed to follow covariance matrix estimation. Nevertheless, this experiment shows that by solving the general Problem 1.1, we can estimate covariance matrices beyond those considered in the discrete case of the heterogeneity problem.
8. Discussion
In this paper, we proposed a covariance matrix estimator from noisy linearly projected data and proved its consistency. The covariance matrix approach to the cryo-EM heterogeneity problem is essentially a special case of the general statistical problem under consideration, but has its own practical challenges. We overcame these challenges and proposed a methodology to tractably estimate the covariance matrix and reconstruct the molecular volumes. We proved the consistency of our estimator in the cryo-EM case and also began the mathematical investigation of the projection covariance transform. We discovered that inverting the projection covariance transform involves applying the triangular area filter, a generalization of the ramp filter arising in tomography. Finally, we validated our methodology on simulated data, producing accurate reconstructions at low SNR levels. Our implementation of this algorithm is now part of the ASPIRE package at spr.math.princeton.edu. In what follows, we discuss several directions for future research.
As discussed in section 2.3, our statistical framework and estimators have opened many new questions in high-dimensional statistics. While a suite of results are already available for the traditional high-dimensional PCA problem, generalizing these results to the projected data case would require new random matrix analysis. Our numerical experiments in the cryo-EM case have shown many qualitative similarities between the estimated covariance matrix in the cryo-EM case and the sample covariance matrix in the spiked model. There is again a bulk distribution with eigenvalues separated from it. Moreover, there is a phase-transition phenomenon in the cryo-EM case, in which the top eigenvectors of the estimated covariance lose correlation with those of the population covariance once the corresponding eigenvalues are absorbed by the bulk distribution. Answering the questions posed in section 2.3 would be very useful in quantifying the theoretical limitations of our approach.
As an additional line of further inquiry, note that the optimization problem (2.4) for the covariance matrix is amenable to regularization. If n ≥ f (p, q) is the high-dimensional statistical regime in which the unregularized estimator still carries a signal, then of course we need regularization when n ≤ f (p, q). Here, f is a function depending on the distribution of the operators Ps. Moreover, regularization increases robustness to noise, so in applications like cryo-EM, this could prove useful. Tikhonov regularization does not increase the complexity of our algorithm, but has the potential to make LPn invertible. Under what conditions can we still achieve accurate recovery in a regularized setting? Other regularization schemes can take advantage of a priori knowledge of Σ0, such as using nuclear norm regularization in the case when Σ0 is known to be low rank. See [25] for an application of nuclear norm minimization in the context of dealing with heterogeneity in cryo-electron tomography. Another special structure Σ0 might have is that it is sparse in a certain basis. For example, the localized variability assumption in the case of the heterogeneity problem is such an example; in this case, the covariance matrix is sparse in the real Cartesian basis or a wavelet basis. This sparsity can be encouraged using a matrix 1-norm regularization term. Other methods, such as sparse PCA [22] or covariance thresholding [7] might be applicable in certain cases when we have sparsity in a given basis.
We developed our algorithm in an idealized environment, assuming that the rotations Rs (and in-plane translations) are known exactly and correspond to approximately uniformly distributed viewing directions, and that the molecules belong to B. Moreover, we did not account for the CTF effect of the electron microscope. In practice, of course rotations and translations are estimated with some error. Also, certain molecules might exhibit a preference for a certain orientation, invalidating the uniform rotations assumption. Note that as long as LPn is invertible, our framework produces a valid estimator, but without the uniform rotations assumption, the computationally tractable approach to inverting this matrix proposed in section 5 no longer holds. Moreover, molecules might have higher frequencies than those we reconstruct, which could potentially lead to artifacts. Thus, an important direction of future research is to investigate the stability of our algorithm to perturbations from the idealized assumptions we have made. An alternative research direction is to devise numerical schemes to invert LPn without replacing it by LĈ, which could allow incorporation of CTF and obviate the need to assume uniform rotations. We proposed one such scheme in section 5.1.
As we discussed in the introduction, our statistical problem (1.1) is actually a special case of the matrix sensing problem. In future work, it would be interesting to test matrix sensing algorithms on our problem. In the cryo-EM case, it would be useful to compare our approach with matrix sensing algorithms. It would also be interesting to explore the applications of our methodology to other tomographic problems involving variability. For example, the field of four-dimensional (4D) electron tomography focuses on reconstructing a 3D structure that is a function of time [26]. This 4D reconstruction is essentially a movie of the molecule in action.
The methods developed in this paper can in principle be used to estimate the covariance matrix of a molecule varying with time. This is another kind of “heterogeneity” that is amenable to the same analysis we used to investigate structural variability in cryo-EM.
Acknowledgments
E. Katsevich thanks Jane Zhao, Lanhui Wang, and Xiuyuan Cheng (PACM, Princeton University) for their valuable advice on several theoretical and practical issues. Parts of this work have appeared in E. Katsevich’s undergraduate Independent Work at Princeton University.
The authors are also indebted to Philippe Rigollet (ORFE, Princeton), as this work benefited from discussions with him regarding the statistical framework. Also, the authors thank Joachim Frank (Columbia University) and Joakim Anden (PACM, Princeton University) for providing helpful comments about their manuscript. They also thank Dr. Frank and Hstau Liao (Columbia University) for allowing them to reproduce Figure 2 from [29] as our Figure 1. Finally, they thank the editor and the referees for their many helpful comments.
The research of this author was partially supported by Award DMS-1115615 from NSF.
The research of this author was partially supported by Award R01GM090200 from the NIGMS, by Awards FA9550-12-1-0317 and FA9550-13-1-0076 from AFOSR, and by Award LTR DTD 06-05-2012 from the Simons Foundation.
Appendix A. Matrix derivative calculations
The goal of this appendix is to differentiate the objective functions of (2.3) and (2.4) to verify formulas (2.5) and (2.6). In order to differentiate with respect to vectors and matrices, we appeal to a few results from [17]. The results are as follows:
(A.2) |
Here, the lowercase letters represent vectors and the uppercase letters represent matrices. Also note that z* denotes the complex conjugate of z. The general term of (2.3) is
(A.3) |
We can differentiate this with respect to μ* by using the first two formulas of (A.1). We get
Summing in s gives us (2.5).
If we let As = (Is − Psμn)(Is − Psμn)H − σ2I, then the general term of (2.4) is
Using the last two formulas of (A.1), we find that the derivative of this expression with respect to Σ is
(B.1) |
Taking a Hermitian and summing in s gives us (2.6).
Appendix B. Consistency of µn and Σn
In this appendix, we will prove the consistency results about μn and Σn stated in section 2.2. Recall μn and Σn are defined nontrivially if IA−1I ≤ 2 IA−1I and IL−1I ≤ 2 IL−1I. As a necessary step towards our consistency results, we must first prove that the probability of these events tends to 1 as n → ∞. Such a statement follow from a matrix concentration argument based on Bernstein’s inequality [59, Theorem 1.4], which we reproduce here for the reader’s convenience as a lemma.
Lemma B.1 (matrix Bernstein’s inequality)
Consider a finite sequence Ys of independent, random, self-adjoint matrices with dimension p. Assume that each random matrix satisfies
(B.2) |
Then, for all t ≥ 0,
(B.3) |
Next, we prove another lemma, which is essentially the Bernstein inequality in a more convenient form.
Lemma B.2
Let Z be a symmetric d × d random matrix, with lZl≤ B almost surely. If Z1,… , Zn are i.i.d. samples from Z, then
(B.4) |
Moreover,
(B.5) |
where C is an absolute constant.
Proof
The proof is an application of the matrix Bernstein inequality. Let . Then, note that E[Ys] = 0 and
(B.6) |
Next, we have
(B.7) |
It follows that
(B.8) |
Now, by the matrix Bernstein inequality, we find that
(B.9) |
(B.10) |
This proves (B.3). The bound (B.4) follows from [59, Remark 6.5].
PHPs, where P1,… , Pn are i.i.d. samples from P . Then,
Corollary B.3
Let P be a random q × p matrix such that lP l ≤ BP almost surely. Let A = E[PHP ] and let
(B.11) |
Moreover,
(B.12) |
where the last equality holds if n ≥ 4 log p.
Proof
These bounds follow by letting Z = PHP in Lemma B.2 and noting that lZl≤ B2 almost surely.
Corollary B.4
Let P be a random q × p matrix such that lP l ≤ BP almost surely. Let , where P1,… , Pn are i.i.d. samples from P . Then,
(B.13) |
(B.14) |
Moreover,
where the last equality holds if n ≥ 8 log p.
Proof
We wish to apply Lemma B.2 again, this time for ZΣ = PHP ΣPHP . In this case we must be careful because Z is an operator on the space of p × p matrices. We can view it as a p2 × p2 matrix if we represent its argument (a p × p matrix Σ) as a vector of length p2 (denoted by vec(Σ)). Then, almost surely,
(B.15) |
In the penultimate inequality above we used the fact that lAlF ≤ √rank(A) lAl for an arbitrary matrix A. Now, (B.11) follows from (B.3) by setting B = q2B4 and d = p2.
Proposition B.5
Let E A be the event that IA−1I ≤ 2 IA−1I, and let E L be the event that
(B.16) |
where
(B.17) |
Proof
Note that λmin(An) ≥ λmin(A) − lAn − Al. It follows that
(B.18) |
By Corollary B.3, it follows that
(B.19) |
Analogously, Corollary B.4 implies that
(B.20) |
Now, we prove the consistency results, which we restate for convenience. In the following propositions, define
(B.21) |
Note that
(B.22) |
Also, recall the following notation introduced in section 2.2:
(B.23) |
where V is a random vector. For example, (B.20) can be written as .
Proposition B.6
Suppose A (defined in (2.10)) is invertible, that lP l≤ BP almost surely, and that |||X|||2, |||E|||2 < ∞. Then, for fixed p, q we have
(B.24) |
Hence, under these assumptions, μn is consistent.
Proof
Since P[lµn − μ0l ≥ t] ≤ t−1E[lµn − μ0l] by Markov’s inequality, it is sufficient to prove that E[lµn − μ0l] → 0 as n → ∞. Note that by the definition of µn and Proposition B.5,
(B.25) |
(B.26) |
where these summands are i.i.d., we find
Since
(B.27) |
Putting together what we have, we arrive at
(B.28) |
Inspecting this bound reveals that E[lµn − μ0l] → 0 as n → ∞, as needed.
Remark B.7
Note that with a simple modification to the above argument, we obtain
(B.29) |
This bound will be useful later.
Before proving the consistency of Σn, we state a lemma.
Lemma B.8
Let V be a random vector on Cp with E[VVH ] = ΣV , and let V1,… , Vn be i.i.d. samples from V . Then, for some absolute constant C,
(B.30) |
provided the RHS does not exceed lΣV l.
Proof
This result is a simple modification of [47, Theorem 1].
Proposition B.9
Suppose A and L (defined in 2.10) are invertible, that lP l≤ BP almost surely, and that there is a polynomial Q for which
(B.31) |
Then, for fixed p, q, we have
(B.32) |
Hence, under these assumptions, Σn is consistent.
Proof
In parallel to the proof of Proposition B.6, we will prove that E[lΣn − Σ0l] → 0 as n → ∞. We compute
(B.33) |
Now, we will bound E . To do this, we write
(B.34) |
Let us consider each of these four difference terms in order. Note that
(B.35) |
Moreover,
(B.36) |
Using the Cauchy–Schwarz inequality and (B.26), we find
(B.37) |
Here, we used (B.26). This bound also holds for the second term in the last line of (B.33). As for the third term,
(B.38) |
Putting these bounds together, we arrive at
(B.39) |
(B.40) |
Next, we move on to analyzing D2. If V = PH (I − P μ0), note that
(B.41) |
By Lemma (B.8), we find (B.38)
Since Σ0 = E[(X − μ0)(X − μ0)H ], it follows that lΣ0l ≤ E[lX − μ0l ] = |||X|||2. Further, the calculation (B.13) implies that
(B.42) |
Also, it is clear that . Furthermore, Minkowski inequality implies that
(B.43) |
Hence, (B.38) becomes
(B.44) |
Next, a bound for D3 follows immediately from (B.10):
(C.1) |
Similarly, (B.12) gives
(C.2) |
Combining the four bounds (B.36), (B.39), (B.42), (B.43) with (B.30) and (B.31), we arrive at
(C.3) |
Fixing all the variables except n, we see that the largest term is the one in the second line, and it decays as Q(log n)/√n due to the moment growth condition (B.28).
Appendix C. Simplifying (5.12)
Here, we simplify the expression for an element of LPk1,k2:
(C.4) |
Let . Then, (C.1) becomes
(C.5) |
Recall from section 5.3 that is a spherical harmonic of order up to k. It follows that has a spherical harmonic expansion up to order 2k1 (using the formula for the product of two spherical harmonics, which involves the Clebsch–Gordan coefficients). The same holds for , where the order goes up to 2k2. Let us write for the l, m coefficient of the spherical harmonic expansion of . Thus, we have
(C.6) |
It follows that
(C.7) |
Since K(α, β) depends only on α · β, by an abuse of notation we can write K(α, β) = K(α · β). Thus, the Funk–Hecke theorem applies [38], so we may write
(C.8) |
where
(C.9) |
Note that the PR are the Legendre polynomials. Since K is an even function of t and PR has the same parity as .e, it follows that c(.e) = 0 for odd .e. For even .e, we have
(C.9) |
It follows from formula 3 on p. 423 of [45] that
(C.9) |
Using Stirling’s formula, we can find that c(.e) ~ .e−1 for large .e.
Finally, plugging the result of Funk–Hecke into (C.4), we obtain
(C.9) |
Thus, we have verified (5.13).
Footnotes
Received by the editors September 3, 2013; accepted for publication (in revised form) September 22, 2014; published electronically January 22, 2015.
REFERENCES
- [1].Amunts A, Brown A, Bai X, Llaácer J, Hussain T, Emsley P, Long F, Murshudov G, Scheres S, Ramakrishnan V. Structure of the yeast mitochondrial large ribosomal subunit. Science. 2014;343:1485–1489. doi: 10.1126/science.1249410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Baddour N. Operational and convolution properties of three dimensional Fourier transforms in spherical polar coordinates. J. Opt. Soc. Amer. A. 2010;27:2144–2155. doi: 10.1364/JOSAA.27.002144. [DOI] [PubMed] [Google Scholar]
- [3].Bai X, Fernandez I, McMullan G, Scheres S. Ribosome structures to near-atomic resolution from thirty thousand cryo-em particles. eLife. 2013;2:e00461. doi: 10.7554/eLife.00461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Baik J, Ben Arous G, Páecháe S. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 2005;33:1643–1697. [Google Scholar]
- [5].Baik J, Silverstein JW. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 2006;97:1382–1408. [Google Scholar]
- [6].Bennett J, Lanning S. The Netflix prize. 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Jose, CA, ACM, New York. 2007. [Google Scholar]
- [7].Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008;36:2577–2604. [Google Scholar]
- [8].Bishop C. Inf. Sci. Statist. Springer-Verlag; New York: 2006. Pattern Recognition and Machine Learning. [Google Scholar]
- [9].Candes E, Plan Y. Matrix completion with noise. Proc. IEEE. 2010;98:925–936. [Google Scholar]
- [10].Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 1977;39:1–38. [Google Scholar]
- [11].Donoho D. Math Challenges of the 21st Century. Los Angeles: 2000. High-dimensional data analysis: The curses and blessings of dimensionality. [Google Scholar]
- [12].Frank J. Three-Dimensional Electron Microscopy of Macromolecular Assemblies: Visualization of Biological Molecules in Their Native State. Oxford University Press; Oxford: 2006. [Google Scholar]
- [13].Frank J. Exploring the Dynamics of Supramolecular Machines with Cryo-Electron Microscopy. Proceedings of the 23rd International Solvay Conference on Chemistry; Brussels: International Solvay Institutes; 2013. [Google Scholar]
- [14].Frank J. Story in a sample – the potential (and limitations) of cryo-electron microscopy applied to molecular machines. Biopolymers. 2013;99:832–836. doi: 10.1002/bip.22274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Henderson R. Realizing the potential of electron cryo-microscopy. Quart. Rev. Biophys. 2004;37:3–13. doi: 10.1017/s0033583504003920. [DOI] [PubMed] [Google Scholar]
- [16].Herman G, Kalinowski M. Classification of heterogeneous electron microscopic projections into homogeneous subsets. Ultramicroscopy. 2008;108:327–338. doi: 10.1016/j.ultramic.2007.05.005. [DOI] [PubMed] [Google Scholar]
- [17].Hjorungnes A, Gesbert D. Complex-valued matrix differentiation: Techniques and key results. IEEE Trans. Signal Process. 2007;55:2740–2746. [Google Scholar]
- [18].Ilin A, Raiko T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 2010;11:1957–2000. [Google Scholar]
- [19].Jain P, Netrapalli P, Sanghavi S. Low-rank matrix completion using alternating minimization. Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, ACM; New York. 2013. pp. 665–674. [Google Scholar]
- [20].Jin Q, Sorzano COS, de la Rosa-Trevlin JM, Bilbao-Castro JR, Núnez-Ramírez R, Llorca O, Tama F, Jonić S. Iterative elastic 3D-to-2D alignment method using normal modes for studying structural dynamics of large macromolecular complexes. Structure. 2014;22:496–506. doi: 10.1016/j.str.2014.01.004. [DOI] [PubMed] [Google Scholar]
- [21].Johnstone I. On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 2001;29:295–327. [Google Scholar]
- [22].Johnstone I, Lu A. On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 2009;104:682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Kalai AT, Moitra A, Valiant G. Disentangling Gaussians. Commun. ACM. 2012;55:113–120. [Google Scholar]
- [24].Kühlbrandt W. The resolution revolution. Science. 2014;343:1443–1444. doi: 10.1126/science.1251652. [DOI] [PubMed] [Google Scholar]
- [25].Kuybeda O, Frank GA, Bartesaghi A, Borgnia M, Subramaniam S, Sapiro G. A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryoelectron tomography. J. Struct. Biol. 2013;181:116–127. doi: 10.1016/j.jsb.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Kwon O, Zewail AH. 4D electron tomography. Science. 2010;328:1668–1673. doi: 10.1126/science.1190470. [DOI] [PubMed] [Google Scholar]
- [27].Leger F, Yu G, Sapiro G. Efficient matrix completion with Gaussian models. IEEE 2011 International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE; Piscataway, NJ. 2011. pp. 1113–1116. [Google Scholar]
- [28].Li X, Mooney P, Zheng S, Booth C, Braunfeld M, Gubbens S, Agard D, Cheng Y. Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-em. Nature Methods. 2013;10:584–590. doi: 10.1038/nmeth.2472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Liao H, Frank J. Classification by bootstrapping in single particle methods. Proceedings of the 2010 IEEE International Conference on Biomedical Imaging: From Nano to Macro, IEEE; Piscataway, NJ. 2010. pp. 169–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Liao M, Cao E, Julius D, Cheng Y. Structure of the TRPV 1 ion channel determined by electron cryo-microscopy. Nature. 2013;504:107–124. doi: 10.1038/nature12822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Little R, Rubin D. Wiley Ser. Probab. Stat. 2nd John Wiley; Hoboken, NJ: 2002. Statistical Analysis with Missing Data. [Google Scholar]
- [32].Loh P, Wainwright M. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 2012;40:1637–1664. [Google Scholar]
- [33].Lounici K. High-dimensional covariance matrix estimation with missing observations. Bernoulli. 2014;20:1029–1058. [Google Scholar]
- [34].Ludtke S, Baker M, Chen D, Song J, Chuang D, Chiu W. De novo backbone trace of GroEL from single particle electron cryomicroscopy. Structure. 2008;16:441–448. doi: 10.1016/j.str.2008.02.007. [DOI] [PubMed] [Google Scholar]
- [35].Marčenko VA, Pastur LA. Distribution of eigenvalues of some sets of random matrices. Math. USSR Sb. 1967;1:507–536. [Google Scholar]
- [36].Morrison MA, Parker GA. A guide to rotations in quantum mechanics. Aust. J. Phys. 1987;40:465–497. [Google Scholar]
- [37].Nadler B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 2008;36:2791–2817. [Google Scholar]
- [38].Natterer F. Classics Appl. Math. SIAM; Philadelphia: 2001. The Mathematics of Computerized Tomography. [Google Scholar]
- [39].O’Neil M, Woolfe F, Rokhlin V. An algorithm for the rapid evaluation of special function transforms. Appl. Comput. Harmon. Anal. 2010;28:203–226. [Google Scholar]
- [40].Pearson K. On lines and planes of closest fit to systems of points in space. Philos. Mag. 1901;2:559–572. [Google Scholar]
- [41].Penczek P, Liang ZP. Variance in three-dimensional reconstructions from projections. In: Unser M, editor. Proceedings of the 2002 IEEE International Symposium on Biomedical Imaging; Piscataway, NJ. 2002. pp. 749–752. IEEE. [Google Scholar]
- [42].Penczek P, Chao Y, Frank J, Spahn CMT. Estimation of variance in single-particle reconstruction using the bootstrap technique. J. Struct. Biol. 2006;154:168–183. doi: 10.1016/j.jsb.2006.01.003. [DOI] [PubMed] [Google Scholar]
- [43].Penczek P, Kimmel M, Spahn C. Identifying conformational states of macromolecules by eigenanalysis of resampled cryo-EM images. Structure. 2011;19:1582–1590. doi: 10.1016/j.str.2011.10.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Penczek P, Renka R, Schomberg H. Gridding-based direct Fourier inversion of the three-dimensional ray transform. J. Opt. Soc. Amer. A. 2004;21:499–509. doi: 10.1364/josaa.21.000499. [DOI] [PubMed] [Google Scholar]
- [45].Prudnikov AP, Brychkov YA, Marychev OI. Integrals and Series: Special Functions. Gordon and Breach; Amsterdam: 1983. [Google Scholar]
- [46].Recht B, Fazel M, Parrilo PA. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010;52:471–501. [Google Scholar]
- [47].Rudelson M. Random vectors in the isotropic position. J. Funct. Anal. 1999;164:60–72. [Google Scholar]
- [48].Saxton WO, Baumeister W. The correlation averaging of a regularly arranged bacterial cell envelope protein. J. Microscopy. 1982;127:127–138. doi: 10.1111/j.1365-2818.1982.tb00405.x. [DOI] [PubMed] [Google Scholar]
- [49].Scheres S. Relion: Implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 2012;180:519–530. doi: 10.1016/j.jsb.2012.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Scheres S. Maximum-likelihood methods in cryo-EM. Part II: Application to experimental data. J. Struct. Biol. 2013;181:195–206. [Google Scholar]
- [51].Schneider T. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. J. Climate. 2001;14:853–871. [Google Scholar]
- [52].Shatsky M, Hall R, Nogales E, Malik J, Brenner S. Automated multi-model reconstruction from single-particle electron microscopy data. J. Struct. Biol. 2010;170:98–108. doi: 10.1016/j.jsb.2010.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Sigworth F, Doerschuk P, Carazo J, Scheres S. Maximum-likelihood methods in cryo EM. Part I: Theoretical basis and overview of existing approaches. Methods Enzymology. 2010;482:263–294. doi: 10.1016/S0076-6879(10)82011-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Silverstein JW, Bai ZD. On the empirical distribution of eigenvalues of a class of large dimensional random matrices. J. Multivariate Anal. 1995;54:175–192. [Google Scholar]
- [55].Singer A, Shkolnisky Y. Three-dimensional structure determination from common lines in cryo-EM by eigenvectors and semidefinite programming. SIAM J. Imag. Sci. 2011;4:543–572. doi: 10.1137/090767777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Slepian D. Prolate spheroidal wave functions. Fourier analysis and uncertainty – IV: Extensions to many dimensions; generalized prolate spheroidal functions, Bell System Tech. J. 1964;43:3009–3057. [Google Scholar]
- [57].Stein EM, Weiss GL. Introduction to Fourier Analysis on Euclidean Spaces. Princeton University Press; Princeton, NJ: 1971. [Google Scholar]
- [58].Trefethen L, Bau D., III . Numerical Linear Algebra. SIAM; Philadelphia: 1997. [Google Scholar]
- [59].Tropp J. User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 2012;12:389–434. [Google Scholar]
- [60].van Heel M. Principles of Phase Contrast (Electron) Microscopy. 2009 http://www.singleparticles.org/methodology/MvH_Phase Contrast.pdf. [Google Scholar]
- [61].van Heel M, Gowen B, Matadeen R, Orlova EV, Finn R, Pape T, Cohen D, Stark H, Schmidt R, Patwardhan A. Single particle electron cryo-microscopy: Towards atomic resolution. Quart. Rev. Biophys. 2000;33:307–369. doi: 10.1017/s0033583500003644. [DOI] [PubMed] [Google Scholar]
- [62].Vershynin R. Introduction to the non-asymptotic analysis of random matrices, in Compressed Sensing, Theory and Applications. In: ldar Y, Kutyniok G, editors. Cambridge University Press; Cambridge: 2012. pp. 210–268. [Google Scholar]
- [63].Wang L, Sigworth FJ. Cryo-EM and single particles. Physiology (Bethesda) 2006;21:13–18. doi: 10.1152/physiol.00045.2005. [DOI] [PubMed] [Google Scholar]
- [64].Wang L, Singer A, Wen Z. Orientation determination of cryo-EM images using least unsquared deviations. SIAM J. Imag. Sci. 2013;6:2450–2483. doi: 10.1137/130916436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].Wang Q, Matsui T, Domitrovic T, Zheng Y, Doerschuk P, Johnson J. Dynamics in cryo EM reconstructions visualized with maximum-likelihood derived variance maps. J. Struct. Biol. 2013;181:195–206. doi: 10.1016/j.jsb.2012.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Wilks SS. Moments and distributions of estimates of population parameters from fragmentary samples. Ann. Math. Statist. 1932;3:163–195. [Google Scholar]
- [67].Zhang W, Kimmel M, Spahn CM, Penczek P. Heterogeneity of large macromolecular complexes revealed by 3d cryo-em variance analysis. Structure. 2008;16:1770–1776. doi: 10.1016/j.str.2008.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Zhang X, Settembre E, Xu C, Dormitzer P, Bellamy R, Harrison S, Grigorieff N. Near-atomic resolution using electron cryomicroscopy and single-particle reconstruction. Proc. Natl. Acad. Sci. USA. 2008;105:1867–1872. doi: 10.1073/pnas.0711623105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [69].Zhao Z, Singer A. Fourier-Bessel rotational invariant eigenimages. J. Opt. Soc. Amer. A. 2013;30:871–877. doi: 10.1364/JOSAA.30.000871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Zhao Z, Singer A. Rotationally invariant image representation for viewing direction classification in cryo-EM. J. Struct. Biol. 2014;186:153–166. doi: 10.1016/j.jsb.2014.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]