Skip to main content
Biophysical Journal logoLink to Biophysical Journal
. 2015 Mar 10;108(5):1165–1175. doi: 10.1016/j.bpj.2014.12.054

Bayesian Inference of Initial Models in Cryo-Electron Microscopy Using Pseudo-atoms

Paul Joubert 1,, Michael Habeck 1,2,∗∗
PMCID: PMC4375433  PMID: 25762328

Abstract

Single-particle cryo-electron microscopy is widely used to study the structure of macromolecular assemblies. Tens of thousands of noisy two-dimensional images of the macromolecular assembly viewed from different directions are used to infer its three-dimensional structure. The first step is to estimate a low-resolution initial model and initial image orientations. This is a challenging global optimization problem with many unknowns, including an unknown orientation for each two-dimensional image. Obtaining a good initial model is crucial for the success of the subsequent refinement step. We introduce a probabilistic algorithm for estimating an initial model. The algorithm is fast, has very few algorithmic parameters, and yields information about the precision of estimated model parameters in addition to the parameters themselves. Our algorithm uses a pseudo-atomic model to represent the low-resolution three-dimensional structure, with isotropic Gaussian components as moveable pseudo-atoms. This leads to a significant reduction in the number of parameters needed to represent the three-dimensional structure, and a simplified way of computing two-dimensional projections. It also contributes to the speed of the algorithm. We combine the estimation of the unknown three-dimensional structure and image orientations in a Bayesian framework. This ensures that there are very few parameters to set, and specifies how to combine different types of prior information about the structure with the given data in a systematic way. To estimate the model parameters we use Markov chain Monte Carlo sampling. The advantage is that instead of just obtaining point estimates of model parameters, we obtain an ensemble of models revealing the precision of the estimated parameters. We demonstrate the algorithm on both simulated and real data.

Introduction

Single-particle cryo-electron microscopy (cryo-EM) is a method used to determine the three-dimensional structure of macromolecular assemblies (1). Many copies of the assembly of interest are prepared in a thin ice layer, and imaged using an electron microscope. Each image, called a micrograph, contains non-overlapping two-dimensional images of hundreds of particles, all assumed to have approximately the same three-dimensional structure, but oriented differently. Tens of thousands of these particle images are extracted from a collection of micrographs. Such large numbers are required due to the extremely low signal-to-noise ratio (SNR) of the images.

The standard image formation model for this setting is to model each image as the linear projection of the unknown structure along an unknown direction, convolved with a known point-spread function (due to the electron microscope), and corrupted by noise (1). The reconstruction problem is to infer the three-dimensional structure from the two-dimensional images.

The workflow for solving the reconstruction problem can be divided into two parts: obtaining a low-resolution initial model, followed by a refinement of this model. A common refinement algorithm is three-dimensional projection matching (2). It alternates between updating the three-dimensional model and the image orientations. Given the current three-dimensional model, its projections are calculated along a discrete grid of directions. Each image is aligned to the best matching projection. Having updated all the image orientations, a new three-dimensional model is reconstructed using direct Fourier inversion for example, and the steps are repeated until convergence.

A low-resolution initial model could be a previous reconstruction of the same structure, or a model of a similar structure. However, in cases where a suitable initial model is not available, it has to be reconstructed from the data using an ab initio reconstruction algorithm. This is an important step: if the initial model does not represent the structure accurately enough, it may lead the refinement algorithm to converge to an incorrect model.

As input data, many ab initio algorithms do not use the individual particle images, but work with two-dimensional class averages instead. Class averages are obtained by clustering, aligning, and averaging the two-dimensional images to improve the SNR (3). This significantly reduces the number of unknown image orientations to be estimated by the algorithm.

Several ab initio algorithms exist. They include common-lines-based algorithms (4–9), random-model methods (10,11), methods using stochastic hill climbing (12) or nonlinear dimensionality reduction (13), and a Bayesian approach (14).

A drawback of most of these ab initio algorithms is that they have many ad hoc parameters whose effect on the results is difficult to interpret, and which are a potential source of bias by the user.

This has motivated the use of statistical modeling in cryo-EM reconstruction, first in the form of maximum-likelihood methods (15,16), and more recently as maximum a posteriori (MAP) estimates in a Bayesian framework (14,17,18). Statistical modeling requires a complete description of how the data (i.e., the two-dimensional images) are generated from the model (i.e., the three-dimensional structure and image orientations). It distinguishes between parameters used to describe the statistical model and algorithmic parameters influencing, for example, how fast the algorithm runs, but which cannot bias the results. Such an approach therefore has parameters that are easier to interpret, and a higher degree of objectivity.

The cryo-EM reconstruction problem is highly ill-posed: different models can give rise to very similar data. The standard way of dealing with this is to regularize, for instance by penalizing three-dimensional structures with too much high-frequency content. From the Bayesian perspective, this is equivalent to introducing prior assumptions about the model (in this case that the three-dimensional structure should have mostly low-frequency content). The Bayesian approach provides a systematic and theoretically well-grounded way to combine such explicit prior knowledge about the model with the data to find models (i.e., three-dimensional structures and image orientations) that are consistent with both the prior knowledge and the data.

Bayesian approaches to reconstruction algorithms tend to be very computationally intensive. Typical computation times in CPU time range from days (14) to several months (15).

We introduce a probabilistic ab initio algorithm that addresses the above-mentioned challenges. It uses a pseudo-atomic model with several hundred pseudo-atoms that can move around and change their size. As we will show later, this significantly reduces the number of parameters needed to describe the three-dimensional structure. Computing two-dimensional projections of the three-dimensional structure also becomes much simpler and faster.

Our reconstruction algorithm uses a Bayesian approach. The data-generation process is simple and intuitive, with only a small number of adjustable parameters such as the number of pseudo-atoms. Expressing prior knowledge becomes straightforward.

Instead of just generating the single model most consistent with the data and prior knowledge (the MAP estimate), our algorithm generates multiple similar models that are all consistent with the data and prior knowledge. The ensemble of models can be analyzed to obtain information about the precision of the estimated three-dimensional structure and image orientations. This approach also allows us to integrate out the image orientations without the use of a discrete grid, which slows down other Bayesian approaches.

We demonstrate our algorithm using simulated and experimental data, and show that in all cases it can obtain suitable initial models in a relatively short time.

Materials and Methods

Model

Pseudo-atomic model

We use a coarse-grained representation of the three-dimensional structure as a cloud of K pseudo-atoms. Each pseudo-atom is a spherical blob centered at position μk with unknown radius σ. All the pseudo-atoms have the same (adjustable) size, and their positions can vary continuously, i.e., they are not fixed to a regular grid. Each pseudo-atom has an unknown weight wk. In analogy to high-resolution atomic structures, the μk vectors are the Cartesian coordinates of the kth pseudo-atom, and wk and σ are its occupancy and temperature factor, respectively. In contrast to atomic models, pseudo-atoms are much larger than atoms, and far fewer of them are therefore required to represent a structure.

Pseudo-atomic models have been used to rigidly fit multiple subunits into a given low-resolution three-dimensional density map (19), and to identify possible conformational changes through a normal mode analysis (20,21). In all these applications the pseudo-atomic model is fit to a three-dimensional structure that has been reconstructed earlier using other algorithms. In contrast, we are exploiting the advantages of the pseudo-atomic representation for the reconstruction problem itself: the parameters of the pseudo-atoms will be estimated directly from the two-dimensional images, without any reference to three-dimensional volumes on regular grids.

If we choose our pseudo-atoms to have a Gaussian shape, then from a statistical perspective our representation is known as a Gaussian mixture model (GMM) (22). GMMs are widely used to estimate probability density functions from observed data points, and are a smooth and efficient alternative to the common histogram estimator. The advantage of casting cryo-EM reconstruction as a GMM fitting problem is that we can draw inspiration from well-established statistical methods for estimating the parameters of the pseudo-atoms.

Each pseudo-atom is represented by a Gaussian function G3D(x;μk,σ) describing the density at the three-dimensional point x of a pseudo-atom centered at μk with radius σ (the parameters μk and σ are the mean and standard deviation of the Gaussian). The density map ρ representing the entire three-dimensional structure is a weighted sum of K such pseudo-atoms,

ρ(x)=k=1KwkG3D(x;μk,σ)=k=1Kwk(2π)3/2σ3exp{xμk22σ2}, (1)

where ||xμk|| denotes the Euclidean distance between any three-dimensional point x and the position of the pseudo-atom μk. Equation 1 is used, for example, in many flexible fitting algorithms that fit atomic structures to experimental EM maps.

The pseudo-atomic model has many advantages compared to the standard three-dimensional grid based representation. The first advantage is that it requires far fewer parameters (see Fig. 1 and Movie S1 in the Supporting Material). Each pseudo-atom needs only four parameters to describe its position and weight, whereas a three-dimensional grid has one parameter for each three-dimensional voxel. The pseudo-atomic model can also be evaluated on an arbitrarily fine grid for visualization purposes.

Figure 1.

Figure 1

Comparison between the number of parameters required for the standard grid-based representation and the pseudo-atomic representation of RNA polymerase II. The normalized cross-correlation was computed with respect to the original reference structure on a grid with dimensions 112 × 112 × 112. The reference structure was downsampled by factors of 2, 4, and 8 to obtain the grid-based representations. The number of parameters is the number of voxels and four times the number of components (K) for the respective representations. The figure shows that, for any specified level of accuracy (quantified by the cross-correlation), the pseudo-atomic representation needs <10% of the number of parameters needed by the grid-based representation. See also Movie S1.

Another important advantage is that computing two-dimensional projections is simple and fast. A given image orientation is described by a three-dimensional rotation matrix R. The three-dimensional structure is projected along the corresponding direction by first rotating it by R, and then integrating along the z axis to obtain an image in the x,y plane. The in-plane translation is denoted by the vector t. In-plane rotations are already accounted for by the rotation matrix R.

Applying this procedure to our pseudo-atomic model is very simple: first we apply the rotation by transforming each pseudo-atom position μk to k. Then we project to the x,y plane by just discarding the z coordinate. Formally, we project k to PRμk, where

P=[100010].

Finally, we translate the projection by t. The resulting translated two-dimensional projection of all the pseudo-atoms is also a GMM, of the form

(PRρ)(x)=k=1KwkG2D(x;PRμk+t,σ), (2)

where

G2D(x;μ,σ)=12πσ2exp{xμ22σ2}

is a two-dimensional Gaussian. The weights wk and size σ are the same as for the three-dimensional model. The computation only requires a small number of elementary matrix operations. There is also no need for any interpolation.

The parameters of the pseudo-atomic model, together with a rotation Ri and translation ti for each image, constitute our unknown model parameters,

θ={μ,σ,w,R,t},

where μ = {μk}, w = {wk}, R = {Ri}, and t = {ti}.

We do not have to specify the size σ of the pseudo-atoms because it is estimated by the algorithm. Instead, we have to specify the number of pseudo-atoms, K, which implicitly determines the optimal size σ. We choose K such that the estimated size σ is approximately the same as the pixel size of the two-dimensional projection images. A similar rule-of-thumb has been shown to work when fitting atomic models to three-dimensional volumes (20), by choosing the pseudo-atom size to be roughly the same as the voxel size.

Using the Bayesian approach, we have to encode our prior assumptions by defining a probability distribution p(θ) (called the “prior”) over all possible models describing how plausible each model is before including any data. For instance, we use a three-dimensional Gaussian distribution as the prior for each position μk to encode our assumption that the pseudo-atoms should be spread across a region roughly the size of the unknown three-dimensional structure. For each rotation Ri we use a uniform prior to model the assumption that each image orientation is equally likely.

Our distribution of the prior has only four additional parameters (hyperparameters) that determine the shape of the prior, and are fixed during the reconstruction: one for the expected size of the structure (which can be estimated from the size of the images), one determining how much the individual weights wk are allowed to deviate from the average weight, and two specifying the range of plausible sizes σ for the pseudo-atoms. The hyperparameters have only a minor influence on the final model, and the default values will work for a large number of reconstructions. See the Supporting Material for an analysis of the effect of the hyperparameters on the final model.

Data-generation model

One way to create two-dimensional projection images is to project the three-dimensional mixture model to a two-dimensional mixture model as described above (Eq. 2), and then to evaluate this two-dimensional mixture model on a two-dimensional grid. This approach will be used below to generate simulated data.

Viewing our pseudo-atomic model as a GMM, we would like to make use of the powerful statistical algorithms that exist for fitting GMMs to three-dimensional point clouds. Examples of such algorithms include expectation maximization (23) and Gibbs sampling (24). Two complications prevent us from directly applying one of these algorithms: we have two-dimensional intensities instead of individual two-dimensional points, and a dimension is missing (we have two-dimensional data instead of three-dimensional points). To address these complications and connect with existing methods for estimating GMM parameters, we adopt an alternative view of the data-generation process (see the Supporting Material for a formal description).

Starting with a pseudo-atomic structure, we assume that the first step in generating the ith image is to generate a three-dimensional point cloud with C points covering the same region as the pseudo-atomic model. Each point in the point cloud is created by first randomly selecting a pseudo-atom according to its weight wk, and then randomly placing the point near the center of the pseudo-atom. This is exactly how algorithms such as expectation maximization and Gibbs sampling assume the three-dimensional point cloud to have been generated in the standard application of fitting a GMM to a three-dimensional point cloud.

The three-dimensional point cloud is then rotated and translated by Ri and ti, and projected to a two-dimensional point cloud by discarding the z coordinate. Finally, a two-dimensional histogram is formed by using the two-dimensional pixels as bins. The two-dimensional histogram is viewed as a quantized image, with the number of points in each bin being the image intensity at that pixel. The input data D to the algorithm consists of all the quantized images together.

The randomness in generating the three-dimensional point cloud translates into randomness in the generated data D. Given fixed parameters for the pseudo-atoms and the rotations and translations, the probability distribution over possible datasets that can be generated is denoted by p(D | θ). In statistical parlance, p(D | θ) is called the “data likelihood” and defines a random forward model of how the observed images could have been generated. Estimation of the model parameters θ is achieved by inverting the data generation process with the help of Bayes’ theorem.

To use our ab initio algorithm we first have to convert the raw particle images to quantized class averages. The preprocessing steps needed to obtain nonnegative (real-valued) class averages are described below. The nonnegative images are converted to quantized images by first scaling them by α and then rounding to the nearest integer. The scaling factor α is chosen such that the total number of points C in the image equals a predetermined constant.

The idea of the ab initio algorithm to be described below is to reverse the above data-generation process: starting with two-dimensional points from the quantized image, we first back-project them to three-dimensional points by estimating their missing z coordinates. Then we assign each three-dimensional point to a pseudo-atom that was likely to have generated it. And finally we move the pseudo-atom to align it to its assigned three-dimensional points. The last part of this strategy is the same as used in the standard application of Gibbs sampling to three-dimensional point clouds, and very similar to expectation maximization.

Data preprocessing

Similarly to many other ab initio algorithms, we use class averages instead of raw particle images as input to our algorithm. This yields a computational advantage by significantly reducing the number of unknown rotations, in addition to an increase in the SNR. It comes at the cost of corrupting the high-frequency information in the images, but this is not a drawback for ab initio methods, where we are only interested in low-resolution reconstructions.

Our ab initio algorithm requires the class averages to be nonnegative. This is a sensible assumption, given that in the standard model of cryo-EM image formation, the images are taken to be nonnegative before the contrast transfer function (CTF) is applied. If we apply commonly used CTF-correction algorithms such as Wiener filtering or phase-flipping, the resulting images typically still have negative values.

Here we describe an extra deconvolution step to correct for the CTF that can be appended to the class-averaging algorithm to ensure that the resulting images are nonnegative (see Fig. 2). The deconvolution algorithm can be applied either to individual raw images that have been clustered and aligned, or to class averages as produced by any existing class-averaging algorithm.

Figure 2.

Figure 2

Preprocessing pipeline to prepare the data for the ab initio algorithm. The raw images on the left are clustered and aligned using any of the standard class-averaging algorithms. The deconvolution algorithm in the text is then applied to every cluster or class average to obtain a deconvolved image (on the right).

Let {zi} be the raw images from a single class that have been two-dimensionally aligned relative to each other. We model each image zi=fiy+ϵi as the convolution of a nonnegative image y with a point-spread function fi, with added independent and identically distributed Gaussian noise ϵi. Each point-spread function is the inverse Fourier transform of the corresponding CTF for that image, which is assumed to be known. The unknown image y is the projection of the unknown density map along an unknown direction.

A MAP estimate for y is found by minimizing the convex loss function,

L(y)=12izifiy2+12αy2, (3)

subject to the constraint that y be nonnegative. Here ▿ is the gradient operator, and α is a hyperparameter controlling the smoothness of y. The regularization parameter α can be chosen manually, and was fixed to a value of 10 for all our experiments. We use the L-BFGS-B algorithm (25) to optimize Eq. 3.

Algorithm

In the previous section we defined the model parameters θ, the data D, the prior distribution on the parameters p(θ), and the data-generation model p(D | θ). Bayes’ theorem dictates how to compute the posterior distribution p(θ | D):

p(θ|D)=p(D|θ)p(θ)p(D). (4)

The posterior is a probability distribution over all possible models, quantifying how well each model explains the data without violating the prior assumptions.

Gibbs sampling is a widely-used Markov chain Monte Carlo algorithm for sampling from the posterior distribution, in other words for generating model realizations that follow the posterior distribution and are therefore consistent with both the data and the prior information (24).

The first step is to generate a random initial model by sampling the model parameters from the prior distribution. The parameters are then updated in turn: first, the assignments of points to pseudo-atoms and missing z coordinates (back-projection); then, the pseudo-atom parameters (positions, weights, and size); and finally, the rotations and translations. Each parameter update depends on the current values of the other parameters. This single Gibbs sampling step is iterated several times until the parameters converge to a stable region of parameter space that should be independent of the initial random model.

Each group of parameters is updated according to their corresponding conditional distribution. The conditional distribution quantifies the likelihood of each possible value of a given parameter, assuming that all other model parameters are known and fixed.

Importantly, in the Bayesian framework all the conditional distributions are completely determined by just the prior distribution p(θ) and the data-generation model p(D | θ). The only way to modify these distributions is by making different prior assumptions or by using different data. Furthermore, each conditional distribution is a well-known distribution for which is it straightforward to generate parameters. For instance, the conditional distribution for each pseudo-atom position is a Gaussian distribution. Less well known is the conditional distribution for each rotation Ri, which is of the form exp[tr(ATRi)] for some matrix A. This is a unimodal distribution, which can be seen as the analog for three-dimensional rotations of the well-known von Mises distribution for two-dimensional rotations. We use the algorithm introduced by Habeck (26) to generate rotations from this distribution.

We will first give an overview of the entire algorithm, which consists of several Gibbs sampling steps, and then describe a single Gibbs sampling step in more detail.

In the flowchart in Fig. 3, the algorithm is divided into two parts: an initial stage and a refinement stage. A very low resolution structure with only 100 or 200 pseudo-atoms is constructed during the initial stage, and then refined with 500 or 2000 pseudo-atoms during the refinement stage. See Movie S2 for a visualization of the algorithm.

Figure 3.

Figure 3

Algorithm flowchart showing initial and refinement stages. The initial stage consists of 25 steps of updating first the pseudo-atoms, then the rotations, each using 100 Gibbs sampling steps. The refinement stage consists of 5000 Gibbs sampling steps, of which the first 2500 are discarded as the burn-in phase. The number of steps is conservatively chosen to be far more than is needed for the algorithm to converge in all tested cases.

The initial stage is reminiscent of the projection-matching algorithm described in the Introduction. We alternate between multiple Gibbs sampling steps to update just the pseudo-atom parameters, and multiple Gibbs sampling steps to update just the rotations. The rotation updates using Gibbs sampling tend to make only small adjustments, and can sometimes get stuck in local optima. Therefore we add a global rotation update during every outer loop. During this global rotation update, each image is compared to 10,000 two-dimensional projections of the pseudo-atomic model in random orientations. For each orientation, we compute the likelihood of the image given the two-dimensional projection corresponding to the orientation. These likelihoods form the coefficients of a discrete approximation to the conditional posterior distribution over the rotation. We then sample an orientation from this discrete distribution, and use it to update the rotation for the image. In this way rotations can escape local optima.

For the refinement stage, we increase the number of pseudo-atoms to 500 or 2000 and then sample all parameters using Gibbs sampling. During this stage, only minor adjustments are made to the rotations.

During both stages we can monitor the progress and convergence of the Gibbs sampler via the log-posterior, defined as log p(θ | D), where θ is the model being used (see Fig. 5 B).

Figure 5.

Figure 5

Results for the 50S ribosome. (A) Starting from a random initial model, the initial stage converges within <10 steps. The number of pseudo-atoms (shown as small solid circles) is then increased from 100 to 2000, and multiple models from the posterior distribution are shown. These are averaged to obtain the final reconstruction. The cross-correlation with the reference model at 25 Å is 0.990. (B) The log-posterior measures how well the estimated model matches the data and the prior. It shows that both the initial and refinement stages converge rapidly. The estimation of the optimal pseudo-atom size converges fast as well. The figure shows that increasing the number of pseudo-atoms leads to a decrease in their size from ∼10 to ∼5 Å. (C) Instead of a single value for the pseudo-atom size, the algorithm gives us a distribution of plausible sizes. Comparing the distributions shows that the size is more precisely determined for the refinement stage. (D) For each of the 25 images, we compare the Euler angles of the original rotation and the final rotation. All rotation estimates are very accurate, with most of the angular errors <1°.

The output of the algorithm is an ensemble of pseudo-atomic models from the posterior distribution. This ensemble consists of every 50th model generated during the refinement stage after discarding the first 2500 models to exclude the burn-in period. To represent the result as a single volume we evaluate each model in the posterior family on a three-dimensional grid, and report either the mean of these volumes, or any one of the volumes (they are all very similar).

We explain a single Gibbs sampling step in Fig. 4 using a toy two-dimensional reconstruction example with only two pseudo-atoms and two images. Each one-dimensional image is shown as a bar chart with the height of each bar indicating the number of points at the corresponding one-dimensional pixel. For clarity, most of the steps are shown only for the data lying below the pseudo-atoms.

Figure 4.

Figure 4

Simple two-dimensional reconstruction example to explain a single Gibbs sampling iteration. (Solid lines) One-dimensional projections of this model. (Dashed lines) One-dimensional projections of the previous model. Initially (1), the one-dimensional projections differ significantly from the one-dimensional data. They improve after moving the pseudo-atoms (5), and once again after updating the image orientations/rotations (6). The final projections approximate the data quite well.

The first step (panels 2 and 3) is to evaluate the one-dimensional projection of each pseudo-atom at all the one-dimensional pixels, and assign points to pseudo-atoms. At each pixel, the relative value of the two pseudo-atoms determines the proportion of points to assign to each. For instance, for pixels on the left, all points are assigned to the bottom-left pseudo-atom, while for pixels in the middle, the points are distributed equally among the two pseudo-atoms (only half the bar is shaded). The second step (panel 4) is to estimate the missing y coordinates (missing z coordinates in the three-dimensional case). For each one-dimensional point, its y coordinate is chosen randomly near the y coordinate of the pseudo-atom to which it was assigned. The next step (panel 5) is to update the pseudo-atoms, i.e., their weights, positions, and size.

In this example, we update only their positions. The position of each pseudo-atom is chosen randomly near the mean of the two-dimensional points assigned to that pseudo-atom. After this update, the one-dimensional projections of the new pseudo-atomic model match the input data more closely. In the final step (panel 6), we update the rotations. The pseudo-atoms remain fixed, and the two-dimensional point cloud is rotated about the origin to better fit the pseudo-atoms. As a result of the rotation, the one-dimensional projections match the input data very well.

Results

We used five different datasets to test our algorithm: one consisting of simulated class averages, one with realistically simulated raw particles, and three with real data.

For the first dataset, we converted an atomic model of the ribosome 50S subunit (Protein Data Bank (PDB): 1VOR) to a three-dimensional volume at 15 Å using the software CHIMERA (University of California at San Francisco, San Francisco, CA) and projected it using random orientations to create 25 class averages. The size of the images is 50 × 50 pixels and the sampling rate is 6 Å/pixel. Fig. 5 A shows the progression of the reconstruction, including the positions of the pseudo-atoms. Also see Movie S2. The computation took 22 min on a single core, with 7 and 15 min for the initial and refinement stages, respectively. We used 100 pseudo-atoms for the initial stage and 2000 for the refinement stage. The models produced by the refinement stage have a pseudo-atom size of ∼5.0 Å, as can be seen in Fig. 5, B and C. To evaluate the reconstruction, we created a reference volume at res = 25 Å. The reference volume was created from the atomic model with CHIMERA's MOLMAP command, which describes each atom using a three-dimensional Gaussian with size 0.225 × res ≈ 5.6 Å, i.e., almost the same as our final pseudo-atom size. Our final reconstruction agrees very well with the reference: the normalized cross-correlation between the two structures is 0.990, and they agree to a resolution of 15.9 Å as measured by the FSC = 0.5 criterion (27). The FSC curves are shown in Fig. S1 in the Supporting Material. We also compared our final estimated rotations with the true rotations, and found that most of them agree to within 0.5° with a maximum error of <1.5° (see Fig. 5 D).

The second dataset consists of 5000 realistically simulated RNA polymerase II (PDB: 1I3Q) particles with size 100 × 100 pixels at a sampling rate of 2.5 Å/pixel. The reference volume was projected along random orientations, and random translations were applied to the images. The CTF was applied with a random defocus value for each image, followed by Gaussian noise with SNR = 0.2 (Fig. 6). We used EMAN2 (28) to compute 41 class averages in 98 min, followed by deconvolution, which took 73 min. Applying our ab initio algorithm to the deconvolved images took another 38 min, a total of 209 min. The final reconstruction has a cross-correlation of 0.966 compared to the reference model at 20 Å, and they agree to a resolution of 14.5 Å at FSC = 0.5.

Figure 6.

Figure 6

Results for realistically simulated RNA polymerase II data. At the top left are nine of the 5000 raw particles that were used to compute 41 deconvolved class averages, of which nine are shown (top right). The final reconstruction agrees well with the reference at 20 Å, as shown by the cross-correlation value of 0.966.

For the third dataset, we used publically available experimental 70S ribosome data from the EMDB test image data (15,29). The dataset consists of 5000 images with size 130 × 130 at a sampling rate of 2.82 Å/pixel. We used the software toolbox ASPIRE (30) to compute 50 class averages, followed by deconvolution. The algorithm was initialized with GroEL, an unrelated structure. Fig. 7 shows how the algorithm eliminates the bias caused by the incorrect initial model and quickly converges to the correct 70S structure. To provide further evidence of the robustness of the algorithm, we successfully repeated the reconstruction with a random initial model. We compared the final reconstruction to the result obtained using the PRIME algorithm (12). As shown in Fig. 7, the two structures are visually very similar, certainly enough for each to be used as initial model for a refinement. The normalized cross-correlation between the two structures is 0.900, and they agree to a resolution of 31.1 Å at FSC = 0.5. The computation of the first reconstruction (starting with GroEL) took 102 min in total (50 min for forming class averages, 24 min for deconvolution, and 28 min for the initial and refinement stages of our algorithm). The class-averaging step used eight cores on a desktop computer, while the other steps used a single core on a laptop, a total of <8 CPU h. In contrast, the PRIME reconstruction took ∼10 h on a cluster with 40 cores. In general, PRIME takes ∼500–1000 CPU hours to compute an initial model. This example shows that our algorithm produces comparable results in a fraction of the time required by PRIME.

Figure 7.

Figure 7

Results for the 70S ribosome, using real data. The algorithm was initialized with a model of an unrelated structure, GroEL, and successfully converged to the 70S structure. Shown below the labeled models from the initial stage are multiple models from the posterior distribution. These are averaged to obtain the final reconstruction (last two rows). We computed a second reconstruction starting from a different, random initial model (rightmost column). Once again the algorithm converged to the correct structure, showing its robustness to the choice of initial model. Shown as the reference is the reconstruction obtained by the PRIME algorithm, low-pass-filtered. The cross correlation between each of our reconstructions and the PRIME reconstruction is 0.900 for our first and 0.895 for our second reconstruction. The cross correlation between our reconstructions is 0.986.

For the fourth experiment, we used a publically available experimental GroEL dataset (31) consisting of ∼5000 images with size 128 × 128 at a sampling rate of 2.12 Å/pixel. EMAN2 was used to obtain 13 class averages in 19 min, followed by deconvolution, which took 2 min. Applying the initial and refinement stages of our algorithm took another 13 min, for a total of 34 min. We also took into account the known D7 symmetry of the structure, which shows that our framework is flexible enough to include symmetry constraints as prior information. Our final result (Fig. 8) has a cross-correlation of 0.927 with the reference model (PDB:1OEL) at 20 Å, and they agree to a resolution of 17.5 Å at FSC = 0.5.

Figure 8.

Figure 8

Results for GroEL, using real data. The final reconstruction agrees well with the reference at 20 Å, as shown by the cross-correlation value of 0.927.

For the fifth and final experiment, we tested the algorithm using experimental data from the human Anaphase Promoting Complex (APC/C) (32). Approximately 10,000 particles of size 80 × 80 pixels at a sampling rate of 4.9 Å/pixel were processed using reference-free alignment to produce 61 class averages. As required for a realistic test case, no knowledge of previous structures was used in computing the class averages. As before, the class averages were deconvolved to obtain nonnegative images, which were then used as input to our ab initio algorithm. Our reconstruction (Fig. 9) was compared to the reconstruction (EMD-2354) published earlier using data from the same source (32). The structures have a cross-correlation of 0.902 and agree to a resolution of 24.8 Å at FSC = 0.5.

Figure 9.

Figure 9

Results for experimental APC/C data. Class averages were computed from 10,000 raw particles in an ab initio setting, without making use of previous structures. The final reconstruction has a cross-correlation of 0.902 with the reference. At the bottom is the distribution of rotations at the end of the initial stage. Instead of estimating just a single rotation for each image, we obtain a cluster of rotations consistent with the image. The width of each cluster gives an indication of the precision of the estimated rotation.

Discussion

Our algorithm differs significantly from other cryo-EM reconstruction algorithms in the way in which three-dimensional structures are represented. Typically one uses a cubic three-dimensional grid comprising a large number of voxels. An alternative approach is to use rotationally symmetric blobs (33), each roughly the size of a voxel, positioned on a regular three-dimensional grid. The blobs are fixed in shape and size, the only free parameters being their weights. Thus the voxel and blob representations require a similar number of parameters, but projecting from three dimensions to two dimensions is faster and more accurate when using blobs instead of voxels. Blobs are used in the software package XMIPP (34), and were reported to produce superior quality reconstructions at lower computational cost (33).

Our approach can be seen as an extension of the blob approach, where we use pseudo-atoms instead of blobs, and allow their positions to vary smoothly instead of fixing them to a regular grid. This allows for a more parsimonious representation (Fig. 1), as pseudo-atoms can be moved to regions where they are more needed. Furthermore, the size of voxels or blobs needs to be fixed before reconstruction. In our case, the pseudo-atoms still all have the same size, but the appropriate size is estimated during reconstruction (Fig. 5). Instead of specifying their size, we have to choose the number of pseudo-atoms. There is a strong inverse relation between the number of pseudo-atoms and their size: as the number increases, the size must decrease to fill the same volume. Therefore, choosing the number is equivalent to implicitly choosing the size.

As mentioned before, our rule-of-thumb is to choose the number of pseudo-atoms such that the resulting pseudo-atom size is similar to the pixel size. For example, for our ribosome reconstruction, during the initial stage the pixel size is 9.4 Å and the final pseudo-atom size is ∼9.9 Å, indicating that 100 pseudo-atoms was an appropriate choice. For the refinement stage, the corresponding values are 6.0 Å and 5.0 Å. Guided by this strategy, we used either 100 or 200 pseudo-atoms for the initial stage, and either 500 or 2000 pseudo-atoms for the refinement stage, for all our experiments.

The significant reduction in the number of parameters needed to describe a structure (Fig. 1) has two advantages. The first is that the algorithm is very fast. Starting from the class averages, the algorithm took <40 min for each of the five structures. All experiments with the exception of class-averaging with ASPIRE were done on a standard laptop (Dell, Round Rock, TX) with a 2.40-GHz Core i7 quad-core processor (Intel, Santa Clara, CA) with 8 GB memory. Except for the EMAN2 and ASPIRE class-averaging steps, the entire algorithm runs on a single core. In comparison, almost all other ab initio algorithms take multiple days, with only a single recent exception that is comparable to ours in terms of speed (13). Our algorithm was implemented in the software PYTHON (Python Software Foundation, python.org) with CYTHON extension (cython.org), and is available upon request. The second advantage is reducing the model complexity, i.e., reducing the possible three-dimensional structures that can be represented using our pseudo-atomic model. During the initial stage of our algorithm when there are only a few large pseudo-atoms it is impossible to represent high-frequency information in the three-dimensional structure. Because we are interested in a low-resolution model, this excludes a large number of undesired models from our search space, thereby simplifying the problem and making the algorithm more robust. Some other reconstruction algorithms (both ab initio (12) and refinement (35)) apply a low-pass filter to the current volume at every iteration to achieve a similar effect. However, in our case, this is a property of the model itself.

Another principle difference to existing reconstruction methods is that we use Markov chain Monte Carlo sampling to generate an ensemble of models from the posterior distribution. This allows us to assess the ambiguity of the data when there are multiple reconstructions compatible with the projection images. It also adds to the robustness of our method. Moreover, we are able to estimate the precision of every model parameter including the pseudo-atom positions and the rotations (Figs. 5 and 9). Yet another advantage of sampling is that by computing the mean we can represent the final structure more accurately than would be possible using any single model, such as the MAP estimate. Compare, for example, any of the posterior models from the refinement stage with the final reconstruction in either Fig. 5 or 7. The individual pseudo-atoms are no longer visible in the posterior mean.

Using our pseudo-atomic model, it is possible to express many different forms of prior information about the structure, and the Bayesian framework dictates how to incorporate such prior information. In this article, nonnegativity and smoothness were used as prior information, by using nonnegative weights for the pseudo-atoms, and restricting all pseudo-atoms to be the same size. Another form of prior information that we demonstrated using GroEL is symmetry constraints, which can be imposed on the pseudo-atom positions for inferring initial models with known symmetry. Some extensions are straightforward, such as using a known low-resolution version of the structure as a prior distribution for the pseudo-atom positions, or using a nonuniform prior distribution for the rotations in the case of structures with preferred orientations. A more ambitious possibility for future work is to incorporate data from other sources, such as cross-linking/mass spectrometry, or crystallography. Another direction of future research is to modify the algorithm to handle conformational heterogeneity by inferring multiple structural conformations.

Author Contributions

P.J. and M.H. designed and performed the research, and wrote the paper. P.J. contributed the analytic tools and analyzed the data.

Acknowledgments

We thank Niels Fischer and Holger Stark (MPI for Biophysical Chemistry Göttingen) for providing us with the APC/C data. We also thank Hans Elmlund for providing us with the 70S ribosome initial model obtained using the software PRIME (12). Molecular graphics and analyses were performed with the UCSF CHIMERA software package (36).

This work was supported by Deutsche Forschungsgemeinschaft grants No. HA 5918/1-1 and No. SFB860 TP B9.

Contributor Information

Paul Joubert, Email: pjouber@gwdg.de.

Michael Habeck, Email: mhabeck@gwdg.de.

Supporting Material

Document S1. Supporting Materials and Methods and three figures
mmc1.pdf (652.3KB, pdf)
Movie S1. Multiple Pseudo-Atomic Models of RNA Polymerase II with Increasing Numbers of Pseudo Atoms

It demonstrates that a small number of pseudo-atoms are sufficient for representing low-resolution structures and that the number of model parameters required is orders of magnitude fewer than with the standard grid-based representation.

Download video file (10.9MB, mp4)
Movie S2. Reconstruction of the 50S Ribosome from the Initial Random Model to the Final Model

The trajectories of the individual pseudo-atoms as well as those of the individual rotations can be seen. For the rotations, only the projection direction is shown (the first two Euler angles), not the in-plane rotation component (the third Euler angle).

Download video file (9.4MB, mp4)
Document S2. Article plus Supporting Material
mmc4.pdf (2.5MB, pdf)

References

  • 1.Frank J. Oxford University Press; New York: 2006. Three-Dimensional Electron Microscopy of Macromolecular Assemblies. [Google Scholar]
  • 2.Penczek P.A., Grassucci R.A., Frank J. The ribosome at improved resolution: new techniques for merging and orientation refinement in 3D cryo-electron microscopy of biological particles. Ultramicroscopy. 1994;53:251–270. doi: 10.1016/0304-3991(94)90038-8. [DOI] [PubMed] [Google Scholar]
  • 3.Scheres S.H.W., Valle M., Carazo J.M. Maximum-likelihood multi-reference refinement for electron microscopy images. J. Mol. Biol. 2005;348:139–149. doi: 10.1016/j.jmb.2005.02.031. [DOI] [PubMed] [Google Scholar]
  • 4.Penczek P.A., Zhu J., Frank J. A common-lines based method for determining orientations for N > 3 particle projections simultaneously. Ultramicroscopy. 1996;63:205–218. doi: 10.1016/0304-3991(96)00037-x. [DOI] [PubMed] [Google Scholar]
  • 5.van Heel M. Angular reconstitution: a posteriori assignment of projection directions for 3D reconstruction. Ultramicroscopy. 1987;21:111–123. doi: 10.1016/0304-3991(87)90078-7. [DOI] [PubMed] [Google Scholar]
  • 6.Elmlund H., Lundqvist J., Lindahl M. A new cryo-EM single-particle ab initio reconstruction method visualizes secondary structure elements in an ATP-fueled AAA+ motor. J. Mol. Biol. 2008;375:934–947. doi: 10.1016/j.jmb.2007.11.028. [DOI] [PubMed] [Google Scholar]
  • 7.Singer A., Shkolnisky Y. Three-dimensional structure determination from common lines in cryo-EM by eigenvectors and semidefinite programming. SIAM J. 2011;4:543–572. doi: 10.1137/090767777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Elmlund D., Elmlund H. SIMPLE: software for ab initio reconstruction of heterogeneous single-particles. J. Struct. Biol. 2012;180:420–427. doi: 10.1016/j.jsb.2012.07.010. [DOI] [PubMed] [Google Scholar]
  • 9.Lyumkis D., Vinterbo S., Carragher B. OPTIMOD—an automated approach for constructing and optimizing initial models for single-particle electron microscopy. J. Struct. Biol. 2013;184:417–426. doi: 10.1016/j.jsb.2013.10.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yan X., Dryden K.A., Baker T.S. Ab initio random model method facilitates 3D reconstruction of icosahedral particles. J. Struct. Biol. 2007;157:211–225. doi: 10.1016/j.jsb.2006.07.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sanz-García E., Stewart A.B., Belnap D.M. The random-model method enables ab initio 3D reconstruction of asymmetric particles and determination of particle symmetry. J. Struct. Biol. 2010;171:216–222. doi: 10.1016/j.jsb.2010.03.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Elmlund H., Elmlund D., Bengio S. PRIME: probabilistic initial 3D model generation for single-particle cryo-electron microscopy. Structure. 2013;21:1299–1306. doi: 10.1016/j.str.2013.07.002. [DOI] [PubMed] [Google Scholar]
  • 13.Vargas J., Alvarez-Cabrera A.L., Sorzano C.O. Efficient initial volume determination from electron microscopy images of single particles. Bioinformatics. 2014;30:2891–2898. doi: 10.1093/bioinformatics/btu404. [DOI] [PubMed] [Google Scholar]
  • 14.Jaitly N., Brubaker M.A., Lilien R.H. A Bayesian method for 3D macromolecular structure inference using class average images from single particle electron microscopy. Bioinformatics. 2010;26:2406–2415. doi: 10.1093/bioinformatics/btq456. [DOI] [PubMed] [Google Scholar]
  • 15.Scheres S.H.W., Gao H., Carazo J.M. Disentangling conformational states of macromolecules in 3D-EM through likelihood optimization. Nat. Methods. 2007;4:27–29. doi: 10.1038/nmeth992. [DOI] [PubMed] [Google Scholar]
  • 16.Scheres S.H.W. Maximum-likelihood methods in cryo-EM. Part II: application to experimental data. Methods Enzymol. 2010;482:295–320. doi: 10.1016/S0076-6879(10)82012-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Scheres S.H.W. A Bayesian view on cryo-EM structure determination. J. Mol. Biol. 2012;415:406–418. doi: 10.1016/j.jmb.2011.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Scheres S.H.W. RELION: implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 2012;180:519–530. doi: 10.1016/j.jsb.2012.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kawabata T. Multiple subunit fitting into a low-resolution density map of a macromolecular complex using a Gaussian mixture model. Biophys. J. 2008;95:4643–4658. doi: 10.1529/biophysj.108.137125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Nogales-Cadenas R., Jonic S., Sorzano C.O. 3DEM LOUPE: analysis of macromolecular dynamics using structures from electron microscopy. Nucleic Acids Res. 2013;41:W363–W367. doi: 10.1093/nar/gkt385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jin Q., Sorzano C.O.S., Jonić S. Iterative elastic 3D-to-2D alignment method using normal modes for studying structural dynamics of large macromolecular complexes. Structure. 2014;22:496–506. doi: 10.1016/j.str.2014.01.004. [DOI] [PubMed] [Google Scholar]
  • 22.McLachlan G., Peel D. John Wiley; New York: 2000. Finite Mixture Models, Wiley Series in Probability and Statistics. [Google Scholar]
  • 23.Dempster A.P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. A. 1977;39:1–38. [Google Scholar]
  • 24.Geman S., Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984;6:721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
  • 25.Byrd R.H., Lu P., Zhu C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 1995;16:1190–1208. [Google Scholar]
  • 26.Habeck M. Generation of three-dimensional random rotations in fitting and matching problems. Comput. Stat. 2009;24:719–731. [Google Scholar]
  • 27.Rosenthal P.B., Henderson R. Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy. J. Mol. Biol. 2003;333:721–745. doi: 10.1016/j.jmb.2003.07.013. [DOI] [PubMed] [Google Scholar]
  • 28.Tang G., Peng L., Ludtke S.J. EMAN2: an extensible image processing suite for electron microscopy. J. Struct. Biol. 2007;157:38–46. doi: 10.1016/j.jsb.2006.05.009. [DOI] [PubMed] [Google Scholar]
  • 29.Frank, J. 2014. 70S E. coli ribosome. Protein Data Bank. http://www.ebi.ac.uk/pdbe/emdb/test_data.html. Accessed December 2014.
  • 30.Zhao Z., Singer A. Rotationally invariant image representation for viewing direction classification in cryo-EM. J. Struct. Biol. 2014;186:153–166. doi: 10.1016/j.jsb.2014.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.EMAN Wiki. 2014. EMAN2.1 Workshops, Summer 2014. http://blake.bcm.edu/emanwiki/Ws2014. Accessed December 2014.
  • 32.Frye J.J., Brown N.G., Schulman B.A. Electron microscopy structure of human APC/C(CDH1)-EMI1 reveals multimodal mechanism of E3 ligase shutdown. Nat. Struct. Mol. Biol. 2013;20:827–835. doi: 10.1038/nsmb.2593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Marabini R., Herman G.T., Carazo J.M. 3D reconstruction in electron microscopy using ART with smooth spherically symmetric volume elements (blobs) Ultramicroscopy. 1998;72:53–65. doi: 10.1016/s0304-3991(97)00127-7. [DOI] [PubMed] [Google Scholar]
  • 34.Sorzano C.O.S., Marabini R., Pascual-Montano A. XMIPP: a new generation of an open-source image processing package for electron microscopy. J. Struct. Biol. 2004;148:194–204. doi: 10.1016/j.jsb.2004.06.006. [DOI] [PubMed] [Google Scholar]
  • 35.Scheres S.H.W., Chen S. Prevention of overfitting in cryo-EM structure determination. Nat. Methods. 2012;9:853–854. doi: 10.1038/nmeth.2115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pettersen E.F., Goddard T.D., Ferrin T.E. UCSF CHIMERA—a visualization system for exploratory research and analysis. J. Comput. Chem. 2004;25:1605–1612. doi: 10.1002/jcc.20084. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supporting Materials and Methods and three figures
mmc1.pdf (652.3KB, pdf)
Movie S1. Multiple Pseudo-Atomic Models of RNA Polymerase II with Increasing Numbers of Pseudo Atoms

It demonstrates that a small number of pseudo-atoms are sufficient for representing low-resolution structures and that the number of model parameters required is orders of magnitude fewer than with the standard grid-based representation.

Download video file (10.9MB, mp4)
Movie S2. Reconstruction of the 50S Ribosome from the Initial Random Model to the Final Model

The trajectories of the individual pseudo-atoms as well as those of the individual rotations can be seen. For the rotations, only the projection direction is shown (the first two Euler angles), not the in-plane rotation component (the third Euler angle).

Download video file (9.4MB, mp4)
Document S2. Article plus Supporting Material
mmc4.pdf (2.5MB, pdf)

Articles from Biophysical Journal are provided here courtesy of The Biophysical Society

RESOURCES