Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Feb 3.
Published in final edited form as: Comput Vis ECCV. 2022 Oct 23;13681:540–557. doi: 10.1007/978-3-031-19803-8_32

CryoAI: Amortized Inference of Poses for Ab Initio Reconstruction of 3D Molecular Volumes from Real Cryo-EM Images

Axel Levy 1,2, Frédéric Poitevin 1, Julien Martel 2, Youssef Nashed 3, Ariana Peck 1, Nina Miolane 4, Daniel Ratner 3, Mike Dunne 1, Gordon Wetzstein 2
PMCID: PMC9897229  NIHMSID: NIHMS1824058  PMID: 36745134

Abstract

Cryo-electron microscopy (cryo-EM) has become a tool of fundamental importance in structural biology, helping us understand the basic building blocks of life. The algorithmic challenge of cryo-EM is to jointly estimate the unknown 3D poses and the 3D electron scattering potential of a biomolecule from millions of extremely noisy 2D images. Existing reconstruction algorithms, however, cannot easily keep pace with the rapidly growing size of cryo-EM datasets due to their high computational and memory cost. We introduce cryoAI, an ab initio reconstruction algorithm for homogeneous conformations that uses direct gradient-based optimization of particle poses and the electron scattering potential from single-particle cryo-EM data. CryoAI combines a learned encoder that predicts the poses of each particle image with a physics-based decoder to aggregate each particle image into an implicit representation of the scattering potential volume. This volume is stored in the Fourier domain for computational efficiency and leverages a modern coordinate network architecture for memory efficiency. Combined with a symmetrized loss function, this framework achieves results of a quality on par with state-of-the-art cryo-EM solvers for both simulated and experimental data, one order of magnitude faster for large datasets and with significantly lower memory requirements than existing methods.

Keywords: Cryo-electron Microscopy, Neural Scene Representation

1. Introduction

Understanding the 3D structure of proteins and their associated complexes is crucial for drug discovery, studying viruses, and understanding the function of the fundamental building blocks of life. Towards this goal, cryo-electron microscopy (cryo-EM) of isolated particles has been developed as the go-to method for imaging and studying molecular assemblies at near-atomic resolution [21,31,39]. In a cryo-EM experiment, a purified solution of the molecule of interest is frozen in a thin layer of vitreous ice, exposed to an electron beam, and randomly oriented projections of the electron scattering potential (i.e., the volume) are imaged on a detector (Fig. 1 (a)). These raw micrographs are then processed by an algorithm that reconstructs the volume and estimates the unknown pose, including orientation and centering shift, of each particle extracted from the micrographs (Fig. 1 (b)).

Fig. 1.

Fig. 1.

(a) (Top) Illustration of a cryo-EM experiment. Molecules are frozen in a random orientation and their electron scattering potential (i.e., volume) V interacts with an electron beam imaged on a detector. (Bottom) Noisy projections (i.e., particles) of V selected from the full micrograph measured by the detector. (b) Output of a reconstruction algorithm: poses ϕi and volume V. Each pose is characterized by a rotation in SO(3) (hue represents in-plane rotation) and a translation in 2 (not shown). An equipotential surface of V is shown on the right. (c) Evolution of the maximum number of images collected in one day [29] and established and emerging state-of-the-art reconstruction methods.

Recent advances in sample preparation, instrumentation, and data collection capabilities have resulted in very large amounts of data being recorded for each cryo-EM experiment [4,29] (Fig. 1 (c)). Millions of noisy (images of) particles, each with an image size on the order of 1002–4002 pixels, need to be processed by the reconstruction algorithm to jointly estimate the pose of each particle and the unknown volume. Most existing algorithms that have been successful with experimental cryo-EM data address this problem using a probabilistic approach that iteratively alternates between updating the volume and the estimated poses [44,37,59,61]. The latter “orientation matching” step, however, is computationally expensive, requiring an exhaustive search in a 5-dimensional space (ϕiSO(3)×2) for each particle. In spite of using smart pose search strategies and optimization schedules, the orientation matching step is the primary bottleneck of existing cryo-EM reconstruction algorithms, requiring hours to estimate a single volume and scaling poorly with increasing dataset sizes.

We introduce cryoAI, a technique that uses direct gradient-based optimization to jointly estimate the poses and the electron scattering potential of a non-deformable molecule (homogeneous reconstruction). Our method operates in an unsupervised manner over a set of images with an encoder–decoder pipeline. The encoder learns a discriminative model that associates each particle image with a pose and the decoder is a generative physics-based pipeline that uses the predicted pose and a description of the volume to predict an image. The volume is maintained by an implicit, i.e., neural network–parameterized, representation in the decoder, and the image formation model is simulated in Fourier space, thereby avoiding the approximation of integrals via the Fourier-slice theorem (see Sec. 3.1). By learning a mapping from images to poses, cryoAI avoids the computationally expensive step of orientation matching that limits existing cryo-EM reconstruction methods. Our approach thus amortizes over the size of the dataset and provides a scalable approach to working with modern, large-scale cryo-EM datasets. We demonstrate that cryoAI performs homogeneous reconstructions of a comparable resolution but with nearly one order of magnitude faster runtime than state-of-the-art methods using datasets containing millions of particles.

Specifically, our contributions include

  • a framework that learns to map images to particle poses while reconstructing an electron scattering potential for homogeneous single-particle cryo-EM;

  • demonstration of reconstruction times and memory consumption that amortize over the size of the dataset, with nearly an order of magnitude improvement over existing algorithms on large datasets;

  • formulations of a symmetrized loss function and an implicit Fourier-domain volume representation that enable the high-quality reconstructions we show.

Source code will be made public upon publication.

2. Related Work

Estimating the 3D structure of an object from its 2D projections with known orientations is a classical problem in tomography and has been solved using backprojection-based methods [18,43] or compressive sensing–style solvers [8,12]. In cryo-EM, the reconstruction problem is complicated by several facts: (1) the poses of the unknown object are also unknown for all projections; (2) the signal-to-noise ratio (SNR) is extremely low (around −20 dB for experimental datasets [6,5]); (3) the molecules in a sample can deform and be frozen in various (unknown) conformations. Unlike homogeneous reconstruction methods, heterogeneous methods take into account the deformations of the molecule and reconstruct a discrete set or a low-dimensional manifold of conformations. Although they give more structural information, most recent heterogeneous methods [59,36,62,9] assume the poses to be known. For each particle i, a pose ϕi is defined by a rotation RiSO(3) and a translation ti2. In this work, we do not assume the poses to be known and aim to estimate the electron scattering function V of a unique underlying molecule in a homogeneous setting. We classify previous work on pose estimation into two inference categories [11]: non-amortized and amortized.

Non-amortized Inference

Non-amortized Inference refers to a class of methods where the posterior distribution of the poses p(ϕi|Yi, V) is computed independently for each image Yi. Common-line approaches [51,47,55,16,35,57], projection-matching strategies [33,3] and Bayesian formulations [24,10,45,37] belong to this category. The software package RELION [44] widely popularized the Bayesian approach by performing Maximum-A-Posteriori (MAP) optimization through Expectation–Maximization (EM). Posterior distributions over the poses (and the optional conformational states) are computed for each image in the expectation step and all frequency components of the volume are updated in the maximization step, which makes the approach computationally costly. The competing software cryoSPARC [37] proposed to perform MAP optimization jointly using stochastic gradient descent (SGD) to optimize the volume V and branch-and-bound algorithms [22] to estimate the poses ϕi. While a gradient-based optimization scheme for V circumvents the costly updates in the maximization step of RELION, a pose must be estimated for each image by aligning each 2D projection Yi with the estimated 3D volume V. Although branch-and-bound algorithms can accelerate the pose search, this step remains computationally expensive and is one of the bottlenecks of the method in terms of runtime. Ullrich et al. [50] proposed a variational and differentiable formulation of the optimization problem in the Fourier domain. Although they demonstrated that their method can estimate the volume when poses are known, they also showed that jointly optimizing the pose posterior distributions by SGD fails due to the high non-convexity of the problem. Instead of parameterizing the volume with a 3D voxel array, Zhong et al. proposed in cryoDRGN [60,59,61] to use a coordinate-based representation (details in Sec. 3.4) to directly approximate the electron scattering function in Fourier space. Their neural representation takes 3D Fourier coordinates and a latent vector encoding the conformational state as input, therefore accounting for continuous deformations of the molecule. The latest published version of cryoDRGN [59] reports excellent results on the reconstruction of conformation heterogeneities but assumes the poses to be determined by a consensus reconstruction. Poses are jointly estimated with V in cryoDRGN-BNB [60] and cryoDRGN2 [61], but in spite of a frequency-marching strategy, the use of a branch-and-bound algorithm and a later introduced multi-resolution approach the global 5D pose search remains the most computationally expensive step in their pipeline.

Amortized Inference

Amortized Inference techniques, on the other hand, learn a parameterized function qξ(Yi) that approximates the posterior distribution of the poses p(ϕi|Yi, V) [14]. At the expense of optimizing the parameter ξ, these techniques avoid the orientation matching step which is the main computational bottleneck in non-amortized methods. Lian et al [23] demonstrated the possibility of using a convolutional neural network to approximate the mapping between cryo-EM images and orientations, but their method cannot perform end-to-end volume reconstruction. In cryoVAEGAN [27], Miolane et al. showed that the in-plane rotation could be disentangled from the contrast transfer function (CTF) parameters in the latent space of an encoder. Rosenbaum et al. [41] were the first to demonstrate volume reconstruction from unknown poses in a framework of amortized inference. In their work, distributions of poses and conformational states are predicted by the encoder of a Variational Autoencoder (VAE) [20]. In their model-based decoder, the predicted conformation is used to deform a base backbone frame of Gaussian blobs and the predicted pose is used to make a projection of these blobs. The reconstructed image is compared to the measurement in order to optimize the parameters of both the encoder and the decoder. While this method is able to account for conformational heterogeneity in a dataset, it requires a priori information about the backbone frame. CryoPoseNet [30] proposed a non-variational autoencoder framework that can perform homogeneous reconstruction with a random initialization of the volume, avoiding the need for prior information about the molecule. Although it demonstrated the possibility of using a non-variational encoder to predict the orientations Ri, cryoPoseNet assumes the translations ti to be given and the volume is stored in real space in the decoder (while the image formation model is in Fourier space, see Sec. 3.1), thereby requiring a 3D Fourier transform at each forward pass and making the overall decoding step slow. The volume reconstructed by cryoPoseNet often gets stuck in local minima, which is a problem we also address in this paper (see Sec. 3.5). Finally, the two last methods only proved they could be used with simulated datasets and, to the best of our knowledge, no amortized inference technique for volume estimation from unknown poses have been proven to work with experimental datasets in cryo-EM.

Previous methods differ in the way poses are inferred in the generative model. Yet, the only variable of interest is the description of the conformational state (for heterogeneous methods) and associated molecular volumes, while poses can be considered “nuisance” variables. As a result, recent works have explored methods that avoid the inference of poses altogether, such as GAN-based approaches [1]. CryoGAN [17], for example, used a cryo-EM simulator and a discriminator neural network to optimize a 3D volume. Although preliminary results are shown on experimental datasets, the reconstruction cannot be further refined with other methods due to the absence of predicted poses.

Our approach performs an amortized inference of poses and therefore circumvents the need for expensive searches over SO(3)×2, as in non-amortized techniques. In the implementation, no parameter needs to be statically associated with each image. Consequently, the memory footprint and the runtime of our algorithm does not scale with the number of images in the dataset. We introduce a loss function called “symmetrized loss” that prevents the model from getting stuck in local minima with spurious planar symmetries. Finally, in contrast to previous amortized inference techniques, our method can perform volume reconstruction on experimental datasets.

3. Methods

3.1. Image Formation Model and Fourier-slice Theorem

In a cryo-EM sample, the charges carried by each molecule and their surrounding environment create an electrostatic potential that scatters probing electrons, which we refer to as the electron scattering “volume,” and consider as a mapping

V:3. (1)

In the sample, each molecule i is in an unknown orientation RiSO(3)3×3. The probing electron beam interacts with the electrostatic potential and its projections

Qi:(x,y)zV(Ri[x,y,z]T)dz (2)

are considered mappings from 2 to . The beam then interacts with the lens system characterized by the Point Spread Function (PSF) Pi and individual particles are cropped from the full micrograph. The obtained images may not be perfectly centered on the molecule and small translations are modeled by ti2. Finally, taking into account signal arising from the vitreous ice into which the molecules are embedded as well as the non-idealities of the lens and the detector, each image Yi is generally modeled as

Yi=Tti*Pi*Qi+ηi (3)

where * is the convolution operator, Tt the t-translation kernel and ηi white Gaussian noise on 2 [53,44].

With a formulation in real space, both the integral over z in Eq. (2) and the convolution in Eq. (3) make the simulation of the image formation model computationally expensive. A way to avoid these operations is to use the Fourier-slice Theorem [7], which states that for any volume V and any orientation Ri,

F2D[Qi]=Si[F3D[V]], (4)

where F2D and F3D are the 2D and 3D Fourier transform operators and Si the “slice” operator defined such that for any V^:3,

Si[V^]:(kx,ky)V^(Ri[kx,ky,0]T). (5)

That is, Si[V^] corresponds to a 2D slice of V^ with orientation Ri and passing through the origin. In a nutshell, if Y^i=F2D[Yi] and V^=F3D[V], the image formation model in Fourier space can be expressed as

Y^i=T^tiCiSi[V^]+η^i, (6)

where ⊙ is the element-wise multiplication, Ci=F2D[Pi] is the Contrast Transfer Function (CTF), T^t the t-translation operator in Fourier space (phase shift) and η^i complex white Gaussian noise on 2. Based on this generative model, cryoAI solves the inverse problem of inferring V^, Ri and ti from Y^i assuming Ci is known.

3.2. Overview of CryoAI

CryoAI is built with an autoencoder architecture (see Fig. 2). The encoder takes an image Yi as input and outputs a predicted orientation Ri along with a predicted translation ti (Sec. 3.3). Ri is used to rotate a 2-dimensional grid of L2 3D-coordinates [kx,ky,0]3 which are then fed into the neural network V^θ. This neural network is an implicit representation of the current estimate of the volume V^ (in Fourier space), and this query operation corresponds to the “slicing” defined by Eq. (5) (Sec. 3.4). Based on the estimated translation ti and given CTF parameters Ci, the rest of the image formation model described in Eq. (6) is simulated to obtain X^i, a noise-free estimation of Y^i. These images are compared using a loss described in Sec. 3.5 and gradients are backpropagated throughout the differentiable model in order to optimize both the encoder and the neural representation.

Fig. 2.

Fig. 2.

Overview of our pipeline. The encoder, parameterized by ξ learns to map images Yi to their associated pose ϕi = (Ri, ti). The matrix Ri rotates a slice of 3D coordinates in Fourier space. The coordinates are fed into a neural representation of V^, parameterized by θ. The output is multiplied by the CTF Ci and the translation operator T^ti to build X^i, a noise-free estimation of F2D[Yi]=Y^i. X^i and Y^i are compared via the symmetrized loss Lsym . Differentiable parameters are represented in blue.

3.3. Pose Estimation

CryoAI uses a Convolutional Neural Network (CNN) to predict the parameters Ri and ti from a given image, thereby avoiding expensive orientation matching computations performed by other methods [44,37,61]. The architecture of this encoder has three layers.

  1. Low-pass filtering: YiL×L is fed into a bank of Gaussian low-pass filters.

  2. Feature extraction: the filtered images are stacked channel-wise and fed into a CNN whose architecture is inspired by the first layers of VGG16 [46], which is known to perform well on image classification tasks.

  3. Pose estimation: this feature vector finally becomes the input of two separate fully-connected neural networks. The first one outputs a vector of dimension 6 of S2 × S2 [63] (two vectors on the unitary sphere in 3) and converted into a matrix Ri3×3 using the PyTorch3D library [38]. The second one outputs a vector of dimension 2, directly interpreted as a translation vector ti2.

We call ξ the set of differentiable parameters in the encoder described above. We point the reader to Supp. B for more details about the architecture of the encoder.

3.4. Neural Representation in Fourier Space (FourierNet)

Instead of using a voxel-based representation, we maintain the current estimate of the volume using a neural representation. This representation is parameterized by θ and can be see seen as a mapping V^θ:3.

In imaging and volume rendering, neural representations have been used to approximate signals defined in real space [32,2,13,25,49]. Neural Radiance Field (NeRF) [26] is a successful technique to maintain a volumetric representation of a real scene. A view-independent NeRF model, for example, maps real 3D-coordinates [x, y, z] to a color vector and a density scalar using positional encoding [52] and a set of fully-connected layers with ReLU activation functions. Sinusoidal Representation Networks (SIRENs) [48] can also successfully approximate 3D signed distance functions with a shallow fully-connected neural network using sinusoidal activation functions. However, these representations are tailored to approximate signals defined in real space. Here, we want to directly represent the Fourier transform of the electrostatic potential of a molecule. Since this potential is a smooth function of the spatial coordinates, the amplitude of its Fourier coefficients V^(k) is expected to decrease with |k|, following a power law (see Supp. C for more details). In practice, this implies that |V^| can vary over several orders of magnitude and SIRENs, for example, are known to poorly approximate these types of functions [48]. The first method to use neural representations for volume reconstruction in cryo-EM, cryoDRGN [60,61], proposed to use a Multi-Layer Perceptron (MLP) with positional encoding in Hartley space (where the FST still applies).

With our work, we introduce a new kind of neural representation (FourierNet), tailored to represent signals defined in the Fourier domain, inspired by the success of SIRENs for signals defined in real space. Our idea is to allow a SIREN to represent a signal with a high dynamic range by raising its output in an exponential function. Said differently, the SIREN only represents a signal that scales logarithmically with the approximated function. Since Fourier coefficients are defined on the complex plane, we use a second network in our implicit representation to account for the phase variations. This architecture is summarized in Fig. 2 and details on memory requirements are given in Supp. C. Input coordinates [kx, ky, kz] are fed into two separate SIRENs outputting 2-dimensional vectors. For one of them, the exponential function is applied element-wise and the two obtained vectors are finally element-wise mutliplied to produce a vector in 2, mapped to with the Cartesian coordinate system. Since V^θ must represent the Fourier transform of real signals, we know that it should verify V^θ(k)=V^θ(k)*. We enforce this property by defining

V^θ(k)=V^θ(k)*  if kx<0. (7)

Benefits of this neural representation are shown on 2-dimensional signals in the Supp. C.

The neural representation is queried for a set of L2 3D-coordinates [kx, ky, kz], thereby producing a discretized slice Si[V^θ]L×L. The rest of the image formation model (6) is simulated by element-wise multiplying Si[V^θ] by the CTF Ci and a translation matrix,

X^i=T^tiCiSi[V^θ], (8)

where T^ti is defined by

T^ti(k)=exp(2jπkti). (9)

The parameters of the CTF are provided by external CTF estimation softwares such as CTFFIND [40]. The whole encoder–decoder pipeline can be seen as a function that we call Γξ,θ, such that X^i=Γξ,θ(Yi).

3.5. Symmetrized Loss

In the image formation model of Eq. (3), the additive noise ηi is assumed to be Gaussian and uncorrelated (white Gaussian noise) [53,44], which means that its Fourier transform η^i follows the same kind of distribution. Therefore, maximum likelihood estimation on a batch B amounts to the minimization of the L2-loss.

Nonetheless, we empirically observed that using this loss often led the model to get stuck in local minima where the estimated volume showed spurious planar symmetries (see Sec. 4.3). We hypothesize that this behaviour is linked to the fundamental ambiguity contained in the image formation model in which, given unknown poses, one cannot distinguish two “mirrored” versions of the same volume [42]. We discuss this hypothesis in more detail in Supp. D. To solve this problem, we designed a loss that we call “symmetrized loss” defined as

Lsym=iBmin{Y^iΓξ,θ(Yi)2,Rπ[Y^i]Γξ,θ(Rπ[Yi])2} (10)

where Rπ applies an in-plane rotation of π on L × L images. Using the symmetrized loss, the model can be supervised on a set of images Yi in which the predicted in-plane rotation (embedded in the predicted matrix Ri) can always fall in [−π/2, π/2] instead of [−π, π]. As shown in Sec. 4.3 and explained in Supp. D, this prevents cryoAI from getting stuck in spuriously symmetrical states.

4. Results

We qualitatively and quantitatively evaluate cryoAI for ab initio reconstruction of both simulated and experimental datasets. We first compare cryoAI to the state-of-the-art method cryoSPARC [37] in terms of runtime on a simulated dataset of the 80S ribosome with low levels of noise. We then compare our method with baseline methods in terms of resolution and pose accuracy on simulated datasets with and without noise (spike, spliceosome). Next, we show that cryoAI can perform ab initio reconstruction on an experimental cryo-EM dataset (80S), which is the first time for a method estimating poses in an amortized fashion. Finally, we highlight the importance of a tailored neural representation in the decoder and the role of the symmetrized loss in an ablation study.

4.1. Reconstruction on Simulated Datasets

Experimental Setup.

We synthesize three datasets from deposited Protein Data Bank (PDB) structures of the Plasmodium falciparum 80S ribosome (PDB: 3J79 and 3J7A) [56], the SARS-CoV-2 spike protein (PDB: 6VYB) [54] and the pre-catalytic spliceosome (PDB: 5NRL) [34]. First, a 3D grid map, the ground-truth volume, is generated in ChimeraX [15] from each atomic model using the steps described in Supp. A. Then a dataset is generated from the ground-truth volume using the image formation model described in Sec. 3.1. Images are sampled at L = 128. Rotations Ri are randomly generated following a uniform distribution over SO(3) and random translations ti are generated following a zero-mean Gaussian distribution (σ = 20 Å). The defocus parameters of the CTFs are generated with a log-normal distribution. We build noise-free (ideal) and noisy versions of each dataset (SNR = 0dB for 80S, SNR = −10dB for the others, see Supp. E for details). We compare cryoAI with three baselines: the state-of-the-art software cryoSPARC v3.2.0 [37], the neural network–based method cryoDRGN2 [61] and the autoencoder-based method cryoPoseNet [30] (with the image formation model in real space in the decoder, see Supp. A). We quantify the accuracy of the reconstructed volume by computing the Fourier Shell Correlations (FSC) between the reconstruction and the ground truth and reporting the resolution at the 0.5 cutoff. All experiments are run on a single Tesla V100 GPU with 8 CPUs.

Convergence Time.

We compare cryoAI with cryoSPARC in terms of runtime for datasets of increasing size in Fig. 3. We use the simulated 80S dataset and define the running time as the time needed to reach a resolution of 10 Å (2.65 pixels), which is a sufficiently accurate resolution to perform refinement with cryoSPARC (see workflow in Supp. A). We indicate to cryoSPARC to stop the ab initio reconstruction when this resolution is reached. We additionally show the time required by cryoSPARC for importing data and for the refinement step. With cryoAI, the computational complexity and the number of statically maintained variables does not scale with the number of images, making the convergence time independent from the size of the dataset. By contrast, the computation time of cryoSPARC increases with the number of images and can reach 5 hours with a dataset of 9M particles. We additionally show in Supp F the time required to estimate all the poses of the dataset with cryoAI’s encoder.

Fig. 3.

Fig. 3.

(Left) Time to reach 10 Å of resolution with cryoAI (range and average over 5 runs per datapoint) and cryoSPARC vs. number of images in the simulated 80S dataset. (Right) Estimated volume at initialization and after 35 min of running cryoAI vs. cryoSPARC after convergence, with 9M images.

Accuracy.

We compare cryoAI with baseline methods on the spike and spliceosome datasets in Table 1. We compare the reconstructed variables (volume and poses) with their ground truth values (from simulation). Results of cryoDRGN2 are reported from available data in [59]. Images were centered for cryoPoseNet since the method does not predict ti. A “tight” adaptive mask was used with cryoSPARC. The performance of cryoAI is comparable with the baselines. The splicesome and the noise-free spike protein are reconstructed with state-of-the-art accuracy. In the noisy spike dataset, the accuracy of cryoAI and cryoSPARC decreases, which may be due to the pseudo-symmetries shown by the molecule (visual reconstruction in Supp. F). CryoPoseNet gets stuck for at least 24 hours in a state where the the resolution is very poor on both spike datasets.

Table 1.

Accuracy of pose and volume estimation for simulated data. Resolution (Res.) is reported using the FSC = 0.5 criterion, in pixels (↓). Rotation (Rot.) error is the median square Frobenius norm between predicted and ground truth matrices Ri (↓). Translation (Trans.) error is the mean square L2-norm, in pixels (↓).

Dataset cryoPoseNet cryoSPARC cryoDRGN2 cryoAI
Spliceosome (ideal) Res. 2.78 2.13 2.13
Rot. 0.004 0.0002 0.0004
Trans. 0.006 0.001
Spliceosome (noisy) Res. 3.15 2.61 2.61
Rot. 0.01 0.002 0.007
Trans. 0.007 0.01
Spike (ideal) Res. 16.0 2.33 2.29
Rot. 5 0.0003 0.0001 0.0003
Trans. 0.007 0.001
Spike (noisy) Res. 16.0 3.56 2.03 2.91
Rot. 6 0.02 0.01 0.01
Trans. 0.008 0.003

4.2. Reconstruction on Experimental Datasets

Experimental Setup.

We use the publicly available 80S experimental dataset EMPIAR-10028 [56,58,19] containing 105,247 images of length L = 360 (1.34 Å per pixel), downsampled to L = 256. The dataset is evenly split in two, each method runs independent reconstructions on each half and the FSC are measured between the two reconstructions. We compare cryoAI with cryoPoseNet and cryoSPARC. The dataset fed to cryoAI and cryoPoseNet is masked with a circular mask of radius 84 pixels, while cryoSPARC adaptively updates a “tight” mask. CryoAI and cryoPoseNet reconstruct a volume of size 1283. For cryoSPARC, both the ab initio volume and the volume subsequently homogeneously refined from it were downsampled to the same size 1283. We also demonstrate the possibility of refining cryoAI’s output with the software cryoSPARC. Finally, we report the results published for cryoDRGN2 [61] that were obtained on a filtered version of the same dataset [58] downsampled to L = 128 prior reconstruction.

Results.

We report quantitative and qualitative results in Fig. 4. CryoAI is the first amortized method to demonstrate proper volume reconstruction on an experimental dataset, although techniques predicting poses with an orientation-matching step (like cryoDRGN2) or followed by an EM-based refinement step (like cryoSPARC) can reach slightly higher resolutions. State-of-the-art results can be obtained with cryoSPARC’s refinement, initialized from either cryoSPARC’s or cryoAI’s ab initio. Since simulated datasets were built using the same image formation model as the one cryoAI uses in its decoder, the gap in performance between the experimental and simulated datasets suggests that improvements could potentially be achieved with a more accurate physics model.

Fig. 4.

Fig. 4.

(Top left) Volume reconstruction on a noise-free simulated dataset of the spliceosome (L = 128, pixel size = 4.25 Å). (Bottom left) Volume reconstruction for the experimental 80S dataset (L = 128, pixel size = 3.77 Å). (Right) Fourier Shell Correlations, reconstruction-to-ground-truth (top) or reconstruction-to-reconstruction (bottom). A resolution of 2.0 pixels corresponds to the Nyquist frequency. CryoAI can be refined using the software cryoSPARC.

4.3. Ablation Study

Importance of Symmetrized Loss.

The purpose of the symmetrized loss is to prevent the model from getting stuck in local minima where the volume shows incorrect planar symmetries. Ullrich et al. showed in [50] that optimizing the poses using a gradient-based method often leads the model to fall in sub-optimal minima, due to the high non-convexity of the optimization problem. In [61], Zhong et al. implemented an autoencoder-based method (dubbed PoseVAE), and compared it to cryoDRGN2. The method is unable to properly reconstruct a synthetic hand, and a spurious planar symmetry appears in their reconstruction. We use a noisy dataset (L = 128) generated from a structure of Adenylate kinase (PDB 4AKE) [28]. We show in Fig. 5 that our method presents the same kind of artifact when using a L2 loss and validate that the symmetrized loss prevents these artifacts. In Fig. 5, we compare our method to cryoPoseNet with and without the symmetrized loss on a simulated ideal dataset of the same molecule (L = 64). Both methods use an autoencoder-based architecture and both converge significantly faster with the symmetrized loss. With the same loss, cryoAI is always faster than cryoPoseNet since our method operates in Fourier space and avoids the approximation of integrals using the FST.

Fig. 5.

Fig. 5.

(Top left) Ablation study on the symmetrized loss with cryoAI and cryoPoseNet with simulated noise-free adenylate kinase (L = 64). We report the minimal convergence time out of 5 runs. CryoPoseNet is always slower and achieves worse results. The symmetrized loss always accelerates convergence. (Bottom left) Volume reconstruction when using a L2 loss vs. the symmetrized loss. The latter prevents the model from getting stuck in a symmetrical local minimum. (Right) Loss and resolution (in pixels, FSC = 0.143 cutoff) vs. number of iterations with a FourierNet, a SIREN [48] and an MLP with ReLU activation functions and positional encoding (32 images per batch).

Comparison of Neural Representations.

We replaced FourierNet with other neural representations in the decoder and compared the convergence rate of these models on the noisy Adenylate kinase dataset (L=128). In Fig. 5, we compare our architecture with a multi-layer perceptron (MLP) with sinusoidal activation functions (i.e., a SIREN [48]) and an MLP with ReLU activation function and positional encoding, as used by cryoDRGN2 [61]. We keep approximately 300k differentiable parameters in all representations. FourierNet significantly outperforms the two other architectures in terms of convergence speed.

5. Discussion

The amount of collected cryo-EM data is rapidly growing [29], which increases the need for efficient ab initio reconstruction methods. CryoAI proposes a framework of amortized inference to meet this need by having a complexity that does not grow with the size of the dataset. Since CryoAI jointly estimates volume and poses, it can be followed by reconstruction methods that address conformational heterogeneities, such as the ones available in cryoSPARC [37], RELION [44], or cryoDRGN [59]. The ever increasing size of cryo-EM datasets is necessary to provide sufficient sampling of conformational heterogeneities with increasing accuracy, in particular when imaging molecules that display complex dynamics. However, existing methods that tackle the more complex inference task of heterogeneous reconstruction also see their runtime suffer as datasets grow bigger, again showing the need for new developments that leverage amortized inference.

Future work on cryoAI includes adding features to the image formation model implemented in the decoder. CTFs, for example, are currently only characterized by three parameters (two defoci parameters and an astigmatism angle) but could be readily enhanced to account for higher-order effects (see e.g. [64]). A richer noise model, currently assumed to be Gaussian and white, could also improve the performance of the algorithm. In order to tackle the case of very noisy experimental datasets, adaptive masking techniques, such as those used by cryoSPARC, could be beneficial. In terms of hardware development, cryoAI would benefit from being able to run on more than a single GPU using data parallelism and/or model parallelism, thereby improving both runtime and efficiency. CryoAI, as described here, belongs to the class of homogeneous reconstruction methods; future developments should explore its performance in an heterogenous reconstruction setting, where conformational heterogeneity is baked in the generative model and the encoder is enhanced to predict descriptions of conformational states in low-dimensional latent space along with the poses.

Conclusion.

Advancing our understanding of the building blocks of life hinges upon our ability to leverage cryo-EM at its full potential. While recent advances in instrumentation and hardware have enabled massive datasets to be recorded at unprecedented throughput, advancing the associated algorithms to efficiently scale with these datasets is crucial for the field to move forward. Our work presents important steps towards this goal.

Supplementary Material

Supplementary materials

Acknowledgment.

We thank Wah Chiu for numerous discussions that helped shape this project. This work was supported by the U.S. Department of Energy, under DOE Contract No. DE-AC02-76SF00515. N.M. acknowledges support from the National Institutes of Health (NIH), grant No. 1R01GM144965-01. We acknowledge the use of the computational resources at the SLAC Shared Scientific Data Facility (SDF).

References

  • 1.Akçakaya M, Yaman B, Chung H, Ye JC: Unsupervised deep learning methods for biological image reconstruction and enhancement: An overview from a signal processing perspective. IEEE Signal Processing Magazine 39, 28–44 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Atzmon M, Lipman Y: Sal: Sign agnostic learning of shapes from raw data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2565–2574 (2020) [Google Scholar]
  • 3.Baker TS, Cheng RH: A model-based approach for determining orientations of biological macromolecules imaged by cryoelectron microscopy. Journal of Structural Biology 116, 120–130 (1996) [DOI] [PubMed] [Google Scholar]
  • 4.Baldwin PR, Tan YZ, Eng ET, Rice WJ, Noble AJ, Negro CJ, Cianfrocco MA, Potter CS, Carragher B: Big data in cryoem: automated collection, processing and accessibility of em data. Current Opinion in Microbiology 43, 1–8 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bendory T, Bartesaghi A, Singer A: Single-particle cryo-electron microscopy: Mathematical theory, computational challenges, and opportunities. IEEE signal processing magazine 37, 58–76 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bepler T, Kelley K, Noble AJ, Berger B: Topaz-denoise: general deep denoising models for cryoem and cryoet. Nature communications 11, 1–12 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bracewell RN: Strip integration in radio astronomy. Australian Journal of Physics 9, 198–217 (1956) [Google Scholar]
  • 8.Candes E, Romberg J, Tao T: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory 52, 489–509 (2006) [Google Scholar]
  • 9.Chen M, Ludtke SJ: Deep learning-based mixed-dimensional Gaussian mixture model for characterizing variability in cryo-EM. Nature Methods 18, 930–936 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dempster AP, Laird NM, Rubin DB: Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39, 1–22 (1977) [Google Scholar]
  • 11.Donnat C, Levy A, Poitevin F, Miolane N: Deep Generative Modeling for Volume Reconstruction in Cryo-Electron Microscopy. arXiv: 2201.02867 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Donoho D: Compressed sensing. IEEE Transactions on Information Theory 52, 1289–1306 (2006) [Google Scholar]
  • 13.Genova K, Cole F, Vlasic D, Sarna A, Freeman WT, Funkhouser T: Learning shape templates with structured implicit functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7154–7164 (2019) [Google Scholar]
  • 14.Gershman S, Goodman N: Amortized inference in probabilistic reasoning. In: Proceedings of the annual meeting of the cognitive science society. vol. 36 (2014) [Google Scholar]
  • 15.Goddard TD, Huang CC, Meng EC, Pettersen EF, Couch GS, Morris JH, Ferrin TE: Ucsf chimerax: Meeting modern challenges in visualization and analysis. Protein Science 27, 14–25 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Greenberg I, Shkolnisky Y: Common lines modeling for reference free ab-initio reconstruction in cryo-EM. Journal of Structural Biology 200, 106–117 (2017) [DOI] [PubMed] [Google Scholar]
  • 17.Gupta H, McCann MT, Donati L, Unser M: CryoGAN: A New Reconstruction Paradigm for Single-Particle Cryo-EM Via Deep Adversarial Learning. IEEE Transactions on Computational Imaging 7, 759–774 (2021) [Google Scholar]
  • 18.Hertle A: On the Problem of Well-Posedness for the Radon Transform. In: Herman GT, Natterer F (eds.) Mathematical Aspects of Computerized Tomography. pp. 36–44. Springer; (1981) [Google Scholar]
  • 19.Iudin A, Korir P, Salavert-Torres J, Kleywegt G, Patwardhan., A.: Empiar: A public archive for raw electron microscopy image data. Nature Methods 13, 387–388 (2016) [DOI] [PubMed] [Google Scholar]
  • 20.Kingma DP, Welling M: An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019) [Google Scholar]
  • 21.Kühlbrandt W: The resolution revolution. Science 343, 1443–1444 (2014) [DOI] [PubMed] [Google Scholar]
  • 22.Lawler EL, Wood DE: Branch-and-bound methods: A survey. Operations research 14, 699–719 (1966) [Google Scholar]
  • 23.Lian R, Huang B, Wang L, Liu Q, Lin Y, Ling H: End-to-end orientation estimation from 2D cryo-EM images. Acta Crystallographica Section D: Structural Biology 78, 174–186 (2022) [DOI] [PubMed] [Google Scholar]
  • 24.Mallick S, Agarwal S, Kriegman D, Belongie S, Carragher B, Potter C: Structure and View Estimation for Tomographic Reconstruction: A Bayesian Approach. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 2253–2260 (2006) [Google Scholar]
  • 25.Michalkiewicz M, Pontes JK, Jack D, Baktashmotlagh M, Eriksson A: Implicit surface representations as layers in neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4743–4752 (2019) [Google Scholar]
  • 26.Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R: Nerf: Representing scenes as neural radiance fields for view synthesis. In: European conference on computer vision. pp. 405–421. Springer; (2020) [Google Scholar]
  • 27.Miolane N, Poitevin F, Li YT, Holmes S: Estimation of Orientation and Camera Parameters from Cryo-Electron Microscopy Images with Variational Autoencoders and Generative Adversarial Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 4174–4183. IEEE; (2020) [Google Scholar]
  • 28.Müller C, Schlauderer G, Reinstein J, Schulz GE: Adenylate kinase motions during catalysis: an energetic counterweight balancing substrate binding. Structure 4, 147–156 (1996) [DOI] [PubMed] [Google Scholar]
  • 29.Namba K, Makino F: Recent progress and future perspective of electron cryomicroscopy for structural life sciences. Microscopy 71, i3–i14 (2022) [DOI] [PubMed] [Google Scholar]
  • 30.Nashed YSG, Poitevin F, Gupta H, Woollard G, Kagan M, Yoon CH, Ratner D: CryoPoseNet: End-to-End Simultaneous Learning of Single-Particle Orientation and 3D Map Reconstruction From Cryo-Electron Microscopy Data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 4066–4076 (2021) [Google Scholar]
  • 31.Nogales E: The development of cryo-em into a mainstream structural biology technique. Nature Methods 13, 24–27 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Park JJ, Florence P, Straub J, Newcombe R, Lovegrove S: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2019) [Google Scholar]
  • 33.Penczek PA, Grassucci RA, Frank J: The ribosome at improved resolution: new techniques for merging and orientation refinement in 3D cryo-electron microscopy of biological particles. Ultramicroscopy 53, 251–270 (1994) [DOI] [PubMed] [Google Scholar]
  • 34.Plaschka C, Lin PC, Nagai K: Structure of a pre-catalytic spliceosome. Nature 546, 617–621 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pragier G, Shkolnisky Y: A common lines approach for ab-initio modeling of cyclically-symmetric molecules. Inverse Problems 35, 124005 (2019) [Google Scholar]
  • 36.Punjani A, Fleet DJ: 3d flexible refinement: structure and motion of flexible proteins from cryo-em. BioRxiv (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Punjani A, Rubinstein JL, Fleet DJ, Brubaker MA: cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nature Methods 14, 290–296 (2017) [DOI] [PubMed] [Google Scholar]
  • 38.Ravi N, Reizenstein J, Novotny D, Gordon T, Lo WY, Johnson J, Gkioxari G: Accelerating 3D Deep Learning with PyTorch3D. arXiv: 2007.08501 (2020) [Google Scholar]
  • 39.Renaud JP, Chari A, Ciferri C, ti Liu W, Rémigy HW, Stark H, Wiesmann C: Cryo-em in drug discovery: achievements, limitations and prospects. Nature Reviews Drug Discovery 17, 471–492 (2018) [DOI] [PubMed] [Google Scholar]
  • 40.Rohou A, Grigorieff N: Ctffind4: Fast and accurate defocus estimation from electron micrographs. Journal of structural biology 192, 216–221 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rosenbaum D, Garnelo M, Zielinski M, Beattie C, Clancy E, Huber A, Kohli P, Senior AW, Jumper J, Doersch C, Eslami SMA, Ronneberger O, Adler J: Inferring a Continuous Distribution of Atom Coordinates from Cryo-EM Images using VAEs. arXiv:2106.14108 (Jun 2021) [Google Scholar]
  • 42.Rosenthal PB, Henderson R: Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy. Journal of molecular biology 333, 721–745 (2003) [DOI] [PubMed] [Google Scholar]
  • 43.Rudin LI, Osher S, Fatemi E: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, 259–268 (1992) [Google Scholar]
  • 44.Scheres SH: RELION: Implementation of a Bayesian approach to cryo-EM structure determination. Journal of Structural Biology 180, 519–530 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Sigworth FJ: A maximum-likelihood approach to single-particle image refinement. Journal of Structural Biology 122, 328–339 (1998) [DOI] [PubMed] [Google Scholar]
  • 46.Simonyan K, Zisserman A: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) [Google Scholar]
  • 47.Singer A, Coifman RR, Sigworth FJ, Chester DW, Shkolnisky Y: Detecting consistent common lines in cryo-EM by voting. Journal of Structural Biology 169, 312–322 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Sitzmann V, Martel J, Bergman A, Lindell D, Wetzstein G: Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33, 7462–7473 (2020) [Google Scholar]
  • 49.Sitzmann V, Zollhöfer M, Wetzstein G: Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems 32 (2019) [Google Scholar]
  • 50.Ullrich K, Berg R.v.d., Brubaker M, Fleet D, Welling M: Differentiable probabilistic models of scientific imaging with the fourier slice theorem. arXiv preprint arXiv:1906.07582 (2019) [Google Scholar]
  • 51.Vainshtein B, Goncharov A: Determination of the spatial orientation of arbitrarily arranged identical particles of unknown structure from their projections. In: Soviet Physics Doklady. vol. 31, p. 278 (1986) [Google Scholar]
  • 52.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I: Attention is all you need. Advances in neural information processing systems 30 (2017) [Google Scholar]
  • 53.Vulović M, Ravelli RBG, van Vliet LJ, Koster AJ, Lazić I, Lücken U, Rullgård H, Öktem O, Rieger B: Image formation modeling in cryo-electron microscopy. Journal of Structural Biology 183, 19–32 (2013) [DOI] [PubMed] [Google Scholar]
  • 54.Walls AC, Park YJ, Tortorici MA, Wall A, McGuire AT, Veesler D: Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein. Cell 181, 281–292 (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Wang L, Singer A, Wen Z: Orientation Determination of Cryo-EM Images Using Least Unsquared Deviations. SIAM Journal on Imaging Sciences 6, 2450–2483 (2013) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Wong W, Bai X.c., Brown A, Fernandez IS, Hanssen E, Condron M, Tan YH, Baum J, Scheres SH: Cryo-em structure of the plasmodium falciparum 80s ribosome bound to the anti-protozoan drug emetine. Elife 3, e03080 (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zehni M, Donati L, Soubies E, Zhao ZJ, Unser M: Joint Angular Refinement and Reconstruction for Single-Particle Cryo-EM. IEEE Transactions on Image Processing 29, 6151–6163 (2020) [DOI] [PubMed] [Google Scholar]
  • 58.Zhong E: cryodrgn-empiar (2022), https://github.com/zhonge/cryodrgn_empiar
  • 59.Zhong ED, Bepler T, Berger B, Davis JH: CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nature Methods 18, 176–185 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Zhong ED, Bepler T, Davis JH, Berger B: Reconstructing continuous distributions of 3D protein structure from cryo-EM images. arXiv:1909.05215 (2019) [Google Scholar]
  • 61.Zhong ED, Lerer A, Davis JH, Berger B: Cryodrgn2: Ab initio neural reconstruction of 3d protein structures from real cryo-em images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4066–4075 (2021) [Google Scholar]
  • 62.Zhong ED, Lerer A, Davis JH, Berger B: Exploring generative atomic models in cryo-EM reconstruction. arXiv:2107.01331 (2021) [Google Scholar]
  • 63.Zhou Y, Barnes C, Lu J, Yang J, Li H: On the Continuity of Rotation Representations in Neural Networks. arXiv: 1812.07035 (2020) [Google Scholar]
  • 64.Zivanov J, Nakane T, Forsberg BO, Kimanius D, Hagen WJ, Lindahl E, Scheres SH: New tools for automated high-resolution cryo-em structure determination in relion-3. elife 7, e42166 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary materials

RESOURCES