Skip to main content
eLife logoLink to eLife
. 2016 Nov 15;5:e18722. doi: 10.7554/eLife.18722

Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2

Dari Kimanius 1,, Björn O Forsberg 1,, Sjors HW Scheres 2,*, Erik Lindahl 1,3,*
Editor: Sriram Subramaniam4
PMCID: PMC5310839  PMID: 27845625

Abstract

By reaching near-atomic resolution for a wide range of specimens, single-particle cryo-EM structure determination is transforming structural biology. However, the necessary calculations come at large computational costs, which has introduced a bottleneck that is currently limiting throughput and the development of new methods. Here, we present an implementation of the RELION image processing software that uses graphics processors (GPUs) to address the most computationally intensive steps of its cryo-EM structure determination workflow. Both image classification and high-resolution refinement have been accelerated more than an order-of-magnitude, and template-based particle selection has been accelerated well over two orders-of-magnitude on desktop hardware. Memory requirements on GPUs have been reduced to fit widely available hardware, and we show that the use of single precision arithmetic does not adversely affect results. This enables high-resolution cryo-EM structure determination in a matter of days on a single workstation.

DOI: http://dx.doi.org/10.7554/eLife.18722.001

Research Organism: None

Introduction

With the advent of direct-electron detectors and advanced methods of image processing, structural characterisation of macromolecular complexes to near-atomic resolution is now feasible using single-particle electron cryo-microscopy (cryo-EM) (Cheng, 2015; Fernandez-Leiro and Scheres, 2016b). Although this has caused a rapid gain in its popularity, two technological factors still limit wide applicability of cryo-EM as a standard tool for structural biology.

First, partly due to the steep increase in demand, access to high-end microscopes is limited. This is being addressed with acquisition of new equipment in a large number of departments worldwide, as well as the establishment of shared infrastructures (Saibil et al., 2015). Second, processing the large amounts of data produced by these microscopes requires computational hardware that is not directly accessible to many labs. Even at larger centres the computational requirements are so high that cryo-EM now suffers from a computational bottleneck. The work presented here addresses this second problem, to the end of drastically reducing the computational time and investment necessary for cryo-EM structure determination.

A typical cryo-EM data set may constitute hundreds or thousands of images (called micrographs) of a thin layer of vitreous ice in which multiple individual macromolecular complexes (called particles) are imaged. Because radiation damage imposes strict limitations on the electron exposure, micrographs are extremely noisy. Thus, to extract fine structural details, one needs to average over multiple images of identical complexes to cancel noise sufficiently. This is achieved by isolating two-dimensional particle-projections in the micrographs, which can then be recombined into a three-dimensional structure (Cheng et al., 2015). The latter requires the estimation of the relative orientations of all particles, which can be done by a wide range of different image processing programs, such as SPIDER (Frank et al., 1981, 1996), IMAGIC (van Heel et al., 1996), BSOFT (Heymann and Belnap, 2007), EMAN2 (Tang et al., 2007), SPARX (Hohn et al., 2007), FREALIGN (Grigorieff, 2007), XMIPP (Scheres et al., 2008), RELION (Scheres, 2012a), or SIMPLE (Elmlund and Elmlund, 2012).

These programs also need to tackle the problem that any one data set typically comprises images of multiple different structures; purified protein samples are e.g. rarely free from all contaminants. Multiple conformations, non-stoichiometric complex formation, or sample degradation are all possible sources of additional data heterogeneity. The classification of heterogeneous data into homogeneous subsets has therefore proven critical for high-resolution structure determination and provides a tool for structural analysis of dynamic systems. However, identifying structurally homogeneous subsets in the data by image classification algorithms adds computational complexity, and often increases the computational load dramatically.

An increasingly popular choice for processing cryo-EM data is an empirical Bayesian approach to single-particle analysis (Scheres, 2012b) implemented in the computer program RELION (Scheres, 2012a). In the underlying regularised likelihood optimisation algorithm, optimal weights for the contribution of different Fourier components to the determination of orientations, as well as to the three-dimensional reconstruction(s), are learnt from the data in an iterative manner, thereby reducing the need for user input. In addition, RELION has proven highly effective in classifying a wide range of structural variation, such as conformational dynamics within protein domains (Bai et al., 2015), or of very small sub-populations in large data-sets (Fernández et al., 2013). Unfortunately, the regularised likelihood optimisation algorithm that underlies these calculations is computationally demanding. We estimate that a recent 3.7 Å structure of a yeast spliceosomal complex (Nguyen et al., 2016) required more than half a million CPU hours of classification and high-resolution refinement. Computations of this magnitude require the use of high-performance computing clusters with dedicated staff, and restrict the exploration of new image processing schemes.

The introduction of hardware accelerators, such as graphics processors (GPUs), has recently transformed other scientific fields where computation was a bottleneck. To exploit this type of hardware, substantial redesigns of algorithms are required to make many independent tasks simultaneously available for computation, which is known as exposing (low-level) parallelism. However, the possible gain is equally substantial; together with commodity hardware it has been a revolution e.g. for molecular dynamics simulations (Salomon-Ferrer et al., 2013; Abraham et al., 2015), quantum chemistry (Ufimtsev and Martínez, 2008), and machine learning (Jia et al., 2014).

GPU acceleration of computationally expensive algorithms has also been performed in other cryo-EM software, with varying success and subsequent impact in the field. Many attempts have achieved significant improvement in performance through acceleration of only a few subroutines (Li et al., 2010; Tagare et al., 2010; Li et al., 2013; Hoang et al., 2013), relieving major bottlenecks. The overall software architecture and data structures were however not modified to better suit the hardware. In contrast, other tools were designed or fundamentally reformulated with a particular hardware in mind (Li et al., 2015; Zhang, 2016); this tends to lead to more substantial long-term performance benefits, and for some steps the accelerated implementations have fully replaced CPU codes. Historically, RELION (like most alternatives) has scaled to the large resources it needs by using higher-level parallelism, where computational tasks are divided into subsets of particle images in each iterative refinement step. However, lower-level core computations on individual particle images in RELION have remained serialised since its introduction more than four years ago.

Here, we describe a new implementation of the regularised likelihood optimisation algorithm in RELION that uses GPUs to address its computational bottlenecks. We have chosen to implement our increased parallelism in CUDA, a programming language provided by NVIDIA. The CUDA language currently dominates the GPU computing market, and provides a stable programming environment with a rich C++ interface. We also utilise a number of libraries provided within the CUDA framework, such as cuFFT for fast Fourier transforms (FFTs), and CUB/thrust for sorting and other standard functions. In addition to high-end professional cards there is wide availability of cheap consumer hardware that supports CUDA, which provides outstanding value for many research groups. However, the acceleration and parallelisation approaches are general and should be possible to port to other architectures in the future.

The present acceleration of relion addresses the most computationally intensive steps in a typical image processing workflow. This includes classification of data into structurally homogeneous subsets (2D or 3D classification) and high-resolution refinement of each homogeneous such set of particles (3D auto-refine). In addition, we describe an improved algorithm for semi-automated selection of particles from micrographs (Scheres, 2015), this too targeting GPUs. Memory requirements have been reduced to fit widely available consumer graphics cards, and we show that the current adaptation to use single precision floating-point arithmetic does not cause loss of resolvable detail in the final structures. These developments enable high-resolution cryo-EM structure determination in a matter of days on individual workstations rather than relying on large clusters.

Methods

Regularised likelihood optimisation

The regularised likelihood optimisation in relion uses an Expectation-Maximization algorithm (Dempster et al., 1977) to find the most likely 3D density map for a large set of 2D particles images with unknown orientations, under the prior expectation that the 3D map has limited power in the Fourier domain. This iterative algorithm involves two fundamental steps at every iteration. In the expectation, or E-step, one calculates probability distributions for the relative orientations of all particles based on the current estimate of the 3D map. In the subsequent maximization, or M-step, one updates the estimate for the 3D map. Intuitively, these two steps represent alignment of the particles with respect to a common 3D reference, and reconstruction of a new reference map, respectively. The algorithm is typically started from an initial 3D reference map at low resolution, and is guaranteed to increase the likelihood of the 3D map given the data and the prior at every iteration, until converging onto the nearest local minimum. In the presence of structural variability in the data set, multiple 3D reference maps, or 3D classes, can be refined simultaneously. In this case, the class assignments of the particles are not known either. RELION also allows the refinement of multiple 2D references, or 2D classes. In that case, only in-plane orientations are sampled.

Within a single iteration, the E-step requires four major computational stages (Figure 1): (i) The current 3D reference map is projected along many orientations; (ii) the difference between these reference projections and every particle image is calculated for each orientation and for each sampled translation; (iii) all these differences are converted to probability-weights; and (iv) those weights are used to update a running sum of the back-projected particle images in 3D Fourier space. In practice, to limit the number of operations, stages (i) and (ii) are performed twice for every particle. A first pass examines all orientations using a relatively coarse orientational sampling, and a second pass re-examines the regions of orientational space with significant probabilities at an increased sampling density. Still, millions of orientations and translations are typically compared for each particle at every iteration. Converting the resulting differences to weights is a relatively cheap operation as each image-orientation pair at this stage is represented by a single scalar value. The back-projection in operation (iv) is more demanding since it is again necessary to work with all pixels of each image in multiple orientations. After the E-step has been evaluated for all (typically tens to hundreds of thousand) particles, in the M-step the running sum in 3D Fourier space is transformed into an updated 3D reference map for the next iteration. The corresponding reconstruction algorithm is expensive in terms of computer memory, but it typically needs orders of magnitude less time than the E-step because it is only performed once for every iteration. Therefore, operations (i), (ii) and (iv) of the E-step have dominated execution time in previous versions of RELION.

Figure 1. High level flowchart of RELION.

Figure 1.

(A) Operations and the real vs. Fourier spaces used during (B) image reconstruction in RELION. Micrograph input and model setup use the CPU, while most subsequent processing steps have been adapted for accelerator hardware. The highlighted orientation-dependent difference calculation is by far the most demanding task, and fully accelerated. Taking 2D slices out of (and setting them back into) the reference transforms has also been accelerated at high gain. The inverse FFT operation has not yet been accelerated, but uses the CPU.

DOI: http://dx.doi.org/10.7554/eLife.18722.002

Extracting parallelism for accelerators

Modern accelerator processors - such as GPUs - achieve very high floating-point performance by incorporating a large set (thousands) of very simple and streamlined functional units instead of the handful advanced general-purpose cores in a normal (CPU) processor. In previous versions, RELION has scaled over multiple CPU cores by using separate processes for independent particle images. GPUs however require much broader low-level parallelism to increase performance substantially. Our implementation of RELION therefore required a reformulation to expose additional available parallelism. Since most of the computation time is spent on the E-step for a typical application, we focussed our acceleration efforts on this part. By treating multiple reference maps, all relative image translations and orientations, and even individual image pixels as parallel tasks (Figure 2), we created sufficient low-level parallelism to provide outstanding performance on accelerator hardware such as GPUs.

Figure 2. Extensive task-level parallelism for accelerators.

Figure 2.

While previously relion only exploited parallelism over images (left), in the new implementation classes and all orientations of each class are expressed as tasks that can be scheduled independently on the accelerator hardware (e.g. GPUs). Even individual pixels for each orientation can be calculated in parallel, which makes the algorithm highly suited for GPUs.

DOI: http://dx.doi.org/10.7554/eLife.18722.003

Formulating and enabling the available parallelism in RELION for GPUs did present specific challenges. In operations (i) and (ii) of the E-step, it would for instance require very large amounts of fast memory to pre-calculate projections of the 3D reference maps for all examined orientations. Pre-calculation could potentially increase re-use of data on the GPU, but it would impose severe limitations on the number of simultaneously examined references, input image sizes, and sampling granularity. Instead, like the CPU code, our implementation stores a two-fold oversampled Fourier transform of each reference in GPU memory, and 2D slices (along any orientation) are extracted only when needed. This alleviates limitations imposed by hardware, and makes it possible to improve performance through the use of so-called GPU textures. Textures store data that require fast lookup, and the corresponding texture units provide support for performing non-integer pixel interpolation of texture data in a single instruction with only marginal loss of precision. In graphics applications, textures are used to efficiently rescale image data. In RELION, they turn out to be well-suited for the resampling operations when taking slices from the 3D Fourier transform. The on-the-fly extraction of 2D Fourier slices is combined with the very broad parallelism of calculating squared differences between all pixels of all particle images and the reference projections in all orientations and all translations. For the technically interested reader, details of how we exploit this parallelism in our implementation of the squared difference calculation are presented in Appendix I. Technical details of how we also accelerated operation (iv) of the E-step, the back-projection of 2D images into the oversampled 3D Fourier transform, are presented in Appendix II. When combined, the E-step can now be evaluated in a fraction of the runtime previously needed.

Finally, like most scientific software, the CPU version of relion has historically used double precision, and merely recompiling in single does not improve performance. However, all GPUs provide higher performance for single precision, and the dedicated texture units can typically only perform reduced-precision operations. For these reasons we also reformulated our implementation of the E-step to be less precision-sensitive, making it possible to execute these calculations in single precision without adverse effects on the final results. As we did not accelerate the M-step, this part of the algorithm remains on the CPU and is still executed in double precision.

Semi-automated picking

RELION also implements a template-based particle selection procedure, which calculates a probability measure (the R-value) for each pixel in the micrograph to signify the likelihood that it is the location of any of the provided templates (Scheres, 2015). The R-value map of a micrograph considers all possible rotations of each template, and is subsequently used in a peak-search algorithm that locates particles within the original micrograph (Figure 3). These calculations are performed in Fourier space, where they are highly efficient (Roseman, 2003). In fact, they are so fast that their execution time becomes negligible compared to the time spent performing FFTs to transform image objects between real and Fourier spaces. Consequently, even though reference templates are also treated as independent tasks to increase parallelism in the GPU version, a much larger gain is found at the level of template rotations, through parallel execution of FFTs. For example, when using 5-degree incremental template rotations, 72 such inverse FFTs are now performed concurrently on the GPU through the cufft cuda library. The size of these FFTs is now also padded automatically, since substantial performance penalties can occur if the transform size includes any large prime factors.

Figure 3. Semi-automated particle picking in RELION-2.

Figure 3.

The low-pass filter applied to micrographs is a novel feature in RELION, aimed at reducing the size and execution time of the highlighted inverse FFTs, which accounts for most of the computational work. In addition to the inverse FFTs, all template- and rotation-dependent parallel steps have also been accelerated on GPUs.

DOI: http://dx.doi.org/10.7554/eLife.18722.004

Results

Acceleration of regularised likelihood optimisation

Performance

The performance of our implementation on a workstation equipped with modern GPUs can exceed that of hundreds of CPU cores (Figure 4). This is most prominent for increasing numbers of pixels, orientations and classes, due to the increased low-level parallelism RELION-2 has been designed to use efficiently. Therefore, calculations where many classes and orientations need to be sampled, e.g. 3D classifications over multiple classes and with fine sampling of orientations, experience the greatest gain from the acceleration (Figure 4). Traditionally such large problems have required cluster-size resources with high-performance interconnects for fast communication. As seen in the performance benchmarks, RELION-2 makes it possible to run many of these calculations even on workstations or low-cost desktop solutions (Figure 4). GPU hardware evolves rapidly, but appendix XI contains a few recommendations about the solutions we currently think are most cost-efficient. For sufficiently large computational problems, relion’s processing time scales linearly with increased number of classes, but since the extra calculations are much faster with the GPU-enabled version the relative advantage is larger the more classes are used (Figure 4D). In practice we believe this will make it more common to use very large numbers of classes in 3D classification. Further scrutiny reveals that the calculations still performed on the CPU actually dominate both the execution and scaling even for the GPU version (Figure 4E), which indicates new bottlenecks are now limiting scaling - we intend to focus on these parts of the code for future improvements.

Figure 4. RELION-2 enables desktop classification and refinement using GPUs.

Figure 4.

EMPIAR (Iudin et al., 2016) entry 10028 was used to assess performance, using refinements of 105 k ribosomal particles in 3602-pixel images. (A) A quad-GPU workstation easily outperforms even a large cluster job in 3D classification. (B) In 2D classification, the GPU desktop performs slightly better in the first few iteration and then provides performance equivalent to the 280 CPU cores. (C) Total time for 25 iterations of 3D classification for a few different hardware configurations. (D) Additional classes are processed at reduced cost compared to CPU-only execution, due to faster execution and increased capacity for latency hiding. (E) With increasing number of classes, the time spent in non-accelerated vs accelerated execution increases. (F) The workstation also beats the cluster for single-class refinement to high resolution, despite the generally lower degree of parallelism. This is particularly striking for the finer exhaustive sampling at 3.8 due to the GPU’s ability to parallelise the drastically increased number of tasks.

DOI: http://dx.doi.org/10.7554/eLife.18722.005

Limited precision and accuracy

RELION has used double precision arithmetic since its first release in order to be as accurate as possible, at the cost of increased memory requirements. While there are professional GPUs with good double precision performance, the consumer market is dominated by visualisation and gaming applications, and for this reason cheap hardware only provides good performance for single precision. Even for professional hardware, the performance is better with single precision, although the difference is smaller. This makes it highly desirable to use single precision arithmetics wherever possible. In addition to much better floating-point throughput, single precision calculations reduce the memory requirements by a factor two, and make it possible to employ textures for image rotation. The relative advantage of single precision can thus be much higher on GPUs compared to CPUs. Because the required precision depends on the algorithms used in the application, part of the development of RELION-2 involved the evaluation of reconstructed quality when using single precision. We evaluate this primarily by examining the agreement of refinements results, characterising both the reconstructed volume and the image orientation assignment statistics. As image orientations are analogous to a location and rotation on the unit sphere, we compared refinement results of the EMPIAR 10028 dataset by the distribution of angular differences for all images. Two double-precision CPU refinements (using different random seeds) produce distributions where 81% of images fall within 1 of the other. Given that this distribution is modulated by the sine of the angle, this is a very close agreement. The single-precision GPU implementation finds 82% of images within the same 1 tolerance. (Figure 5—figure supplement 1). In other words, the differences due to random seeds are at least as large as any systematic variation between the CPU and GPU versions. While a small loss of precision was observed e.g. in the fast interpolation (texture) intrinsics, the assignments of image orientations are similar enough to not influence the final reconstruction quality (Figure 5). In contrast, single-precision execution of the iterative gridding algorithm that underlies the reconstruction in the M-step (Scheres, 2012a) did exhibit a notable deviation from the double-precision version when first tested. Therefore, we opted for a hybrid implementation of the algorithm: The computationally demanding slice projections, probability calculations, and back-projections in the E-step are performed in single precision on the GPU, while the reconstruction in the M-step remains in double precision on the CPU. Because the M-step is partially responsible for the new bottlenecks apparent in Figure 4E this is a candidate for additional future optimisation, but for now the hybrid approach provides a good compromise.

Figure 5. The GPU reconstruction is qualitatively identical to the CPU version.

(A) A high-resolution refinement of the Plasmodium falciparum 80S ribosome using single precision GPU arithmetic achieves a gold-standard Fourier shell correlation (FSC) indistinguishable from double precision CPU-only refinement (previously deposited as EMD-2660). The FSC of full reconstructions comparing the two methods shows their agreement far exceeds the recoverable signal (grey), and as shown in Figure 5—figure supplement 1 the variation in angle assignments match the differences between CPU runs with different random seeds. (B) Partial snapshots of the final reconstruction following post-processing, superimposed on PDB ID 3J79 (Wong et al., 2014).

DOI: http://dx.doi.org/10.7554/eLife.18722.006

Figure 5.

Figure 5—figure supplement 1. The CPU and GPU implementations provide qualitatively identical distributions of image orientations.

Figure 5—figure supplement 1.

For two CPU runs with different random seeds, 81% of images fall within 1, and for a GPU vs. CPU run 82%. Note that the probability of observing small angles vanishes since the number of potentially available points is proportional to the sine of the angle, which approaches zero for identical orientations. Both distributions were aligned against the reference refinement by fitting reconstructed models.

Disk and memory considerations

relion has traditionally required large amounts of memory. Fortunately its peak use is however not during the accelerated, computationally most intensive, parts of the algorithm. Rather, memory use peaks during the M-step, which is executed on the CPU using the gridding reconstruction algorithm mentioned above. The amount of available GPU memory still remains a limitation, as it determines the capacity for storage of the oversampled Fourier transforms of one or more references. This is of particular concern for larger and higher-resolution structures, which require more memory to be faithfully represented. When resolving detail at the Nyquist frequency, due to a twofold oversampling (associated with gridding during reconstruction), relion requires memory corresponding to twice the cube of the image dimension. For example, when using 4002-pixel particle images, the required grid size is 8003, which becomes 2 GB per class, since each value requires 4 bytes in single precision. Moreover, as the reconstructed object also needs to be accommodated, this number is effectively multiplied by 2.5. In practice, memory use with image sizes up to 4002 indicate that at most 6 GB of GPU memory is needed to perform refinement to Nyquist (Figure 6). Classification using 3D references is usually performed at resolutions much lower than the Nyquist limit, and for this reason its memory requirements are typically much lower.

Figure 6. GPU memory requirements.

Figure 6.

(A) The required GPU memory scales linearly with the number of classes. (B) The maximum required GPU memory occurs for single-class refinement to the Nyquist frequency, which increases rapidly with the image size. Horizontal grey lines indicate avaliable GPU memory on different cards.

DOI: http://dx.doi.org/10.7554/eLife.18722.008

To enable efficient evaluation and good scaling on GPUs, several new methods to manually manage data efficiently have been implemented. Lower levels of parallelism are coalesced into larger objects using customised tools, which results in more efficient use of memory (see appendices I-II for details). In addition, because of the much improved performance, multiple tasks have become limited by how fast input data can be read from disk. Therefore, we now find it highly beneficial to explicitly cache data on local solid state devices (SSDs), as has also been observed for GPU-accelerated CTF estimation (Zhang, 2016). To allow this in a straightforward way, RELION-2 features the ability to automatically copy data sets to fast local disks prior to refinement, which further increases performance during less computationally intensive steps, such as 2D classification.

Acceleration of automatic particle picking

Low-pass filtering of micrographs

Even after parallelisation and acceleration on GPUs, the cross-correlation-based particle selection in relion is dominated by computing many large inverse FFTs (Figure 3), as has been observed previously for similar methods (Castaño-Díez et al., 2008). Reducing their size is thus the most straight-forward way to further improve execution performance. Reference templates are typically subject to low-pass filtering, and for this reason we investigated the possibility to apply a similar filtering to all micrographs, which reduces computations by discarding high-frequency information.

We found little difference in the particles selected when discarding resolution information in micrographs beyond that of search templates. While intuitively straightforward, this conclusion drastically reduces the size of FFT grids and subsequent computations, which provides large acceleration at virtually no quality loss. The low-pass filtering also significantly reduces the amount of memory required for particle selection, which permits parallelism to target hardware like desktop workstations more efficiently.

Autopicking performance

We tested both picking speed and quality of picked particles of RELION-2. In an initial test, a single 40962-pixel micrograph containing ribosomes at 1.62 Å/pixel was processed against 8 templates with 5 degree angular sampling and no low-pass filtering. This took 675 s to evaluate on a CPU-only workstation (i7–5960X, using 1 thread merely to provide a per-core performance). When applying low-pass filtering to 20 Å, this time is reduced to 39 s, i.e. by a factor 17. When using a single consumer-level GPU (GTX 1080) with a single CPU thread, execution is further reduced to just 0.73 s, or an additional factor 54. Each GPU added to the workstation can therefore now process 940 micrographs in the same time previously required to process just 8 micrographs (1 per available core) as shown in Figure 7.

Figure 7. Low-pass filtering and acceleration of particle picking.

Figure 7.

(A) Ribosomal particles were auto-picked from representative 40962-pixel micrographs collected at 1.62 Å/pixel using four template classes, showing near-identical picking with and without low-pass filtering to 20 Å. The only differing particle is indicated in orange, and likely does not depict a ribosomal particle. (B–C) Despite near-identical particle selection, performance is dramatically improved. (D) Filtering alone provides almost 20-fold performance improvement on any hardware compared to previsos versions of relion, and when combined with GPU-accelerated particle picking the resulting performance gain is more than two orders of magnitude using only a single GPU (GTX 1080).

DOI: http://dx.doi.org/10.7554/eLife.18722.009

We further evaluated the quality of filtered selection according to the β-galactosidase benchmark (EMPIAR entry 10017) used in the original implementation in RELION-1.3 (Scheres, 2015). This data set consists of 84 micrographs of 40962 pixels (1.77 Å/pixel), and comes with coordinates for 40,863 particles that were manually selected by Richard Henderson. The latter were used for comparison with our autopicking results, with a center cutoff distance of 35 pixels for particles to be considered identical (Table 1). Filtered selection did not decrease the quality of the results, but provided a just-so-slightly increased recall without increasing the false discovery rate (FDR, see e.g. (Langlois and Frank, 2011) for definitions of recall and FDR). When filtering and GPU-acceleration are combined, a single GPU provides roughly 120 times the performance of an 8-core desktop, and the desktop can easily be equipped with quad GPUs for about 500x performance gain. In fact, similarly to the regularised likelihood optimisation, the semi-automated particle picking can become limited by disk access unless data is read from an SSD.

Table 1.

Quality and speed of autopicking for the β-galactosidase benchmark. Comparing the CPU version with the GPU version using increasing levels of low-pass filtering yields progressively higher recalls at similar FDRs. The GPU version yields identical results to that of the CPU version, but at a much reduced computational costs. Filtering does not depend on GPU-acceleration, and will perform similarly using only CPUs.

DOI: http://dx.doi.org/10.7554/eLife.18722.011

Code Filter # picked Recall FDR Time Performance in
(Å) particles (s/micrograph) CPU core units
CPU none 54,301 0.88 0.34 1,227 1
GPU none 54,325 0.88 0.34 10 122
GPU 5 55,629 0.90 0.34 5.8 211
GPU 10 55,886 0.90 0.34 2.1 584
GPU 15 56,450 0.92 0.33 1.6 766
GPU 20 57,361 0.95 0.33 1.3 943

A complete workflow for β-galactosidase

To illustrate the impact of our GPU implementation and show how it can alter practical work, we chose to re-analyse the EMPIAR-10061 dataset of β-galactosidase (Bartesaghi et al., 2015) using RELION-2. This represents the largest presently available dataset in the empiar database, and provides a realistic challenge. We performed an entire processing workflow, including initial beam-induced motion correction in UNBLUR (Grant and Grigorieff, 2015), CTF estimation in Gctf (Zhang, 2016), and finally RELION-2 was employed for automated particle picking, 2D and 3D classification, movie-refinement, particle polishing (Scheres, 2014) and high-resolution auto-refinement on a single workstation with four GTX 1080 cards. Calculating a map to 2.2 Å resolution (Figure 8A) took a few days – comparable to, if not faster than, the time required acquire this amount of high-quality data. Figure 8B shows an overview of the most computationally demanding steps during this processing. The parts of the workflow that have been GPU-accelerated no longer dominate execution, but this exposes other new bottlenecks. In particular, steps that involve reading large movie files from disk become problematic. We also note that due to the rigid nature of the β-galactosidase complex used for this benchmark, only a single 3D classification was performed. This is not representative for many other use cases: typically 3D classification is repeated multiple times to identify, sort, and isolate structural heterogeneity. In such a scenario, the impact of the GPU acceleration is even larger, as increased or multiple 3D classifications would still not dominate the complete workflow.

Figure 8. High-resolution structure determination on a single desktop.

Figure 8.

(A) The resulting 2.2 Å map (deposited as EMD-4116) shows excellent high-resolution density throughout the complex. (B) The most time-consuming steps in the image processing workflow. GPU-accelerated steps are indicated in orange. The total time of image processing was less than that of downloading the data. (C) The resolution estimate is based on the gold-standard FSC after correcting for the convolution effects of a soft solvent mask (black). The FSC between the relion map and the atomic model in PDB ID 5A1A is shown in orange. The FSC between EMD-2984 and the same atomic model is shown for comparison (dashed gray).

DOI: http://dx.doi.org/10.7554/eLife.18722.010

Discussion

We present a GPU-enabled implementation of RELION-2, as a first step to address current and future needs for large and expedient computations in the field of cryo-EM structure determination. The principal benefits drawn from the presented work are twofold. First, the nature of progress in scientific applications is to continually re-evaluate and examine data in many different ways. With ease of re-processing data, the threshold for trial, error and successive improvement of existing methods is now markedly lowered. Second, the order-of-magnitude speedups make it possible to get by with much less hardware for cryo-EM processing, in most cases even desktops. This removes a computational bottleneck for large labs, and enables any group to perform their own reconstruction without access to supercomputers.

In the next few years, larger data sets and image sizes are expected, as well as new methods that require expedient processing of large data sets. The large reduction in computational costs opens up the possibility to perform more ambitious computational analyses without increasing the investments. For example, the favourable scaling of performance we observed for multi-class refinements will make it feasible to use many more classes than was practical before, which will lead to better descriptions of conformational diversity in flexible molecules. Additionally, with even faster algorithms and hardware it might soon be possible to perform highly automated, on-the-fly, structure determination while data acquisition is ongoing. In anticipation of these developments, RELION-2 already implements a pipelined approach for automated execution of pre-determined image processing workflows (Fernandez-Leiro and Scheres, 2016a).

While the new GPU implementation has removed many of the previous computational bottlenecks in relion, the large speedup has exposed several new areas of the code that can now dominate execution time, such as data input/output and the reconstruction step during iterative image refinement. Although these parts of the algorithm were previously insignificant, in some cases they now collectively account for roughly 50% of total execution time. These parts of the code will see benefit from further modifications. Future work will e.g. strive to further generalise parallelism such that performance is less dependent on the type of refinement performed, as sufficient parallelism is always available within the relion core algorithm. Memory requirements on the GPU are also expected to be reduced further, so that larger image sizes and more classes can be handled to higher resolution.

With the current implementation, cryo-EM structures to near-atomic resolution can be calculated in a matter of days on a single workstation, or hours on a GPU-cluster. Nevertheless, the aim of the current adaptations is not to present a final solution to computational needs in RELION; while the present version achieves excellent speedup on a wide range of low-cost systems, we expect the acceleration to improve both in performance and coverage. Generalising the low-level parallelism described here to vectorised CPU calculations, and possibly an open GPU language like OpenCL, will constitute little more than translating this parallelism to new instructions. This is something we intend to pursue in the future. As such, RELION-2 represents a new incarnation of an existing algorithm, which is intended to be developed far further in the following years. Meanwhile, we hope that the current implementation will have as much impact in the broader community as it is already having in our labs.

Materials and methods

Availability

RELION 2 is both open source and free software, distributed under the GPLv2 licence. It is publicly available for download through http://www2.mrc-lmb.cam.ac.uk/relion.

Data sets and hardware specifications

The ribosome data used for the 3D classification and refinement in Figures 4 and 5 correspond to EMPAIR entry 10028 (Wong et al., 2014). For autopicking, empiar entry 10017 was used (Scheres, 2015). The complete workflow for β-galactosidase used EMPIAR entry 10061 (Bartesaghi et al., 2015), and the reconstructed map was deposited in the EMDB (EMD-4116). In all cases where different hardware and/or software implementations were compared, identical refinement parameters were used, including the random seed provided to RELION. In cases where identical hardware and software were repeatedly used to examine variability, different seeds were used. The nondeterministic scheduling and summations of GPU-enabled execution can introduce some minor noise in results, but all runs achieve convergence and variations in final results are well within estimated errors and resolution of the raw data.

The acceleration in RELION-2 works with any NVIDIA GPU of compute-capability 3.5 or higher, which covers all models launched the last three years. Version 7.0 of the CUDA toolkit is also required, and was used for all presented results. CPU performance was benchmarked with a cluster of 10 compute nodes equipped with dual Xeon E5-2690v4 CPUs (2×14 physical cores, for a total of 280 cores) running at 2.6 GHz with 128 GB memory. While we refer to the physical core count when describing hardware, hyperthreading was enabled and used for all benchmarks (i.e., starting 560 threads on the 10 nodes) since it improves performance on the CPU side slightly. Two different workstations were used for GPU benchmarks. First a cost-efficient desktop with a single Core i7–6700 K (four cores, 4 GHz), dual GTX 1070 GPUs, and a 500 GB SSD disk. Second, a workstations equipped with a single Core i7–5960 X CPU (8 cores, 3GHz), four GTX 1080 GPUs, and either a single SSD or two configured in RAID0 for higher bandwidth. Both workstations had 64 GB of memory, and two CPU threads were used for each GPU to improve utilisation. In all cases hyper-threading was utilised to the fullest extent possible to improve CPU performance.

β-galactosidase image processing

Super-resolution 8k×8k micrograph movies with 38 frames were submitted to initial beam-induced motion correction using UNBLUR (Grant and Grigorieff, 2015). The resulting average micrographs were used for CTF estimation in Gctf (Zhang, 2016). Autopicking with six templates yielded an initial data set of 130,375 particles, which were subjected to reference-free 2D classification using 200 classes. This initial classification was done using 4× downscaled particles (with a pixel size of 1.274 Å and a box size of 192 pixels). Selection of the 75 best classes resulted in 120,514 particles. All subsequent calculations were performed using 2× downscaling (resulting in a pixel size of 0.637 Å and a box size of 384 pixels). The selected particles were subjected to an initial 3D auto-refinement that used PDB ID 3I3E (Dugdale et al., 2010) as an initial model. Subsequent movie-refinement (with a running average of 7 movie frames and a standard deviation of 2 pixels on the translations) was followed by particle polishing (using a standard deviation of 1000 pixels on the inter-particle distance). The resulting shiny particles were submitted to a single round of 3D classification with exhaustive 7.5-degree angular searches and eight classes. Selection of the seven best classes yielded a final data set of 109,963 particles, which were submitted to 3D auto-refinement. The final resolution was estimated using phase-randomisation to account for the convolution effects of a solvent mask on the FSC between the two independently refined half-maps (Chen et al., 2013). This mask was generated by binarisation of a 15 Å low-pass filtered version of the reconstructed map, with addition of a five-pixel wide cosine-shaped soft edge. FSC curves between the model and the solvent-masked map were calculated with relion_image_handler. The same soft solvent mask was also used for the calculation between EMDB-2984 and the atomic model.

Acknowledgements

The authors would like to thank Szilárd Páll, Nikolay Markovskiy, and Mark Berger for fruitful discussions and cuda suggestions; Shintaro Aibara and Marta Carroni for providing data and early quality testing; and Toby Darling, Jake Grimmett, and Stefan Fleichmann for technical support.

Appendix 1

Difference calculation

Exhaustive coarse sampling

To align each particle image against a reference model, RELION performs an exhaustive grid search of a large number of reference model projections against a number of image translations. In our implementation, this difference-calculation kernel is by far the most significant computational load during classification and refinement. The computations are performed in Fourier space (Figure 1), where projections are interpolated slices of the Fourier transform of the reference model. The tri-linear interpolation of the Fourier components performed during extraction of the Fourier-volume slice is both computationally heavy, and has significant latency associated with reading data. This is despite the use of texture objects, which use a dedicated cache and specialised storage formats. To improve memory usage, it is important to reuse the sliced data in the kernel. The maximum number of parallel tasks within this kernel is P×T×C, where P is the number of orientations, T the number translations, and C the number of Fourier components (pixels). However, reuse of sliced data from the reference model requires at least one synchronisation and data communication within the P groups of processes comparing translations with a common Fourier slice. To avoid performance penalties, we make sure all such groups are executed in the same thread-block to enable fast communication through shared memory.

The limitations in shared memory size are circumvented by splitting the reference slice into chunks of components (pixels) that are loaded separately, the management of which does however create further overhead. We fur- ther improve performance by grouping reference orientations (in groups of P0 slices) to also enable reuse of translated image components. The number of completely parallel tasks is then in fact reduced, which may limit perfor- mance by potentially not saturating the hardware with a sufficient number of independent tasks. However, this reduction in turn reduces the number of reads of the Fourier transform of the particle image. This, and the reuse of translated components, ultimately provided a significant performance boost.

The described protocol was implemented by dividing the kernel into two stages, where reference data is prepared in the first stage and then reused as much as possible in the second. Figure 9 illustrates this setup in detail. The two arbitrary parameters N and P0 can be tuned to balance the shared memory size with the reuse of the translation intermediates. To avoid overfit- ting these parameters to benchmarking systems, we have selected values that yield a suitable minimum required shared memory size for typical available hardware.

Appendix 1—figure 1. Computational flow in difference calculation kernel.

Appendix 1—figure 1.

The kernel is initiated with ceil(𝐏/P0) thread-blocks and N threads, where P is the total number of projections. The work flow of a thread-block in each iteration i is divided into two stages. In stage A the N pixels of P0 reference slices are fetched through texture memory, interpolated, and stored in shared memory. This data is then exhaustively reused in stage B, where groups of threads compute the differences to the corresponding translated image components. Individual threads within a group work with different image components, n, of each reference slice, p. Collectively all threads iterate through the N components of each reference slice, for a total of N×P0 components for each iteration i. The final result is reduced back into shared memory through atomic reduction operations. All image components are covered as i goes from 1 to ceil(C/N), where C is the total number of Fourier components. A reduced sum of differences for each pair of orientation and translation is written to global memory prior to the kernel exiting.

DOI: http://dx.doi.org/10.7554/eLife.18722.012

Sparse fine sampling

Following the exhaustive search of orientations described above, relion performs a second, fine-grained, search of the orientations which contributed most to the total alignment weight. In the majority of cases this constitutes a more sparse operation, as a few orientations and/or classifications are typically dominating. If one were to invoke the same kernel in this case, many thread-blocks would contribute insignificantly or even perform null work. We therefore chose to specialise this as a separate kernel, to reach better efficiency and stay within hardware requirements under sparse but fine sampling. The fine-grained search therefore proceeds through a preparatory stage wherein the significantly contributing combinations of reference orientations and image translations are divided into jobs (see Figure 10). Lists of the relevant orientations and translations are also created. A specialised kernel is then provided with this list of jobs, and invoked in parallel to execute them independently. Jobs are also created to reuse unique information (a reference slice), and minimise overhead associated with reading relevant translation indices from high-latency global memory by using sequential translations within each job. See Figure 10 for an in-depth description of this implementation.

Appendix 1—figure 2. A dedicated kernel function performs the targeted fine-grained examination of the most significantly matching regions during image alignment against a reference model.

Appendix 1—figure 2.

The oversampling of each of five fitting dimensions during fine-grained search renders storage of all possible weights intractable, so input and output data are stored with explicit mapping arrays. These are read by the kernel function thread-block, rather than inferred based on block ID. This creates overhead and possible latency of global memory access, which makes this kernel even further separated from the exhaustive kernel represented in Figure 9. Here, a pixel-chunk of a single projection is reused for a number of sequential translations, arranged contiguously if possible. Invoking separate thread blocks for non-contiguous translations allows some implicit indexing of them, which affords better access patterns for SIMD instructions and reduced latency. Due to the sparseness, shared memory can also be used for in-kernel summation of all pixels of each image, which despite some some required explicit thread-level synchronisation increases throughput by avoiding the higher latency of atomic write operations during image summation in the coarse-search kernel.

DOI: http://dx.doi.org/10.7554/eLife.18722.013

Appendix 2

Back-projection

The calculated weights for different orientation of the particle image are used to back-project the image data into a 3D volume (Figure 11A). This subroutine takes a comparably small amount of 2D data as input and outputs it to a large container of 3D data, made up of voxels, elements much like pixels in an image. Each image pixel contributes to eight voxels through a tri-linear interpolation.

Appendix 2—figure 1. Computational flow of fine-grained search kernel.

Appendix 2—figure 1.

(A) Weighted back-projection of a 2D image into three different planes. We explored two memory access approaches (B) for this task, namely gather and scatter. In the gather approach a process (marked with orange) is assigned to individual or groups of 3D voxels. The process read from the input image and updates the data of the assigned voxel(s).

DOI: http://dx.doi.org/10.7554/eLife.18722.014

Two fundamentally different implementations were explored for this task, using either gather or scatter memory access patterns (Figure 11B). The issue of parallel processes writing to the same memory position simultaneously is avoided in the gather approach by restricting subsets of positions in the output to individual processes. In the scatter approach, on the contrary, write clashes are handled with atomic writes, where memory positions are reserved just in time prior to the write operation. This extra set of operations have an overhead and can become a considerable performance issue in regions with many clashes, e.g. close to the origin. However, the scatter approach enables full reuse of the interpolation of pixels for all the affected voxels, since no access restrictions exists for write operations.

In the scatter approach, processes are instead assigned to individual or groups of pixels and can output to all of the voxels. In our benchmarks the scatter approach performed significantly better and was thus selected as the standard. This is most likely due to the fact that individual images on average contribute very sparsely to the 3D volume and hence only affect a small subset of voxels. Since the number of processes in the gather approach is proportional to the number of voxels, this renders many initiated processes jobless. Both this issue and the reduced reuse of intermediates can be addressed to some degree by enabling processes to manage groups of voxels. This method yielded some performance improvements, but the scatter approach nevertheless provides superior performance.

Appendix 3

Hardware recommendations

Each new generation of GPU hardware has provided significant performance gains and increasing amounts of memory, so the most important factor is to use a new GPU – currently Pascal-class cards. Titan-X (Pascal) cards are likely to provide the highest performance, but due to the substantial price premium of these cards, we believe GTX 1080 cards offer better value. The workstation with quad GTX 1080 GPUs described in the materials can currently be assembled from parts for around $5000 (See links from the relion home page), or purchased pre-assembled from a few vendors for about $6200. Since the heavy computational steps run on the GPU, there is no need for dual CPU sockets. However, disk I/O bandwidth needs are high, and as seen in Figure 4C it helps to use two SSDs configured in RAID0. This provides a powerful cluster replacement for many users, but the quad GPU configuration is sensitive to specific motherboard models and the fans can be a bit loud for a normal office environment. In our labs, the currently most popular option is rather a dual-GPU desktop that is both quiet and smaller. This will work with virtually any motherboard with two GPU sockets. With the cheap GTX 1070 cards the machine parts only cost $2000, and we expect it can be obtained from vendors for roughly $2500. A single SSD is sufficient in this case. All these prices are subject to fluctuations (and changes due to new hardware), but it is nevertheless interesting to compare the performance/price for the most computationally intensive 3D classification with the x86 CPU cluster ($8500/node, not including fast network). Based on the run times in Figure 4C, the performance/price ratio is a factor 27 higher for the quad-GPU workstation (with RAID0 SSD) compared to x86 CPU nodes, and a factor 45 higher for the low-cost dual GTX 1070 desktop (at the cost of lower absolute performance). This is based on pre-assembled system prices, and while some vendors might charge more there is always the option to assemble systems from parts for even better value. When it comes to rack-mounted alternatives, many such nodes only work with professional-class (Tesla) cards, but there are larger 4U nodes that work with quad GTX 1080 cards, and even a few 1U models. However, this is highly dependent on the hardware and the reader is advised to seek up-to-date recommendations from vendors or the RELION web pages.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

  • Medical Research Council MC UP A025 1013 to Sjors HW Scheres.

  • Vetenskapsrådet 2013-5901 to Erik Lindahl.

  • Horizon 2020 EINFRA-2015-1-675728 to Erik Lindahl.

  • Swedish e-Science Research Centre to Erik Lindahl.

  • Swedish National Infrastructure for Computing 2015/16-45 to Erik Lindahl.

Additional information

Competing interests

SHWS: Reviewing editor, eLife.

The other authors declare that no competing interests exist.

Author contributions

DK, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article.

BOF, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article.

SHWS, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

EL, Conception and design, Analysis and interpretation of data, Drafting or revising the article.

Additional files

Major datasets

The following dataset was generated:

Kimanius D,Forsberg BO,Scheres SHW,Lindahl E,2016,RELION-2.0 reconstruction for beta-galactosidase data in EMPIAR-10061,http://www.ebi.ac.uk/pdbe/entry/emdb/EMD-4116,Publicly available at the EBI Protein Data Bank in Europe (accession no: EMD-4116)

The following previously published datasets were used:

Scheres SH,2014,Beta-galactosidase Falcon-II micrographs plus manually selected coordinates by Richard Henderson,https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10017/,Publicly available at the EBI Electron Microscopy Pilot Image Archive (accession no: EMPIAR-10017)

Wong W,Bai XC,Brown A,Fernandez IS,Hanssen E,Condron M,Tan YH,Baum J,Scheres SHW,2014,Cryo-EM structure of the Plasmodium falciparum 80S ribosome bound to the anti-protozoan drug emetine,https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10028/,Publicly available at the EBI Electron Microscopy Pilot Image Archive (accession no: EMPIAR-10028)

Bartesaghi A,Merk A,Banerjee S,Matthies D,Wu X,Milne JL,Subramaniam S,2015,2.2 A resolution cryo-EM structure of beta-galactosidase in complex with a cell-permeant inhibitor,https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10061/,Publicly available at the EBI Electron Microscopy Pilot Image Archive (accession no: EMPIAR-10061)

References

  1. Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1-2:19–25. doi: 10.1016/j.softx.2015.06.001. [DOI] [Google Scholar]
  2. Bai XC, Rajendra E, Yang G, Shi Y, Scheres SH. Sampling the conformational space of the catalytic subunit of human γ-secretase. eLife. 2015;4:e11182. doi: 10.7554/eLife.11182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bartesaghi A, Merk A, Banerjee S, Matthies D, Wu X, Milne JL, Subramaniam S. 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Science. 2015;348:1147–1151. doi: 10.1126/science.aab1576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Castaño-Díez D, Moser D, Schoenegger A, Pruggnaller S, Frangakis AS. Performance evaluation of image processing algorithms on the Gpu. Journal of Structural Biology. 2008;164:153–160. doi: 10.1016/j.jsb.2008.07.006. [DOI] [PubMed] [Google Scholar]
  5. Chen S, McMullan G, Faruqi AR, Murshudov GN, Short JM, Scheres SH, Henderson R. High-resolution noise substitution to measure overfitting and validate resolution in 3D structure determination by single particle electron cryomicroscopy. Ultramicroscopy. 2013;135:24–35. doi: 10.1016/j.ultramic.2013.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cheng Y, Grigorieff N, Penczek PA, Walz T. A primer to single-particle cryo-electron microscopy. Cell. 2015;161:438–449. doi: 10.1016/j.cell.2015.03.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cheng Y. Single-Particle Cryo-EM at Crystallographic Resolution. Cell. 2015;161:450–457. doi: 10.1016/j.cell.2015.03.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via Em algorithm. Journal of the Royal Statistical Society Series B-Methodological. 1977;39:1–38. [Google Scholar]
  9. Dugdale ML, Dymianiw DL, Minhas BK, D'Angelo I, Huber RE. Role of Met-542 as a guide for the conformational changes of Phe-601 that occur during the reaction of β-galactosidase (Escherichia coli) Biochemistry and Cell Biology . 2010;88:861–869. doi: 10.1139/O10-009. [DOI] [PubMed] [Google Scholar]
  10. Elmlund D, Elmlund H. SIMPLE: Software for ab initio reconstruction of heterogeneous single-particles. Journal of Structural Biology. 2012;180:420–427. doi: 10.1016/j.jsb.2012.07.010. [DOI] [PubMed] [Google Scholar]
  11. Fernandez-Leiro R, Scheres SHW. A pipeline approach to single-particle processing in Relion. bioRxiv. 2016a doi: 10.1101/078352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fernandez-Leiro R, Scheres SHW. Unravelling biological macromolecules with cryo-electron microscopy. Nature. 2016b;537:339–346. doi: 10.1038/nature19948. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fernández IS, Bai XC, Hussain T, Kelley AC, Lorsch JR, Ramakrishnan V, Scheres SH. Molecular architecture of a eukaryotic translational initiation complex. Science. 2013;342:1240585. doi: 10.1126/science.1240585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Frank J, Radermacher M, Penczek P, Zhu J, Li Y, Ladjadj M, Leith A. SPIDER and WEB: processing and visualization of images in 3D electron microscopy and related fields. Journal of Structural Biology. 1996;116:190–199. doi: 10.1006/jsbi.1996.0030. [DOI] [PubMed] [Google Scholar]
  15. Frank J, Shimkin B, Dowse H. Spider—A modular software system for electron image processing. Ultramicroscopy. 1981;6:343–357. doi: 10.1016/S0304-3991(81)80221-5. [DOI] [Google Scholar]
  16. Grant T, Grigorieff N. Measuring the optimal exposure for single particle cryo-EM using a 2.6 Å reconstruction of rotavirus VP6. eLife. 2015;4:e06980. doi: 10.7554/eLife.06980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Grigorieff N. FREALIGN: high-resolution refinement of single particle structures. Journal of Structural Biology. 2007;157:117–125. doi: 10.1016/j.jsb.2006.05.004. [DOI] [PubMed] [Google Scholar]
  18. Heymann JB, Belnap DM. Bsoft: image processing and molecular modeling for electron microscopy. Journal of Structural Biology. 2007;157:3–18. doi: 10.1016/j.jsb.2006.06.006. [DOI] [PubMed] [Google Scholar]
  19. Hoang TV, Cavin X, Ritchie DW. gEMfitter: a highly parallel FFT-based 3D density fitting tool with GPU texture memory acceleration. Journal of Structural Biology. 2013;184:348–354. doi: 10.1016/j.jsb.2013.09.010. [DOI] [PubMed] [Google Scholar]
  20. Hohn M, Tang G, Goodyear G, Baldwin PR, Huang Z, Penczek PA, Yang C, Glaeser RM, Adams PD, Ludtke SJ. SPARX, a new environment for Cryo-EM image processing. Journal of Structural Biology. 2007;157:47–55. doi: 10.1016/j.jsb.2006.07.003. [DOI] [PubMed] [Google Scholar]
  21. Iudin A, Korir PK, Salavert-Torres J, Kleywegt GJ, Patwardhan A. EMPIAR: a public archive for raw electron microscopy image data. Nature Methods. 2016;13:387–388. doi: 10.1038/nmeth.3806. [DOI] [PubMed] [Google Scholar]
  22. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international conference on Multimedia. 2014. pp. 675–678.
  23. Langlois R, Frank J. A clarification of the terms used in comparing semi-automated particle selection algorithms in cryo-EM. Journal of Structural Biology. 2011;175:348–352. doi: 10.1016/j.jsb.2011.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li X, Grigorieff N, Cheng Y. GPU-enabled FREALIGN: accelerating single particle 3D reconstruction and refinement in Fourier space on graphics processors. Journal of Structural Biology. 2010;172:407–412. doi: 10.1016/j.jsb.2010.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Li X, Mooney P, Zheng S, Booth CR, Braunfeld MB, Gubbens S, Agard DA, Cheng Y. Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nature Methods. 2013;10:584–590. doi: 10.1038/nmeth.2472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li X, Zheng S, Agard DA, Cheng Y. Asynchronous data acquisition and on-the-fly analysis of dose fractionated cryoEM images by UCSFImage. Journal of Structural Biology. 2015;192:174–178. doi: 10.1016/j.jsb.2015.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Nguyen TH, Galej WP, Bai XC, Oubridge C, Newman AJ, Scheres SH, Nagai K. Cryo-EM structure of the yeast U4/U6.U5 tri-snRNP at 3.7 å resolution. Nature. 2016;530:298–302. doi: 10.1038/nature16940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Roseman AM. Particle finding in electron micrographs using a fast local correlation algorithm. Ultramicroscopy. 2003;94:225–236. doi: 10.1016/S0304-3991(02)00333-9. [DOI] [PubMed] [Google Scholar]
  29. Saibil HR, Grünewald K, Stuart DI. A national facility for biological cryo-electron microscopy. Acta Crystallographica. Section D, Biological Crystallography. 2015;71:127–135. doi: 10.1107/S1399004714025280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Salomon-Ferrer R, Götz AW, Poole D, Le Grand S, Walker RC, Goetz AW, Grand SL. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. explicit solvent particle Mesh Ewald. Journal of Chemical Theory and Computation. 2013;9:3878–3888. doi: 10.1021/ct400314y. [DOI] [PubMed] [Google Scholar]
  31. Scheres SH, Núñez-Ramírez R, Sorzano CO, Carazo JM, Marabini R. Image processing for electron microscopy single-particle analysis using XMIPP. Nature Protocols. 2008;3:977–990. doi: 10.1038/nprot.2008.62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Scheres SH. A Bayesian view on cryo-EM structure determination. Journal of Molecular Biology. 2012b;415:406–418. doi: 10.1016/j.jmb.2011.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Scheres SH. RELION: implementation of a Bayesian approach to cryo-EM structure determination. Journal of Structural Biology. 2012a;180:519–530. doi: 10.1016/j.jsb.2012.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Scheres SH. Beam-induced motion correction for sub-megadalton cryo-EM particles. eLife. 2014;3:e03665. doi: 10.7554/eLife.03665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Scheres SH. Semi-automated selection of cryo-EM particles in RELION-1.3. Journal of Structural Biology. 2015;189:114–122. doi: 10.1016/j.jsb.2014.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Tagare HD, Barthel A, Sigworth FJ. An adaptive Expectation-Maximization algorithm with GPU implementation for electron cryomicroscopy. Journal of Structural Biology. 2010;171:256–265. doi: 10.1016/j.jsb.2010.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Tang G, Peng L, Baldwin PR, Mann DS, Jiang W, Rees I, Ludtke SJ. EMAN2: an extensible image processing suite for electron microscopy. Journal of Structural Biology. 2007;157:38–46. doi: 10.1016/j.jsb.2006.05.009. [DOI] [PubMed] [Google Scholar]
  38. Ufimtsev IS, Martínez TJ. Graphical processing units for quantum chemistry. Computing in Science & Engineering. 2008;10:26–34. doi: 10.1109/MCSE.2008.148. [DOI] [Google Scholar]
  39. van Heel M, Harauz G, Orlova EV, Schmidt R, Schatz M. A new generation of the IMAGIC image processing system. Journal of Structural Biology. 1996;116:17–24. doi: 10.1006/jsbi.1996.0004. [DOI] [PubMed] [Google Scholar]
  40. Wong W, Bai Xiao-chen, Brown A, Fernandez IS, Hanssen E, Condron M, Tan YH, Baum J, Scheres SHW, Bai X-C. Cryo-EM structure of the Plasmodium falciparum 80S ribosome bound to the anti-protozoan drug emetine. eLife. 2014;3:e03080. doi: 10.7554/eLife.03080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Zhang K. Gctf: Real-time CTF determination and correction. Journal of Structural Biology. 2016;193:1–12. doi: 10.1016/j.jsb.2015.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
eLife. 2016 Nov 15;5:e18722. doi: 10.7554/eLife.18722.023

Decision letter

Editor: Sriram Subramaniam1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Accelerated cryo-EM structure determination with parallelisation using GPUs in RELION-2" for consideration by eLife. Your article has been reviewed by four peer reviewers, one of whom, Sriram Subramaniam, is a member of our Board of Reviewing Editors, and the evaluation has been overseen by John Kuriyan as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Steven J Ludtke and J Bernard Heymann. A further reviewer remains anonymous.

Following extensive discussions with the reviewers, we are willing to re-review a revised version of your manuscript. The revised version must address all of the technical comments from the reviewers and include a substantive technical section with more details of the software implementation, as well as a more extensive description of suitable platforms with a credible comparison of performance.

The main points of this discussion are noted below:

1) There was agreement that the main point of this manuscript was to advertise the existence of a GPU port of Relion, which many people in the field will likely wish to use. The changes to Relion described in the manuscript do not improve the scientific results, but improve computational efficiency that could potentially accelerate discoveries. Nevertheless, the reviewers expressed doubts that the increase in computational efficiency would stand out as a significant advance technologically or methodologically.

2) The reviewers also noted that the extent of the performance improvement with GPU was exaggerated and potentially misleading; this is an important parameter because this would lead to users making purchasing decisions without the correct facts. While the reviewers felt it is not possible to be completely equitable when comparing completely different hardware platforms, the views was that this manuscript is more strongly skewed towards GPU for the following reasons:

a) Very different generations of hardware were compared (the CPUs are 4 years older than the GPUs, and many changes have been made to CPUs in that time)

b) The code in Relion 2.0 was substantially optimized in the GPU version; it seems that many of these optimizations could be easily included in the CPU version of Relion 2.0. A better comparison would be to compare equivalent generations of GPU/CPUs, and equivalent versions of Relion.

3) There was concern that the specific improvements to the code that helped improve GPU performance were not identified clearly and that it was difficult to tell which are only of interest for GPU and which can be implemented for CPU. A portion of the community may have strong interest in the details of these improvements, while others will likely only be interested in the question of which platform is more likely to be cost efficient and remain relevant over a relatively extended lifetime. However, this manuscript does not lay out these improvements clearly for the first group, but also does not provide fully controlled comparison of comparable platforms for the second group.

The suggestions from the reviewers follow:

1) The manuscript needs to include a paragraph discussing previous EM work that used GPU implementations, encompassing the references listed below. Pubmed /26592709 is cited in section 2.1 in another context, but not for GPU implementation.

http://www.ncbi.nlm.nih.gov/pubmed/20538058http://www.ncbi.nlm.nih.gov/pubmed/24060989http://www.ncbi.nlm.nih.gov/pubmed/26370395http://www.ncbi.nlm.nih.gov/pubmed/23644547http://www.ncbi.nlm.nih.gov/pubmed/20558298http://www.ncbi.nlm.nih.gov/pubmed/26592709

2) It would be beneficial to include significantly more algorithmic details in the paper. The way it is currently written seems to present mainly the benchmark tests, while leaving largely absent the explanation of what was important for the development of the GPU code.

3) On a technical note, regarding the experiments made for comparing CPU and GPU code and between single and double precision, one would assume that the authors used the same initial model, parameters and random seed in all cases (--random_seed option in RELION) to initialize all the runs, but this is not mentioned anywhere in the text. If this was actually done, a fairly thorough analysis of the assigned orientations and orientational probabilities could've been done where the single difference between the runs would've been the different program implementations.

4) In paragraph one of the Results section, the mention of what the authors believe to have the most prominent influence on performance is only vaguely explained:

"To overcome this, our implementation instead stores oversampled Fourier transforms of every class-reference in GPU memory, and extracts 2D-slices (in any orientation) on demand. By utilising fast-access data structures known as textures (normally used to project images on 3D objects), on-demand projection in fact achieves faster execution compared to reading pre-calculated projections from memory."

i) The authors claim to store oversampled Fourier transforms of every class-reference in GPU memory. What kind of transforms are being referred to, 3D or 2D? If this is a 3D oversampled FFT volume, then it should be mentioned and the sentence needs to be corrected to have one FFT transform per class-reference. On the other hand, if the authors mention 2D reference FFT slices (which is less likely because it is not consistent with the on-demand projection statement), they need to elaborate on how many such slices they used and on angular sampling strategies.

ii) The authors need to expand upon using texture operations for the on-demand projection. Specific citation(s) would be helpful.

5) There are inconsistent statements regarding the differences between the methods used in RELION 1.4 and the new accelerated version of RELION. On one side the authors state: "Neither method nor behaviour of RELION has changed from that of version 1.4.", but later on they describe algorithmic changes to the particle picking strategy in Section 2.2. The only tangible difference appears to be the discarding of resolution information in micrographs beyond that of search templates, but this has to be properly highlighted and the statement in Section 2.1 revised to reflect the fact that there are changes, albeit subtle, between versions 1.4 and 2.0 other than the GPU speedup. It will save end-users time and confusion if all of the changes are documented.

Is it correct that Relion adopts a maximum-likelihood approach, and not a Bayesian approach because all the priors (distributions) it uses are refined? It would be good that a person familiar with the difference between "Bayesian" and "Maximum Likelihood" could clarify this point to get it right in the published version.

6) There are several comparisons noted in the paper that need to be clarified.

i) The comparison draw between overall processing time and the time it took the authors to download the data from EMPIAR is meaningless, there are many factors that can influence data download speed of large datasets and to use that as a benchmark can be misleading. The typical RELION user will probably not be downloading data from a database for processing but instead would be collecting and processing their own data. A comparison against the time it takes to collect data would therefore be more appropriate and provide the reader with a better reference point about what the other rate limiting steps in single particle cryo-EM. Another useful comparison is to estimate the number of CPUs required to process the data in the same time (115h) that it took for a GPU-driven solution.

ii) Distributing a job across a cluster leads to substantial overhead (e.g. network traffic) that will not be present when running on a single workstation. While it may be justified to look at the overhead as one reason for the slower execution of jobs on CPU systems, this should ideally be separated in benchmark tests from the actual CPU hours needed to complete the job. Maybe some of the tests could be rerun on a single workstation to exclude cluster-related overhead.

7) It is not clear who might be the intended reader of the manuscript. It is clearly not meant for an average user of Relion package, who most likely is a structural biologist oblivious to intricacies of computer code and its dependence on hardware platform. The problems mentioned are immediately apparent at the beginning of the Results section. The fact that "class" is used to mean "3D structure" took me by surprise, particularly that one has to compare Figure 1 with Figure 2 to realize that. Some formulations are cryptic – "Even individual image pixels are evaluate independently from on another". Specifically how one evaluates individual pixels, for what purpose, and why doing it independently would constitute an advantage?

Results section, first paragraph; what is "class-reference"? What follows is difficult to understand – if, as stated, 2D slices are extracted from a presumably 3D volume, this constitutes a projection operation, but next sentence casts it in doubt, as it is stated this is being done using textures "normally used to project images on 3D objects". So it is not a projection? Or some other projection is meant?

Subsequently, much is made of the number of classes, but it is not clear what these classes are and why one would want to have them. How their number is decided? What if one wants to preform analysis without any classes, is it possible? What would be the speed up?

The most cryptic is the closing sentence of the section: Neither method nor behaviour of relion has changed from that of version 1.4. First, what is so special about version 1.4? How a casual reader is supposed to know that or worse, care about this number? It would also appear the statement is contradicted by what follows, as much is made of single precision GPU implementation as opposed to double precision of the general purpose original code. On a side, why double precision is "so-called"? I am not sure what is meant here.

Finally, section on Limited precision does not offer much outside of generalities and ends at a surprising statement: "Execution of the iterative gridding algorithm.… appeared to show significant loss of information." Appeared? How carefully was it investigated? What is "significant"?

8) Figures are microscopic, which probably reflects authors' interest in atomic-scale objects, but at the same time they are filled with details essential for the understanding of the text. In the very least, they should be properly referenced in the text in key places. The final maps should also be deposited as is customary in the field, this intent is not indicated in the manuscript.

9) As this manuscript may entice people to spend tens to hundreds of thousands of dollars in research funds, it is incumbent upon the authors and reviewers to insure that this study is as balanced and fair as possible. There was a similar pro-GPU move ~8 years ago, Frealign, EMAN2 and SPARX were all parallelized for the GPU, achieving typically 20-80x speedups, and many purchased hardware, but few made effective use of it for very long. While the reviewers do not dispute the accuracy of the presented results, they believe that the results are highly biased based on the specific hardware selection, and on issues not discussed in the manuscript.

10) There is some price/performance advantage in GPUs, but the current estimate is simply ludicrous, and many people reading this manuscript will not understand the technical differences. Primary issues that occur immediately are:

– The processors used on the 120 core test is a 4 year-old processor considered "end of life" by Intel. These are being compared to a set of 4 GPUs so new they are difficult to acquire at most vendors. The vectorization and architecture changes made by intel in recent years, major hyperthreading improvements, and other factors will come into play. Specifically:

–- CPU optimization for the specific processors in use? Image processing code compiled for specific current generation CPUs with -march=native often see 30% or more speedups, without any coding changes.

–hyperthreading use. This likely will not have an impact the older processors in the 120 core test, but Xeon v3 processors can often see speedups of ~30% even on compute-intensive loads with moderate thread oversubscription.

– Were the core parallelism improvements made for the GPU (those which were applicable) also compiled into the CPU version?

– Running on 120 cores implies use of MPI, and brings many other factors (network, available I/O) into play when comparing speed.

– Was the CPU computation done with single precision at similar points to make the comparison fair?

In Figure 4D, the authors are claiming a 10-40x price-performance improvement. One reviewer notes that he/she configured two machines at a reputable, but inexpensive vendor. Both machines had 64 GB of RAM, a 1.2 TB SSD. One was configured with 4x 1080 GPU cards and 8 CPU cores at 2.6 GHz. The other was configured with a base-level GPU for display and 2x E5-2697Av4 -> 32 cores @ 2.6 GHz. Both machines come in at ~$9000. As compared to their "one node", this machine has 32 rather than 12 cores, and with improvements in the CPU, each is ~1.5x more powerful than their test machine. This would already bring their price/performance comparison down by 4x, ignoring other issues.

11) The reviewers do not believe the current comparison in the manuscript is fair without at least some attempt to test on current generation CPUs with some real attempt to insure things are compiled and executed optimally. This need not be a month-long test, but at least a few iterations of a single model refinement to establish a baseline for modern hardware. As the major point of the manuscript is benchmarking, this is a minimum requirement for publication.

12) The following statements are not clear:

"In fact, using GPU-enabled relion, the time needed for classification is only weakly dependent on the number of classes used, whereas the CPU-based implementation has a much steeper linear dependence (Figure 3)"

and:

"In the CPU-only version, the computational time scales linearly with increased number of classes (Figure 3B) due to the serialised single image calculations, whereas GPU-enabled execution can show better-than linear scaling."

Looking at Figure 3B, this is not clear. The GPU curve also visually appears to be following a linear trend, and rough estimates show a similar scaling to the GPU curve.

The values on the GPU curve are small enough compared to the size of the dots, this is difficult to assess quantitatively, but this figure certainly doesn't illustrate the point being made.

13) In Figure 3E, the prominent lump in the middle is quite odd. The statement "GPU's ability to parallelise the drastically increased number of tasks" does not really explain it. If N independent tasks need to be completed in iteration 10, and 30N need to be completed in iteration 15, the only way this curve makes sense is if the GPU is massively underutilizing resources during iteration 10. It also would make no sense that going from 7.5 to 3.8 degree sampling would increase the number of tasks 30-fold. Something very peculiar is going on here, and it seems important to understand it.

"While relion-1.4 only exploited parallelism over images (left), in the new implementation classes and all orientations of each class are expressed as tasks that can be scheduled independently on the accelerator hardware (e.g. GPUs)"

Could the same change not be tested on CPUs as well?

14) Figure 4 – Why is this test suddenly shown for a ribosome? How does this compare on the benchmark ΒGal data set?

15) The use of GPU's is usually problematic because of rapid progress in GPU technology. This means that many different GPU's are available and in particular older one's suffer from poor feature support. The authors need to explain exactly what is required in a GPU and which GPU's are too old or inadequate.

16) The loss of information in the gridding reconstruction algorithm using single precision is not explained clearly. Typically, this indicates very small or very large values that could be avoided by some normalization approach. This could also indicate that the iterative refinement in the gridding algorithm is too fast, generating instabilities with single precision.

17) In the Introduction, the authors mention some other software packages giving the impression of a comprehensive list. However, some software packages are left out. Either be comprehensive or be more general in referring to the wiki showing software packages.

18) "This allows high-resolution structure determination with minimal bias or user input." The impression is that a Bayesian approach is less biased than other methods. This is incorrect, as Bayesian methods are just as biased as any other when it is fed erroneous data (such as poorly picked particles) or an inappropriate starting reference (prior) is used.

19) There is no information about the availability of the package (web site and license) or about deposition of the maps.

eLife. 2016 Nov 15;5:e18722. doi: 10.7554/eLife.18722.024

Author response


The main points of this discussion are noted below:

1) There was agreement that the main point of this manuscript was to advertise the existence of a GPU port of Relion, which many people in the field will likely wish to use. The changes to Relion described in the manuscript do not improve the scientific results, but improve computational efficiency that could potentially accelerate discoveries. Nevertheless, the reviewers expressed doubts that the increase in computational efficiency would stand out as a significant advance technologically or methodologically.

As previously discussed, the acceleration described in our paper has the potential to completely change how we work with image reconstruction. With the cost of image processing substantially reduced, more extensive classification strategies, for example, may now be explored. Because this now becomes possible even for labs that do not have access to large supercomputing facilities, our approach will also contribute to increasing the accessibility of high-resolution cryo-EM structure determination to new users. In the revised version, we now also describe the modifications to our algorithms on a level where others can build on them. We have extended the manuscript with a more detailed methods section, descriptions of how the various steps have been tailored to modern GPU hardware, and we also provide appendices with detailed descriptions of the technical implementations, as requested by the reviewers.

2) The reviewers also noted that the extent of the performance improvement with GPU was exaggerated and potentially misleading; this is an important parameter because this would lead to users making purchasing decisions without the correct facts. While the reviewers felt it is not possible to be completely equitable when comparing completely different hardware platforms, the views was that this manuscript is more strongly skewed towards GPU for the following reasons:a) Very different generations of hardware were compared (the CPUs are 4 years older than the GPUs, and many changes have been made to CPUs in that time)b) The code in Relion 2.0 was substantially optimized in the GPU version; it seems that many of these optimizations could be easily included in the CPU version of Relion 2.0. A better comparison would be to compare equivalent generations of GPU/CPUs, and equivalent versions of Relion.

Yes, we previously discussed this comparison should have used the latest-generation hardware. We have rewritten the text and base all comparisons on brand-new Xeon E5- 2960v4 nodes. However, because of the higher cost for such nodes, because more cores share memory, and because we introduced further optimizations in the GPU code, the relative performance advantages remain. When comparing to even more cost-efficient GTX 1070 GPUs that we now include in the performance/price ratio can even get close to a factor 50. However, we are not interested in arguing specific numbers (that no doubt will change), so to make the manuscript relevant in the long run we have also altered the formulations in the Abstract and understate it to only mention order-of-magnitude advantages.

We have also included detailed specifications of hardware, and provide users with examples of both cost-efficient and performance-leading hardware (in appendix III).

We are equally interested in improving CPU performance, and will invest efforts there in the future, but as seen e.g. from efforts by Intel to accelerate RELION on Xeon Phi (that only provide 30% speedup), this is not trivial work.

3) There was concern that the specific improvements to the code that helped improve GPU performance were not identified clearly and that it was difficult to tell which are only of interest for GPU and which can be implemented for CPU. A portion of the community may have strong interest in the details of these improvements, while others will likely only be interested in the question of which platform is more likely to be cost efficient and remain relevant over a relatively extended lifetime. However, this manuscript does not lay out these improvements clearly for the first group, but also does not provide fully controlled comparison of comparable platforms for the second group.

We were initially hesitant about making the manuscript too technical in the initial version, but have now rewritten the methods section to provide a better technical background to the algorithms used and where the bottlenecks are, how this maps to GPU hardware features, and we added two appendices with specific details of the kernel implementations for the technically interested readers. We think that this has improved the paper substantially for both audiences, and thank the reviewers for their suggestion.

The suggestions from the reviewers follow:

1) The manuscript needs to include a paragraph discussing previous EM work that used GPU implementations, encompassing the references listed below. Pubmed /26592709 is cited in section 2.1 in another context, but not for GPU implementation.

http://www.ncbi.nlm.nih.gov/pubmed/20538058http://www.ncbi.nlm.nih.gov/pubmed/24060989http://www.ncbi.nlm.nih.gov/pubmed/26370395http://www.ncbi.nlm.nih.gov/pubmed/23644547http://www.ncbi.nlm.nih.gov/pubmed/20558298http://www.ncbi.nlm.nih.gov/pubmed/26592709

We’re delighted to add all of them, and these contributions are now mentioned in the Introduction.

2) It would be beneficial to include significantly more algorithmic details in the paper. The way it is currently written seems to present mainly the benchmark tests, while leaving largely absent the explanation of what was important for the development of the GPU code.

In the revised manuscript we have added a Method section which more thoroughly introduces parallelism and which goes on to explain the fundamental aspects of the acceleration. Additionally, appendices are provided and referred to in the text, which explain and schematically show program flow and discusses effects on performance.

3) On a technical note, regarding the experiments made for comparing CPU and GPU code and between single and double precision, one would assume that the authors used the same initial model, parameters and random seed in all cases (--random_seed option in RELION) to initialize all the runs, but this is not mentioned anywhere in the text. If this was actually done, a fairly thorough analysis of the assigned orientations and orientational probabilities could've been done where the single difference between the runs would've been the different program implementations.

All experiments were performed using the same command-line, however the non- deterministic nature of the GPU-enabled execution provides an element of variability which to some extent makes identical seeds superfluous. We nonetheless agree that this is an important consideration to mention, and this is now explicitly done in the revised manuscript. The analysis of the assigned orientations is a good suggestion. We've added histograms of the differences in orientations as Figure 5—figure supplement 1.

4) In paragraph one of the Results section, the mention of what the authors believe to have the most prominent influence on performance is only vaguely explained:

"To overcome this, our implementation instead stores oversampled Fourier transforms of every class-reference in GPU memory, and extracts 2D-slices (in any orientation) on demand. By utilising fast-access data structures known as textures (normally used to project images on 3D objects), on-demand projection in fact achieves faster execution compared to reading pre-calculated projections from memory."

i) The authors claim to store oversampled Fourier transforms of every class-reference in GPU memory. What kind of transforms are being referred to, 3D or 2D? If this is a 3D oversampled FFT volume, then it should be mentioned and the sentence needs to be corrected to have one FFT transform per class-reference. On the other hand, if the authors mention 2D reference FFT slices (which is less likely because it is not consistent with the on-demand projection statement), they need to elaborate on how many such slices they used and on angular sampling strategies.

For 3D refinements or 3D classification, we keep oversampled 3D Fourier transforms in memory and take 2D slices out of these on demand. This is covered more in-depth by the expanded method-section. The phrase “Instead, like the CPU code, our implementation stores a two-fold oversampled Fourier transform of each reference in GPU memory, and 2D slices (along any orientation) are extracted only when needed.” has also been adopted to clarify storage of transformed volumes.

ii) The authors need to expand upon using texture operations for the on-demand projection. Specific citation(s) would be helpful.

We now both explain textures, texture units, and how it can be used to improve performance, in the expanded methods section and appendix I.

5) There are inconsistent statements regarding the differences between the methods used in RELION 1.4 and the new accelerated version of RELION. On one side the authors state: "Neither method nor behaviour of RELION has changed from that of version 1.4.", but later on they describe algorithmic changes to the particle picking strategy in Section 2.2. The only tangible difference appears to be the discarding of resolution information in micrographs beyond that of search templates, but this has to be properly highlighted and the statement in Section 2.1 revised to reflect the fact that there are changes, albeit subtle, between versions 1.4 and 2.0 other than the GPU speedup. It will save end-users time and confusion if all of the changes are documented.

The conflicting statements have been resolved by removing the former. All implemented changes have been described and documented.

Is it correct that Relion adopts a maximum-likelihood approach, and not a Bayesian approach because all the priors (distributions) it uses are refined? It would be good that a person familiar with the difference between "Bayesian" and "Maximum Likelihood" could clarify this point to get it right in the published version.

This is not entirely correct, although much of this discussion is semantics. RELION uses a regularised likelihood approach, where a likelihood function is complemented with a prior probability of the model. According to Bayes' law, the regularised likelihood function is proportional to the posterior probability of the model given the experimental data. Because RELION estimates the parameters of the prior from the data themselves, this type of approach is called an empirical Bayesian approach (see Wikipedia for more details: https://en.wikipedia.org/wiki/Empirical_Bayes_method). Note that un-regularised likelihood optimisation algorithms (like e.g. implemented in Xmipp, and in a limited manner also in Frealign) do not include a prior on the model. These approaches are also called "maximum likelihood", but often depend on more ad-hoc regularisation approaches, for example through customised low-pass filters. To distinguish RELION from the unregularised likelihood approaches, we use the term "Bayesian". In response to the reviewer's comment, we've added the distinction of "empirical" Bayesian to the revised manuscript, and explain the entire algorithm better in the methods section.

6) There are several comparisons noted in the paper that need to be clarified.

i) The comparison draw between overall processing time and the time it took the authors to download the data from EMPIAR is meaningless, there are many factors that can influence data download speed of large datasets and to use that as a benchmark can be misleading. The typical RELION user will probably not be downloading data from a database for processing but instead would be collecting and processing their own data. A comparison against the time it takes to collect data would therefore be more appropriate and provide the reader with a better reference point about what the other rate limiting steps in single particle cryo-EM. Another useful comparison is to estimate the number of CPUs required to process the data in the same time (115h) that it took for a GPU-driven solution.

This is a valid objection. This data set comprises more than 1,500 micrographs, and although no exact timings for the data acquisition are given in the corresponding paper, it is likely that the data acquisition took multiple days. Therefore, we've replaced this statement and instead state that the total processing time was probably comparable to the data acquisition time.

Regarding the second point and quantitative comparisons of CPU and GPU hardware: such comparisons are given in detail for the ribosome benchmarks, and Figure 4 as well as appendix III now provide both performance and performance/price ratios for different GPU options. The section on the betagal data set is rather intended as a representative use case where we illustrate that it is now indeed possible to process even one of today’s largest data set on a single GPU workstation.

ii) Distributing a job across a cluster leads to substantial overhead (e.g. network traffic) that will not be present when running on a single workstation. While it may be justified to look at the overhead as one reason for the slower execution of jobs on CPU systems, this should ideally be separated in benchmark tests from the actual CPU hours needed to complete the job. Maybe some of the tests could be rerun on a single workstation to exclude cluster-related overhead.

Because the absolute speed of the CPU implementation is so much lower, presently the network overhead is not a major limiting factor; when comparing performance for 1 vs. 10 CPU nodes in our benchmark, our communication overhead during the expectation step was below 5% when using 10Gb network. A comparison of CPU and GPU implementations on a single desktop would only serve the purpose of illustrating that the CPU implementation is entirely unfeasible to run on desktops (it would take over a month), whereas it works great with GPUs. Therefore, a limited amount of network overhead is inescapable for realistic processing using CPUs, whereas it can be avoided using GPUs.

However, doing extensive MPI benchmarking fairly would require an entire paper of its own, so we have instead removed our statement about MPI overhead.

7) It is not clear who might be the intended reader of the manuscript. It is clearly not meant for an average user of Relion package, who most likely is a structural biologist oblivious to intricacies of computer code and its dependence on hardware platform. The problems mentioned are immediately apparent at the beginning of the Results section. The fact that "class" is used to mean "3D structure" took me by surprise, particularly that one has to compare Figure 1 with Figure 2 to realize that. Some formulations are cryptic – "Even individual image pixels are evaluate independently from on another". Specifically how one evaluates individual pixels, for what purpose, and why doing it independently would constitute an advantage?

We now include more algorithmic detail in the expanded methods section, and we also add more useful information on the different hardware platforms for the average RELION user (Figure 4C, appendix III). We agree that the modern structural biologist interested in doing cryo-EM does need some minimum knowledge about issues pertaining to high- performance computing in order to make the best use of the available hardware (using RELION or any other program). By adding more information about the different aspects of the computations in our implementation, the revised manuscript should also be useful in that respect. The formulation of pixel “evaluation” has also been revised to more directly indicate any logical arithmethical operation.

Results section, first paragraph; what is "class-reference"? What follows is difficult to understand – if, as stated, 2D slices are extracted from a presumably 3D volume, this constitutes a projection operation, but next sentence casts it in doubt, as it is stated this is being done using textures "normally used to project images on 3D objects". So it is not a projection? Or some other projection is meant?

Subsequently, much is made of the number of classes, but it is not clear what these classes are and why one would want to have them. How their number is decided? What if one wants to preform analysis without any classes, is it possible? What would be the speed up?

We have removed the term class-reference from the manuscript. We now speak about "3D reference maps, or 3D classes" (e.g. subsection “Regularised likelihood optimisation”) to make this clearer. The number of classes to refine is a user-defined parameter, as is the case with most multi- reference refinement programs in the field.

The most cryptic is the closing sentence of the section: Neither method nor behaviour of relion has changed from that of version 1.4. First, what is so special about version 1.4? How a casual reader is supposed to know that or worse, care about this number? It would also appear the statement is contradicted by what follows, as much is made of single precision GPU implementation as opposed to double precision of the general purpose original code. On a side, why double precision is "so-called"? I am not sure what is meant here.

Finally, section on Limited precision does not offer much outside of generalities and ends at a surprising statement: "Execution of the iterative gridding algorithm.… appeared to show significant loss of information." Appeared? How carefully was it investigated? What is "significant"?

Version 1.4 is simply the most recent stable version of RELION. We've clarified this in the revised manuscript by instead referring to 'previous versions' of RELION. We have also removed the “so-called” statement regarding double precision. With regards to the gridding algorithm, we felt it necessary to briefly comment on what was omitted from GPU acceleration, but felt it unnecessary to elaborate on changes not made. In the revised manuscript we have removed the "significant" statement, and now mention the gridding algorithm as an opportunity for further improvement.

8) Figures are microscopic, which probably reflects authors' interest in atomic-scale objects, but at the same time they are filled with details essential for the understanding of the text. In the very least, they should be properly referenced in the text in key places. The final maps should also be deposited as is customary in the field, this intent is not indicated in the manuscript.

We don't really understand how this happened (on our screen/prints the figures aren't microscopic at all). Regardless, we apologize for any inconvenience caused. All figures are vector based and will be rescaled to balance layout and readability, and we have increased the relative size of small type and objects. We have deposited the β- galactosidase map at the EMDB (EMD-4116). We chose not to deposit the ribosome reconstruction from Figure 5 (as numbered in the revised manuscript), as it is basically identical to the EMDB entry 2660, so would only constitute a duplication of an existing entry.

9) As this manuscript may entice people to spend tens to hundreds of thousands of dollars in research funds, it is incumbent upon the authors and reviewers to insure that this study is as balanced and fair as possible. There was a similar pro-GPU move ~8 years ago, Frealign, EMAN2 and SPARX were all parallelized for the GPU, achieving typically 20-80x speedups, and many purchased hardware, but few made effective use of it for very long. While the reviewers do not dispute the accuracy of the presented results, they believe that the results are highly biased based on the specific hardware selection, and on issues not discussed in the manuscript.

There are two aspects to this point. First, there is the point of hardware selection. Second, the point of long-term usefulness of the code.

To address the first point, we have expanded the comparison of different combinations of hardware as a part of Figure 4 (and appendix III). This comparison is not a comprehensive study of all hardware available on the market, but rather illustrates the considerations relevant for modern GPUs. We believe that the relative wide spread of difference GPUs (and CPUs) gives potential users a reasonable insight into what hardware to buy.

Regarding the second point, the impact of any software implementation depends on long- term maintenance, and that the code is incorporated in the main codebase rather than a separate spin-off from an old release. This is one key reason why everyone uses GPU for some cryo-EM programs like Gctf and Motioncorr (1 or -2), but not e.g. Frealign. We note here that our implementation represents a close collaboration between the Scheres and Lindahl groups, and is not a one-off optimization of a specific application version, but a long-term commitment to GPUs for RELION, now a part of the main codebase. We are ourselves also making large hardware investments in GPUs. In addition, after having supported GPUs for close to a decade in Gromacs, the molecular simulation users have switched entirely to GPUs. We believe this backs up our claims to long-term commitment.

10) There is some price/performance advantage in GPUs, but the current estimate is simply ludicrous, and many people reading this manuscript will not understand the technical differences. Primary issues that occur immediately are:

– The processors used on the 120 core test is a 4 year-old processor considered "end of life" by Intel. These are being compared to a set of 4 GPUs so new they are difficult to acquire at most vendors.

We understand these concerns and have updated benchmarks to use newer CPU hardware. However, not all benchmarks have seen an improvement in favour of the CPU, when considering price/performance, and the cheap GTX 1070 cards provide even better performance/price ratios. Still, despite this we have decided to tone down the statements and no longer mention specific numbers in the Abstract to focus on principles, not numbers.

The vectorization and architecture changes made by intel in recent years, major hyperthreading improvements, and other factors will come into play. Specifically:

– CPU optimization for the specific processors in use? Image processing code compiled for specific current generation CPUs with -march=native often see 30% or more speedups, without any coding changes.

We used hyper-threading and the fastest (as far as we know) flags for the x86 code (RELION uses –O3 by default, which includes –march=native). Intel has also worked to accelerate RELION-1.4 for their latest Xeon Phi processors. When changing the code and selecting the best compiler flags, they see a 20-30% performance improvement – but only when adding a $4800 accelerator (the cost per node would be much higher than our nodes above). While this indicates possible improvements on the CPU-side, they do not amount to that estimated by the reviewer.

– hyperthreading use. This likely will not have an impact the older processors in the 120 core test, but Xeon v3 processors can often see speedups of ~30% even on compute-intensive loads with moderate thread oversubscription.

As mentioned above, hyperthreading was of course utilized for all CPU runs as is now clarified by “While we refer to the physical core count when describing hardware, hyperthreading was enabled and used for all benchmarks since it improves performance on the CPU side slightly.”

– Were the core parallelism improvements made for the GPU (those which were applicable) also compiled into the CPU version?

Some concepts – such as partial use of single precision as in the GPU implementation – were tested on the CPU but made execution slower (extra conversions), and thus they were not applied in the code nor in any benchmarks. Others – e.g. the autopick filtering – were clearly benefitial and were thus implemented on both the GPU and CPU code-bases. More generally, the extraction of large amounts of parallelism is key for GPUs require some 10,000 or more independent tasks to achieve good performance, meaning that the present developments are largely unapplicable to CPUs, which rather achieve good performance with a handful of general-purpose cores. Applying the GPU-intended methods and algorithms direclty to CPUs would just cause overhead. Generalized vector-operations however currently noted as a future improvement.

– Running on 120 cores implies use of MPI, and brings many other factors (network, available I/O) into play when comparing speed.

The mentioned overhead is in part addressed in the updated benchmarks, where all nodes make a copy of the data to a local disk prior to the computations, thereby removing the major part of network/MPI communication cost. For our benchmark using 10 vs. 1 CPU nodes, during the expectation step we never observed communication overhead in excess of 5% (on 10GbE). However, mentioned above, there is also not much alternative to MPI when using CPU clusters since few users (in our opinion) would be willing to wait a month for the results just to achieve slightly higher efficiency by limiting it to a single node.

– Was the CPU computation done with single precision at similar points to make the comparison fair?

Unfortunately, the performance of the CPU version does not improve with simple translation of instructions to use single precision (mainly due to conversions to double precision in parts which require more accurate computations). We would additionally like to note that it is not merely a matter of “changing double to single”, but to make it possible to get accurate results from single precision on CPUs we would need to implement similar amendments to core algorithms as in the GPU code, and since there is little or no performance improvement that is presently of limited value. Even if this is done, the relative advantage of single is much higher on GPUs, both because of a factor ~8 higher performance and use of texture units. We have updated the text to explain this better.

In Figure 4D, the authors are claiming a 10-40x price-performance improvement. One reviewer notes that he/she configured two machines at a reputable, but inexpensive vendor. Both machines had 64 GB of RAM, a 1.2 TB SSD. One was configured with 4x 1080 GPU cards and 8 CPU cores at 2.6 GHz. The other was configured with a base-level GPU for display and 2x E5-2697Av4 -> 32 cores @ 2.6 GHz. Both machines come in at ~$9000. As compared to their "one node", this machine has 32 rather than 12 cores, and with improvements in the CPU, each is ~1.5x more powerful than their test machine. This would already bring their price/performance comparison down by 4x, ignoring other issues.

Yes, as we acknowledge above it is more fair to use the very latest Xeon cores for a comparison, and we now have access to such nodes. However, as seen from Figure 4, with the latest code version running on brand-new Xeon cores vs. GTX1080 the difference persists – and then we have not even included the 25% faster Titan-X Pascal GPUs.

Rather than discussing how theoretically “powerful” different cores might be (the reviewer e.g. does not take memory bandwidth into account), we focus on how fast the reconstruction actually runs on the hardware, taking a pragmatic view which we believe benefits users better.

There are definitely highly priced GPU systems around too, in particular when selecting dual CPU socket systems (which are not needed for RELION). We have provided publicly available (www.cryoem.se/RELION-gpu) examples of specific hardware which we have purchased, tested and re-evaluated recently, at significantly better prices. However, ultimately we do not think specific prices should be featured in a scientific paper, since they change rapidly. We think it is better to show the performance for different hardware, and encourage educated users to make their own decision based on current prices. Because we do however realize the need for some guidance on purchase options for our users, we will maintain the blog mentioned above, and have also added benchmark results with different types of hardware on the Relion WIKI page (http://www2.mrc- lmb.cam.ac.uk/relion/index.php/Benchmarks_%26_computer_hardware).

11) The reviewers do not believe the current comparison in the manuscript is fair without at least some attempt to test on current generation CPUs with some real attempt to insure things are compiled and executed optimally. This need not be a month-long test, but at least a few iterations of a single model refinement to establish a baseline for modern hardware. As the major point of the manuscript is benchmarking, this is a minimum requirement for publication.

Agreed – as described above, we have now completely replaced the previous benchmarks using state-of-the-art CPU hardware, which we believe backs up our statements better.

12) The following statements are not clear:

"In fact, using GPU-enabled relion, the time needed for classification is only weakly dependent on the number of classes used, whereas the CPU-based implementation has a much steeper linear dependence (Figure 3)"

and:

"In the CPU-only version, the computational time scales linearly with increased number of classes (Figure 3B) due to the serialised single image calculations, whereas GPU-enabled execution can show better-than linear scaling."

Looking at Figure 3B, this is not clear. The GPU curve also visually appears to be following a linear trend, and rough estimates show a similar scaling to the GPU curve.

The values on the GPU curve are small enough compared to the size of the dots, this is difficult to assess quantitatively, but this figure certainly doesn't illustrate the point being made.

The indicated statements have been reformulated to “For sufficiently large computational problems, RELION’s processing time scales linearly with increased number of classes, but since the extra calculations are much faster with the GPU- enabled version the relative advantage is larger the more classes are used (Figure 4D).” This also improves the clarity of the message conveyed by the indicated figure.

13) In Figure 3E, the prominent lump in the middle is quite odd. The statement "GPU's ability to parallelise the drastically increased number of tasks" does not really explain it. If N independent tasks need to be completed in iteration 10, and 30N need to be completed in iteration 15, the only way this curve makes sense is if the GPU is massively underutilizing resources during iteration 10. It also would make no sense that going from 7.5 to 3.8 degree sampling would increase the number of tasks 30-fold. Something very peculiar is going on here, and it seems important to understand it.

"While relion-1.4 only exploited parallelism over images (left), in the new implementation classes and all orientations of each class are expressed as tasks that can be scheduled independently on the accelerator hardware (e.g. GPUs)"

Could the same change not be tested on CPUs as well?

The reason for the “lump” is simply that the fine exhaustive search requires a huge amount of both floating-point operations and memory bandwidth, which becomes very costly on the CPU. The reason this is not evident on the GPU is that the reviewer is quite correct – we are still nowhere near utilizing the full GPU resources in the early iterations (the GPU time/iteration does go up by ~50% in this region, but the absolute number is still much lower than the CPU).

The factor 30 stems from the two-fold increase in sampling in each of the 5 fitting dimensions (3 angular and 2 translational), where the load subsequently increases by 2^5=32. Similar parallel procedures could be adopted for the CPU implementation, however the anticipated improvement does not motivate the time investment, as explained previously.

14) Figure 4 – Why is this test suddenly shown for a ribosome? How does this compare on the benchmark ΒGal data set?

Throughout the manuscript, we use the Plasmodium falciparum ribosome data set to benchmark each of the different algorithms separately (e.g. auto-picking, 3D classification and 3D auto-refinement). We have chosen this data set (for which so-called 'polished particles' from RELION's movie-processing procedure are available for download by anyone interested from the EMPIAR data base) as the standard benchmark for RELION. The betagal data set is used as an additional illustration of the impact of the GPU- implementation on an entire processing workflow. We believe that a second example data set, being the highest-resolution data set available in the EMPIAR data base, adds value to the paper. We've added the following sentence to the revised manuscript to clarify this: "This represents the largest presently available dataset in the EMPIAR database, and provides a realistic challenge."

15) The use of GPU's is usually problematic because of rapid progress in GPU technology. This means that many different GPU's are available and in particular older one's suffer from poor feature support. The authors need to explain exactly what is required in a GPU and which GPU's are too old or inadequate.

We understand reviewer’s concerns that rapid development of hardware can be problematic for directed optimizations of code which relies on hardware specifications. We would argue that the rapid progress in GPU technology is good and provides opportunity for long- term commitments which are able to exploit such hardware competently and without excessive optimization, of which we believe RELION is a perfect example. We do agree that information about performance and support for older cards is useful, and we have added this to the dataset and hardware section. In short, we support at least the three latest generations of NVIDIA GPUs and will continue to provide updated support and development, making the rapid progress of GPU hardware an asset rather than problematic.

16) The loss of information in the gridding reconstruction algorithm using single precision is not explained clearly. Typically, this indicates very small or very large values that could be avoided by some normalization approach. This could also indicate that the iterative refinement in the gridding algorithm is too fast, generating instabilities with single precision.

Yes, this is an excellent point, and we already use the normalization approach in other parts of the code. Because the current implementation avoids this issue by leaving the gridding algorithm non-accelerated (i.e. run always on the CPU), we now rather mention this opportunity for future improvement in the revised manuscript, as mentioned above.

17) In the Introduction, the authors mention some other software packages giving the impression of a comprehensive list. However, some software packages are left out. Either be comprehensive or be more general in referring to the wiki showing software packages.

We feel it would be inappropriate not to cite at least some of the available alternative programs, so we have extended this list with several additional software packages to be more comprehensive.

18) "This allows high-resolution structure determination with minimal bias or user input." The impression is that a Bayesian approach is less biased than other methods. This is incorrect, as Bayesian methods are just as biased as any other when it is fed erroneous data (such as poorly picked particles) or an inappropriate starting reference (prior) is used.

Agreed. We have removed the word 'bias' from the sentence.

19) There is no information about the availability of the package (web site and license) or about deposition of the maps.

We added the following sentence to the manuscript. "RELION-2 is both open source and free software, distributed under the GPLv2 licence. It is publicly available for download through http://www2.mrc-lmb.cam.ac.uk/relion." We also provide accession code of the deposited map, and the previously available ribosome map.


Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

RESOURCES