CLASSIFICATION BY BOOTSTRAPPING IN SINGLE PARTICLE METHODS

Hstau Y Liao; Joachim Frank

doi:10.1109/ISBI.2010.5490386

. Author manuscript; available in PMC: 2010 Aug 20.

Published in final edited form as: Proc IEEE Int Symp Biomed Imaging. 2010 Apr 14;2010:169–172. doi: 10.1109/ISBI.2010.5490386

CLASSIFICATION BY BOOTSTRAPPING IN SINGLE PARTICLE METHODS

Hstau Y Liao ¹, Joachim Frank ^1,^2,^3,^*

PMCID: PMC2924593 NIHMSID: NIHMS180400 PMID: 20729994

Abstract

In single-particle reconstruction methods, projections of macromolecules at random orientations are collected. Often, several classes of conformations or binding states coexist in a biological sample, which requires classification, so that each conformation can be reconstructed separately. In this work, we examine bootstrap techniques for classifying the projection data. When these techniques are applied to variance estimation, the projection images (particles) are randomly sampled with replacement from the data set and a bootstrap volume is reconstructed from each sample. In a recent extension of the bootstrap technique to classification, each particle is assigned to a volume in the space spanned by the bootstrap volumes, such that the projection of the assigned volume best matches the particle. In this work we explain the rationale of these techniques by discussing the nature of the bootstrap volumes and provide some statistical analyses.

Keywords: classification, electron microscopy, bootstrapping, single particle, variance estimation

1. INTRODUCTION

In single-particle reconstruction methods [1], projections of macromolecules at randomly unknown orientations are collected by a transmission electron microscope. Often, several classes of conformations or binding states coexist in a sample. To obtain structures with high accuracy, it is required to separate the classes before reconstruction of the macromolecule takes place. In this work, we take a close look at bootstrap techniques for classifying the projection data. In the bootstrap techniques for variance estimation [2], the projection images (or particles) are randomly sampled with replacement from the data set and a bootstrap volume is reconstructed from each sample, assuming the orientations to be known. In a recent extension of the bootstrap technique to classification [3], each particle is assigned to a volume in the space spanned by the bootstrap volumes, such that the projection (in the same orientation as the particle) of the assigned volume best matches the particle. Then, a clustering algorithm applied to the assigned volumes determines the class to which the particle belongs. In this work we explain the rationale of these techniques by discussing the nature of the bootstrap volumes: i.e., how they relate to the underlying structural classes. Furthermore, several statistical analyses should become easy to apply in our framework. Finally, the way the particles are assigned to volumes in the space spanned by the bootstrap volumes is closely examined, and our proposed solution differs from that given in [3].

In Section 2 we discuss the nature of the bootstrap volumes and the effect of noise, as well as the classification method based on the analysis of the bootstrap volumes (‘bootstrap classification’). Section 3 shows the results obtained by bootstrap classification for simulated and experimental data and a comparison of the bootstrap method with a maximum likelihood classification approach [4]. Finally, discussion and conclusions are provided in Section 4.

2. BOOTSTRAPPING IN THE SINGLE PARTICLE METHOD

The aim of the bootstrapping technique [6] is to estimate the sampling distribution of an estimator by sampling with replacement from a given sample. It is a general-purpose approach to statistical inference, which circumvents the problem posed by the unavailability of large sample-size data.

2.1. Variability of classes via the bootstrap method

If we repeatedly sample, with replacement, the projection data and reconstruct a 3D volume from each sample (assumed the corresponding orientations are known), we obtain an estimate of the probability distribution that reflects the “variability” in the data. This variability, which is estimated as variance of the bootstrap volumes, is not only due to the presence of different conformational or binding states, which is the goal in 3D variance estimation [2], but it also comes from imperfections in the data collection such as instrument shot noise, “background” noise, reconstruction artifacts, contrast transfer function effect [1], alignment error, etc. Two major sources of variance are those due to the coexistence of different conformational or binding states and instrument noise. Unlike the latter, which is characteristics for 2D projection data, the former is three-dimensional in nature. Therefore, care must be taken when relating the two. Here we attempt to establish such a relation, by describing how the bootstrap volumes relate to the underlying true structures of the classes.

2.2. Bootstrap volumes and the class structures

We show that the bootstrap volumes are in fact approximations to convex combinations of the true structures. For simplicity in the discussions, we consider the case where the projection data come from a molecule occurring in M = 2 conformations. The analysis for more than two conformations follows straightforwardly. Assuming a discrete model, given the data $y \in R^{I}$ , the least-squares estimator is a popular criterion¹ for finding the true volume $x \in R^{j}$ :

x_{LSQ} = \arg \min_{x} ∥ y - R x ∥,

(1)

where R is the discrete Radon transform and ∥∥ is the Euclidean norm. Assuming that R’R is invertible (R’ denotes the transpose of R), the solution to (1) has a closed form

x_{LSQ} = {(R^{'} R)}^{- 1} R^{'} y .

(2)

In bootstrapping, data come from the two structures. Let $y_{1} \in R^{H_{1} I}$ and $y_{2} \in R^{H_{2} I}$ be the respective sampled projection data, where H_i is the number of projection images taken from class i = 1, 2 and I is the number of pixels in a projection image. Let $R_{1} \in R^{(H_{1} I) \times J}$ and $R_{2} \in R^{(H_{2} I) \times J}$ be the corresponding Radon transforms; i.e., if $x_{1}, x_{2} \in R^{J}$ are the true volumes, then in the absence of imperfections in the data

y_{i} = R_{i} x_{i},

(3)

for i = 1, 2. Without loss of generality, we can set the sampled data to be $y = [\begin{matrix} y_{1} \\ y_{2} \end{matrix}]$ and $R = [\begin{matrix} R_{1} \\ R_{2} \end{matrix}]$ . Substituting the values of y and R in (2), we obtain an expression for the reconstructed bootstrap volume, based on the least-square criterion,

x_{BS} = {(R_{1}^{'} R_{1} + R_{2}^{'} R_{2})}^{- 1} (R_{1}^{'} y_{1} + R_{2}^{'} y_{2}) .

(4)

Taking in to account (3) and the fact that $R_{1}^{'} R_{1} + R_{2}^{'} R_{2} = R^{'} R$ , the bootstrap volume can be viewed as a sum of linear transformations of the true volumes x₁ and x₂

x_{BS} = {(R^{'} R)}^{- 1} (R_{1}^{'} R_{1} x_{1} + R_{2}^{'} R_{2} x_{2}),

(5)

whose linear transformations ${(R^{'} R)}^{- 1} R_{i}^{'} R_{i}, i = 1, 2$ , sum up to the identity matrix of appropriate size.

We have proved that a bootstrap volume is a sum of linear transformations of the true classes. In fact, this sum is an approximation to a convex combination of the classes. To see this, we note that the effect of $R_{i}^{'} R_{i}, i = 1, 2$ , is essentially a blurring with a kernel that goes like 1/r (r is the radial distance; [8]) multiplied by a factor that is proportional to the number of projection images H_i taken from class i = 1, 2. Thus, the effect of the linear transformations ${(R^{'} R)}^{- 1} R_{i}^{'} R_{i}$ is basically a constant multiplication by factor H_i/H, where H is the total number of projections in a bootstrap sample; i. e.,

x_{BS} ≃ \frac{H_{1}}{H} x_{1} + \frac{H_{2}}{H} x_{2} .

(6)

2.3. Profile of the distribution of the bootstrap volumes

It is easy to see that the right hand side of (6) corresponds to summing volumes from Bernoulli trials with support {x₁, x₂}, probability p (whose realization is H₁/H in this case), and dividing the sum by the total number of projections H:

x_{BS} ≃ \frac{1}{H} \sum_{h = 1}^{H} x^{h}; where x^{h} = {\begin{matrix} x_{1} w . p . & p \\ x_{2} w . p . & 1 - p \end{matrix} .

(7)

That is, most of the bootstrap volumes are located near the center of the convex hull with vertices x₁ and x₂. A concentration near the vertices would be more desirable from the point of view of estimating the convex hull.

2.4. Imperfections in the projection data

Imperfections in the data come from the electron optics (astigmatism, spherical aberration, etc.), background noise, shot noise, alignment error, etc. Let us assume an additive noise model for the 2D projection data, such that it can be “back-projected,” leading to an additive noise model for the 3D volumes. That is, in (3) we have that, for i = 1, 2, y_i =R_i (x_i + g_i)=R_ix_i + h_i, for some g_i and h_i = R_ig_i, which are respectively the 3D and 2D noise component. This is realizable if, for instance, h_i consists of uniform independent Gaussians at the pixels, the reconstruction region is spherical, and g_i are independent Gaussians at the voxels, with lower variance close the center of the volume than away from the center. Accordingly, (7) becomes $x_{BS} ≃ \frac{1}{H} \sum_{h = 1}^{H} (x^{h} + g^{h})$ . The variance of x_BS is thus composed of two terms: one that is due to the class difference, which is the signal part (computed as $\frac{p (1 - p)}{H} d_{j}^{2}$ , where d_j is the difference between x₁ and x₂ at voxel j, for j = 1, …, J) and the other term that is due to noise. Since both terms are scaled by factor H, the signal-to-noise ratio (SNR) for detecting class difference will not be improved by increasing H.

2.5. Classification using bootstrap method

An immediate consequence of the fact that the bootstrap volumes are convex combinations of the class volumes is that the space spanned by the bootstrap volumes approximates that spanned by the class volumes. Hence, for each projection image we can restrict ourselves in that space to estimate the volume that generated that projection. Ideally, these estimated volumes cluster around the true class volumes.

We now proceed to consider the case of M ≥ 2 classes. Suppose we have generated a sufficient number of bootstrap volumes and H is large enough, so that their principal directions are close to the principal directions of the space of the class volumes. Let $z_{1}, \dots, z_{N} \in R^{J}$ be the resulting eigen-volumes and z₀ the average volume. Given a projection image y_P, we wish to find an element z (α) in that space (reconstitution problem); i.e., $z (α) = z_{0} + \sum_{n = 1}^{N} α_{n} z_{n}$ , such that the discrepancy between its projection $P z (α) \in R^{I}$ (in the same orientation as that of y_P) and y_P is minimized. For simplicity, in this paper we choose the discrepancy to be the Euclidean distance

∣ ∣ \sum_{n = 1}^{N} α_{n} P z_{n} + P z_{0} - y_{p} ∣ ∣ .

(8)

To avoid shift and scale variabilities, y_P and Pz_n, 0 ≤ n ≤ N, are replaced by their normalized (to zero mean and unit variance) version, prior to computing α in (8).

2.6. Algorithm for classification based on bootstrapping

Algorithm 1 summarizes our proposed approach to classification using bootstrapping, which is the same as the existing algorithm [3], except for the way in which the coefficients are determined. In [3], apparently the _n are set to be the inner product between Pz_n and y_P, 1 ≤ n ≤ N. We stress that, due to the space limitation, the description of our algorithm as described here is rather sketchy. We will treat in a separate communication such issues as the dependence of the results on the number of bootstrap volumes and particles used, filtering of the bootstrap volumes, criteria for estimating α, etc., following our framework; though several useful statistical analyses are already dealt in [9] for variance estimation.

Algorithm 1.

Bootstrap Classification

1. Sample with replacement the projection data

2. Reconstruct bootstrap volume from each sample

3. Compute the eigenvolumes of the set of samples

4. For each particle:

4a. Project the eigenvolumes in the orientation of the particle

4b. Compute α in (8)

6. Classify the particles by clustering α, for all the particles

Open in a new tab

3. RESULTS

We tested our proposed algorithm on experimental and simulated data. The experimental data set consists of ten thousand 130 × 130 particles randomly chosen from a larger data set, on which the maximum likelihood (ML) classification method (a popular alternative to ours) [5] was previously tested, giving rise to two main structures: the 70S E. coli ribosome in the classical and hybrid state (see Fig. 1). For the simulated data set, we used these two states as phantoms and generated ten thousand 130 × 130 noisy projections in the exact same manner as described in [10]; i.e., the SNR was 0.06 and the CTF was applied. To gain computational speed, we decimated the particles to size 65² in both data sets, aligned them to a library of reference projections (on a ten-degree angular grid) of a common 3D reference (the density map of the ribosome in one of the two states), and used SPIDER [11] to generate forty thousand bootstrap volumes in each case. It was necessary to filter the volumes, and for that we used a low-pass filter with cut-off 0.1, which was limited to the first lobe of the CTF [9] (though this value was not optimized). We relied on SPARX [12] to perform the eigen-decompositions. The clustering of the coefficient vectors α was performed via the k-means algorithm. To assess the classification performance, we also tested the ML method on the simulated data set.

Fig. 1 — 70S *E. coli* ribosome in classical (left) and hybrid (center) states [5], and our classification (right) of noisy mixed projections of the two structures.

3.1. Experimental data

Fig. 2 shows the result of classification using our version of the bootstrapping classification method. We used five eigen-volumes (N = 5) and looked for two classes (M = 2). One can immediately recognize the differences of the two structures in the presence/absence of the EF-G and the position of the L1 stalk. Not visible is the presence/absence of the A-site tRNA, which is another difference that the algorithm was able to pull out.

3.2. Simulated data

As measured by a classification error score whose minimum is 0% (perfect classification) and the maximum is 50% (random guess), the bootstrapping method (with N = M = 2) yields 16%±0.2% (see Fig. 1) versus 34%±10% for the ML approach [4] (refinement angle of ten degrees, two classes, 20 iterations). The confidence interval of the classification error was obtained by running the respective algorithms ten times. The large dispersion of the figure in the ML method is likely due to the presence of local maxima (which is not an issue in the bootstrapping approach, except for the k-means algorithm) and the relatively small data size.

4. DISCUSSION AND CONCLUSIONS

We have explained the rationale of bootstrapping in the context of classification and proposed an algorithm the differs from the one initially proposed in an important detail. Through repeated reconstructions from bootstrap samples, we can estimate the space spanned by the underlying class structures. By searching in this space a volume whose projection best matches a given particle helps us decide on the class to which the particle belongs. We show that the bootstrapping approach offers a competitive alternative to current popular methods, such as the ML approach: the former does not suffer from local maxima effect (except for the clustering algorithm, if k-means is used). It is noted that in the experiments, the angular assignment of the projection data was done only once, at the beginning, and with respect to one reference volume. An iterative process, in which the angles are refined with respect to the current reconstructed class volumes, should provide even better results. Further improvement may also come from alternative ways of finding the coefficient vector α, since the Euclidean distance in (8) is sensitive to outliers. Finally, it should be noted that classification becomes more challenging as the variability of the structure classes competes with noise in the data, both of which are scaled down by the number of particles used in the sampling. Hence, to reduce the noise, it is necessary to find ways other than increasing the number of particles; for that, if filtering is used, the loss of high frequency information can be detrimental for the classification.

5. ACKNOWLEDGMENT

We are grateful to Zhi-Quan (Tom) Luo for help with optimization.

Footnotes

For instance, the well known SIRT algorithm can be viewed as a gradient descent algorithm for finding x_LSQ [7].

6. REFERENCES

[1].Frank J. Three-Dimensional Electron Microscopy of Macromolecular Assemblies. Oxford University Press; New York: 2006. [Google Scholar]
[2].Penczek PA, Yang C, Frank J, Spahn CMT. Estimation of variance in single-particle reconstruction using the bootstrap technique. J. Struc. Biol. 2006;154:168–183. doi: 10.1016/j.jsb.2006.01.003. [DOI] [PubMed] [Google Scholar]
[3].Spahn CMT, Penczek PA. Exploring conformational modes of macromolecular assemblies by mutli-particle cryo-EM. Current Opinion in Structural Biology. 2009;19:623–631. doi: 10.1016/j.sbi.2009.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Scheres SHW, Valle M, Grob P, Nogales E, Carazo JM. Maximum likelihood refinement of electron microscopy data with normalization errors. J. Struc. Biol. 2009;166:234–240. doi: 10.1016/j.jsb.2009.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Scheres SHW, Gao H, Valle M, Herman GT, Eggermont PPB, Frank J, Carazo JM. Disentangling conformational states of macromolecules in 3D-EM through likelihood optimization. Nat. Methods. 2007;4:27–29. doi: 10.1038/nmeth992. [DOI] [PubMed] [Google Scholar]
[6].Efron B. Bootstrap methods: Another look at the jack-knife. The Annals of Statistics. 1979;1:1–26. [Google Scholar]
[7].Herman GT. Image Reconstruction from Projections: The Fundamentals of Computerized Tomography. Academic Press; New York: 1980. [Google Scholar]
[8].Deans SR. The Radon transform and some of its applications. John Wiley & Sons; New York: 2006. [Google Scholar]
[9].Zhang W, Kimmel M, Spahn CM, Penczek PA. Heterogeneity of large macromolecular complexes revealed by 3D cryo-EM variance analysis. Structure. 2008;16:1770–1776. doi: 10.1016/j.str.2008.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Baxter WT, Grassucci RA, Gao H, Frank J. Determination of signal-to-noise ratios and spectral SNRs in cryo-em low-dose imaging of molecules. J. Struc. Biol. 2009;166(2):126–132. doi: 10.1016/j.jsb.2009.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Shaikh TR, Gao H, Baxter W, Asturias FJ, Boisset N, Leith A, Frank J. SPIDER image processing for single-particle reconstruction of biological macromolecules from electron micrographs. Nat. Protoc. 2008;3:1941–1974. doi: 10.1038/nprot.2008.156. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Hohn M, Tang G, Goodyear G, Baldwin PR, Huang Z, Penczek PA, Yang Ch., Glaeser RM, Adams P, Ludtke SJ. SPARX, a new environment for cryo-em image processing. J. Struct. Biol. 2007;157:47–55. doi: 10.1016/j.jsb.2006.07.003. [DOI] [PubMed] [Google Scholar]

[R1] [1].Frank J. Three-Dimensional Electron Microscopy of Macromolecular Assemblies. Oxford University Press; New York: 2006. [Google Scholar]

[R2] [2].Penczek PA, Yang C, Frank J, Spahn CMT. Estimation of variance in single-particle reconstruction using the bootstrap technique. J. Struc. Biol. 2006;154:168–183. doi: 10.1016/j.jsb.2006.01.003. [DOI] [PubMed] [Google Scholar]

[R3] [3].Spahn CMT, Penczek PA. Exploring conformational modes of macromolecular assemblies by mutli-particle cryo-EM. Current Opinion in Structural Biology. 2009;19:623–631. doi: 10.1016/j.sbi.2009.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Scheres SHW, Valle M, Grob P, Nogales E, Carazo JM. Maximum likelihood refinement of electron microscopy data with normalization errors. J. Struc. Biol. 2009;166:234–240. doi: 10.1016/j.jsb.2009.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Scheres SHW, Gao H, Valle M, Herman GT, Eggermont PPB, Frank J, Carazo JM. Disentangling conformational states of macromolecules in 3D-EM through likelihood optimization. Nat. Methods. 2007;4:27–29. doi: 10.1038/nmeth992. [DOI] [PubMed] [Google Scholar]

[R6] [6].Efron B. Bootstrap methods: Another look at the jack-knife. The Annals of Statistics. 1979;1:1–26. [Google Scholar]

[R7] [7].Herman GT. Image Reconstruction from Projections: The Fundamentals of Computerized Tomography. Academic Press; New York: 1980. [Google Scholar]

[R8] [8].Deans SR. The Radon transform and some of its applications. John Wiley & Sons; New York: 2006. [Google Scholar]

[R9] [9].Zhang W, Kimmel M, Spahn CM, Penczek PA. Heterogeneity of large macromolecular complexes revealed by 3D cryo-EM variance analysis. Structure. 2008;16:1770–1776. doi: 10.1016/j.str.2008.10.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Baxter WT, Grassucci RA, Gao H, Frank J. Determination of signal-to-noise ratios and spectral SNRs in cryo-em low-dose imaging of molecules. J. Struc. Biol. 2009;166(2):126–132. doi: 10.1016/j.jsb.2009.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Shaikh TR, Gao H, Baxter W, Asturias FJ, Boisset N, Leith A, Frank J. SPIDER image processing for single-particle reconstruction of biological macromolecules from electron micrographs. Nat. Protoc. 2008;3:1941–1974. doi: 10.1038/nprot.2008.156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Hohn M, Tang G, Goodyear G, Baldwin PR, Huang Z, Penczek PA, Yang Ch., Glaeser RM, Adams P, Ludtke SJ. SPARX, a new environment for cryo-em image processing. J. Struct. Biol. 2007;157:47–55. doi: 10.1016/j.jsb.2006.07.003. [DOI] [PubMed] [Google Scholar]

PERMALINK

CLASSIFICATION BY BOOTSTRAPPING IN SINGLE PARTICLE METHODS

Hstau Y Liao

Joachim Frank

Abstract

1. INTRODUCTION

2. BOOTSTRAPPING IN THE SINGLE PARTICLE METHOD

2.1. Variability of classes via the bootstrap method

2.2. Bootstrap volumes and the class structures

2.3. Profile of the distribution of the bootstrap volumes

2.4. Imperfections in the projection data

2.5. Classification using bootstrap method

2.6. Algorithm for classification based on bootstrapping

Algorithm 1.

3. RESULTS

Fig. 1.

3.1. Experimental data

Fig. 2.

3.2. Simulated data

4. DISCUSSION AND CONCLUSIONS

5. ACKNOWLEDGMENT

Footnotes

6. REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

CLASSIFICATION BY BOOTSTRAPPING IN SINGLE PARTICLE METHODS

Hstau Y Liao

Joachim Frank

Abstract

1. INTRODUCTION

2. BOOTSTRAPPING IN THE SINGLE PARTICLE METHOD

2.1. Variability of classes via the bootstrap method

2.2. Bootstrap volumes and the class structures

2.3. Profile of the distribution of the bootstrap volumes

2.4. Imperfections in the projection data

2.5. Classification using bootstrap method

2.6. Algorithm for classification based on bootstrapping

Algorithm 1.

3. RESULTS

Fig. 1.

3.1. Experimental data

Fig. 2.

3.2. Simulated data

4. DISCUSSION AND CONCLUSIONS

5. ACKNOWLEDGMENT

Footnotes

6. REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases