Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2012 Apr 27;28(15):1972–1979. doi: 10.1093/bioinformatics/bts243

Optimal simultaneous superpositioning of multiple structures with missing data

Douglas L Theobald 1,*, Phillip A Steindel 1
PMCID: PMC3400950  PMID: 22543369

Abstract

Motivation: Superpositioning is an essential technique in structural biology that facilitates the comparison and analysis of conformational differences among topologically similar structures. Performing a superposition requires a one-to-one correspondence, or alignment, of the point sets in the different structures. However, in practice, some points are usually ‘missing’ from several structures, for example, when the alignment contains gaps. Current superposition methods deal with missing data simply by superpositioning a subset of points that are shared among all the structures. This practice is inefficient, as it ignores important data, and it fails to satisfy the common least-squares criterion. In the extreme, disregarding missing positions prohibits the calculation of a superposition altogether.

Results: Here, we present a general solution for determining an optimal superposition when some of the data are missing. We use the expectation–maximization algorithm, a classic statistical technique for dealing with incomplete data, to find both maximum-likelihood solutions and the optimal least-squares solution as a special case.

Availability and implementation: The methods presented here are implemented in THESEUS 2.0, a program for superpositioning macromolecular structures. ANSI C source code and selected compiled binaries for various computing platforms are freely available under the GNU open source license from http://www.theseus3d.org.

Contact: dtheobald@brandeis.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

How should we properly compare and contrast the 3D conformations of similar structures? This fundamental problem in structural biology is commonly addressed by performing a superposition, which removes arbitrary differences in translation and rotation so that a set of structures is oriented in a common reference frame (Flower, 1999). For instance, the conventional solution to the superpositioning problem uses the least-squares optimality criterion, which orients the structures in space so as to minimize the sum of the squared distances between all corresponding points in the different structures. Superpositioning problems, also known as Procrustes problems, arise frequently in many scientific fields, including anthropology, archaeology, astronomy, computer vision, economics, evolutionary biology, geology, image analysis, medicine, morphometrics, paleontology, psychology and molecular biology (Dryden and Mardia, 1998; Gower and Dijksterhuis, 2004; Lele and Richtsmeier, 2001). A particular case we consider here is the superpositioning of multiple 3D macromolecular coordinate sets, where the points to be superpositioned correspond to atoms. Although our analysis specifically concerns the conformations of macromolecules, the methods developed herein are generally applicable to any entity that can be represented as a set of Cartesian points in a multidimensional space, whether the particular structures under study are proteins, skulls, MRI scans or geological strata.

We draw an important distinction here between a structural ‘alignment’ and a ‘superposition.’ An alignment is a discrete mapping between the residues of two or more structures. One of the most common ways to represent an alignment is using the familiar row and column matrix format of sequence alignments using the single letter abbreviations for residues (Fig. 1). An alignment may be based on sequence information or on structural information (or on both). A superposition, on the other hand, is a particular orientation of structures in 3D space.

Fig. 1.

Fig. 1.

Cytochrome c alignment. An alignment of cytochrome c proteins, each of known structure, from seven different species. The gaps can be considered as ‘missing data’ in a likelihood framework. As in Figure 2, the subset of residues completely shared among all four proteins is indicated by asterisks

Calculating an optimal superposition normally requires a one-to-one correspondence (a bijection) between the atoms in the different structures (Bourne and Weissig, 2003; Flower, 1999; Gower, 1975). For instance, a sequence alignment is necessary to superposition protein molecules. In many real cases, however, certain residues (and their atoms) are ‘missing’ in some of the structures. As a case in point, one crystal structure of a protein may omit loop regions that are present in another crystal structure of the same protein. Figure 2a shows a sequence alignment of four protein structures that we wish to superposition. In this example, the protein sequences are identical except that several residues are missing from some structures; only about half of the atoms to be superpositioned are shared among all four proteins (indicated by blue asterisks). An analogous situation exists when we wish to superposition a set of homologous proteins, which in general will have different sequences, various lengths, and gaps and insertions in the alignment (Fig. 1). In both cases, particular columns in the alignment will have gaps, and immediately the question emerges of how to properly incorporate these positions into a global superposition.

Fig. 2.

Fig. 2.

Alignments with missing data. (a) A sequence alignment of four identical proteins, except that different residues are missing in each of the proteins. This could, for instance, correspond to the case of superpositioning different crystal structures where different regions of the protein are disordered in different crystal forms. The subset of residues completely shared among all four proteins is indicated by asterisks. (b) A second alignment of the same four proteins with different missing data. (c) A third alignment of the same four proteins with no ‘common core,’ in which no residues are completely shared among all four proteins

Current multiple superposition methods explicitly require complete data (Diamond, 1992; Flower, 1999; Gerber and Müller, 1987; Gower, 1975; Kearsley, 1990; Shapiro et al., 1992; Sutcliffe et al., 1987; Theobald and Wuttke, 2006a), and hence current superposition implementations deal with missing data crudely, usually by excluding many atoms from the calculation (Birzele et al., 2007; Dror, 2003; Guda et al., 2001; Hill and Reilly, 2006; Konagurthu et al., 2006; Maiti et al., 2004; Menke et al., 2008; Ortiz, 2002; Shatsky et al., 2002; Ye and Godzik, 2005). For the proteins in Figures 1 and 2, standard practice would calculate the superposition based on only the small subset of fully shared residues, often referred to as the ‘common core’ (indicated above the alignment by blue asterisks). This method, which corresponds to superpositioning based only on columns in the alignment that contain no gaps, is therefore inefficient as it disregards much of the observed data. Atoms that are not shared among all structures are nevertheless informative, and ideally they should be considered in calculating an optimal superposition. In the most extreme case, no residues are completely shared among the macromolecules (Fig. 2c). Here, the practice of disregarding positions with missing data prohibits the calculation of a superposition altogether, because there is no common core and consequently nothing to include in the calculation. Ignoring some atoms in the superposition also clearly fails to satisfy the least-squares criterion, where the object is to minimize the sum of squared distances among all corresponding atoms, not just among a subset of the atoms.

Despite the popularity of ordinary least squares as an optimality criterion for determining the best superposition, other criteria are better justified both theoretically and empirically (Theobald and Wuttke, 2006a, b, 2008). According to the Gauss–Markov theorem, two basic assumptions must be met to justify the use of least squares. In terms of a superposition, these two assumptions require that all atoms have the same variance and that none of the atoms are spatially correlated. Both of these assumptions are strongly violated with biological macromolecules, as certain regions of a structure are more variable than others (due to a combination of experimental imprecision, dynamics and conformational heterogeneity) and because atoms physically communicate with each other (e.g. via electrostatic, Van der Waals and covalent interactions).

We previously presented a maximum-likelihood superposition method that addresses these problems with the least-squares criterion by explicitly allowing individual atoms to be correlated and to have different variances (Theobald and Wuttke, 2006a, b, 2008). As is often true with Gaussian distributed data, the conventional least-squares superposition solution falls out as a special case of the likelihood analysis (namely, when assuming uncorrelated data with equal variances). Most importantly for the present work, likelihood-based methods can elegantly handle cases of missing data via the expectation–maximization (EM) algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997).

Here, we present solutions for finding the optimal superposition with incomplete structural data. We use the EM algorithm, a classic statistical technique for dealing with missing data, to find maximum-likelihood solutions, which include the conventional least-squares solution as a special case. For completeness, we present two different classes of solutions, listed in decreasing order of generality, complexity and computational requirements: (i) the ‘non-isotropic’ solution, which allows for heterogeneous atomic variances but assumes that there are no correlations; and (ii) the ‘isotropic’ solution, which corresponds to the least-squares solution and assumes that all atomic variances are equivalent and no atoms are correlated. This manner of presentation should allow for the straightforward modification of existing least-squares superpositioning routines to handle missing data.

2 APPROACH

2.1 A Gaussian statistical model for the superposition problem

To analyze the superposition problem within a likelihood-based framework and to use the EM algorithm, one must choose a statistical model for the observed data. We assume a perturbation model in which each structure is distributed according to a Gaussian (or Normal) probability density (Goodall, 1991; Lele, 1993; Theobald and Wuttke, 2006a).

The EM algorithm is frequently used in likelihood analyses to determine maximum-likelihood estimates of parameters of a model when some data are missing (Dempster et al., 1977; Pawitan, 2001). The EM method involves cycling between two steps: (i) the ‘E-step’ in which the expected likelihood function is calculated, conditional on the observed data and the current estimates of the model parameters; and (ii) the ‘M-step’ in which the expected likelihood function is maximized over an unknown parameter.

3 METHODS

3.1 Representation of structures with missing atoms

Consider the case of superpositioning r different structures (Xi, i=1,…, r), each with a total of k corresponding atoms. We represent each structure as a k×3 matrix of k rows of atoms, where each atom is a 3-vector. Some of the atoms in each structure may be missing (or unobserved). For each structure Xi, there are mi missing atoms and a complementary number of ni observed atoms, so that mi+ni=k.

To represent structures with missing points, we imagine that the complete data are given in the matrix Xi. A portion of this data is missing or unobserved. The complete data matrix can then be represented as the sum of the observed data and the unobserved data:

graphic file with name bts243m1.jpg (1)

where νiXi is the observed data and μiXi is the unobserved, missing data. The complementary indicator matrices νi and μi are square, symmetric, diagonal, k×k matrices. The ‘observed’ indicator matrix, νi, contains a one on the diagonal if the corresponding atom in the structure Xi is observed (i.e. not missing) and zeros otherwise. For example, the following three ‘observed’ ν indicator matrices correspond to the three small protein fragments in the alignment in Figure 3:

graphic file with name bts243m2.jpg (2)

Fig. 3.

Fig. 3.

An alignment of three proteins with missing residues. Missing data in this alignment is indicated by the three ‘observed’ indicator matrices ν1, ν2 and ν3 corresponding to proteins p1, p2 and p3, respectively, which are presented in equation (2)

Conversely, the ‘unobserved’ indicator matrix, μi, contains a 1 on the diagonal if the corresponding atom in the structure Xi is missing, and 0s otherwise. Both indicator matrices, though redundant, are useful for simplifying the presentation of the solutions. If no data are missing for structure Xi then νi equals the identity matrix and μi=0, the square matrix of zeros.

The following useful properties hold for the indicator matrices:

graphic file with name bts243m3.jpg (3)
graphic file with name bts243m4.jpg (4)
graphic file with name bts243m5.jpg (5)

where Ik is the k×k identity matrix and tr A is the trace of matrix A (i.e. the sum of the diagonal elements of A).

3.2 The Gaussian perturbation model

In our probabilistic model, each structure Xi is considered to be a randomly rotated and translated Gaussian perturbation of a mean structure M:

graphic file with name bts243m6.jpg (6)

where ti is a 3×1 translational row vector, 1k denotes the k×1 column vector of ones and Ri is a proper, orthogonal 3×3 rotation matrix. The k×3 matrix Ei is a matrix of Gaussian random errors with mean 0, being a random variate from a matrix Gaussian distribution i.e. Ei ~ Nk,3(0, Σ, I3). Here, Σ is a k×k covariance matrix for the atoms, which describes the variance of each atom and the covariances among the atoms. For simplicity, we assume that the variance about each atom is spatially spherical. Extensions to higher (and lower) dimensions are trivial (e.g. 4D data would be represented by a k×4 matrix and use 4×4 rotation matrices and 4×1 translation vectors).

For the non-isotropic solution, the covariance matrix is diagonal, with all offdiagonal, covariance elements constrained to 0. For the isotropic solution, which corresponds to least squares, the covariance matrix is constrained to be diagonal and to have identical diagonal elements (i.e. Σ=σ2I).

3.3 The superposition likelihood function

The full joint PDF for our likelihood superposition problem is thus obtained from a multivariate matrix normal distribution (Dutilleul, 1999; Gupta and Nagar, 2000) corresponding to the perturbation model described by equation (6)

graphic file with name bts243m7.jpg (7)

where

graphic file with name bts243m8.jpg (8)

and d=3 for 3D data. The Jacobian for the transformation from Yi to Xi is the product of the Jacobians for the translation and rotation, which are each simply unity [see Chapter 1.3 of (Gupta and Nagar, 2000)]. Detailed background and justification of this likelihood treatment can be found elsewhere (Theobald and Wuttke, 2006a, b, 2008).

The full superposition log-likelihood ℓ(R, t, M, Σ|X)=ℓS is therefore given by

graphic file with name bts243m9.jpg (9)

With the likelihood function in hand, the ML estimate of a parameter can be derived straightforwardly by taking the derivative of the log-likelihood with respect to the parameter (producing the ‘score function’), setting the derivative to zero, and solving for the parameter. Note that columns of the alignment that contain all gaps except for one lone sequence have no influence on the likelihood and should be excluded from the maximization calculations. For the E-step of the EM algorithm, one first finds the expected log likelihood, where the expectation is over the missing data conditional on the observed data and current estimates of the other parameters. In practice, these conditional expectations can be cast in terms of the current parameter estimates, and hence the expectations can be combined with other terms in the log likelihood containing those parameters. For the M-step, the expected log likelihood is maximized over a given parameter by taking the derivative as explained above. In the following sections, the conditional ML estimates are provided for both the complete data and missing data cases. Detailed derivations are provided in the Supplementary Material.

3.4 The translations

ML estimates of the translation parameters are given below.

3.4.1 Complete data solution

Where Inline graphic is the estimate of the translation:

graphic file with name bts243m10.jpg (10)
graphic file with name bts243m11.jpg (11)

3.4.2 Missing data solution

graphic file with name bts243m12.jpg (12)
graphic file with name bts243m13.jpg (13)

For the remaining solutions, it will be convenient to define a centred structure:

graphic file with name bts243m14.jpg (14)

3.5 The rotations

The optimal rotations are calculated using a singular value decomposition (SVD). Let the SVD of an arbitrary matrix A be U Λ V′. The ML rotations Inline graphic are estimated by

graphic file with name bts243m15.jpg (15)

where rotoinversions can be avoided by constraining the determinant of Inline graphic to be 1 by using P=I if |V||U|=1 or P=diag(1,…, 1,−1) if |V||U|=−1.

3.5.1 Complete data solution

In the non-isotropic case, the U and V matrices are determined from the SVD as follows:

graphic file with name bts243m16.jpg (16)

and in the isotropic case, from

graphic file with name bts243m17.jpg (17)

3.5.2 Missing data solution

For the non-isotropic case

graphic file with name bts243m18.jpg (18)

and for the isotropic case

graphic file with name bts243m19.jpg (19)

3.6 The mean structure

3.6.1 Complete data solution

The mean structure is estimated as the arithmetic average of the optimally translated and rotated structures

graphic file with name bts243m20.jpg (20)

3.6.2 Missing data solution

graphic file with name bts243m21.jpg (21)

Both of these solutions are independent of the covariance matrix.

3.7 Covariance matrix and superposition variance

The following equations for the covariance matrix estimates are only valid when the translations are known (Theobald and Wuttke, 2006a). In general, this is not the case, and thus a constrained (regularized) estimator of the covariance matrix is necessary. For instance, the estimates given below may be modified to give a hierarchical estimator as shown in equation (6) of (Theobald and Wuttke, 2006a) or equation (10) of (Theobald and Wuttke, 2008). The estimators of the isotropic variance are already adequately constrained and do not need to be adjusted.

To simplify the following formulae, we first define the matrix Di:

graphic file with name bts243m22.jpg (22)

3.7.1 Complete data solution

The unconstrained estimate of the diagonal, non-isotropic covariance matrix is

graphic file with name bts243m23.jpg (23)

where Ik is the k×k identity matrix and ‘⊙’ is the Hadamard operator for elementwise matrix multiplication. The Hadamard operation simply sets all offdiagonal elements of the covariance matrix to 0.

For a least-squares analysis, which is equivalent to assuming that Σ=σ2I, the ML estimate of the variance is

graphic file with name bts243m24.jpg (24)

Note that in equation (24), one needs to only calculate the diagonal elements.

3.7.2 Missing data solution

The ML estimate of the non-isotropic, diagonal covariance matrix is

graphic file with name bts243m25.jpg (25)

The estimate of the isotropic covariance matrix (Σ=σ2I) is given by

graphic file with name bts243m26.jpg (26)

In both isotropic cases, calculation of the isotropic variance σ2 can often be omitted, as it is not needed to estimate any of the other parameters. The summation terms may be calculated easily by noting that

graphic file with name bts243m27.jpg (27)

4 ALGORITHM

The simultaneous solution of the optimal parameters must be solved numerically, as each of the unknown parameters is a function of some of the others. Our iterative algorithm is an extension of similar algorithms proposed previously, and it is based on nested rounds of EM cycles and conditional maximization (Dempster et al., 1977; Dutilleul, 1999; Goodall, 1991; Theobald and Wuttke, 2006a, b). As with all superposition algorithms, this algorithm requires a priori knowledge of the alignment (the one-to-one correspondence among atoms/points in the structures):

  1. Initialize: Set Inline graphic for all i. Estimate the mean structure Inline graphic by embedding the average of the distance matrices, including gaps, for each structure (Crippen and Havel, 1978; Lele, 1993; Lele and Richtsmeier, 2001). Rather than embedding, one may simply choose one of the structures (preferably with the fewest gaps) to serve as the mean for the first iteration, setting missing coordinates to zeros (convergence may be hindered in cases with a large fraction of gaps).

  2. Translate: Translate (i.e. centre) each Xi. For the first iteration, the μiMRi term can be omitted from equations (12)–(13).

  3. Rotate: Calculate each rotation Inline graphic and rotate each translated structure: Inline graphic.

  4. Estimate the mean: Recalculate the average structure Inline graphic. Return to Step 2 and loop until convergence.

  5. Estimate the covariance matrix Inline graphic or variance σ2: If the covariance matrix or the translations are unknown (the usual case), calculate Inline graphic [equation (25)] and modify it to constrain the eigenvalues to all be >0. In the isotropic case, this step can be omitted if desired. Return to Step 2 and loop until convergence.

5 IMPLEMENTATION

The algorithm described above for calculating optimal superpositions with missing data is implemented in the command-line UNIX program THESEUS (Theobald and Wuttke, 2006a, b). THESEUS functions in two modes: (i) one mode for superpositioning structures with identical sequences [such as different (NMR) models] and (ii) an ‘alignment mode,’ which superpositions structures with different sequences (i.e. with missing data) conditional on a known alignment. THESEUS does not attempt to determine structure-based sequence alignments, which is a distinct bioinformatics problem (Bourne and Weissig, 2003). When superpositioning homologous proteins with different sequences or identical proteins with missing portions, a sequence alignment must be provided to THESEUS. We provide a wrapper script, theseus_align, to perform this latter procedure transparently for the user. The theseus_align wrapper script will automatically extract the proper sequences from the Protein Data Bank (PDB) files, align them with a sequence alignment program of the users choice and superposition the structures based on this alignment with THESEUS.

THESEUS will superposition any number of structures within the limits set by the operating system and memory capability. Via command line options, users can choose to superposition assuming an isotropic covariance matrix (i.e. the conventional LS (least squares)-method) or using the non-isotropic model. Specified alignment columns can also be excluded from the calculation. On modern personal desktop computers, convergence is usually very fast (within seconds for even very large problems).

6 RESULTS AND DISCUSSION

To demonstrate the advantages of the method, we constructed three test sets of structures based on four NMR models of a zinc finger domain protein. In each of the three test sets, different portions of the the four structures were removed. In the first test set, the C-terminal helix of the zinc finger is the only region fully shared among all four partial structures (indicated by asterisks above the alignment columns in Fig. 2a). In the second test set, the N-terminal β-sheet is the only region fully shared (Fig. 2b). For the third test set, no regions of the protein are fully shared among all four structures—in this set, there is no ‘common core’ (Fig. 2c), and hence this test set is impossible to superposition using conventional algorithms.

The four original, unmodified zinc finger structures were superpositioned with THESEUS (using both the conventional least-squares criterion and non-isotropic maximum likelihood) to provide a ‘complete data’ superposition. For comparison, the modified structures (with various portions deleted) were then superpositioned using the EM algorithm and using the traditional method which omits columns with gaps. The superpositions with the modified structures can then be compared with the ‘complete data’ superposition, which serves as a reference.

The EM method produces a least-squares superposition much closer to the ‘true’ complete data superposition than the conventional method (Fig. 4). The EM superposition is also largely independent of which portions were fully aligned. The EM variances are much lower for the entire structure than the conventional method variances, and they are generally much closer to the ‘true’ variances (Fig. 7). Results for the non-isotropic ML superpositions are similar to those of the LS superpositions (Figs. 5 and 8). The EM method can easily handle the ‘impossible’ situation seen in Figure 2c, with results similar to the true superposition (Figs. 6, 7c and 8c).

Fig. 4.

Fig. 4.

Least-squares (isotropic) superpositions with missing data. In each pane, four protein structures are superpositioned, each with a different conformation. The top row, (a)–(c), compares superpositions of proteins corresponding to the alignment in Figure 2a, where only residues in the α-helix are fully shared among the structures. Other regions of the structures, e.g. the two-stranded β-sheet in the right side of the images, are missing in some of the structures. The bottom row, (d)–(f), compares superpositions corresponding to the alignment in Figure 2b, where only residues in the β-sheet are fully shared. The left-most column, (a) and (d), shows superpositions found using the EM method described here. The middle column, (b) and (e), shows the reference superposition using all of the data; this can be thought of as the ‘true’ superposition before regions of the structures were deleted. For ease of comparison, in these images, the missing residues are not displayed, even though all of the original data were included in the superposition calculation. The right-most column, (c) and (f), shows conventional superpositions based on only the subset of fully shared residues. The structures used in these superpositions were derived from four NMR models of a zinc finger domain, PDB ID 1zfd

Fig. 7.

Fig. 7.

Standard deviation of α-carbons for least-squares superpositions. For each residue in the structures, the α-carbon standard deviation is plotted. The reference superposition (‘complete data’) is plotted as the green line, a superposition using the conventional algorithm (based on only the subset of fully shared residues) is plotted as the blue line, and the superposition from our EM algorithm, which includes all data, is plotted as the red line. The reference data (green line) is in fact the same in all three panes; it is re-plotted in each for convenience. (a) Superpositions of proteins corresponding to the alignment in Figure 2a, where only residues in the α-helix are fully shared among the structures. (b) Superpositions corresponding to the alignment in Figure 2b, where only residues in the β-sheet are fully shared. (c) Superpositions when no residues are completely shared among the four proteins (‘no common core’)

Fig. 5.

Fig. 5.

Maximum-likelihood (non-isotropic) superpositions with missing data. Aside from the optimization criterion, all other details, structures and alignments are as in Figure 4

Fig. 8.

Fig. 8.

Standard deviation of α-carbons for non-isotropic maximum-likelihood superpositions. For details see the legend to Figure 7

Fig. 6.

Fig. 6.

Superpositions when there is no ‘common core.’ (a) and (b). Least-squares superpositions when no residues are completely shared among the four proteins. Panel (a) shows the results of the EM missing data algorithm, based on the alignment shown in Figure 2c. (b) The original least-squares superposition when all missing data are included in the calculations (c) and (d). Corresponding non-isotropic ML superpositions

Because the conventional superpositions ignore regions of the structure that have missing data, they are biased to closely superposition only the regions that are fully shared. For instance, in Figures 4c and 5c, the superposition is biased to closely orient the α-helix at the expense of the β-sheet, as the α-helix is the only region of the protein fully shared among the three structures. Similarly, the conventional superposition is biased toward the β-sheet in panels 4f and 5f, since the β-sheet is the only region of the protein fully shared among this set of structures.

Our test sets represent somewhat extreme cases, and we expect the effect of accounting for missing data in real proteins to be more subtle. In practice, the effect of missing data will vary from case to case, depending on which regions are missing and on the patterns of structural variability in the proteins. We have found, however, that in many cases, properly handling missing data is an important concern with significant effects on the superposition. One salient example is given in Figure 9, which shows the effects of accounting for missing data in the superpositions of the NMR structure families of two different scorpion toxins. Using both ML and LS superposition methods, significant differences are seen when the missing data are accounted for properly. In all cases, our EM method results in a closer superposition for the unaligned regions (shown in cyan) that would conventionally be excluded from the analysis. The fully aligned regions (shown in dark blue) also exhibit changes in relative conformation, as seen clearly in the per residue RMSD plots.

Fig. 9.

Fig. 9.

Comparison of conventional superpositions versus our missing data method for NMR families of two scorpion neurotoxins (PDB IDs 1big and 1i6g). The Centruroides neurotoxin (1i6g) contains two insertions, shown in cyan, not found in the Chinese scorpion neurotoxin (1big). The cyan regions hence represent positions with gaps in the alignment, whereas the dark blue regions are fully aligned. The top row, (a) and (b), compares non-isotropic ML superpositions of the two neurotoxins, with a conventional superposition using only fully aligned residues at left and a superposition accounting for all data using our EM algorithm at right. The middle row, (c) and (d), compares analogous LS superpositions. The bottom row shows plots of α-carbon Root Mean Square Deviation (RMSD) for each residue position

7 CONCLUSION

We have developed a method for superpositioning multiple structures when some of the structures have residues that other structures lack. Our algorithm uses the EM algorithm to find the optimal superposition by treating the gaps in an alignment as ‘missing data.’ To our knowledge, this is the first superposition solution proposed for cases with incomplete or missing structural information. The use of indicator matrices allows our EM method to be incorporated easily in conventional LS superpositioning procedures. Programs need only be modified to keep track of these matrices and to adjust the calculations accordingly. Hence, our method should be widely applicable to the diverse superposition problems found throughout molecular biology.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

The authors thank Catherine J.L. Theobald for comment and criticism.

Funding: National Institutes of Health, United States [GM094468 and GM096053].

Conflict of Interest: none declared.

REFERENCES

  1. Birzele F., et al. Vorolign—Fast structural alignment using voronoi contacts. Bioinformatics. 2007;23:e205–e211. doi: 10.1093/bioinformatics/btl294. [DOI] [PubMed] [Google Scholar]
  2. Bourne P.E., Weissig H. Methods of Biochemical Analysis. Vol. 44. Hoboken, NJ: Wiley-Liss; 2003. Structural Bioinformatics. [PubMed] [Google Scholar]
  3. Crippen G.M., Havel T.F. Stable calculation of coordinates from distance information. Acta Crystallogr. A. 1978;34:282–284. [Google Scholar]
  4. Dempster A.P., et al. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Metab. 1977;39:1–38. [Google Scholar]
  5. Diamond R. On the multiple simultaneous superposition of molecular-structures by rigid body transformations. Protein Sci. 1992;1:1279–1287. doi: 10.1002/pro.5560011006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dror O. Multiple structural alignment by secondary structures: algorithm and applications. Protein Sci. 2003;12:2492–2507. doi: 10.1110/ps.03200603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dryden I.L., Mardia K.V. Wiley series in probability and statistics. Chichester, New York: John Wiley & Sons; 1998. Statistical Shape Analysis. [Google Scholar]
  8. Dutilleul P. The MLE algorithm for the matrix normal distribution. J. Stat. Comput. Simul. 1999;64:105–123. [Google Scholar]
  9. Flower D.R. Rotational superposition: a review of methods. J. Mol. Graph Model. 1999;17:238–244. [PubMed] [Google Scholar]
  10. Gerber P.R., Müller K. Superimposing several sets of atomic coordinates. Acta Crystallogr A. 1987;43:426–428. [Google Scholar]
  11. Goodall C.R. Procrustes methods in the statistical analysis of shape. J. Roy. Stat. Soc. B Metab. 1991;53:285–321. [Google Scholar]
  12. Gower J.C. Generalized Procrustes analysis. Psychometrika. 1975;40:33–51. [Google Scholar]
  13. Gower J.C., Dijksterhuis G.B. Procrustes Problems. Vol. 30. Oxford, New York: Oxford Statistical Science Series, Oxford University Press; 2004. [Google Scholar]
  14. Guda C., et al. A new algorithm for the alignment of multiple protein structures using monte carlo optimization. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing. 2001;6:275–286. doi: 10.1142/9789814447362_0028. [DOI] [PubMed] [Google Scholar]
  15. Gupta A.K., Nagar D.K. Matrix Variate Distributions. Vol. 104. Chapman and Hall; 2000. [Google Scholar]
  16. Hill A.D., Reilly P.J. Comparing programs for rigid-body multiple structural superposition of proteins. Proteins. 2006;64:219–226. doi: 10.1002/prot.20975. [DOI] [PubMed] [Google Scholar]
  17. Kearsley S.K. An algorithm for the simultaneous superposition of a structural series. J. Comput. Chem. 1990;11:1187–1192. [Google Scholar]
  18. Konagurthu A.S., et al. Mustang: a multiple structural alignment algorithm. Proteins. 2006;64:559–574. doi: 10.1002/prot.20921. [DOI] [PubMed] [Google Scholar]
  19. Lele S. Euclidean distance matrix analysis (EDMA)—estimation of mean form and mean form difference. Math. Geol. 1993;25:573–602. [Google Scholar]
  20. Lele S., Richtsmeier J.T. Interdisciplinary statistics. Boca Raton, FL: Chapman and Hall/CRC; 2001. An Invariant Approach to Statistical Analysis of Shapes. [Google Scholar]
  21. Maiti R., et al. SuperPose: a simple server for sophisticated structural superposition. Nucleic Acids Res. 2004;32:W590–W594. doi: 10.1093/nar/gkh477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. McLachlan G.J., Krishnan T. Wiley series in probability, and statistics, Applied Probability and Statistics. New York: Wiley; 1997. The EMAlgorithm and Extensions. [Google Scholar]
  23. Menke M., et al. Matt: local flexibility aids protein multiple structure alignment. PLoS Comput. Biol. 2008;4:e10. doi: 10.1371/journal.pcbi.0040010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ortiz A.R. Mammoth (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci. 2002;11:2606–2621. doi: 10.1110/ps.0215902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pawitan Y. In All Likelihood: Statistical Modeling and Inference Using Likelihood. Oxford: Oxford Science Publications, Clarendon Press; 2001. [Google Scholar]
  26. Shapiro A., et al. A method for multiple superposition of structures. Acta Crystallogr. A. 1992;48:11–14. doi: 10.1107/s010876739100867x. [DOI] [PubMed] [Google Scholar]
  27. Shatsky M., et al. A method for simultaneous alignment of multiple protein structures. Proteins: Structure, Function, and Bioinformatics. 2004;56:143–56. doi: 10.1002/prot.10628. [DOI] [PubMed] [Google Scholar]
  28. Sutcliffe M., et al. Knowledge based modelling of homologous proteins, part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engi. 1987;1:377–384. doi: 10.1093/protein/1.5.377. [DOI] [PubMed] [Google Scholar]
  29. Theobald D.L., Wuttke D.S. Empirical Bayes hierarchical models for regularizing maximum likelihood estimation in the matrix Gaussian Procrustes problem. Proc. Natl. Acad. Sci. USA. 2006a;103:18521–18527. doi: 10.1073/pnas.0508445103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Theobald D.L., Wuttke D.S. THESEUS: Maximum likelihood superpositioning and analysis of macromolecular structures. Bioinformatics. 2006b;22:2171–2172. doi: 10.1093/bioinformatics/btl332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Theobald D.L., Wuttke D.S. Accurate structural correlations from maximum likelihood superpositions. PLoS Comput. Biol. 2008;4:e43. doi: 10.1371/journal.pcbi.0040043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ye Y., Godzik A. Multiple flexible structure alignment using partial order graphs. Bioinformatics. 2005;21:2362–2369. doi: 10.1093/bioinformatics/bti353. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES