Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2012 May 25;7(5):e37493. doi: 10.1371/journal.pone.0037493

SAS-Pro: Simultaneous Residue Assignment and Structure Superposition for Protein Structure Alignment

Shweta B Shah 1, Nikolaos V Sahinidis 1,2,*
Editor: Vasilis J Promponas3
PMCID: PMC3360771  PMID: 22662161

Abstract

Protein structure alignment is the problem of determining an assignment between the amino-acid residues of two given proteins in a way that maximizes a measure of similarity between the two superimposed protein structures. By identifying geometric similarities, structure alignment algorithms provide critical insights into protein functional similarities. Existing structure alignment tools adopt a two-stage approach to structure alignment by decoupling and iterating between the assignment evaluation and structure superposition problems. We introduce a novel approach, SAS-Pro, which addresses the assignment evaluation and structure superposition simultaneously by formulating the alignment problem as a single bilevel optimization problem. The new formulation does not require the sequentiality constraints, thus generalizing the scope of the alignment methodology to include non-sequential protein alignments. We employ derivative-free optimization methodologies for searching for the global optimum of the highly nonlinear and non-differentiable RMSD function encountered in the proposed model. Alignments obtained with SAS-Pro have better RMSD values and larger lengths than those obtained from other alignment tools. For non-sequential alignment problems, SAS-Pro leads to alignments with high degree of similarity with known reference alignments. The source code of SAS-Pro is available for download at http://eudoxus.cheme.cmu.edu/saspro/SAS-Pro.html.

Introduction

Protein alignment is a problem that has gained tremendous attention in bioinformatics and proteomics due to its applicability in protein clustering, identifying homology relationships, and inferring structure-activity relationships about new and existing proteins. Proteins may be compared with each other through sequence alignment, where the similarities between the proteins are identified through similarities within their amino acid residue sequences. Research on protein sequence alignment has led to the development of numerous dynamic programming algorithms [1], [2] that are central to the BLAST code [3], [4], an alignment tool that radically transformed the bioinformatics field and found extensive applications in the biotechnology industry. However, structural information of proteins is difficult to infer from sequence information alone. While sequence similarity generally implies structural similarity between proteins, there exist a large number of protein pairs, including haemoglobin and myoglobin found in the human body, that are structurally similar but possess low sequence similarities (also known as twilight zone proteins). Physical comparisons of protein structures [5], [6] further demonstrate the need for direct comparison of 3D protein structures, also known as the protein structure alignment problem, which is the focus of this paper.

The aim of protein structure alignment is to determine structural similarities between a given pair of proteins so that further functional relationships between them may be identified. Thus, protein structure alignment tools are useful in systematic classification of proteins based on their functional and homology relationships. They may be further employed in predicting functional properties of newly discovered or newly synthesized proteins based on structural similarity with existing proteins. Protein structure alignment tools may also be used in the pharmaceutical industry to determine alternative options for existing drugs, or development of personalized medication. Further applications are also possible in the bio-catalysis and other protein-based product industries, where structure alignment tools could help in development of new protein-based products.

Over the past three decades, a variety of algorithms have been developed for finding protein structural alignments, which has turned out to be a very difficult computational problem. Kolodny et al. [7], Gibrat et al. [8], Lancia and Istrail [9], Singh and Brutlag [10], and Novotny et al. [11] provide descriptions and comparisons of the most frequently used structure alignment tools. These tools include DALI [12], CE [13], Structal [14], and SSM [15], all of which are known to provide good quality sequential alignments in low computational times. These tools have been instrumental in the development of various protein structure databases like FSSP [16], SCOP [17], CATH [18] and HOMSTRAD [19], which provide extensive information on classification of protein folds and domains. However, these alignment tools employ heuristic methods and provide only approximate alignments with no guarantee of optimality. This may lead to inaccurate conclusions about relationships between proteins. Thus, for accurate analysis of structural similarities, exact structure alignment tools are required. Lancia et al. [20], Caprara et al. [21], Xie and Sahinidis [22], and Wohlers et al. [23] have developed exact structure alignment algorithms based on contact maps representations of proteins. However, these exact algorithms are often computationally expensive and may not be practical for performing a large number of structure comparisons. The development of protein structure alignment tools that strike a balance between fully optimal alignments and low computational requirements remains a challenge.

Early protein structure comparisons were based on computing the root mean square deviation (RMSD) amongst two protein structures of known residue correspondence. In order to make such comparisons on a large-scale, McLachlan [24] and Sippl [25] developed algorithms for fast RMSD computations. These algorithms were then used to construct the first protein structure alignment tools [26][28] that were based on determining the optimal correspondence amongst individual residues of two proteins. The structure alignment problem is traditionally formulated as a continuous optimization problem, where similar protein substructures are superimposed onto each other to evaluate structural similarity through RMSD calculation. Here, the proteins are represented using the 3D coordinates of all the CInline graphic atoms representing the protein backbone. To obtain an alignment, one of the protein structures is rotated and translated to superimpose it onto the other protein structure, while optimizing a measure of similarity between them. Current structure alignment tools address the alignment optimization problem through a two-step process. In the first step, ‘assignment’ between amino-acid residues of two proteins is established using dynamic programming or heuristic methods. The objective here is to obtain the largest possible sequential alignment between the two proteins. In the second step, ‘superposition’ is achieved via computing optimal values for rotation-translation variables by various convex optimization techniques. In the superposition step, the RMSD value or a variant of the RMSD value is minimized. An iterative application of this process results in obtaining the final alignment. Structal [14], MAMMOTH [29], and alignment tools developed by Wu et al. [30], Andreani and Martinez [31], and Andreani et al. [32] are all based on this two-step approach. These approaches differ in the algorithms they use for assignment evaluation and structure superposition, as well as the choice of the objective functions in the two stages of alignment. Nearly all these methods determine the assignment by basic dynamic programming, and utilize different ways of building the similarity matrices based on different structural characteristics of the proteins. The exception is Andreani et al. [32], who determine the assignment of amino-acid residues by a heuristic method.

The two-step approach to structural alignment has clear computational advantages and results in very fast implementations. However, by decoupling the inter-dependence between the assignment and superposition problems, alignment tools based on this approach may produce suboptimal alignments. In this work, we present a novel approach, Simultaneous Alignment and Superposition of PROteins (SAS-Pro), that combines the evaluation of the assignment and the rotation-translation problems into a single bilevel optimization formulation. We further propose a combination of optimization algorithms, which we demonstrate leads to a practical computational approach for the solution of the proposed formulation.

By eliminating the residue-sequentiality constraints, the SAS-Pro approach is additionally capable of providing both sequential and non-sequential structure alignments. Most structure alignment tools developed in the past are designed to provide only sequential alignments between protein structures. However, there exist a multitude of similar protein pairs that exhibit non-sequential structure similarities. Thus, development of alignment tools to identify non-sequential similarities is important. This problem is only recently being addressed through the development of alignment tools such as STSA [33], and the introduction of non-sequential alignment capabilities in DALI [12] and SSM [15].

The remainder of this paper is structured as follows. After stating the protein structure alignment problem, the SAS-Pro optimization model is presented and a numerical solution algorithm is proposed. The implementation is subsequently discussed along with computational results, followed by conclusions.

Methods

The problem and a natural decomposition

Consider proteins A and B to be structurally aligned. Let Inline graphic represent the Inline graphic residue of protein A, and Inline graphic represent the Inline graphic residue of protein B. In addition, let Inline graphic and Inline graphic represent the 3D coordinates of the corresponding amino-acid residues. We seek to align amino-acid residues of A to amino-acid residues of B so that, when A is rotated-translated onto B, a similarity measure between the two proteins is minimized. The RMSD function will be used to determine the similarity between the protein structures and is defined as

graphic file with name pone.0037493.e008.jpg (1)

Here, Inline graphic is a binary variable that equals Inline graphic when Inline graphic is aligned to Inline graphic and Inline graphic otherwise, and Inline graphic represents the rotation-translation transformation applied to protein A. The rotation-translation transformation is characterized by the three components of the translation vector and the Inline graphic, and Inline graphic angles of rotations about the Inline graphic, Inline graphic and Inline graphic axes, respectively.

The problem of minimizing the RMSD may be represented as the following mixed-integer nonlinear optimization program:

graphic file with name pone.0037493.e020.jpg (2)
graphic file with name pone.0037493.e021.jpg (3)
graphic file with name pone.0037493.e022.jpg (4)
graphic file with name pone.0037493.e023.jpg (5)

Here, the parameter Inline graphic in Constraint (4) is the minimum number of residues that must be aligned to ensure that the global optimum of the model attains a non-trivial value. Constraints (2) and (3) ensure that no more than one amino-acid residue of protein A is aligned with an amino-acid residue of protein B and vice versa. Constraint (5) enforces the binary nature of the assignment variables Inline graphic.

Two-stage approach

A two-stage solution approach employed by existing alignment tools decouples the effects of Inline graphic and Inline graphic variables and evaluates the effect of the assignment variables Inline graphic and rotation-translation variables Inline graphic separately. The two-stage optimization problem may be viewed as follows:

graphic file with name pone.0037493.e030.jpg
graphic file with name pone.0037493.e031.jpg
graphic file with name pone.0037493.e032.jpg
graphic file with name pone.0037493.e033.jpg
graphic file with name pone.0037493.e034.jpg (6)
graphic file with name pone.0037493.e035.jpg
graphic file with name pone.0037493.e036.jpg
graphic file with name pone.0037493.e037.jpg

where Inline graphic and Inline graphic are optimal values of Inline graphic and Inline graphic, respectively, obtained in Stage 1 and Stage 2 of an iteration of the two-stage optimization problem. Constraint (6) in Stage 1 is imposed implicitly in the model by solution procedures utilized to solve for Inline graphic.

In typical approaches, values for the assignment variables Inline graphic are determined by heuristic methods and dynamic programming techniques. The function Inline graphic is thus selected as the dynamic programming objective function based on different similarity matrices designed for the alignment tool. The similarity matrices currently in use are based on structural features of the proteins, including inter-residue distances [14], [31], bond angles [29], and radii of fragment curvature [30]. These heuristic methods and dynamic programming techniques do not guarantee optimality of the alignment obtained with respect to the objective of Stage 2, the RMSD value. Thus, the final alignment obtained from the iterative procedure is not guaranteed to be globally optimal, and is known to be dependent on the initialization of the process [14], [31], [32]. Hence, the two-stage formulation may provide only a feasible solution of the MINLP and not necessarily a global optimum. Global optimality cannot be guaranteed unless the MINLP is somehow solved directly.

SAS-Pro model

The SAS-Pro model reformulates the MINLP model into a single bilevel optimization problem. For any given Inline graphic, the function Inline graphic may be defined as

graphic file with name pone.0037493.e047.jpg

The master problem of the SAS-Pro model optimizes over the solution of the subproblem Inline graphic. The bilevel SAS-Pro model is as follows:

graphic file with name pone.0037493.e049.jpg
graphic file with name pone.0037493.e050.jpg
graphic file with name pone.0037493.e051.jpg
graphic file with name pone.0037493.e052.jpg
graphic file with name pone.0037493.e053.jpg
graphic file with name pone.0037493.e054.jpg
graphic file with name pone.0037493.e055.jpg
graphic file with name pone.0037493.e056.jpg
graphic file with name pone.0037493.e057.jpg

The master problem objective function Inline graphic is in the space of the Inline graphic variables alone. Yet, it is trivial to see that any assignment/superposition feasible to the MINLP is also feasible to the SAS-Pro master problem. Hence, our reformulation maintains optimality.

Evaluation of the function Inline graphic involves solving the subproblem and determining the optimal assignment variables Inline graphic, for given values of Inline graphic and parameter Inline graphic. Our key observation is that, for a given value of Inline graphic, the subproblem can be reformulated as the following k-cardinality linear assignment problem (k-LAP):

graphic file with name pone.0037493.e065.jpg (7)
graphic file with name pone.0037493.e066.jpg
graphic file with name pone.0037493.e067.jpg
graphic file with name pone.0037493.e068.jpg (8)
graphic file with name pone.0037493.e069.jpg

where Inline graphic, Inline graphic, Inline graphic. A highly efficient polynomial-time algorithm, SKAP [34], has been developed to solve the k-LAP problem and can be readily utilized in this context. The solution to the k-LAP problem will provide an assignment of exactly Inline graphic amino-acid residues, as constrained in equation (8). The numerical value of Inline graphic can be obtained from the objective value in equation (7) of the k-LAP problem as Inline graphic. The k-LAP model does not include any sequence preserving constraints. Thus, the SAS-Pro model is designed to provide an optimal assignment and structure superposition of protein structures for specified values of the parameter Inline graphic, with no sequence-preserving constraints. We later show how to recover a sequential alignment, if desired, from the SAS-Pro alignment.

Kolodny and Linial [35] also present a bilevel approach to structure alignment by utilizing the SAS [36] similarity measure as the objective function in the master problem, as opposed to the RMSD value. They obtain values for the assignment variables Inline graphic through a dynamic programming methodology and determine the rotation-translation variables by enumeration over a grid in the Inline graphic space. Our approach differs from their approach in three major aspects. First, the objective function used by Kolodny and Linial in the subproblem to determine the assignment variables Inline graphic (dynamic programming based objective) differs from their master problem objective (SAS score). We use the same objective in both the subproblem as well as the master problem of the SAS-Pro model, which guarantees that a SAS-Pro optimal solution is optimal also for the original MINLP problem. Second, we utilize efficient search techniques to solve the master problem and obtain near-optimal rotation-translation variables, as opposed to the expensive enumeration approach used by Kolodny and Linial. Finally, our approach has the added capability of providing both sequential and non-sequential structure alignments for protein pairs.

As mentioned above, an optimal solution of the MINLP is feasible to our reformulation. In order for an optimal solution to be identified, suitable algorithms must be used to solve the master problem to global optimality. Indeed, there exist derivative-free optimization (DFO) algorithms that can achieve this goal based on dense sampling of the domain [37]. However, in the search of the most computationally efficient approach, in the next section we will also evaluate local search techniques for solving the master problem. With the same goal in mind, we will introduce a heuristic approach for determining the optimal parameter Inline graphic as well as for curtailing the number of degrees of freedom for the alignment problem.

Algorithm

Derivative-free optimization

The landscape of the RMSD function with varying values of the rotation angles Inline graphic and Inline graphic is presented in the contour plot of Figure 1 for proteins 1B00 and 1DBW. As seen in this figure, the objective function in the SAS-Pro model is highly multi-modal and nonlinear. This multi-modality can be addressed by optimization techniques that span the entire search space of the problem in the search for global optima. Furthermore, an explicit algebraic form for the SRMSD objective function for the master problem is not available, thus making it difficult to utilize derivative-based optimization methods. Thus, we opted to employ DFO techniques in order to solve the SAS-Pro model.

Figure 1. Contour plot of the landscape of the RMSD function for 1B00 and 1DBW proteins in the Inline graphic rotation angles plane.

Figure 1

We performed extensive computational analysis with 28 different DFO solvers, based on a variety of techniques that included direct search, pattern search, surrogate management frameworks, domain partitioning methods, local search, global search, deterministic and stochastic algorithms [37]. Our experiments indicated that the derivative-free solver SNOBFIT [38] provides the best performance for a small number of function evaluations. This observation is consistent with the results reported in [37]. Keeping the number of function evaluations low was dictated by our desire to design an algorithm that would take no more than a few CPU minutes on a standard computer workstation for the alignment of protein pairs that are routinely analyzed nowadays.

Our interface to SNOBFIT is based on the ‘mydfo’ interface developed by Rios [39]. We have limited SNOBFIT to Inline graphic function evaluations for each value of the parameter Inline graphic. Every RMSD function evaluation for a given value of Inline graphic involves solving the k-LAP problem using the SKAP code developed by Dell'Amico and Martello [34].

Choice of parameter Inline graphic

The solution to the SAS-Pro model is dependent on the parameter Inline graphic. Different values of Inline graphic may lead to very different optimal alignments. The best alignment is found when the value of Inline graphic is close to the number of biologically relevant residue matches. It is therefore important to determine the right value of the parameter Inline graphic. Furthermore, it is important for an implementation to select a value for this parameter automatically, i.e., without requiring the user to specify it. This is achieved here as follows.

Proteins with high level of similarity have a large length of alignment, usually corresponding to 85% or more of size of the smaller protein. Hence, the number of biologically relevant residues matches is expected to be between to 85% to 100% of the size of the smaller protein. To identify the best value for Inline graphic, we systematically vary the value of Inline graphic from 100% to 85% of the size of the smaller protein, until an alignment with a good similarity measure cutoff is obtained. The similarity measure used here is SASInline graphic, a modified version of the SAS score, that is further discussed later in this paper. In our implementation, for a given structure alignment problem, we evaluate structure alignments for different values of Inline graphic and select the one for which an SASInline graphic score of less than 4 Å is obtained.

Lower levels of similarities between proteins may arise while attempting to obtain a match of a smaller substructure from one protein with other proteins. In order to use SAS-Pro in such a context, it is advisable to isolate the relevant substructure in question before comparing with larger proteins. This will increase the chances of obtaining a suitable alignment within the limits of choice of the parameter Inline graphic.

Reducing the number of degrees of freedom

The solution to the SAS-Pro model involves determining the optimal values of both the assignment variables Inline graphic as well as the rotation-translation variables Inline graphic. The assignment variables Inline graphic are obtained as an exact solution to the SAS-Pro subproblem. Thus, the only degrees of freedom available in the SAS-Pro master problem are the three translation vector components Inline graphic, Inline graphic, and Inline graphic along the X, Y and Z axes, respectively, and the three rotation angles Inline graphic, and Inline graphic about the X, Y and Z axes, respectively.

In the course of our computational experimentations, we observed that, for proteins with similar sizes, a good approximation of the translation vectors is very often obtained if the centroids of the two protein structures are required to coincide. Thus, while comparing proteins of similar sizes, the number of degrees of freedom for optimization may be reduced to only the three rotation angles. As demonstrated in [37], for a collection of over 500 test problems, problems with up to three or four variables were almost always solved to global optimality by a variety of DFO algorithms. Thus, while solving the SAS-Pro optimization problem, the small number of degrees of freedom provides a computational advantage in terms of obtaining globally optimal structure alignments.

For structural comparison of proteins with different sizes, the SAS-Pro algorithm offers an option to utilize all six degrees of freedom. In this case, in order to maintain solution quality of the DFO solvers, we found it necessary to increase the number of function evaluations to 1000 for each value of Inline graphic considered.

Extracting sequential alignments

The solution to the SAS-Pro model is usually a non-sequential structure alignment between the two proteins. However, a sequential alignment is easy to extract from the non-sequential alignment obtained from the SAS-Pro algorithm in a post-processing step. A dynamic programming algorithm was designed to identify the largest sequential alignment amongst the aligned residues provided by SAS-Pro. This algorithm sequentially evaluates the largest length of sequential alignment terminating at residue Inline graphic of protein A and stores it in the vector Inline graphic. The algorithm maintains a pointer to the residue before Inline graphic in the sequential alignment in the vector Inline graphic. Inline graphic denotes the residue Inline graphic of protein B which is aligned to Inline graphic. The largest value of Inline graphic provides the length of the largest sequential alignment terminating at residue Inline graphic. Backtracking the residues from this value of Inline graphic using the vector Inline graphic provides the corresponding alignment. A pseudo-code of the algorithm is presented below:

graphic file with name pone.0037493.e118.jpg

Similarity measure

For sequential protein alignments, where the sequence of the amino acid residues is preserved in the alignment, many suitable similarity measures, such as the Structure Alignment Score SAS [36] and the Similarity Index SI [40], have been defined. These measures are based on weighted ratios of the RMSD value and the length of alignment produced by the algorithm:

graphic file with name pone.0037493.e119.jpg (9)
graphic file with name pone.0037493.e120.jpg (10)

Here, Inline graphic and Inline graphic represent the lengths of the proteins A and B, and Inline graphic represents the number of sequentially aligned residues between the two proteins. For non-sequential structure alignments, the length of alignment is not properly defined and hence cannot be used to calculate the SAS and SI measures. We introduce a new measure of length of alignment, the total fragment length (Inline graphic), to extend the definition of the SAS similarity measure to non-sequential structure alignments. Following earlier works [41][43], the total fragment length is defined as the sum of lengths of aligned continuous fragments of five or more residues. Sequentiality of the amino-acid residues in the fragment is not required, thus providing for a measure of the length of alignment that is applicable to both sequential and non-sequential structure alignment.

The similarity between proteins is then determined using the proposed SASInline graphic measure, which is defined as

graphic file with name pone.0037493.e126.jpg (11)

This measure reduces to the SAS measure for the case of sequential structure alignments.

The best non-sequential structure alignment obtained from the SAS-Pro algorithm may include multiple local small-length matches as opposed to a single large global alignment. This disorder of the alignment can be measured by the value of the fragment length. A disordered alignment is expected to have a small fragment length, while a biologically relevant ordered alignment is expected to have a large fragment length, thus providing lower SASInline graphic values for biologically relevant alignments. Hence, the best alignment for a given pair of proteins is expected to be one with the lowest SASInline graphic score.

Results

We performed computational experiments based on three data sets:

  • the Sokol data set [44], which is a set of 9 small size proteins with proteins from three different fold families,

  • the Skolnick data set [20], which is a set of 40 large globular proteins from four different fold families from the SCOP data base, and

  • the RIPC data set [45], which is a set of 23 complex structure alignment problems.

An all-to-all pairwise alignment for all the proteins in the Sokol and Skolnick data sets was obtained, resulting in 850 pairwise alignment problems with 222 similar protein pairs and 628 dissimilar protein pairs. The Sokol data set includes 20 similar protein pairs that align sequentially. The Skolnick data set consists of proteins from four fold families: a) Flavodoxin-like fold CheY-related, b) Plastocyanin, c) TIM beta/alpha-barrel, and d) Ferratin. Protein pairs within the same fold family are termed as similar pairs and exhibit sequential similarity. The RIPC data set consists of 23 protein alignment problems for which a biologically relevant reference alignment is available. These 23 alignment problems are complex and exhibit non-sequential structure similarities. The complexity of these alignments arises from repetitions, insertions/deletions, permutations, and conformational changes between the protein pairs that are not easily handled by alignment algorithms. All data sets are provided at http://eudoxus.cheme.cmu.edu/saspro/SAS-Pro.html.

In all tests, the typical computing time requirements for SAS-Pro were around 1 CPU minute per protein pair on an Intel Quad Core 2.83 GHz processor with 6 GB RAM, while providing sequential and non-sequential alignments with exceptional classification ability.

Sequential structure alignments

The Sokol and Skolnick data sets were analyzed to evaluate the performance of SAS-Pro in obtaining sequential alignment problems. To obtain sequential alignments from the non-sequential alignments provided by SAS-Pro, the procedure described in the subsection entitled “Extracting sequential alignments” was used. Alignments were compared using the RMSD values as well as the geometric similarity measures SI and SAS.

A comparison of the RMSD, SI, and SAS values obtained by SAS-Pro for similar and dissimilar proteins is presented in Table 1. For protein pairs within the same fold family, alignments with low RMSD, SI, and SAS values were obtained. For pairs from different fold families, the values of RMSD, SI, and SAS were comparatively higher than the corresponding values for similar proteins. In addition, the alignments obtained from the SAS-Pro alignment tool were near-sequential for similar protein pairs and were 96% in agreement with known optimal alignments between the proteins that were obtained from the exact structure alignment tool CMOS [22]. These optimal alignments contain both large fragments of aligned residues as well as a few isolated aligned residues. SAS-Pro matches the large fragments of aligned residues with these optimal alignments exactly. However, the alignments may differ in isolated residue matches, that are not of biological consequence, resulting in an average of 96% agreement between the alignments between SAS-Pro and CMOS.

Table 1. Average (standard deviation) RMSD value, SI score, SAS score, and match with reference alignments for the Sokol and Skolnick data sets for similar and dissimilar protein pairs.

Sokol set Skolnick set
Similar Dissimilar Similar Dissimilar
RMSD 0.60 (0.4) 2.9 (1.45) 1.72 (0.78) 3.94 (0.6)
SI 1.17 (0.4) 7.04 (1.45) 3.15 (1.23) 9.77 (3.9)
SAS 1.61 (0.7) 7.37 (1.78) 2.19 (0.89) 8.51 (2.9)
% agreement with optimal alignment 96 N.A. 96 N.A.

The alignments obtained from SAS-Pro were also compared with those obtained from the CE [13], SSM [15], and STSA [33] alignment tools. Raw comparison results for SAS-Pro and other methods are provided in File S1.zip of the Supporting Information. The results are summarized in Table 2. The SAS-Pro approach provided alignments with better or equal RMSD for over 59 to 69% of the similar structures. For some problems, SAS-Pro was able to provide RMSD, SI, and SAS scores which were smaller by more than 4 Å than those obtained from CE. Moreover, the RMSD values of more than three quarters of the remaining problems were observed to exceed those in CE and SSM by only a single standard deviation (0.5 Å), while preserving a 96% similarity with the corresponding sequential structure alignments. Consequently, the corresponding SI and SAS scores for these problems were also within a single standard deviation of those from CE and SSM. t-test results for SAS-Pro, CE, SSM, and STSA show that these algorithms distinguish between similar and dissimilar protein pairs with the same high significance (t-test value Inline graphic). However, SAS-Pro has lower mean and standard deviation values for the similarity measures, resulting in better quality solutions with an average t-test significance value of 0.5.

Table 2. Comparison of SAS-Pro with CE, SSM, and STSA for the similar protein pairs of the Sokol and Skolnick data sets using RMSD, SI, and SAS measures.

% Problems where
SAS-Pro is better SAS-Pro is at par
Solver RMSD SI SAS RMSD SI SAS
CE 57 51 51 12 12 12
SSM 47 36 36 12 12 12
STSA 44 40 40 21 21 21
Average (standard deviation) improvement obtained by SAS-Pro (Å)
Solver RMSD SI SAS RMSD SI SAS
CE 0.45 (0.46) 0.3 (0.41) 0.3 (0.42) N.A. N.A. N.A.
SSM 0.26 (0.2) 0.2 (0.12) 0.16 (0.1) N.A. N.A. N.A.
STSA 0.4 (0.15) 0.4 (0.15) 0.21 (0.1) N.A. N.A. N.A.

The table presents the percentage of problems where SAS-Pro performed better than, or at par with CE, SSM, and STSA. In addition, the table presents the average improvement in the RMSD, SI, SAS scores for these problems when SAS-Pro is used instead of other solvers.

The Sokol and Skolnick data sets together include 222 similar protein pairs and 628 dissimilar protein pairs. A classification of these 850 problems into similar and dissimilar pairs was sought based on the SAS scores of the alignments obtained. The CE, SSM, and SAS-Pro alignment tools provided exact classification of these protein pairs. The STSA algorithm, however, produced very short alignments for 5 of the similar pairs, leading to an imperfect classification.

Figure 2 shows the distributions of the SAS values obtained for similar and dissimilar protein pairs for the Skolnick data set by SAS-Pro. The distributions for the similar and dissimilar proteins were observed to be completely disjoint, with lower SAS scores for similar proteins and higher SAS scores for dissimilar proteins. A SAS score cutoff of 4 Å produced a perfect classification of the alignment problems into similar and dissimilar protein pairs. Based on this observation, a termination criterion for the SAS-Pro code was implemented. For computations reported in the sequel, SAS-Pro was designed to terminate if (a) an alignment with a SAS score of 4 Å or less is obtained, or (b) all values of Inline graphic between 85% and 100% of the size of the smaller protein are explored. In either case, the best alignment and the corresponding RMSD, SAS score, and fragment length of the alignment are returned by the software.

Figure 2. Distribution of SAS values obtained by SAS-Pro for similar and dissimilar proteins in the Skolnick data set.

Figure 2

The means (standard deviations) for the similar and dissimilar protein pairs are 2.19 (0.89) and 8.51 (2.9) Å, respectively.

Non-sequential structure alignments

We performed a computational study to determine the quality of SAS-Pro's non-sequential structure alignments utilizing the RIPC data set and the non-sequential alignment problems presented by Salem and Zaki [33]. Salem and Zaki [33] provided two examples of non-sequential structure alignments for which their alignment tool, STSA, performs better than other structure alignment tools. We performed an alignment of the corresponding two protein pairs, 2LH3:A with 2HPD:A, and 1FSF:A with 1IG0:A, and obtained better alignments with SAS-Pro than STSA for both cases. For the 2LH3:A and 2HPD:A proteins, SAS-Pro provided an alignment with length 126 and RMSD 3.17 Å, as compared to STSA's alignment of length 117 and RMSD 3.27 Å. For the 1FSF:A and 1IG0:A proteins, SAS-Pro obtained an alignment with length 117 and RMSD 2.68 Å, as compared to STSA's alignment of length 104 and RMSD 5.4 Å. We present a quantitative comparison of the SAS-Pro alignment between the 2LH3:A and 2HPD:A proteins and other solvers in Table 3. As the results in this table demonstrate, SAS-Pro provides an RMSD in the same ball-park range as most other tools but with larger alignment length, thus providing a superior structure alignment as the SASInline graphic values indicate.

Table 3. Comparison of performance of alignment tools for aligning 2LH3:A and 2HPD:A proteins.

Alignment tool RMSD (Å) NInline graphic SASInline graphic
SAS-Pro 3.17 126 2.5
SARF2 3.05 108 2.8
STSA 3.37 117 2.9
STRUCTAL 2.27 56 4
CE 4.05 91 4.4
DALI 4.8 87 5.5

(All results, except SAS-Pro, taken from [33].)

We next present results from a computational study with the 23 protein pairs in the RIPC data set. The 3D coordinates of the C-alpha atoms for the SAS-Pro alignments for the 23 pairs are provided in File S2.zip of the Supporting Information. For this test set, SAS-Pro provided alignments which are 30% to 100% in agreement with the reference alignments. The mean agreement of SAS-Pro is 62% and the median is 70%. SAS-Pro provides alignments with greater mean and median agreements than CE, DALI, FATCAT, MATRAS, CA, SHEBA, SARF, and LGA. The corresponding box and whisker plot of percentage agreement with reference alignments is shown in Figure 3. STSA provides alignments with better mean and median agreements with reference alignments than SAS-Pro. However, SAS-Pro provides excellent quality alignments with 100% agreement with the reference alignments for eight problems, while STSA provides alignments in 100% agreement with reference alignments for only four problems. Amongst the remaining alignment methods, only DALI, FATCAT, and MATRAS provide some (fewer than four) alignments that are in 100% agreement with the reference alignments. Even though STSA provides non-sequential alignments, it is bound by the sequentiality and choice of the five-residue fragments it utilizes. SAS-Pro is more flexible in allowing non-sequentiality, thus resulting in better alignments than STSA and other solvers for several problems. As suggested by Mayr et al. [45], while the provided reference alignments are biologically relevant, multiple alternative alignments that result in equivalent structurally optimal solutions may exist, especially for proteins with conformational variability and multiple insertions/deletions. In these cases, results obtained from different alignment tools may differ considerably, where one of the alignments matches with the provided reference alignment while others provide alternative optimal alignments.

Figure 3. Box and whisker plot for the performance of different alignment tools for the RIPC data set.

Figure 3

The red line represents the mean and the dot represents the median of the box. (All results, except for SAS-Pro and CE, were taken from [33]).

The eight alignments for which SAS-Pro is in complete agreement with reference structures are shown in Figure 4. These eight protein pairs represent alignment problems spanning all four types of alignment challenges encountered in the RIPC data set, namely, repetitions, insertions/deletions, permutations, and conformational changes. The protein pairs 1gbg-1ovw (Figure 4(a)) and 1jj7-1vga (Figure 4(b)) present alignments with large requirements of insertions/deletions, not handled by all alignment tools. Specifically, 1gbg-1ovw are glucan hydrolase proteins with Inline graphic-sandwich structure, while proteins 1jj7-1vga are P-loop containing NTP hydrolases that vary in the number of Inline graphic-strands in the central region. Thus, these protein alignment problems require a large number of insertions/deletions for a good alignment. Mayr et al. [45] indicate that different alignment tools provide very different alignments for these proteins, usually matching only the N-terminal ADP binding site of 1jj7-1vga proteins correctly. SAS-Pro places no limit on the number of insertions/deletions, resulting in a very good alignment for these proteins. Protein pairs 1nkl-1qdm (Figure 4(c)), 1qas-1rsy (Figure 4(d)), 1nls-2bqp (Figure 4(e)), and 1qq5-3chy (Figure 4(f)) are examples of proteins with permutations. The 1nkl-1qdm, 1qas-1rsy, and 1qq5-3chy proteins consist of multiple Inline graphic-helices, which do not align sequentially. Most structure alignment tools mentioned above align the Inline graphic-helices sequentially, resulting in incorrect structure alignments for these proteins. SAS-Pro correctly aligns the right Inline graphic-helices with each other, producing biologically relevant alignments. The 1nls-2bqp proteins have a Inline graphic-sandwich structure, where 1nls is posttranslationally cleaved, resulting in different N- and C-terminals. As a result, in the 1nls-2bqp protein pair, the N-terminus of one protein aligns with the C-terminus of the other protein and vice versa. Most alignment codes match only the N-terminus half of 1nls with the C-terminus half of 1bqp. Additionally, most alignment methods align only five out of the six reference alignment points, while SAS-Pro aligns the entire protein accurately. Finally, protein pairs 1gsa-2hgs (Figure 4(g)) and 1l5b-1l5e (Figure 4(h)) present conformational changes which cause slight bends in the structures. The 1gsa-2hgs proteins both contain the three-layered Inline graphic-Inline graphic-Inline graphic sandwich structure, similar to the Rossmann fold, while 1l5b-1l5e are both HIV-inactivating proteins with Inline graphic-roll structures. SAS-Pro was able to provide the correct structural alignment with 100% match with the reference.

Figure 4. Alignments obtained by SAS-Pro for the RIPC data set.

Figure 4

These alignments are in 100% agreement with the reference alignments [45].

There are three problems in the RIPC data set for which the agreement of the SAS-Pro alignment with the reference in less than 50%. These three problems are from the permutation class of alignments for which, as Mayr et al. [45] suggest, biologically relevant alternative alignments may exist. Hence, it is likely that SAS-Pro's performance may be even better than what the results of this section suggest.

Mayr et al. [45] and Salem and Zaki [33] have discussed eight protein pairs from the RIPC data set that are difficult to align. Amongst these, Salem and Zaki [33] reported the 1nkl-1qdm protein pair and the 1qq5-3chy protein pair, for which most alignment tools provided a 0% match with the reference alignment. For both of these pairs, SAS-Pro and STSA provided a 100% match with the reference alignment. Amongst the remaining six protein pairs, SAS-Pro provided high quality alignments with 100% agreement with the reference for three pairs and over 50% agreement with the reference for the remaining three pairs.

Discussion

In this paper, we presented a novel formulation of the protein structure alignment problem as a single bilevel optimization problem that addresses the assignment of amino acid residues and the structural superposition of proteins simultaneously. We employed derivative-free optimization techniques to deal with the multi-modality and non-differentiability of the RMSD function in the proposed formulation. The proposed structure alignment methodology is capable of providing both sequential and non-sequential alignments.

Our computational experiments demonstrate that the SAS-Pro model captures similarities within proteins accurately and provides alignments with lower RMSD values and larger lengths of alignments as compared to CE, SSM, and STSA for a majority of problems in the Sokol and Skolnick data sets. Moreover, SAS-Pro exhibits very good performance for the RIPC data set, for which it provided alignments with 100% agreement with the reference for a large number of protein pairs.

While the present methodology addresses both sequential and non-sequential alignments, future work should investigate the introduction of additional degrees of freedom (bond rotation) for the development of a more comprehensive structure alignment tool.

Supporting Information

File S1.

Results with the Skolnick-Sokol dataset.

(ZIP)

File S2.

Results with the RIPC dataset.

(ZIP)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: The authors have no support or funding to report.

References

  • 1.Needleman S, Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
  • 2.Smith T, Waterman M. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  • 3.Altschul SF, Gish W, Miller W. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 4.Altschul SF, Schaffer AA. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Freer ST, Kraut J, Robertus JD, Wright HT, Xuong NH. Chymotrypsinogen: 2.5-A Crystal Structure, Comparison with α-Chymotrypsin, and Implications for Zymogen Activation. Biochemistry. 1970;9:1997–2009. doi: 10.1021/bi00811a022. [DOI] [PubMed] [Google Scholar]
  • 6.Huber R, Epp O, Steigemann W, Formanek H. The Atomic Structure of Ery-throcruorin in the Light of the Chemical Sequence and its Comparison with Myoglobin. European journal of biochemistry. 1971;19:42–50. doi: 10.1111/j.1432-1033.1971.tb01285.x. [DOI] [PubMed] [Google Scholar]
  • 7.Kolodny R, Koehl P, Levitt M. Comprehensive evaluation of protein structure alignment methods: Scoring by geometric measures. Journal of Molecular Biology. 2005;346:1173–1188. doi: 10.1016/j.jmb.2004.12.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Current Opinion in Structural Biology. 1996:377–385. doi: 10.1016/s0959-440x(96)80058-3. [DOI] [PubMed] [Google Scholar]
  • 9.Lancia G, Istrail S. Protein structure comparison: Algorithms and applications. Lecture Notes in Computer Science. 2004;2666/2004:1–33. [Google Scholar]
  • 10.Singh AP, Brutlag DL. Protein structure alignment: A comparison of methods. Nature Structural Biology 2001 [Google Scholar]
  • 11.Novotny M, Madsen D, Kleywegt GJ. Evaluation of protein fold comparison servers. Proteins: Structure, Function and Bioinformatics. 2004;54:260–270. doi: 10.1002/prot.10553. [DOI] [PubMed] [Google Scholar]
  • 12.Holm L, Park J. Dalilite workbench for protein structure comparison. Bioinformatics Applications Note. 2000;16:566–567. doi: 10.1093/bioinformatics/16.6.566. [DOI] [PubMed] [Google Scholar]
  • 13.Shindyalov I, Bourne P. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering. 1998;11:739–747. doi: 10.1093/protein/11.9.739. [DOI] [PubMed] [Google Scholar]
  • 14.Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. In: States D, Agarwal P, Gaaster-land T, Hunter L, Smith R, editors. Proceedings of international conference on intelligent systems in molecular biology. AAAI Press; 1996. pp. 59–67. [PubMed] [Google Scholar]
  • 15.Krissinel E, Henrick K. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica Section D: Biological Crystallography. 2004;60:2256–2268. doi: 10.1107/S0907444904026460. [DOI] [PubMed] [Google Scholar]
  • 16.Holm L, Sander C. The fssp database: Fold classification based on structure- structure alignment of proteins. Nucleic Acids Research. 1996;24:206–209. doi: 10.1093/nar/24.1.206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Murzin AG, Brenner SE, Hubbard T, Chothia C. Scop: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  • 18.Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. CATH–A hierarchic classification of protein domain structures. Structure. 1997;5:1093–1108. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]
  • 19.Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Science. 1998;7:2469–2471. doi: 10.1002/pro.5560071126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lancia G, Carr R, Walenz B, Istrail S. Proceedings of the fifth annual international conference on Computational biology. Montreal, Quebec, Canada: 2001. 101 optimal PDB structure alignments: A Branch-and-cut algorithm for the maximum contact map overlap problem. pp. 193–202. [Google Scholar]
  • 21.Caprara A, Carr R, Istrail S, Lancia G, Walenz B. 1001 optimal PDB structure alignments: Integer programming methods for finding the maximum contact map overlap. Journal of Computational Biology. 2004;11:27–52. doi: 10.1089/106652704773416876. [DOI] [PubMed] [Google Scholar]
  • 22.Xie W, Sahinidis NV. A reduction-based exact algorithm for the contact map overlap problem. Journal of Computational Biology. 2007;14:637–654. doi: 10.1089/cmb.2007.R007. [DOI] [PubMed] [Google Scholar]
  • 23.Wohlers I, Domingues FS, Klau GW. Towards optimal alignment of protein structure distance matrices. Bioinformatics. 2010;26:2273–2280. doi: 10.1093/bioinformatics/btq420. [DOI] [PubMed] [Google Scholar]
  • 24.McLachlan AD. Rapid comparison of protein structures. Acta Cryst. 1982;A38:871–873. [Google Scholar]
  • 25.Sippl MJ. On the problem of comparing proteins: Development and applications of a new method for the assessment of structural similarities and polypeptide conformations. Journal of Molecular Biology. 1982;156:359–388. doi: 10.1016/0022-2836(82)90334-5. [DOI] [PubMed] [Google Scholar]
  • 26.Alexandrov NN, Fischer D. Analysis of topological and nontopological structural similarities in the PDB: new examples with old structures. Proteins. 1996 doi: 10.1002/(SICI)1097-0134(199607)25:3<354::AID-PROT7>3.0.CO;2-F. [DOI] [PubMed] [Google Scholar]
  • 27.Alexandrov NN, Takahashi K, Go N. Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. Journal of Molecular Biology. 1992;225:5–9. doi: 10.1016/0022-2836(92)91021-g. [DOI] [PubMed] [Google Scholar]
  • 28.Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: A refined tool for protein structure alignment. Protein Engineering. 2000;11:745–752. doi: 10.1093/protein/13.11.745. [DOI] [PubMed] [Google Scholar]
  • 29.Ortiz AR, Strauss CEM, Olmea O. Mammoth (matching molecular models obtained from theory): An automated method for model comparison. Protein Science. 2002;11:2606–2612. doi: 10.1110/ps.0215902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wu TD, Schmidler SC, Hastie T, Brutlag DL. Modeling and superposition of multiple protein structures using affine transformations: Analysis of the globins. Pac Sym on Bio. 1998:509–520. [PubMed] [Google Scholar]
  • 31.Andreani R, Martinez JM. Trust-region superposition methods for protein align- ment. IMA Journal of Numerical Analysis. 2008;28:690–710. [Google Scholar]
  • 32.Andreani R, Martnez JM, Martnez L, Yano F. Continuous optimization methods for structure alignments. Math Programming Ser B. 2008;112:93–124. [Google Scholar]
  • 33.Salem S, Zaki MJ, Bystroff C. Iterative non-sequential protein structural alignment. Journal of Bioinformatics and Computational Biology. 2009;7:571–596. doi: 10.1142/s0219720009004205. [DOI] [PubMed] [Google Scholar]
  • 34.Dell'Amico M, Lodi A, Martello S. Efficient algorithms and codes for k-cardinality assignment problems. Dis App Math. 2001;110:25–40. [Google Scholar]
  • 35.Kolodny R, Linial N. Approximate protein structural alignment in polynomial time. Proc Natl Acad Sci USA. 2004;101:12201–12206. doi: 10.1073/pnas.0404383101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Current Biology. 1993;3:141–148. doi: 10.1016/0960-9822(93)90255-m. [DOI] [PubMed] [Google Scholar]
  • 37.Rios LM, Sahinidis NV. Derivative-free optimization: A review of algorithms and comparison of software implementations. 2011. Working Paper, http://thales.cheme.cmu.edu/dfo.
  • 38.Huyer W, Neumaier A. SNOBFIT–Stable noisy optimization by branch and fit. ACM Transactions on Mathematical Software. 2008;35:1–25. [Google Scholar]
  • 39.Rios LM. Algorithms for derivative-free optimization. 2009. Ph.D. thesis, Department of Industrial and Enterprise Systems Engineering, University of Illinois, Urbana, IL.
  • 40.Kleywegt GJ, Jones A. Superposition. CCP4/ESF-EACBM Newsletter Protein Crystallogy. 1994;31:9–14. [Google Scholar]
  • 41.Ye Y, Godzik A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics. 2003;19:246–255. doi: 10.1093/bioinformatics/btg1086. [DOI] [PubMed] [Google Scholar]
  • 42.Rocha J, Seguraa J, Wilson RC, Dasgupta S. Flexible structural protein alignment by a sequence of local transformations. Bioinformatics. 2009;25:1625–1631. doi: 10.1093/bioinformatics/btp296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Salem S, Zaki MJ, Bystroff C. FlexSnap: Flexible Non-sequential Protein Structure Alignment. Algorithms for Molecular Biology. 2010;5:12–24. doi: 10.1186/1748-7188-5-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Carr RD, Lancia G, Istrail S. Branch-and-cut algorithms for independent set problems: Integrality gap and an application to protein structural alignment. Technical report, Sandia National laboratories. Sandia Report. 2000;SAND2000-2171 [Google Scholar]
  • 45.Mayr G, Domingues FS, Lackner P. Comparative Analysis of Protein Structure Alignments. BMC Structural Biology. 2007;7:50–64. doi: 10.1186/1472-6807-7-50. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

File S1.

Results with the Skolnick-Sokol dataset.

(ZIP)

File S2.

Results with the RIPC dataset.

(ZIP)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES