Abstract
In the era of structural genomics, it is necessary to generate accurate structural alignments in order to build good templates for homology modeling. Although a great number of structural alignment algorithms have been developed, most of them ignore intermolecular interactions during the alignment procedure. Therefore, structures in different oligomeric states are barely distinguishable, and it is very challenging to find correct alignment in coil regions. Here we present a novel approach to structural alignment using a clique finding algorithm and environmental information (SAUCE). In this approach, we build the alignment based on not only structural coordinate information but also realistic environmental information extracted from biological unit files provided by the Protein Data Bank (PDB). At first, we eliminate all environmentally unfavorable pairings of residues. Then we identify alignments in core regions via a maximal clique finding algorithm. Two extreme value distribution (EVD) form statistics have been developed to evaluate core region alignments. With an optional extension step, global alignment can be derived based on environment-based dynamic programming linking. We show that our method is able to differentiate three-dimensional structures in different oligomeric states, and is able to find flexible alignments between multidomain structures without predetermined hinge regions. The overall performance is also evaluated on a large scale by comparisons to current structural classification databases as well as to other alignment methods.
Keywords: structural alignment, database search, flexible alignment, fold recognition, structural genomics
There are many different ways to compare and classify proteins. Which is best depends on the overall goal of the study: detection of evolutionary relations, succinctly categorizing well-defined tertiary structure motifs, recognizing similar patterns in residue environments in spite of differences in sequence and tertiary structure, etc. Here the eventual goal is to develop a set of categories for globular soluble proteins such that a given sequence will recognize the category corresponding to its native fold. It is possible that polypeptide chains A and B adopt much the same fold as do monomers under native conditions, even though they may have little sequence similarity. On the other hand, chain C may fold up the same way but associate as a homodimer, CC. The dimer interface of the C subunit would be hydrophobic, whereas the corresponding surface residues in A and B would be relatively hydrophilic. Clearly, different kinds of sequences would form monomers versus dimers, even though the secondary and tertiary structures for the individual folded chains are quite similar. This suggests a protein comparison approach that emphasizes spatial similarity in well-ordered parts of the chain, residue environmental features correlated with sequence but not as precise as the sequence, and recognition that the environment depends on more than intrachain interactions.
Using sequence alignment techniques, evolutionary linkages between proteins can be found, and functions for novel proteins can be inferred from known proteins. Well-defined statistical theories have been founded to measure the statistical significance of alignment scores (ASs) (Karlin and Altschul 1990). Based on the extreme value distribution (EVD) model, the probability of finding an AS s larger than x between unrelated sequences is equal to
![]() |
(1) |
where λ and μ are scale and location parameters, respectively.
P(s > x) is also known as the P-value. Some use the E-value instead of the P-value to measure the statistical significance of an alignment. The E-value is defined as
![]() |
(2) |
The lower the P-value or E-value, the more probable it is that two proteins are homologous to each other.
If sequence identities go below the twilight zone of 20%~30% (Jaroszewski et al. 2002), sequence alignment cannot reliably find functional linkage any more. On the other hand, similarities can be detected for proteins having <10% sequence identity with the aid of three-dimensional (3D) structures. The ongoing Protein Structure Initiatives (PSI) will ultimately make 3D structural annotations available for almost every protein sequence (Norvell and Machalek 2000), which will afford greater opportunity for applying structural information to function annotation. Many structure-based alignment methods have been developed (CE [Shindyalov and Bourne 1998]; Dali [Holm and Sander 1993]; LOCK 2 [Shapiro and Brutlag 2004]). Generally, there are two types of approaches to the structural alignment: coordinate-based and environment-based.
In the coordinate-based approach, an alignment is just like aligning two sets of points (or fragments), and the similarity is evaluated based on how well the two sets can be superimposed in 3D space. The problem is that each protein chain is isolated from its native environment, thus ignoring interchain interactions. This situation becomes worse when protein structures are broken into domains before structural comparison (some intrachain interactions are also missing), worse when a domain is only represented as a trace of Cα atoms (some side-chain interactions are missing too), and even worse when the similarity measurement is only based on Cartesian coordinates of those Cα atoms (amino acid types are disregarded). Unfortunately, this is the situation we find in most structural alignment algorithms. The consequence of this approach is obvious: These methods are unable to distinguish protein chains in different oligomeric states.
In the environment-based approach, structure-derived descriptors rather than explicit Cartesian coordinate–based distances are used to generate the structure-based alignment. Most proteins are polymers of the 20 standard amino acids, and different physicochemical environments will favor different amino acid types. Thus a structure alignment can be transformed into an alignment of those environments, while the physicochemical properties of each residue’s environment can be described as a combination of solvent accessibility, hydrogen bond strengths, and other structure-derived environmental descriptors (SED). Since these SEDs may be precalculated under native conditions, intermolecular and intramolecular interactions can be retained very well, and an alignment can be derived in the end that is compatible wsith the native state.
A pure environmental-based structural alignment method was developed by Suyama et al. (1997). Four SEDs (called “pseudoenergy functions” in their article)—i.e., side-chain packing, solvation, hydrogen bonding, and local structure—are used to convert a 3D structure into a position-dependent matrix that contains the fitness scores of the 20 residue types to each position. Given two structures, an environmental compatibility distance matrix is built based on covariances between positions in the two 3D position-dependent matrices. Then the Needleman-Wunsch dynamic programming algorithm is used to generate the final alignment (Needleman and Wunsch 1970). This method can produce globally environment-compatible structural alignments, but it is hard to justify the accuracy of the alignment results. We can imagine that positions on two distinct β-sheet strands buried in a protein core may have exactly the same local environment (Taylor 1999), which cannot be differentiated based on the SEDs. Therefore, alignment errors are expected among those regions. Furthermore, there is no available statistical theory yet to measure the significance of global ASs produced by the Needleman-Wunsch algorithm.
COMPARER is another method to search for environmentally compatible structural alignments (Šali and Blundell 1990). Twenty structure- and sequence-derived descriptors are used to build “element-by-element dissimilarity distance matrices,” over which alignments are determined by the Needleman-Wunsch algorithm. In order to get accurate alignment, an additional simulated annealing step is used to optimize the original alignment and correct incompatible interresidue hydrogen bond patterns for each aligned residue pair. COMPARER can produce relatively accurate environmentally compatible alignments. However, it does not provide a measurement to score the goodness of an alignment result. Instead, users have to check the corresponding 3D superimposed structure coordinates, “as automation of this alignment process cannot be guaranteed” (http://www-cryst.bioc.cam.ac.uk/). Thus this method is also not able to evaluate environmental incompatibilities in structural alignments.
Basically, coding 3D structures into SEDs alone will result in loss of information, which makes it hard to obtain accurate structural alignment. In many structural alignment methods, an environment-based alignment is only the first step, followed by additional coordinate based alignment procedures to refine the initial alignment (Taylor 1999). However, in most of these programs, there are no constraints imposed to ensure environmental compatibility during the latter step, and the previously discussed drawbacks for coordinate-based approaches still apply to these methods.
In order to measure environmental compatibilities, the most important thing is to derive SEDs under a realistic environment. Recently RCSB has released a new format of protein structure file: the biological unit file. Each biological unit file contains a biological unit, which is defined as “the macromolecule that has been shown to be or is believed to be functional” (http://www.rcsb.org/pdb/). Biologically functional interactions (especially intermolecular interactions) in these files are more complete and realistic than those in the original Protein Data Bank (PDB) files (Berman et al. 2000). Based on realistic environments provided by these files, we have developed a novel structural alignment method as follows.
At first, we apply some (environmental) standards to eliminate all of the environmentally incompatible chain fragment pairs from the candidate pool. Then we search for (nonoverlapping) maximal cliques in the candidate pool: Each maximal clique is the largest set of residues that can be aligned in Cartesian space. After iteratively determining all the cliques, the gaps (loop regions) between aligned residue pairs are bridged by dynamic programming over environmentally compatible distance matrices.
There are several unique features of this alignment method: First of all, the resulting alignments are both structurally compatible (alignments in core regions can be superimposed in 3D space) and environmentally compatible (environmentally incompatible parings have been eliminated). Second, loop regions can be aligned without having similar 3D structures. Third, each clique corresponds to a core region of a domain/motif; proteins with multiple domains may be aligned by introducing multiple cliques. Therefore, it is not necessary to initially split the chain into domains.
Structural ASs have been found to follow the EVD in some coordinate-based alignment methods (FATCAT [Ye and Godzik 2004]; Structal [Levitt and Gerstein 1998]). Using the methods reported in FATCAT, we did a similar simulation of random structural alignments, based on which we have successfully developed two statistics to measure the alignment quality of a clique: One is the number of the aligned residue pairs (clique size [CS]), and the other is the AS, to measure overall environmental compatibility. The former corresponds to a coordinate-based similarity measurement, while the latter corresponds to an environment-based similarity measurement. We find that both statistics roughly follow the EVD. With the environment-based statistic (AS), we are finally able to distinguish protein structures with different oligomeric states.
Results
3D–1D table
A 75 × 20 3D–1D table (Bowie et al. 1991; Rice and Eisenberg 1997) was derived to compute environmental compatibility (see Materials and Methods; Supplemental Material). Given a 3D–1D table, a 3D structure can be converted to a structural profile, which is a position-dependent matrix (very similar to the 3D profile used in the work of Suyama et al. [1997]). To test the power of the 3D–1D table to represent 3D structures, we did a self-recognition test: align each structural profile to its native sequence compared with 200 random permutations of the native sequence. The histogram of Z-scores of 1632 self-recognition tests is shown in Figure 1A ▶. Almost all native sequences are compatible with their corresponding native structures (with large Z-scores), suggesting that our 3D–1D table is in agreement with residue preferences for different types of environments.
Figure 1.
Self-recognition test results. (A) For 1632 proteins the 3D–1D table gave alignments for the native sequence to the native structural profile having a good Z-score compared with alignments of random permutations of the native sequence. (B) For 677 structurally diverse proteins, the 3D–1D table gave alignments of the native sequence onto the native structural profile having a good Z-score compared with alignments of the native sequence onto all 677 structural profiles.
The 3D–1D table also is compatible with the native structure in that the alignment of a protein’s sequence onto its own structural template gives a good (large) Z-score, compared with aligning to nonnative structures (Fig. 1B ▶). For this test we used a library of 677 structurally diverse proteins selected from the total set of 1632 proteins.
Statistics of structural alignments
We find that the both the CS (i.e., the number of structurally aligned residues) and the environment-based AS approximately follow the EVD (Fig. 2 ▶), although the quantile-to-quantile plots indicate that the fitting is not perfect in regions of very small CS/AS and extremely large CS/AS. A similar result has been found and discussed in FATCAT, when Ye and Godzik (2004) fit structural ASs to an EVD model. Basically, our EVD model will give higher than expected P-values or E-values for extremely high CS/AS. Therefore, our EVD model is a little bit more stringent in the high scoring regions. Interestingly, the opposite trend is found in FATCAT.
Figure 2.
Fitting the sizes and alignment scores of the first clique found to EVD. The histograms shown above were generated by 5000 alignments of unrelated structures with a length of 148 residues. The two plots on the right are quantile-to-quantile plots to compare data distribution vs. theoretical EVD distribution.
The location and scale parameters are, as in FATCAT, dependent on the size of structures used in the alignment. We find that these two parameters are not linearly correlated with the length of the protein chain (as in FAT-CAT), but are linearly correlated with log(MN), where MN is the product of chain lengths of the two proteins to be aligned. The EVD parameters for CS are
![]() |
(3) |
![]() |
(4) |
and the correlation coefficients for μ and λ are 0.996 and 0.896, respectively. The EVD parameters for ASs are
![]() |
(5) |
![]() |
(6) |
and the correlation coefficients for μ and λ are 0.997 and 0.966, respectively. Based on these linear functions, the P-value and E-value of a CS or of ASs can be estimated by equations 1 and 2.
Overall performance of statistics
Recently, receiver-operating characteristic (ROC) curves have been used by Kolodny et al. (2005) and Gribskov and Robinson (1996) to evaluate the performance of different geometric measurements for scoring structural alignments. Using an automatic protein structure classification database, CATH (Orengo et al. 1997), as a reference, we plot the ROC curves for our structural alignment algorithm. In general, all three statistics (CSs, ASs, and the combination of the CS and the AS) largely agree with CATH. Among them, using CSs as the measurement of the statistical significance gives the best agreement with CATH, while using environment-based ASs gives the least agreement (Fig. 3 ▶).
Figure 3.
Receiver-operating characteristic curves of statistics. CS indicates clique sizes only; AS, alignment scores only; and CS AS, combination of clique sizes and alignment scores.
We also plot error per query (EPQ) curves (Fig. 4 ▶). At EPQ level of 0.1, our method can find 26%–45% of the possible homologs, depending on whether different CATH homologous families were counted as errors (Fig. 4A ▶) or whether different CATH topologies were counted as errors (Fig. 4B ▶). At an EPQ of 1.0, our method can identify 46%–67% of the possible homologs. This shows that our alignment performance is in general similar to that of Dali and better than that of PSI-Blast (Sierk and Pearson 2004). It also shows that under low error rates, combining the CS and the AS performs best to distinguish CATH homologous proteins from nonhomologous proteins.
Figure 4.
EPQ plots of statistics. In both plots, structures from the same CATH homologous families were treated as “similar” (hits). (A) Structures from different CATH homologous families were treated as dissimilar (errors). (B) Structures from different CATH topologies were treated as errors. CS, AS, and CS AS are defined as in Figure 3 ▶. Solid square indicates PSI-Blast; solid triangle, Dali.
Therefore, in general, our method can identify structural homologs as well as other structural alignment algorithms can, and it agrees with CATH classification.
Statistical vs. nonstatistical measurement
A coordinate-based measurement, ρ (Maiorov and Crippen 1995), was also derived after 3D superposition of the extended global structural alignment. The advantage of ρ is that it is size independent and can be viewed as a scaled root mean squared deviation (RMSD). A pair of structures with clear visual similarity will have ρ < 0.5. We randomly sampled 200 pairs of structures with different levels of similarities and calculated ρ values. The results are shown in Figure 5 ▶. As we can see, with a cutoff of 0.01, the combination of E-values can identify more homologs than by using pure coordinate-based ρ.
Figure 5.
Evaluation of structural alignment using statistical and nonstatistical criteria. Both ρ based on the rigid-body superposition of the first clique found and ρ based on flexible superposition were used for nonstatistical measurement. Unrelated, Class, Architecture, Topology, and Homology are defined based on CATH hierarchical levels. The E-values of clique size (A), alignment score (B), and the geometric mean of E-values of the clique size and the alignment score (C) of the first clique were used as a combination score for statistical measurement.
Example: Differentiating oligomeric states
Using the environment-based AS, we can distinguish structures preferring different oligomeric states. One example is shown in Figure 6 ▶. 1idp is scytalone dehydratase (E.C. 4.2.1.94) existing as a homotrimer. 1oh0 is ketosteroid isomerase (E.C. 5.3.3.1) as a homodimer. The structural superposition of the subunits based on coordinates is very good (ρ = 0.27 for 62 aligned residue pairs). CATH defines them as homologs. The size of the first clique is statistically significantly larger than unrelated structure pairs (E-value = 3.3 × 10−5). However, the environment-based score (E-value = 0.12) is not statistically significant, suggesting there are some differences in the environments.
Figure 6.
Superposition of 1oh0:A and 1idp:A. (A) biological unit of 1oh0 (chain A is black). (B) Biological unit of 1idp (chain A is black). (C) The alignment of 1oh0:A (black) to 1idp:A (gray). Figure was generated by MolMol 2k.2 (Koradi et al. 1996).
Example: Structural alignment of multidomain proteins
By iteratively applying the maximal clique finding algorithm, we can find more than one clique, which may correspond to multiple domains. Figure 7 ▶ shows the alignment of a pair of immunoglobulins (8fab:A vs. 1dcl:B). Both proteins are dimeric two-domain proteins (other subunits in the biological unit were omitted for clarity). By using two cliques, we get a “flexible” alignment. E-values for the CS (2.1 × 10−8) and AS (1.3 × 10−10) are both statistically significant. ρ of the “flexible” superposition is 0.10. It can be shown that subsequent cliques can also be evaluated by similar statistics as the first clique (data not shown). Therefore, statistics were also applied to measure the statistical significance of the second clique, giving E-value 1.7 × 10−9 and 1.7 × 10−15 for the CS and environment score, respectively.
Figure 7.

Alignment of immunoglobulins 8fab:A (black) and 1dcl:B (gray). (A) Rigid body alignment based on the first clique. (B) Flexible alignment using two cliques. Figure was generated by MolMol 2k.2 (Koradi et al. 1996).
Example: Alignment of 10 “difficult” protein pairs
Ten difficult protein pairs are widely used as a benchmark to evaluate the performance of structural alignment methods (Fischer et al. 1996). As shown in Table 1, all of these pairs of proteins are statistically significantly similar to each other, as they should be.
Table 1.
Alignment of 10 difficult cases
| PDB entries | CS | ρ | E-value (CSa) | E-value (ASb) | E-value (CS_ASc) |
| 1fxiA:1ubq_ | 40 | 0.28 | 4.68 × 10−03 | 5.11 × 10−03 | 4.89 × 10−03 |
| 1ten_:3hhrB | 76 | 0.16 | 2.39 × 10−06 | 3.18 × 10−06 | 2.76 × 10−06 |
| 3hlaB:2rhe_ | 39 | 0.2 | 1.26 × 10−02 | 4.95 × 10−02 | 2.50 × 10−02 |
| 2azaA:1paz_ | 58 | 0.22 | 1.70 × 10−04 | 8.22 × 10−04 | 3.74 × 10−04 |
| 1cewI:1molA | 67 | 0.18 | 6.56 × 10−06 | 5.11 × 10−03 | 7.76 × 10−05 |
| 1cid_:2rhe_ | 60 | 0.15 | 1.77 × 10−04 | 1.42 × 10−02 | 1.59 × 10−03 |
| 1crl_:1ede_ | 73 | 0.19 | 6.34 × 10−04 | 1.16 × 10−02 | 2.71 × 10−03 |
| 2sim_:1nsbA | 90 | 0.16 | 1.97 × 10−05 | 5.47 × 10−05 | 3.28 × 10−05 |
| 1bgeB:2gmfA | 58 | 0.18 | 2.60 × 10−04 | 3.63 × 10−04 | 3.07 × 10−04 |
| 1tie_:4fgf_ | 63 | 0.2 | 8.83 × 10−05 | 9.72 × 10−04 | 2.93 × 10−04 |
a Based on clique sizes only.
b Based on alignment scores only.
c Based on the combination of clique sizes and alignment scores.
Discussion
We used Bron and Kerbosch’s method (1973) to generate the alignment of core regions. This maximal clique finding algorithm has been used to identify 3D-superimposable ligand binding sites (Kuhl et al. 1984). The advantages of the maximal clique finding algorithm are as follows: (1) It does not require many adjustable parameters (no gap penalty is needed), and (2) it can be extended to generate flexible alignments such as FAT-CAT (Ye and Godzik 2004), making it possible to align multidomain proteins without separating the domains.
Many investigators have recognized that RMSD is not a good measurement of the quality of structural alignment. Many alternative measurements have been developed (Maiorov and Crippen 1995; Yang and Honig 2000; Kolodny et al. 2005). However, all of these measurements are coordinate-based. Therefore, given a structural alignment carried out on the domain level, there is no way to distinguish proteins with different oligomeric states because the intermolecular/interdomain interactions are irrelevant to these measurements. Another concern is that this coordinate-based approach has resulted in a misconception in the structural alignment area: The best structural alignment method should find the largest number of equivalent residue pairs as long as each pair of residues is close enough after being superimposed in 3D space. For alignments in conserved structural regions (i.e., the core regions), this approach is reasonable enough. But such an approach may produce matching errors between residues in loop regions.
With an environmental-based approach, we can guarantee the environmental compatibility between matched residues. This not only eliminates possible wrong alignments but also makes it possible to find some possible alignments in loop regions. We used a very stringent cutoff (1.5 Å) to define core regions (compared with a cutoff of 3.0 Å that CE used). The length of region alignments based on core regions may seem to be relatively small compared with other methods because we want to make sure that the alignment is correct with no misleading alignments. A much longer alignment is easily obtainable by using the extension procedures described in Materials and Methods.
Domain splitting is a hard problem, but rigid body superposition of unsplit domains can result in erroneous conclusions. For example, Gan et al. (2002) used 8fab:A versus 1dcl:B as a pair of proteins with pretty high sequence identity but low structural similarity. In fact, it is just an example of structural flexibility. Although domain splitting is an alternative to solve this problem, we have shown that the overall structural similarity can also be found by using our method. Instead of an example of discrepancy between sequence and structure similarities, it is an example of how structures and sequences are related. The high environmental-based AS (low E-values) suggests that the environments of the two structures are quite similar on the biological unit level.
In the fold recognition area, each fold is converted to a structural template, and a sequence is aligned to a library of structural profiles in order to find the most suitable structure for the sequence. In order to accommodate structural flexibilities, a structural template can be generated by a family of proteins (Shi et al. 2001; Tang et al. 2003). This calls for “accurate” structural alignment methods. Note the “accurate” here does not mean “superimposable” structural alignment, but a sequence-compatible structural alignment. Our environmental-based approach is what is needed because positions that tend to favor similar types of amino acids will tend to be aligned together.
Materials and methods
Data sets
We selected our data set based on CATH (Orengo et al. 1997) version 2.5.1. At first, 1938 single domain proteins with <35% sequence identity were selected. After removing membrane proteins, proteins containing residues other than the standard 20 types, and proteins with <20 residues, we got 1632 protein chains. The reason that we removed chains having nonstandard residues is that we want to simplify the preprocessing procedure to calculate relative solvent accessibilities and secondary structures. All protein biological-unit files were downloaded from ftp.rcsb.org. These 1632 chains were used in self-recognition tests.
Preprocessing of biological unit files
Three SEDs were used in our work: bond strength of secondary structures (BSSS), relative solvent accessibility (RSA), and fraction buried by polar atoms (FBP). These three SEDs were calculated for each residue in a structure. We used the method described in DSSP (Kabsch and Sander 1983) to calculate the electrostatic energy of a hydrogen bond. In DSSP, helices and sheets are defined in terms of having a certain pair of hydrogen bonds, and a hard cutoff (−1.0 kcal/mol) was used to define the existence of a hydrogen bond. If a residue has both of the interturn hydrogen bonds required in a helix, it is declared to be a helical residue; if it has both the required interstrand hydrogen bonds, it is declared to be a β-sheet residue. Here we use the strengths of the two pairs of hydrogen bonds to decide whether a residue is more helical or β-strand, but instead of using a fixed cutoff, we define BSSS as a continuous variable representing secondary structures by ranging over both positive and negative values. The absolute value of BSSS is equal to the magnitude of the weaker of the two hydrogen bond energies; residues in helices are given a negative sign, and residues in sheets are given a positive sign. Thus the more negative, the more likely a residue is in a helix; the more positive, the more likely the residue is in a sheet. Values close to zero correspond to the coil state.
Solvent-accessible surface area (SASA) (Lee and Richards 1971) of each atom was determined by placing 512 equally spaced sample points on the surface of its imaginary “solvent sphere,” with a radius equal to the sum of the atom’s van der Waals radius and the radius of a water molecule. If a point was in the solvent sphere of any other atom, it was defined as buried; otherwise, it was defined as solvent-accessible. SASA of an atom was then determined by (Nacc/512)Areaap. Nacc is the number of solvent-accessible points. Areasp is the surface area of the solvent surface. The calculations of SASA were always performed in a biological unit context. SASA for a residue was the sum of the SASAs for each atom in the residue. RSA for a residue was defined as the ratio between SASA of the residue in a biological unit versus SASA for that residue X in the pentapeptide GGXGG.
FBP for a residue was given by (Np/Ntotalb), where Np is the total number of sample points in the residue that were buried by polar atoms and Ntotalb is the total number of buried sample points for the residue.
3D–1D table
At first, we classified environment states based on each of the above SEDs. Five BSSS states, five RSA states, and three FBP states were defined, giving altogether 5 × 5 × 3 combinations of states. The five BSSS states were defined by (1) strong helix with BSSS < −2.067, (2) medium helix with −2.067 < BSSS < −0.181, (3) weak helix with −0.181 < BSSS < −0.031, (4) coil state with −0.031 < BSSS < 0, and (5) β strand with BSSS > 0. The five RSA states were defined by (1) thoroughly buried with RSA < 1.4%, (2) 1.4% < RSA < 12.5%, (3) 12.5% < RSA < 31.8%, (4) 31.8% < RSA < 54.5%, and (5) very exposed with RSA > 54.4%. Three FBP states were defined by (1) FBP < 26.7%, (2) 26.7% < FBP < 33.5%, and (3) FBP > 33.5%. For each SED the boundaries of the states were chosen so that equal numbers of residues from the 1632 proteins fell into each state.
The 3D–1D table has 20 rows and 75 columns, where entry i, j represents the log-likelihood of residue type i being in environment state j, based on a survey of the 1632 proteins. Some bins contained very few hits for certain amino acids, so pseudo-counts were added to observations based on the background frequency of all residue types. The weight of pseudo-counts was controlled to be 10% of the total observations in each state bin. Based on the corrected counts, the log-likelihood–based 3D–1D score for amino acid i in bin j was obtained as follows: 100 log {P(i, j)/[P(i) P(j)]}. (For the 75 × 20 3D–1D table used in this work, see Supplemental Material.) Each 3D structure was translated into a structural profile by the 3D–1D table. A structural profile is an N by 20 matrix, where N is the number of residues in a structure and each row contains 3D–1D scores for the 20 residue types in the state corresponding to that position.
Self-recognition test
For each structural profile, the native sequence fitness score s was defined as the sum of 3D–1D scores for the native residues at each position. This fitness score was transformed into a Z-score:
. The mean and standard deviation of fitness scores for each protein were estimated by calculating the fitness scores for 200 random permutations of the native sequence. Thus the overall amino acid composition was held constant, but the permuted sequences had very low sequence identity to the native.
Alternatively, in order to test the recognition of the correct fold by a single sequence, we noted that the 1632 proteins fell into 677 homologous families based on CATH version 2.5.1. In order to ensure that the set of alternative proteins was structurally diverse, we selected at random one representative from each family. The sequence of each representative was threaded onto all 677 structural templates by a Smith-Waterman dynamic programming algorithm with affine gap penalties (300 for opening a gap and 30 for extending a gap). To compensate for varying chain lengths, the resulting threading score was divided by the length of the 3D structural template that it was threaded to. The Z-score over adjusted threading scores was calculated as above, where s is the adjusted score for threading the native sequence onto the native structure, and the mean and standard deviations were estimated by the adjusted scores from threading that same sequence onto all 677 structural templates.
Core structural alignment
The alignment of core structures was carried out in the following three steps:
1. Determination of candidate aligned fragment pairs (AFPs). If two octapeptides from two structures were similar in terms of 3D local structures as well as compatible in environments, they were treated as a candidate pairing for structural alignment. The 3D local structure similarity was calculated by
![]() |
(7) |
where Dij is the difference of paring two peptides with length m = 8 in protein A (starting at piA and pjA) and two peptides in protein B (starting at piB and pjB). Dii is the difference of pairing a peptide in A and a peptide in B starting at piA and piB, respectively. See the CE algorithm (Shindyalov and Bourne 1998) for details. A pair of structurally similar octapeptides (i.e., AFP with m = 8) was defined to have Dii <1.5 Å. The environmental compatibility for each residue pair was defined as the Pearson correlation coefficient, cor, of the two corresponding bins of states on the basis of 3D–1D scores. The environment compatibility of two octapeptides (Eij) is equal to the sum of the compatibility scores for the eight pairs of residues:
![]() |
(8) |
Where
is a vector of 20 3D–1D scores of residue types for the correspondent state bin of residue i in structure A. If Eij ≤ 0, we defined the two octapeptides to be environmentally incompatible.
2. Extension of AFP-based alignment by maximal clique finding algorithm. Consider a mathematical graph having N nodes corresponding to the AFP candidates found in the previous step. Two nodes are joined by an undirected edge if the two AFPs i and j satisfy (1) Dij < 1.5 Å, (2) (piA − piB) * (pjA − pjB) > 0, and (3) |piA − pjA| ≥ m and |pjB − pjB| ≥ m. In other words, the two AFPs are geometrically compatible, sequence order is maintained, and the octapeptides do not overlap in sequence in either protein. A maximal clique of such a graph corresponds to the largest subset of completely connected nodes, i.e., the most aligned residues involved in mutually geometrically compatible AFPs. Bron and Kerbosch’s method (1973) was used to find the maximal clique.
3. Iterative clique finding. After a run of the maximal clique finding routine, we eliminated AFPs in the resulting clique from the connection matrix and reran the maximal clique finding procedure to find another clique. Iterations ended when the returned CS was <20, or the maximal number of allowed cliques (e.g., three) had been achieved.
The quality of a clique (i.e., a core region) was measured by two statistics: the CS (CS = number of structurally aligned residues) and the environment-based AS (AS = sum over all residue pairs in the clique of the environmental compatibility scores, as in equation 8).
EVD model
We adopted the method used by FATCAT (Ye and Godzik 2004) to generate unrelated structures: Given a chain length l, all protein chains longer than l were selected from the above 1632 chains. Then we randomly chose two proteins from different topologies and chopped them into two fragments containing l contiguous residues each at random starting points. Since it is unlikely that structures in different topologies are actually homologous (Sierk and Pearson 2004), this randomization gives us a pair of unrelated structures. For each chain length of 43, 55, 70, 90, 106, 148, 191, 245, and 314, up to 5000 pairs of unrelated structures were generated. The structural profile of each fragment was calculated in the context of the original biological units (see “Preprocessing of biological unit files”). Three hundred five structures were used in this training set. EVD parameters were estimated by the method of moments (Altschul and Erickson 1986).
Large-scale test
The unbiased test set was generated by a method similar to the one described by Sierk and Pearson (2004). Among our 1632 protein chains selected above, we removed proteins with chain length >300 residues, resulting in 1456 structures. After removing structures used in the training set, as well as homologous superfamilies containing only one structure, we got 880 structures representing 101 topologies and 174 homologous families. We then selected out 44 “large” homologous families having more than five members per family and used the longest chain from each of these families as a query. The 44 queries covered 31 CATH topologies. The remaining 836 structures were used as our target library. The test was carried out by pairwise alignment of each of the 44 query structures to the 836 targets in the library. Altogether, 36,784 pairwise alignments were performed.
ROC curves were generated by calculating specificity and sensitivity at different cutoffs. E-values of CS and AS of a clique were calculated. The geometric mean of these two E-values was also used as the combination of size and AS (CS AS). The three E-values were used to evaluate the statistical significance of a structural alignment.
In ROC plot, specificity and sensitivity are defined as follows:
![]() |
(9) |
![]() |
(10) |
and positives and negatives in equations 9 and 10 were defined at the CATH homologous family level. This is more challenging than the topology level tests used in purely structural comparisons, and it is more appropriate because we also include environmental effects in AS.
EPQ tests were performed as described in Sierk and Pearson (2004). Coverage was defined as the percentage of correct hits (i.e., CATH homologs) found at a given cutoff. EPQ was defined as the ratio of the total number of errors occurring at a given cutoff over the total number of queries (i.e., 44 in this work). Because CATH is fairly stringent to define homologs, some structurally well-aligned structures may be classified into different homologous families. Sierk and Pearson (2004) suggested using toplogy classification to count errors. Therefore, we used two criteria to define errors: CATH non-homologs and CATH nontopologs (in Fig. 4, A and B ▶, respectively).
Extension to global structural alignment
The global alignment was made by linking the fragments found in the maximal clique finding procedure. In gaps between fragments, an alignment path was generated by dynamic programming over the environmental compatibility scoring matrix. Rigid structure superposition was generated by superimposing residue pairs found in the first clique. When more than one clique was found, residue pairs were superimposed for each clique, and nonclique residues were rotated and translated corresponding to the operation of the nearest clique. The aligned residue pairs were used in the calculation of ρ (Maiorov and Crippen 1995). ρ is defined as
![]() |
(11) |
where ai and bi are the coordinates (x,y,z) centered at the origin for the ith aligned residues from structure A and B, respectively. Among the 36,784 pairwise structural alignments, we randomly sampled 200 alignments with at least 60% of the residues covered by cliques. We eliminated structures with small CS because those structures cannot be similar, whether measuring by statistical or nonstatistical methods. Among the 200 structures, 52 pairs were from different CATH classes, 51 pairs were from the same class but different architectures, 38 were from same architecture but different topologies, 18 were CATH topologs, and 41 were CATH homologs. After extensions to global alignments, ρ was calculated based on the above equation.
Electronic supplemental material
The 75 × 20 (5 × 5 × 3 × 20) 3D–1D table used in this work is available online.
Acknowledgments
This work was supported in part by a grant from the Vahlteich Research Endowment Fund of the College of Pharmacy of the University of Michigan.
Article published online ahead of print. Article and publication date are at http://www.proteinscience.org/cgi/doi/10.1110/ps.051428205.
Supplemental material: see www.proteinscience.org
References
- Altschul, S.F. and Erickson, B.W. 1986. A nonlinear measure of subalignment similarity and its significance levels. Bull. Math. Biol. 48 617–632. [DOI] [PubMed] [Google Scholar]
- Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowie, J.U., Luthy, R., and Eisenberg, D. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253 164–170. [DOI] [PubMed] [Google Scholar]
- Bron, C. and Kerbosch, J. 1973. Finding all cliques of an undirected graph [H]. Commun. ACM. 16 575–577. [Google Scholar]
- Fischer, D., Elofsson, A., Rice, D., and Eisenberg, D. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. 1996 300–318. [PubMed] [Google Scholar]
- Gan, H.H., Perlow, R.A., Roy, S., Ko, J., Wu, M., Huang, J., Yan, S., Nicoletta, A., Vafai, J., Sun, D., et al. 2002. Analysis of protein sequence/structure similarity relationships. Biophys. J. 83 2781–2791. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gribskov, M. and Robinson, N.L. 1996. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 20 25–33. [DOI] [PubMed] [Google Scholar]
- Holm, L. and Sander, C. 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233 123–138. [DOI] [PubMed] [Google Scholar]
- Jaroszewski, L., Li, W., and Godzik, A. 2002. In search for more accurate alignments in the twilight zone. Protein Sci. 11 1702–1713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kabsch, W. and Sander, C. 1983. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 2577–2637. [DOI] [PubMed] [Google Scholar]
- Karlin, S. and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. 87 2264–2268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolodny, R., Koehl, P., and Levitt, M. 2005. Comprehensive evaluation of protein structure alignment methods: Scoring by geometric measures. J. Mol. Biol. 346 1173–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koradi, R., Billeter, M., and Wüthrich, K. 1996. MOLMOL: A program for display and analysis of macromolecular structures. J. Mol. Graph. 14 51–55. [DOI] [PubMed] [Google Scholar]
- Kuhl, F.S., Crippen, G.M., and Friesen, D.K. 1984. A combinatorial algorithm for calculating ligand-binding. J. Comp. Chem. 5 24–34. [Google Scholar]
- Lee, B. and Richards, F.M. 1971. The interpretation of protein structures: Estimation of static accessibility. J. Mol. Biol. 55 379–400. [DOI] [PubMed] [Google Scholar]
- Levitt, M. and Gerstein, M. 1998. A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci. 95 5913–5920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maiorov, V.N. and Crippen G.M. 1995. Size-independent comparison of protein three-dimensional structures. Proteins 22 273–283. [DOI] [PubMed] [Google Scholar]
- Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 443–453. [DOI] [PubMed] [Google Scholar]
- Norvell, J.C. and Machalek, A.Z. 2000. Structural genomics programs at the US National Institute of General Medical Sciences. Nat. Struct. Biol. S7 931. [DOI] [PubMed] [Google Scholar]
- Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATH: A hierarchic classification of protein domain structures. Structure 5 1093–1108. [DOI] [PubMed] [Google Scholar]
- Rice, D.W. and Eisenberg, D. 1997. A 3D–1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol. Biol. 267 1026–1038. [DOI] [PubMed] [Google Scholar]
- Šali, A. and Blundell, T.L. 1990. Definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212 403–428. [DOI] [PubMed] [Google Scholar]
- Shapiro, J. and Brutlag, D. 2004. FoldMiner: Structural motif discovery using an improved superposition algorithm. Protein Sci. 13 278–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi, J., Blundell, T.L., and Mizuguchi, K. 2001. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310 243–257. [DOI] [PubMed] [Google Scholar]
- Shindyalov, I.N. and Bourne, P.E. 1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11 739–747. [DOI] [PubMed] [Google Scholar]
- Sierk, M.L. and Pearson, W.R. 2004. Sensitivity and selectivity in protein structure comparison. Protein Sci. 13 773–785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suyama, M., Matsuo, Y., and Nishikawa, K. 1997. Comparison of protein structures using 3D profile alignment. J. Mol. Evol. 44S 163–173. [DOI] [PubMed] [Google Scholar]
- Tang, C.L., Xie, L., Koh, I.Y., Posy, S., Alexov, E., and Honig, B. 2003. On the role of structural information in remote homology detection and sequence alignment: New methods using hybrid sequence profiles. J. Mol. Biol. 334 1043–1062. [DOI] [PubMed] [Google Scholar]
- Taylor, W.R. 1999. Protein structure comparison using iterated double dynamic programming. Protein Sci. 8 654–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, A.S. and Honig, B. 2000. An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments. J. Mol. Biol. 301 691–711. [DOI] [PubMed] [Google Scholar]
- Ye, Y. and Godzik, A. 2004. Database searching by flexible protein structure alignment. Protein Sci. 13 1841–1850. [DOI] [PMC free article] [PubMed] [Google Scholar]

















