Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2008 Jul 1;24(13):i15–i23. doi: 10.1093/bioinformatics/btn171

The effectiveness of position- and composition-specific gap costs for protein similarity searches

Aleksandar Stojmirović 1, E Michael Gertz 1, Stephen F Altschul 1, Yi-Kuo Yu 1,*
PMCID: PMC2718649  PMID: 18586708

Abstract

Motivation: The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments.

Results: We found that position-specific gap penalties have an advantage over uniform gap costs. We did not explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to have constant gap transition probabilities, albeit with much greater variance. We observed no effect of composition-specific gap costs on retrieval performance. These results suggest possible improvements to the PSI-BLAST protein database search program.

Availability: The scripts for performing evaluations are available upon request from the authors.

Contact:yyu@ncbi.nlm.nih.gov

1 INTRODUCTION

Information retrieval from molecular databases by sequence alignment is an essential component of modern biology. The effectiveness of retrieval strategies depends crucially on how alignments are scored. A pairwise alignment score typically combines scores for the substitutions, insertions and deletions that transform one sequence into another. Scores for substitutions are derived from a substitution matrix, while scores for insertions and deletions are known as gap costs. The importance of gap costs has prompted numerous studies proposing various reasonable gap penalty schemes (Benner et al., 1993; Chang and Benner 2004; Pascarella and Argos 1992; Qiu and Elber 2006; Reese and Pearson 2002; Wrabl and Grishin 2004).

Search accuracy may be improved substantially by using position-specific scoring matrices (PSSMs; Gribskov et al., 1987). In addition, it is possible to introduce position- and composition-specific gap costs, which so far have been implemented primarily by hidden Markov models (HMMs) (Durbin et al., 1998; Krogh et al., 1994). In this article, we attempt to quantify the effect of different gap scores on retrieval performance using PSI-BLAST (Altschul et al., 1997) and HMMER (Eddy 1998, 2003), canonical examples of software tools employing PSSMs and HMMs, respectively.

As its name suggests, a PSSM assigns scores to amino acids in a database sequence based on the position in which they occur in the alignment. PSI-BLAST computes and scores alignments using a heuristic approximation to the Smith–Waterman algorithm (Smith and Waterman 1981) with affine gap costs (Gotoh 1982) providing uniform penalties for opening and extending a gap. PSSMs used by PSI-BLAST may be generated through an iterative search procedure or obtained from other sources, such as databases of curated multiple sequence alignments (MSAs).

Two publicly available sources of curated alignments are the Pfam (Finn et al., 2006) and SUPERFAMILY (Gough et al., 2001; Wilson et al., 2007) databases. The latter is derived from the SCOP (structural classification of proteins) database (Andreeva et al., 2007; Murzin et al., 1995). In both, each MSA is represented by an HMM, which may be used for similarity searches. An HMM is a finite-state automaton, characterized by state-to-state transition probabilities and emission probabilities that generate hypothetical protein sequences. See Figure 1 for an example and Appendix 1 for more details.

Fig. 1.

Fig. 1.

An example of a protein profile HMM architecture used by HMMER. The model contains n positions plus a begin state (B) and end state (E). Each position contains a substitution (S) and a deletion state (D), with a possible insertion state (I) between two S-nodes. Allowed transitions are shown by arrows. To simulate local alignments, transitions BSi and SiE, for any Si, are permitted.

The HMMER package (Eddy 1998, 2003) uses the Viterbi algorithm (Durbin et al., 1998), which finds the highest scoring sequence of states in the HMM that produces the database sequence. The probability that a particular amino acid is emitted in a HMMER substitution state may be identified with the probability that it occurs in a corresponding position in a PSI-BLAST PSSM. On the other hand, HMMER allows position- and composition-specific gap scores, which model the probability that an insertion or deletion occurs at a particular position in an alignment.

With their greater gap cost flexibility, HMMs may be expected to have better retrieval accuracy than PSSMs. We attempt to quantify the effect of HMMER's use of more general gap parameters by separately examining the influence of position- and composition-specific gap scores. We also compare the retrieval accuracy of the PSSMs constructed using PSI-BLAST's iterative procedure to that of the HMMs provided by the Pfam and SUPERFAMILY collections. Our results may suggest some directions for improvements to PSI-BLAST, and the magnitude of the improvements one might expect.

We collected protein profile HMMs from SUPERFAMILY and Pfam. We then modified the profiles from each source to simulate different retrieval strategies, and used them as queries for HMMER and PSI-BLAST to search a set of sequences from the SCOP database, which forms our ‘gold standard’. We use the results of the searches to evaluate and compare the retrieval performance of the search methods considered.

SCOP is a database of protein domains, classified by structure, function and sequence. Protein domains are classified into a hierarchy of class, fold, superfamily and family. Domains sharing the same superfamily are assumed to be homologous. For our testing purposes, we use the ASTRAL 40 (Chandonia et al., 2004) subset of SCOP (release 1.71), consisting of domain sequences that were filtered so that no two sequences share more than 40% pairwise identity. ASTRAL has been used as the testing set in a number of performance evaluations of protein sequence comparison algorithms (Green and Brenner 2002; Price et al., 2005; Vinga et al., 2004; Yu et al., 2006).

It is generally useful to evaluate not only the difference in performance of two search methods, but also whether such a difference is statistically significant. A number of procedures have been proposed, mostly based on bootstrap resampling with replacement (Green and Brenner 2002; Price et al., 2005). In this context, Green and Brenner (2002) observed that large superfamilies have an undue influence on the results, as the number of possible relationships grows quadratically with the number of members in a superfamily. They, therefore, proposed two weighting schemes that reduce the influence of large superfamilies. Price et al., (2005) noted technical challenges in obtaining accurate variances for the weighted statistics and proposed an improved bootstrap.

Our query sets, based on Pfam and SUPERFAMILY, contain several models for each SCOP-classified superfamily. Some superfamilies are overrepresented both in the query sets and in the ASTRAL database. We propose a different method than Price et al., (2005) to address the difficulties associated with having superfamilies of different sizes. Our strategy is to sample without replacement three fourths of the superfamilies and then select a single model for each superfamily in any given query set. Hence, each sample contains no more than a single profile from each superfamily and therefore captures the most distant relationships among queries.

2 MATERIALS AND METHODS

2.1 Software tools

For HMM-based queries, we used the HMMER package (version 2.3.2) (Eddy 1998, 2003), which is also used internally by Pfam. Local alignment between a sequence and an HMM is allowed by the non-zero probabilities of entering match nodes directly from the begin state, as well as moving directly to the end state from them (Fig. 1). The statistical significance of each alignment score is estimated using an assumed extreme value distribution, with model-specific parameters. The final E-value, adjusted for model and sequence composition, is used to rank the hits. Another popular HMM platform is SAM (Barrett et al., 1997; Hughey and Krogh 1996; Karplus et al., 2005), which is used by SUPERFAMILY;We used HMMER rather than SAM for all our HMM-based queries because the programs’ retrieval performances were shown to be comparable (Madera and Gough 2002; Wistrand and Sonnhammer 2005) and because the SUPERFAMILY models were available in HMMER format.

For PSSM-based queries, we used PSI-BLAST (version 2.2.17) Altschul et al., 1997). The statistics of PSI-BLAST scores are based on the extreme value distribution (Gumbel 1958) with a correction for finite sequence length. The statistical significance of each database hit is refined by taking into account its composition as well as that of the PSSM (Schäffer et al., 2001).

PSI-BLAST allows one to start a search from a ‘checkpoint’ file containing a PSSM saved from an earlier PSI-BLAST run, or built by other means. In addition to a PSSM, PSI-BLAST requires gap penalties as input: a gap opening cost and a gap extension cost. The choice of gap penalties is restricted to a few values because the parameters required to produce accurate statistics are precomputed using large-scale simulations. For both HMMER and PSI-BLAST runs, we used the standard search exectutables with their default settings.

2.2 Query sets

Following Wistrand and Sonnhammer (2005), we constructed a query set of Pfam (release 22.0) models by identifying all Pfam-A models that were cross-referenced by Pfam with an identifier in SCOP 1.71, and mapping the cross-referenced SCOP identifier to a SCOP superfamily. We did not consider models that have multiple domains mapping to different superfamilies.

We filtered the resulting set of Pfam models using two additional rules. First, any model mapping to a SCOP superfamily that had fewer than four members in ASTRAL 40 was removed from further consideration, to avoid superfamilies with a small number of members from disproportionally influencing the results. Next, we examined the MSA used to generate the Pfam profile and kept only those families whose MSA contained at least 10 sequences and had an average sequence length of at least 30 amino acids. Our final Pfam query set contained 703 Pfam models representing 299 superfamilies. We used the profiles from the Pfam_fs set, built for local/local alignment.

Our second query set consisted of all 6729 models from the SUPERFAMILY database (release 1.69) that belonged to the 299 superfamilies in the Pfam query set. These models were also built for local/local alignment. The above query sets, paired with HMMER, formed our first two search methods, which we named HOF (HMM, ‘original’, Pfam) and HOU (HMM, ‘original’, SUPERFAMILY).

The second pair of search methods, called HBF and HBU (see Table 1 for an outline of all search methods), was constructed by taking the HMMs from HOF and HOU, respectively, and replacing all emission scores for each insert state with 0. This is equivalent to setting all insertion emission probabilities to the background probabilities.

Table 1.

Nomenclature of search strategies

Name Description
HO Original HMM dataset
HB HMMs, background insertion emission probabilities
HG HMMs, constant state transitions and background insertion emissions
PO PSSMs, converted from original HMMs.
PC PSSMs, from five PSI-BLAST iterations over nr using profile consensus seeds
PS PSSMs, from five PSI-BLAST iterations over nr using SCOP domain sequence seeds

As shown in this table, the first two letters of the abbreviations of various search strategies denote the type of profile (HMM or PSSM), and the method of construction. The third letter is optionally appended to show the database of origin (ℱ for Pfam, U for SUPERFAMILY).

We constructed the third pair of search methods, called HGF and HGU by taking the HMMs from HBF and HBU, respectively, and adjusting the state transition probabilities to correspond to those implied by the affine gap penalties used by PSI-BLAST (see Appendix 1 for a detailed explanation).

Let α denote the gap opening cost and β the gap extension cost, in bits. We used the default penalty of PSI-BLAST, which is 11 (α=5.040 bits) for gap opening and 1 for gap extension (β=0.458 bits). This scale was chosen to match the scale of BLOSUM62 Henikoff and Henikoff 1992, the default scoring matrix of BLAST.

For each position m of an HMM, we left the probabilities P(BSm) and em=P(SmE) unchanged and set the remaining transition probabilities as follows:

graphic file with name btn171m1.jpg 1
graphic file with name btn171m2.jpg 2
graphic file with name btn171m3.jpg 3
graphic file with name btn171m4.jpg 4

where μ=2α+β and ν=2β. The probabilities were read from HMMER files, converted from scores, modified and written back as scores, as per HMMER convention (Eddy 2003). After modification, the HMMER statistical parameters of each HMM of HBF, HBU, HGF and HGU were recalibrated.

The remaining search methods used PSI-BLAST with default gap penalties. POF and POU used PSSMs derived from HOF and HOU, respectively, by taking the match state emission probabilities and writing them in PSI-BLAST checkpoint format. PCF and PCU used PSSMs obtained using the standard PSI-BLAST iterative procedure. We obtained the consensus (most likely) sequences of POF and POU profiles and used them as seeds for the initial searches, running five iterations in total against nr, the database of non-redundant protein sequences maintained by NCBI (frozen on April 11, 2007) (Wheeler et al., 2007).

The final search method, named PSU used the same construction procedure as POU except that the SCOP sequences associated with SUPERFAMILY models were used as PSI-BLAST seeds instead of profile consensus sequences.

2.3 Performance evaluation

As described earlier, our query sets contained no profiles assigned to more than one SCOP superfamily. Each pair p,s, where p is a query profile and s is an ASTRAL sequence, was classified as similar (‘positive’) if s belongs to the superfamily associated with p, and not similar (‘negative’) otherwise. For every query pk from a set of queries, denote by Np(pk) the number of ASTRAL 40 sequences belonging to the same superfamily as pk (i.e. the total number of positives for pk) and let Np=∑k Np(pk).

Comparing each query profile to the ASTRAL 40 database, we retrieved a number of sequences ranked according to their E-values. These sequences were classified as true or false positives. For a given search strategy, after merging the results for the whole set of queries, we obtain the (step) functions p(E) and f(E) giving, respectively, the cumulative numbers of true and false positives with E-value E or smaller. The function p can also be expressed as a function of f, the number of false positives and the graph of p(f) versus f is called the receiver operating characteristic (ROC) curve (Gribskov and Robinson 1996; Hajian-Tilaki and Hanley 2002; Hanley and McNeil 1982). The same curve can be displayed as a coverage versuss error-per-query (EPQ) or which is known as a CVE plot.

Our main performance statistic is the (truncated) ROC score. Given a number of false positives F, the ROCF score is defined by

graphic file with name btn171m5.jpg 5

It represents the accuracy of the search method (given a set of queries) for a given number of false positives. To compare two search methods M1 and M2 we compute their relative ROCF score difference, denoted RRSDF, defined by

graphic file with name btn171m6.jpg 6

To overcome the aforementioned problems associated with overrepre-sentation of large superfamilies, we sampled according to the superfamily classification. For each sample we randomly picked 224 out of 299 super-families (leaving one-fourth out) without replacement. Then, we selected one representative profile for each superfamily to form a sample query set. Search methods using the profiles originating from the same source (Pfam or SUPERFAMILY) used the same samples so that their performances could be compared for each sample. Our main statistic is the RRSD224 per sample, which measures performance at 1 EPQ or less. It allows a fair comparison of search methods.

3 RESULTS

Figure 2 shows the distributions of ROC224 scores and their relative differences (RRSD224) per sample with respect to HO for all query sets. Comparison of Figure 2a and b shows that, in general, the strategies using profiles from SUPERFAMILY perform better than those using Pfam profiles. In terms of relative difference (Figure 2c and d, Table 2), using both Pfam and SUPERFAMILY profiles, original HMMs (HO) perform significantly better than all other query sets except HB. There is no perceivable difference between HB and HO. There is also no significant difference between HG and PO.

Fig. 2.

Fig. 2.

ROC score statistics of 1 million samples. In each sample, 224 superfamilies are first randomly chosen from 299 superfamilies. A representative query profile is then randomly selected from each chosen superfamily. ROC score histograms from using Pfam HMMs (a) and SUPERFAMILY HMMs (b) show appreciable difference in average ROC scores for each search method tested: SUPERFAMILY HMMs always perform better. Note that in panels (a) and (b), the curve for HO is completely covered by that for HB. Using HOF and HOU as baselines, the values of RRSD224 (measurement at 1 EPQ) between various methods and the baselines are computed for each sample. The resulting histograms are shown in panels (c) and (d).

Table 2.

Summary of statistics of RRSD224 between every pair search strategies using the same source

graphic file with name btn171i19.jpg

In Figure 2c and d, HOF and HOU were used as the baselines for Pfam and SUPERFAMILY search strategies, respectively, and the histograms of RRSD224 relative to the baselines are shown. It is impractical to show such histograms for all possible baselines. However, for each pair of search strategies, we may sort (in ascending order) their 1 million values of RRSD224 and record the corresponding RRSD224 value at various designated percentiles. In the table, there are three numbers in a row for any given pair of search strategies. As an example, the numbers 2.9, 4.5 and 6.3, associated with M1=HBF and M2=HGF, are located in the row labeled by HBF and within the column headed by HGF. Those numbers, when divided by 100, have the following interpretation: the leftmost corresponds to the RRSD224 value at the 2.5th percentile, the middle to the median and the rightmost to the 97.5th percentile. Panel A records the numbers associated with Pfam search methods, while Panel B documents those associated with the SUPERFAMILY strategies tested.

In the case of PSSMs, POU gives better performance than PCU and PSU, but there is no significant difference between POF and PCF, although PCF shows a large variance in performance. In a number of cases, a PCF sample even outperforms the corresponding HOF sample. The relative ROC score difference between PCU and PSU is slightly positive, but not significantly so.

Using profiles from Pfam (SUPERFAMILY), we observed two (three) clusters of search strategies that performed equivalently based on RRSD224 (Fig. 2c and d). This trend in performance is supported by Figure 3, which displays examples of CVE curves for all alignment methods tested. The samples associated with these CVE curves have the median ROC224 score.

Fig. 3.

Fig. 3.

Example CVE curves for various search strategies based on Pfam (a) and SUPERFAMILY (b) profiles. Each curve shown is a representative that corresponds to a sample with ROC224 score equal to the median of 1 000 000 samples.

4 DISCUSSION AND CONCLUSION

The clear separation in retrieval performance between the SUPERFAMILY and Pfam profiles could be explained by the fact that the former are based on ASTRAL sequences, which form our testing set as well. In contrast, Pfam models are based on a variety of sequence sources and were not trained on ASTRAL. Hence, a degree of overfitting the SUPERFAMILY models to the testing set, as well as the fact that ASTRAL is structure based, may explain the overall differences in performance.

Another interesting observation is that CVE curves (Fig. 3) cross at low EPQ and form distinct clusters above 0.5 EPQ. Due to small sample size, the coverage at low EPQ is expected to have a larger uncertainty, thus the crossing of CVE curves there is anticipated. At moderate EPQ, the distinct clusters indicate that the relative retrieval efficiency is not influenced by the choice of EPQ.

On both testing collections, we have observed almost no difference in performance between the original HMMs (HO) and the models derived from them having insertion emission probabilities reset to the background (HB). Examining the models in HMMER format, we found that the insertion emission distributions were almost constant over all the positions, with the common distribution being slightly biased in favor of hydrophilic amino acids. The average relative entropy between this distribution and the background distribution is very small (0.037 bits for Pfam, 0.005 bits for SUPERFAMILY), explaining the very small effect of the insertion emissions on the retrieval performance. Note that SUPERFAMILY models had higher overall probabilities of entering a gap state and hence showed a larger influence of insertion emissions than Pfam models (Figure 2c and d).

In addition, an insertion emission distribution biased in favor of hydrophilic amino acids may not be appropriate for all positions within proteins: it implicitly assumes a globular protein structure, with hydrophobic core and hydrophilic surface. Finally, from an information theoretic point of view, it is very difficult to reliably estimate insertion emission probabilities. In particular, if one wishes to establish an emission model whose emission probabilities are similar to those of the background and wants to confidently distinguish those two sets of probabilities, it is necessary to have a large amount of data. The following example illustrates this point.

In the Pfam insertion emission model, Leucine's emission probability, 0.0676, has the largest deviation compared to the background 0.0934. Consider a simple coin tossing experiment where the probability of seeing a leucine (head) is P=0.0676 and the probability of seeing any other amino acid (tail) is 1−0.0676. One may ask how many tosses (number of amino acids present in a gap column of an MSA) are needed in order to confidently rule out the possibility that the probability is 0.0934. It is well known that a binomial distribution in the large number limit becomes a Gaussian. In our example, the probability of observing k heads out of n tosses becomes

graphic file with name btn171um1.jpg

To reject with 85% confidence the value of 0.0934 as the probability of seeing a head, the absolute difference between the two probabilities, 0.0934 and 0.0676, must be greater than or equal to 1.037 times the SD, Inline graphic. This leads to

graphic file with name btn171um2.jpg

When applied to estimating insertion emission probabilities, this example implies that one needs to have about 137 amino acids in a gap column of a multiple alignment. This number seems large for columns associated with an insert state, as these columns normally have more gaps than amino acids. On the other hand, we can confidently determine emission probabilities for columns that contain mostly amino acids and are therefore usually assigned to substitution states. Furthermore, the dominant amino acid in a match column often has very different observed and background frequencies. For example, consider a match column with 20% leucine. The same calculation as above tells us that we need only eight or more amino acids in the match column to indicate a preference for leucine. Of course, considering the subdominant amino acids require more entries in the match column.

Comparing HO to HG and PO, we see that profiles with positiondependent gap parameters have 5% better retrieval performance (as measured by the median RRSD224 value) than those with position-independent ones. This is an area where HMMs are clearly superior to the PSSMs with constant gap penalties, as used by PSI-BLAST. Hence, a possible direction for improvement of PSI-BLAST is to introduce position-dependent gap parameters. When interpreting this difference, one should note that we did not optimize the PSI-BLAST gap penalties, but use only the defaults. It is therefore possible that the performance of PSI-BLAST with a better set of gap opening and extension penalties would more closely match the performance of HMMs. Another possibility is to estimate and optimize gap parameters for each PSSM separately, at the time of its creation (that is, each PSSM would still carry a single, position independent, gap opening and gap extension penalty, but they would not be input beforehand but estimated from the data). The practical problem with these suggested improvements is that the statistical parameters for position-specific gap penalties cannot be quickly computed as yet, and one is therefore restricted to the costs for which the parameters have been precomputed. Another possibility is to modify PSI-BLAST to use the hybrid alignment algorithm (Yu and Hwa 2001; Yu et al., 2002), which is probabilistic, naturally accepts PSSMs with position-specific gap costs, and has well-characterized, universal statistics.

It is not surprising that the performances of HG and PO show no significant difference because HG was designed to simulate the PSI-BLAST gap parameters in the HMM framework. Some differences still exist due to a fundamental difference between the underlying algorithms. First, although the score statistics for HMMER and PSI-BLAST are both based on the extreme value distribution, there are still differences in details. Second, PSI-BLAST alignments may have longer segments of ungapped alignment because the score associated with ungapped alignment is not reduced by the probability of entering another node. Some difference can also be explained by slightly different background probabilities in each case. Finally, local alignment is achieved through different mechanisms: PSI-BLAST alignments terminate when their accumulated score is maximal, while HMMER alignments terminate only when they hit the end state. Thus, HMMER alignments may tend to be more global with respect to the profile.

The difference in performance of PSI-BLAST using PSSMs constructed in different ways shows that focusing on profile construction as well as on position-specific gaps may yield significant improvement. In particular, the performance of PSSMs converted from HMMs (PO) versus those iteratively constructed (PC and PS) shows that a more carefully constructed profile may yield better performance, with the difference being more pronounced in SUPERFAMILY than in Pfam. The fact that the PSSMs obtained iteratively from nr based on SUPERFAMILY consensus seeds generally perform better than those originating from Pfam consensus seeds shows the importance of the choice of the initial seed sequence. This is further emphasized by the slightly better performance of the PSSMs based on the consensus sequence as seed (PCU) than the performance of those based on the seeds taken from ASTRAL (PSU). Hence, another possible way of improving PSI-BLAST would be to run one iteration using the normal scoring matrix and construct a profile as before, but then to rerun the search using the consensus sequence as the seed instead of proceeding into the iterative stage with the profile. In that way, a more ‘central’ seed can be obtained, which, while not corresponding exactly to any sequence present in the dataset, may yield a more accurate profile for the iterative steps. Naturally, the choice of the weighting scheme for the multiple alignment used to obtain the consensus sequence or profile as well as the associated pseudocounts will also exert a significant influence on the result.

Finally, our methodology must be understood in the context of the small size of the testing suite. This does not present a significant problem when testing different parameter sets of the same alignment algorithm but when comparing different algorithms it is essential to eliminate bias due to superfamily size. Our approach, based on sampling three fourth of the superfamilies without replacement, was designed with this aim in mind.

ACKNOWLEDGEMENTS

We thank M. Wistrand and E. Sonnhammer for useful corres-pondence.

Funding: This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health.

Conflict of Interest: none declared.

APPENDIX 1

The connection between the transition probabilities of HMMs for sequence evolution and the scoring function (scoring matrices and gap parameters) used in sequence comparison is elaborated in this appendix. Since such a connection has been sketched explicitly in earlier publications on hybrid alignment (Yu and Hwa 2001; Yu et al., 2002), interested readers are encouraged to look into the original literature. We present a self-contained exposition here to save the reader some effort in reading through earlier papers, and to present a minor extension needed for aligning a protein sequence to a local HMM with explicit termination probabilities at its nodes. Note that keeping a non-zero termination probability is how HMMER achieves local alignments. Hybrid alignments achieve a local alignment by taking the maximum of the log-odd ratios at each possible termination point, and hence do not need to deal explicitly with the termination probabilities of the HMMs.

The fundamental idea of protein sequence comparison is rooted in the amino acid score (substitution) matrix, where the (i,j)-th entry

graphic file with name btn171m7.jpg A1

is the log–odd ratio of the joint probability Qij of amino acids i and j in the target ensemble to the product of the background probabilities, pi and pj, of the two amino acids. Here λ is just a scale and is set to unity from this point on. For a valid scoring matrix (Yu et al., 2003), one has pi=∑j Qij, and one may express Qij as Qij=pi T(ji)=pj T(i|j), with T(ji) being the probability for amino acid i to mutate into amino acid j. In this case, we may also write

graphic file with name btn171m8.jpg A2

which may now be viewed as the log–odds ratio of a conditional emission probability to the background probability.

Extending this concept (Yu and Hwa 2001; Yu et al., 2002), one may score the global relatedness (alignment) between two protein sequences, a and b, the same way: using the log–odds ratio of Q[a,b] to P[a]P[b] (the background probability of generating a pair of random sequences a and b). In terms of global relatedness, Q[a, b] may be regarded again as P[a] T[ba] and

graphic file with name btn171m9.jpg A3

Here T[ba] is the probability for sequence a to mutate into sequence b. It is not hard to convince oneself that there are many different ‘ways’ or ‘paths’ for sequence a to mutate into sequence b. In fact, it has been argued that the usual optimal alignment corresponds to the most probable evolutionary path. In this context, the gap cost is related to the transition probabilities in and out of the insertion/deletion states of the HMM.

A protein HMM consists of a number of nodes. Except at the begin node, each node j allows two possible states, substitution (S) and deletion (D). The substitution state associated with node j is characterized by the transition probability from aj to other amino acids. The deletion state is further divided into cases depending on its preceding state. In between two nodes, one can have an insertion (I) state. The transition probabilities from a given state to all other allowed states have to sum to 1. Four transitions—SS, SD, DS, and DD—will each advance the node index by 1. Transitions SI and IS combined together increase effectively the node index by 1, while the transitions II and DI (if allowed) do not change the node index at all. In many HMMs, such as the ones used by HMMER, the transitions between I and D states are strictly forbidden and we follow this rule here to simplify our exposition.

Constrained by the probability conservation condition, the transition probabilities are usually made node-specific (or equivalently termed position-specific). Focusing on the substitution scores of protein HMMs, position-specific scoring simply means that the substitution states at different nodes may emit amino acids with different sets of probabilities.

As a concrete example, let us consider an alignment of a partial HMM model of eight nodes aligned with a sequence b=[b1, b2, …, b7] of seven amino acids. Their alignment is shown in Figure A1. The alignment score S is given by

graphic file with name btn171m10.jpg A4

where si(bj) represents the substitution score for amino acid j at node i, Inline graphic represents the gap score associated with deleting nodes 3 through 6, and Inline graphic represents the gap score associated with inserting amino acids b4, b5 and b6 between nodes 7 and 8 of the HMM. The superscript ‘+’ associated with the I state will be suppressed from this point on. The probability of occurrence associated with this alignment 𝒜 may be written as

graphic file with name btn171um3.jpg

where Inline graphic is the insertion probability of amino acid b between nodes 7 and 8. Assuming that P[b]=∏i p(bi), one obtains the ratio

graphic file with name btn171m11.jpg A5

Comparing (4) and (5) and events of similar type yields the following mappings:

graphic file with name btn171m12.jpg A6

Frequently, HMMs take P(Di−1Di) and P(IjIj) each to be a constant. In this case, the ratio Inline graphic contributes to a position-specific gap opening cost and the ratio Inline graphic contributes to a composition-dependent insertion cost. The quantity P(II) contributes to the insertion gap extension cost. If one keeps emission probabilities Tai(b) node-dependent, but demands that all the state-to-state transition probabilities be node-independent, one essentially has a PSSM with uniform affine gap costs, although possibly with composition-specific insertion gap costs if Inline graphic is chosen to be different from the background p.

Fig. A1.

Fig. A1.

An example of a partial alignment between a profile HMM and a protein sequence. Note that in the text, the state preceding S1 is assumed to be a substitution state.

Since the transition probabilities are constrained by the respective conservation conditions, and those probabilities are related to the scoring function through (6), the substitution and gap scores are no longer independent if one wishes to have a probabilistic interpretation. We now turn to the relationship among score parameters when the state-to-state transition probabilities are node-independent constants. Let η≡P(SS), μD1P(DS)/P(SS), μD2P(SD), μI1P(IS)/P(SS), μI2P(SI), νIP(II) and νDP(DD). Because μI1D1) and μI2D2) always appear together as a product, we further define μI≡μI1μI2D≡μD1 μD2). The probability conservation condition then demands that

graphic file with name btn171m13.jpg A7
graphic file with name btn171m14.jpg A8
graphic file with name btn171m15.jpg A9

Treating νI, νD, μI and μD as fixed parameters allows us to express η, μI2 and μD2 in terms of νI(D) and μI(D). To do so, we multiply (8) by μI2 and multiply (9) by μD2. Together with (7), we have three linear equations with three unknowns: η, μI2 and μD2. Solving these equations yields

graphic file with name btn171um4.jpg

For the case μDI=μ and νDI=ν, these expressions simplify to

graphic file with name btn171um5.jpg

Note that with this notation, we may rewrite (6) as

graphic file with name btn171um6.jpg

It becomes evident that ln(μ/ν) corresponds to the gap opening score while ln(ν) corresponds to the gap extension score, and Inline graphic becomes an additional composition-specific insertion score.

In HMMER, the local alignment is terminated by going into the end state, and the end state can be reached only from substitution states. In this context, the probability conservation Equations (8) and (9) remain unchanged. However, we may allow a node-specific termination probability from the S state. This requires the introduction of a position index for the other transition probabilities. Let ηmP(SmSm+1), emP(SmE), Inline graphic, Inline graphic, Inline graphic, Inline graphic. However, note that P(ImSm+1) should remain the same, because there is no direct transition ImE. Thus, we may still keep both Inline graphic and Inline graphic as constants. The probability conservation condition then yields

graphic file with name btn171m16.jpg A10
graphic file with name btn171m17.jpg A11
graphic file with name btn171m18.jpg A12

the solution of which is

graphic file with name btn171um7.jpg

Although Inline graphic and Inline graphic are decreased, Inline graphic and Inline graphic are kept the same as before. As a consequence, the only change is that the substitution score at each node is reduced by a node-specific constant ln [1/(1−em)] when it is not preceded by a gap state. If an alignment has deletion at node m followed by k more substitutions from node m+1 to node m+k, then the substitution score reduction starts only at node m+2 and persists to node m+k.

REFERENCES

  1. Altschul SF, et al. Gapped BLAST and PSI–BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andreeva AA, et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2007;36(Database issue):D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barrett C, et al. Scoring hidden Markov models. Comput. Appl. Biosci. 1997;13:191–199. doi: 10.1093/bioinformatics/13.2.191. [DOI] [PubMed] [Google Scholar]
  4. Benner SA, et al. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol. 1993;229:1065–1082. doi: 10.1006/jmbi.1993.1105. [DOI] [PubMed] [Google Scholar]
  5. Chandonia J-M, et al. The ASTRAL Compendium in 2004. Nucleic Acids Res. 2004;32:D189–D192. doi: 10.1093/nar/gkh034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chang MS, Benner SA. Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J. Mol. Biol. 2004;341:617–631. doi: 10.1016/j.jmb.2004.05.045. [DOI] [PubMed] [Google Scholar]
  7. Durbin R, et al. Biological Sequence Analysis. Cambridge, UK: Cambridge University Press; 1998. [Google Scholar]
  8. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
  9. Eddy SR. HMMER user's guide. 2003 ftp://selab.janelia.org/pub/software/hmmer/CURRENT/Userguide.pdf.
  10. Finn RD, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34:D247–D251. doi: 10.1093/nar/gkj149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. doi: 10.1016/0022-2836(82)90398-9. [DOI] [PubMed] [Google Scholar]
  12. Gough J, et al. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 2001;313:903–919. doi: 10.1006/jmbi.2001.5080. [DOI] [PubMed] [Google Scholar]
  13. Green RE, Brenner SA. Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc. IEEE. 2002;90(12):1834–1847. [Google Scholar]
  14. Gribskov M, Robinson NL. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem. 1996;20:25–33. doi: 10.1016/s0097-8485(96)80004-0. [DOI] [PubMed] [Google Scholar]
  15. Gribskov M, et al. Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA. 1987;84:4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Gumbel EJ. Statistics of Extremes. New York: Columbia University Press; 1958. [Google Scholar]
  17. Hajian-Tilaki KO, Hanley JA. Comparison of three methods for estimating the standard error of the area under the curve in ROC analysis of quantitative data. Acad. Radiol. 2002;9:1278–1285. doi: 10.1016/s1076-6332(03)80561-5. [DOI] [PubMed] [Google Scholar]
  18. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  19. Henikoff S, Henikoff J. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hughey R, Krogh A. Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci. 1996;12:95–107. doi: 10.1093/bioinformatics/12.2.95. [DOI] [PubMed] [Google Scholar]
  21. Karplus K, et al. Calibrating E-values for hidden Markov models using reverse-sequence null models. Bioinformatics. 2005;21:4107–4115. doi: 10.1093/bioinformatics/bti629. [DOI] [PubMed] [Google Scholar]
  22. Krogh A, et al. Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 1994;235:1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
  23. Madera M, Gough J. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 2002;30:4321–4328. doi: 10.1093/nar/gkf544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  25. Pascarella S, Argos P. Analysis of insertions/deletions in protein structures. J. Mol. Biol. 1992;224:461–471. doi: 10.1016/0022-2836(92)91008-d. [DOI] [PubMed] [Google Scholar]
  26. Price GA, et al. Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics. 2005;21:3824–3831. doi: 10.1093/bioinformatics/bti627. [DOI] [PubMed] [Google Scholar]
  27. Qiu J, Elber R. SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins. 2006;62:881–891. doi: 10.1002/prot.20854. [DOI] [PubMed] [Google Scholar]
  28. Reese JT, Pearson WR. Empirical determination of effective gap penalties for sequence comparison. Bioinformatics. 2002;18:1500–1507. doi: 10.1093/bioinformatics/18.11.1500. [DOI] [PubMed] [Google Scholar]
  29. Schäffer AA, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. doi: 10.1093/nar/29.14.2994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
  31. Vinga S, et al. Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics. 2004;20:206–215. doi: 10.1093/bioinformatics/btg392. [DOI] [PubMed] [Google Scholar]
  32. Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007;35:D5–D12. doi: 10.1093/nar/gkl1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wilson D, et al. The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 2007;35:D308–D313. doi: 10.1093/nar/gkl910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wistrand M, Sonnhammer EL. Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics. 2005;6:99. doi: 10.1186/1471-2105-6-99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wrabl JO, Grishin NV. Gaps in structurally similar proteins: towards improvement of multiple sequence alignment. Proteins. 2004;54:71–87. doi: 10.1002/prot.10508. [DOI] [PubMed] [Google Scholar]
  36. Yu Y-K, Hwa T. Statistical significance of probabilistic sequence alignment and related local hidden Markov models. J. Comput. Biol. 2001;8:249–282. doi: 10.1089/10665270152530845. [DOI] [PubMed] [Google Scholar]
  37. Yu Y-K, et al. Hybrid alignment: high-performance with universal statistics. Bioinformatics. 2002;18:864–872. doi: 10.1093/bioinformatics/18.6.864. [DOI] [PubMed] [Google Scholar]
  38. Yu Y-K, et al. The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci. USA. 2003;100:15688–15693. doi: 10.1073/pnas.2533904100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Yu Y-K, et al. Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res. 2006;34:5966–5973. doi: 10.1093/nar/gkl731. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES