Using homology relations within a database markedly boosts protein sequence similarity search

Jing Tong; Ruslan I Sadreyev; Jimin Pei; Lisa N Kinch; Nick V Grishin

doi:10.1073/pnas.1424324112

. 2015 May 18;112(22):7003–7008. doi: 10.1073/pnas.1424324112

Using homology relations within a database markedly boosts protein sequence similarity search

Jing Tong ^a, Ruslan I Sadreyev ^b,^c, Jimin Pei ^d, Lisa N Kinch ^d, Nick V Grishin ^a,^d,¹

PMCID: PMC4460465 PMID: 26038555

Significance

In the field of protein structure prediction, identifying homology to known folds offers the most successful and practically useful strategy to provide protein spatial structure models. For protein sequence targets without closely related structures, even very distant homologs can provide reasonable templates for modeling. Despite significant progress in the field, remote sequence similarity search is far from perfection and fresh ideas are needed to extend detection limits. In computer science, the concept of utilizing internal relations within a database to improve similarity search was key to the success of search engines such as Google. Here, we show that similar consideration of the homology network within a protein database of structure templates can dramatically improve the accuracy of homology search.

Keywords: homology detection, remote sequence similarity search, homology network, protein modeling, similarity score

Abstract

Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence–based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit’s known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre.

Prediction of protein structure and function by sequence homology is among the most important problems in computational biology of proteins, perhaps next after the grand problem of de novo protein folding. The existing gap between the number of known protein sequences and the number of experimentally determined 3D structures is bound to grow with more genomes sequenced by high-throughput technologies (1, 2). Currently, the most reliable and effective way to predict the structure of an uncharacterized protein is to find a sequence homolog with available structural information (3, 4). The chance of finding such a template for a given protein sequence is increasing as sequence space is becoming more extensively covered by 3D structures (5). However, there is, and will be for a long time, a significant fraction of proteins for which finding experimentally characterized sequence homologs is challenging or impossible. The structures of many such proteins, when solved, reveal their remote homology to previously known structures that are undetectable by current sequence-based homology search methods (6). Therefore, the quality of sequence-based homology search remains key for accurate structure prediction, as consistently confirmed by multiple rounds of the Critical Assessment of protein Structure Prediction (7).

In the last several years, methods for sequence similarity search have been greatly improved by the analysis of sequence patterns reflecting evolutionary, structural, and functional constraints in protein families. Introduction of numerical profiles (3) and hidden Markov models (HMM) (8) has allowed comparing a sequence to a multiple sequence alignment (MSA) (8–10). Such work was later followed by methods for profile–profile (11–13) and HMM–HMM (14–16) comparison, aimed at detecting similarities between distant families. In addition to the residue substitution preferences at sequence positions, MSA can reveal highly informative patterns of interdependence between amino acid content at different positions, in the form of MSA motifs and secondary structure predictions (15, 17, 18).

Is it possible to further improve sequence-based homology search by considering nonsequence information? In a typical homology search aimed at 3D structure prediction, a sequence or family of interest is compared with a database of proteins with known structures. This knowledge allows confident establishment of evolutionary links within the database. In computer science, networks of relationships between database subjects have been successfully used to improve the quality of search methods, most notably web searchers (19). Here, we show that knowledge of the protein database homology network dramatically increases the accuracy of sequence-based search.

To capitalize on this idea, we modified PROCAIN (protein profile comparison with assisting information) (18), our sensitive method for sequence profile similarity search, by considering the template’s homologs within the database (Fig. 1). We designed the new similarity measure of COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) as a linear combination of the original score for the given template with the scores for a set of its homologs identified by structure, function, and sequence. Consistent similarity of these homologs to the query elevates the original score, which can increase the significance of a marginal sequence-based similarity to a level above detection threshold. On the other hand, a favorable score for a spurious hit becomes less significant if the set of its homologs is consistently dissimilar from the query. Therefore, this measure improves both sensitivity and specificity of homology detection. The resulting method is implemented on a web server (prodata.swmed.edu/compadre) that allows submission of a query sequence for identifying homologs with available 3D structure.

Fig. 1. — Context of template’s relationships within the database is used to modify the original score ( $S_{0}^{T}$ ) for sequence-based similarity between query and template. The modified measure is a linear combination of $S_{0}^{T}$ and scores ( $\sum S_{0}^{H}$ ) for the similarity between query and template’s homologs.

Results

Defining Close and Remote Template Homology Networks.

To establish the network of homologous relationships in the structure database, we defined homologous and nonhomologous pairs in a set of 5,116 representative structural domains (also known as templates) from the Structural Classification of Proteins (SCOP) database (20). Because a good homology search method should rank homologs according to their distance from the query as well as discriminate homologs from nonhomologs, we considered homologous relationships between protein pairs at two different levels. The first level represents “close homologs,” and was defined as domain pairs assigned to the same superfamily in SCOP. These templates are typically closely related to each other by 3D structure, tend to share similarities in sequence and function, and should be ranked higher than more remote homologs. The second level represents “all homologs,” and includes more distantly homologous protein pairs. To define all homologs, the SCOP classification was supplemented with a support vector machine (SVM) classifier (21) that uses a number of sequence and structure similarity scores to establish homology (see SI Appendix, Methods for details). This SVM classifier finds similarities between domain pairs that are most likely evolutionarily related and would be meaningful templates for 3D structure modeling.

Establishing a Scoring Scheme by Using Homology Networks in the Search Database.

The original PROCAIN E values reflecting the sequence-based similarity between the query and templates were first log-transformed into similarity scores $S_{0}$ (see Methods for details). The closest query homolog can often be identified as the top hit by this direct score. To further improve PROCAIN scoring, similarity scores on a particular template ( $S_{0}^{T}$ ) were boosted by the similarity scores of the template’s homologs ( $S_{0}^{H}$ ) according to the following equation:

S_{1} = w_{T} S_{0}^{T} + (1 - w_{T}) \sum S_{0}^{H},

[1]

where $S_{0}^{T}$ and $S_{0}^{H}$ are the similarity scores $S_{0}$ for the given template and for a set H of its structure-based homologs within a certain evolutionary distance from the template, respectively. The $S_{0}^{H}$ similarity score can be calculated using either the close homolog level (H_close) or the all-homolog level (H_all) described above. w_T is a weight optimized for the performance (w_T = 0.8).

Additional information about the query’s top hit may help detect the query’s homologs in the database. Indeed, we find (SI Appendix, Fig. S1 and Tables S1 and S2) that performance can be additionally improved by transforming the measure in Eq. 1 to boost scores for the templates that are homologous to the top hit and to reduce scores for nonhomologous templates:

S_{2} = S_{1} + h_{top} (α * S_{1} + β),

[2]

where $S_{1}$ is the measure defined by Eq. 1, α and β are optimized parameters, and h_top depends on the homology of the template to the top hit that has the highest PROCAIN score: h_top = 1 if the template is homologous to the top hit, h_top = −1 if it is confidently nonhomologous, and h_top = 0 if the homology is unclear (see SI Appendix, Methods for details).

Improvement of COMPADRE Scoring Scheme by the Choice of Homology Network.

The choice of homolog set H in Eqs. 1 and 2 has a dramatic influence on the method’s behavior. Including scores for a template’s all homologs (H_all) results in a wider sampling of protein space around the template and thus should provide more representative information for homolog/nonhomolog discrimination. Such a wide sample, however, may lead to a scrambled ranking among detected homologs, with the closest ones being placed below the more distant. For a query’s close homologs, the strong direct similarity signal from the first scoring term of Eq. 1 may be diluted by the contribution of a diverse set of template’s homologs from the second scoring term of Eq. 1. Restricting the set’s diversity to close homologs (H_close) should improve the ranking of close homologs, but may limit the sensitivity of detection at remote homology levels. Therefore, the size of homolog set H may require adjustment for different evolutionary distances between query and template.

Indeed, applying different sets H (H_all or H_close) to generate $S_{1}$ and $S_{2}$ results in a very different performance of our scoring scheme. We used receiver operator characteristic (ROC) curves to evaluate the homology detection performance of Eq. 2 for all query homologs designated as true positives (Fig. 2A) and for only close homologs designated as true positives (Fig. 2B). An ROC curve plots true positive numbers against false positive numbers at various E-value thresholds. Both plots (Fig. 2 A and B) include the curves for several published sequence-based searchers: the popular PSI-BLAST (position-specific iterated basic local alignment search tool) method (22), PROCAIN (18), and a comparable state-of-the-art method HHSearch (15) (see Methods for details). These plots are shown together with those produced by our scoring scheme using two definitions of H. When contribution from all template homologs (H_all) is allowed in determining S₁ and S₂, the quality of homolog/nonhomolog discrimination is dramatically higher than in other methods (Fig. 2A and SI Appendix, Tables S1 and S3). However, when the contribution is limited to close homologs (H_close), the discrimination between homologs and nonhomologs becomes worse, especially for more distant homologs, in the area further from the plot’s origin (Fig. 2A and SI Appendix, Tables S1 and S3). The situation is opposite when only close homology detection is evaluated (Fig. 2B and SI Appendix, Tables S2 and S4). Inclusion of all template homologs (H_all) in determining S₁ and S₂ of Eq. 2 results in extremely poor identification of close homologs, suggesting that more distant relationships are often erroneously assigned higher significance. Limiting the set H to the close homology level (H_close) leads to an accurate close homology detection far surpassing the original PROCAIN performance (Fig. 2B and SI Appendix, Tables S2 and S4). These effects are consistent among all major protein secondary structure classes (SI Appendix, Figs. S2–S5).

The results shown in Fig. 2 suggest that to improve both the detection of remote homology and the ranking by evolutionary distance to the query, we can adjust the contribution from more distant homologs in Eqs. 1 and 2 according to the template’s distance from the query. Both goals may be achieved if set H is kept relatively narrow for close query–template relationships (the left part of the orange curve in Fig. 2 A and B), and the input from remote template homologs is added only for templates more distant from the query (the right part of the brown curve in Fig. 2 A and B). We construct a combined scoring function for such an adjustment:

S_{3} = w_{c} (S_{2}^{c}) * S_{2}^{c} + (1 - w_{c} (S_{2}^{c})) * S_{2}^{r},

[3]

where $S_{2}^{c}$ and $S_{2}^{r}$ are determined by Eq. 2 with different definitions of set H: only close homologs (H_close) for $S_{2}^{c}$ and all database homologs (H_all) for $S_{2}^{r}$ . The weight w_c is a variable depending on score $S_{2}^{c}$ as a measure of closeness of template to query. For high $S_{2}^{c}$ values [closely similar to the query, when $S_{2}^{c}$ is above an upper boundary $S_{2}^{c (2)}$ ] this weight is set to 1, so that the final score includes only the score of close homologs of a template ( $S_{3} = S_{2}^{c}$ ), whereas for low $S_{2}^{c}$ values [distantly similar to the query, when $S_{2}^{c}$ is less than a lower boundary $S_{2}^{c (1)}$ ] the weight is set to 0 so that the score includes only the score for all template homologs ( $S_{3} = S_{2}^{r}$ ). For the intermediate values of $S_{2}^{c}$ , i.e., $S_{2}^{c (1)}$ < $S_{2}^{c}$ < $S_{2}^{c (2)}$ , the weight monotonically grows from 0 to 1, to gradually mix $S_{2}^{c}$ with $S_{2}^{r}$ . After testing several functions, we find that exponential dependency of w_c on $S_{2}^{c}$ (SI Appendix, Fig. S11) provides the best performance (see Methods for details).

Although consideration of a template’s homologs in Eqs. 1–3 can boost scores of marginally detectable homologs, it can also reduce the significance of original PROCAIN E values for highly confident homologs. Thus, we construct a second combined scoring function:

S_{4} = w_{3} (\ln E_{p}) * S_{3} + (1 - w_{3} (\ln E_{p})) * S_{p},

[4]

where $S_{3}$ is determined by Eq. 3 and S_p is the score obtained from the original PROCAIN E value E_p using the Gumbel extreme value distribution (EVD) (23), which approximates a distribution of sequence similarity scores of random comparisons (9, 24–26). Since it was introduced in sequence analysis by BLAST (24), this distribution has been widely used to estimate statistical significance of sequence and profile similarity scores (i.e., to compute E value from a score) in many applications (9, 18, 27). Karlin and Altschul and Dembo et al. (25, 26) estimated parameters of EVD and suggested a formula to transform a sequence score to E value. E values can be back-transformed to scores using this approximation. The weight w₃ is a function of lnE_p, the logarithm of the PROCAIN E value. For low lnE_p values [highly confident PROCAIN hits, when lnE_p is less than a lower boundary lnE_p⁽¹⁾] this weight is set to 0, so that the final score is only determined by the original PROCAIN score (S₄ =S_p), whereas for high lnE_p values [marginal PROCAIN hits, when lnE_p is above an upper boundary lnE_p⁽²⁾] this weight is set to 1 so that the final score S₄ is equal to the new score $S_{3}$ . For the intermediate values of lnE_p, i.e., lnE_p⁽¹⁾ ≤ lnE_p ≤ lnE_p⁽²⁾, the weight monotonically increases from 0 to 1, to gradually mix S₃ with S_p. Testing several functions, we find that exponential dependency of w₃ on lnE_p (SI Appendix, Fig. S12) gives the best performance (see Methods for details). Based on the score $S_{4}$ for a given template, statistical significance of the detected similarity is provided in the form of E value estimated by transforming the score using the EVD approximation.

The final scoring function $S_{4}$ offers best performance both in remote homology detection and in ranking by evolutionary distance to a query. Performance of the resulting measure is compared with several methods in Fig. 3 A and B and SI Appendix, Figs. S6–S8. The inclusion of all template homologs (H_all) in the set H leads to highly sensitive and accurate retrieval of homology relationships (Fig. 3A and SI Appendix, Tables S1 and S3). At the same time, using a restricted set H for shorter ranges of query–template distance [only close homologs (H_close)] leads to the correct placement of the close query homologs above others (Fig. 3B and SI Appendix, Tables S2 and S4). One of the most important characteristics of this scoring scheme is precision rate, i.e., the expected proportion of true positives among top hits. As shown in SI Appendix, Table S3, the scoring function $S_{4}$ achieves the precision rate of 83% at half-coverage of all homologs, more than quadruple that of the original PROCAIN rate of 18%. Thus, the combined measure $S_{4}$ by far exceeds the current state-of-the-art performance levels in both capturing remote protein relationships and ranking homologs consistently with evolutionary distance. We refer to the resulting detection method as COMPADRE, for COmparison of Multiple Protein sequence Alignments using Database RElationships.

Fig. 3. — Performance of combined similarity measure implemented in COMPADRE method. As illustrated by the ROC plots (red), the score both accurately discriminates homologs from nonhomologs (A) and assigns top ranks to closest sequence relationships (B).

A web server based on this method (prodata.swmed.edu/compadre) is developed to detect homologs for the query sequence provided by the user. The output page shows ranked significant hits, their sequence alignments to the query, and their 3D structures. The sequence alignments between the query and each template are generated by PROMALS (profile multiple alignment with predicted local structures) (28), which produces better alignments than PROCAIN.

Comparison with Structure Similarity Score.

A more detailed analysis of the COMPADRE results suggests that it accurately captures a large fraction of structural similarities that are only weakly reflected in sequence, and at the same time highlights the similarity of local functional motifs that may be missed by an automatic structure comparison method. As an illustration, Fig. 4 shows the comparison of protein groupings based on COMPADRE E value and on the structural similarity measured by DALI (distance-matrix alignment) (29) Z score. We use 1,313 representative protein domains from the α/β class in our database to perform hierarchical clustering by all-to-all COMPADRE scores (logarithm of COMPADRE E values) (Fig. 4A) and by DALI Z scores (Fig. 4B). The resulting matrices of scores for domain pairs are represented as colored maps, with the scores used for grouping shown above diagonal and the corresponding scores by the other method shown below diagonal.

Fig. 4. — Protein groupings according to sequence-based similarity scoring by COMPADRE compared with structure-based scoring by DALI. Scores for pairs of 1,313 representative domains of α/β class are shown in color. Each panel compares sequence- and structure-based scores, with the score used for clustering shown above diagonal. Major protein groups are labeled on the side. (Scale bars show color coding for DALI Z score and decimal logarithm of COMPADRE E value.) (A) Grouping by COMPADRE score. (B) Grouping by DALI Z score. Similarity between the groupings suggests that COMPADRE is able to accurately retrieve the majority of structural relationships, although a number of remote similarities still remain a challenge. In some cases, sequence comparison is able to better highlight important local motifs resulting in functionally relevant grouping, with P-loop hydrolases as the most notable example.

These groupings provide several notable observations. First, there is a strong general correlation in clustering by both scores, including the identity of major protein groups such as TIM (triosephosphate isomerase)-barrels, Rossmann fold, and SAM (S-adenosylmethionine)-dependent methyltransferases (Fig. 4 A and B). Second, as expected, a fraction of structure-based relationships still remains undetected by sequence. These relationships include both similarities outside major clusters (off-diagonal area of the matrices, Fig. 4B) and links within clusters. For example, TIM-barrels have uniformly high DALI Z scores, whereas the coverage by COMPADRE scores is more fragmented (Fig. 4 A and B). Third, COMPADRE produces several clusters of pronounced similarity that stand out from the background (red in Fig. 4A) but are not produced by DALI. These clusters correspond to local functional sequence motifs whose presence is less obvious from structure comparison alone. The most notable example is P-loop nucleoside triphosphatases that are accurately placed together by COMPADRE but split apart by structure similarity (Fig. 4A). As another example, COMPADRE grouping within the TIM-barrel fold highlights a superfamily of (trans)glycosidases that share similar phosphate-binding sites, which is challenging for DALI-based clustering (Fig. 4A).

Detecting More Homologs at the SCOP Superfamily Level.

Compared with the original PROCAIN scoring, COMPADRE detects more homologs at the SCOP superfamily level. As an example, in bacterial lysozyme [Protein Data Bank (PDB) ID code 1jfx, chain A, SCOP family 1, 4-beta-N-acetylmuraminidase, Fig. 5A], the last α/β unit of 1jfx is atypical for TIM-barrel folds: it has an antiparallel hairpin replacing the typical parallel α/β units. A typical TIM-barrel structure (PDB ID code 1bqc, chain A) is shown in Fig. 5B. Due to this alteration, it is difficult for PROCAIN to detect homologs of 1jfx, and the PROCAIN E values for 1jfx vs. other TIM-barrel examples are high. The exception is an N-terminal domain of endolysin (PDB ID code 2j8g, chain A, domain 2) and an uncharacterized bacterial protein (PDB ID code 1sfs, chain A), which are in the same SCOP family 1, 4-beta-N-acetylmuraminidase. A scatter plot of E value vs. DALI Z score shows COMPADRE E values shifted lower to significant E values (E value = 0.005 line displayed in Fig. 5C), while keeping roughly the same ranking as PROCAIN. SCOP classifies all of these structures (dots in Fig. 5C) in the same superfamily: (trans)glycosidases, and they catalyze reactions with similar chemistry, suggesting they should be homologous. PROCAIN only detects two closest sequences at the SCOP family level (PDB ID code 2j8g, chain A, and PDB ID code 1sfs, chain A) with a significant E value, whereas COMPADRE detects all of the most distant structures with significant E values.

Fig. 5. — Detection of more homologs at the SCOP superfamily level. PROCAIN only detects the closest sequence at the SCOP family level (PDB ID code 1sfs, chain A) with a significant E value, whereas COMPADRE detects all but eight of the most distant structures with significant E values. (A) Structure of an atypical TIM-barrel fold (PDB ID code 1jfx, chain A), which has an antiparallel hairpin replacing the typical parallel α/β unit (magenta), with aligned portion colored. (B) Structure of a typical TIM-barrel fold (PDB ID code 1bqc, chain A), with aligned portion colored. (C) The scatter plot shows DALI Z score vs. E values of COMPADRE (red dots) and PROCAIN (blue dots) for 1jfx and all other same SCOP superfamily structures (E value = 0.005 line displayed).

Detecting Homology Relationship for a Recently Determined Structure.

For the inference of structure, function, and evolution of a given uncharacterized protein, COMPADRE improves the opportunity for detection of experimentally characterized homologs. As another example, the zinc-finger antiviral protein (ZAP) is a host factor that specifically inhibits the replication of certain viruses, such as HIV-1 (30). The N-terminal part of ZAP is the major functional region that binds target RNA and recruits the mRNA degradation machinery. The structure of the N-terminal region of ZAP was determined recently (31) (NZAP225, PDB ID code 3u9g, chain A), consisting of an uncharacterized top “cockpit” layer (1-65aa) and four connected zinc fingers (66-225aa). Fig. 6 shows how the confidence of an originally marginal sequence similarity can be verified for the uncharacterized top cockpit domain by using our COMPADRE method. For the query sequence of the first 65 residues from NZAP225, no confident homology to proteins with known 3D structure can be detected by direct query–template sequence comparisons. PSI-BLAST search in NCBI (National Center for Biotechnology Information) nr database converges after three iterations with no hits of known structures, whereas all HHpred and PROCAIN hits in structural databases are outside the significance threshold (best probability of 82.4% for HHpred and best E value of 76.1 for PROCAIN). COMPADRE, however, is able to assign significant E values to the similarities between the N-terminal domain of the query and multiple DNA-binding helix–turn–helix (HTH) domains, with the top E value as low as 2e-3 (Fig. 6). Detected similarity to HTH (Fig. 6B) suggests that the first 65-residue domain of ZAP may also play an important role in recognizing target viral RNA. Sequence alignment to the template (Fig. 6A) points to specific residues that may be involved in RNA recognition and binding, providing potential targets for mutagenesis studies.

Fig. 6. — Confident detection of homology relationship for a newly resolved structure. COMPADRE finds a significant similarity between the top cockpit layer of N-terminal ZAP protein and a DNA-binding HTH domain, suggesting structural fold and specific mode of RNA binding. (A) COMPADRE result, including the alignment of sequences and predicted secondary structures for the query (top) and the hit (bottom). The actual secondary structure of the template is shown as colored arrows below the alignment. (B) Hit structure (PDB ID code 1tc3, chain C) superimposed to the target structure (PDB ID code 3u9g, chain A, 1–65aa), with aligned portion colored.

Conclusions

Our findings show that defined homology relationships within the search database contain essential information for the improvement of sequence-based homology search. Although this concept is not new to computer science in general, to our knowledge, it has not been successfully applied to protein sequence searches before. Our method, COMPADRE, shows a dramatic increase in the performance of sequence search compared with current methods that are based on traditional query–template similarity measures. COMPADRE detects a large fraction of structural protein relationships and allows for predictions of structure and function in previously uncharacterized proteins. Furthermore, our results suggest that this approach may be applicable to a wide variety of methods for the detection of biological similarities.

Methods

Protein Database.

To perform homology searches, we use the set of 5,116 SCOP domain representatives (18, 21), with relationships defined by the combination of manual expert classification in SCOP with an automated SVM-based analysis of multiple scores for structure and sequence similarity (SI Appendix, Methods). PSI-BLAST alignments of homologs for each protein are used for PSI-BLAST, HHSearch, PROCAIN, and COMPADRE comparisons with default settings. PROCAIN is a sequence profile search method that, in addition to sequences, incorporates similarity in secondary structure, positional conservation, and sequence motifs into profile–profile scoring. When combined with an improved estimation of statistical significance of hits, this scoring results in better performance compared with other methods (19).

Scoring Function.

The original similarity scores ( $S_{0}$ ) are produced by logarithmic scaling of PROCAIN E values: S₀ = −log (E/E_c), where E is the E value of PROCAIN hit and E_c is a constant offset [log (E_c) = 6]. In Eq. 1, the optimal weight is empirically determined as w_T = 0.8. In Eq. 2, optimal parameters are α = 0.3, β = 5.5 for the scores based on closer template homologs (same SCOP superfamily) and α = 0.8, β = 8.0 for the scores based on all homologs. In Eq. 3, scores $S_{2}^{c}$ and $S_{2}^{r}$ are defined by Eqs. 1 and 2, with $S_{2}^{r}$ additionally rescaled to be comparable to $S_{2}^{c}$ in the area of mixing (SI Appendix, Methods). Weight w_c( $S_{2}^{c}$ ) is defined as follows (SI Appendix, Figs. S9 and S11): w_c = 0 if $S_{2}^{c}$ < $S_{2}^{c (1)}$ , w_c = 1 if $S_{2}^{c}$ > $S_{2}^{c (2)}$ , and w_c = a + be^γSc if $S_{2}^{c (1)}$ ≤ s_c ≤ $S_{2}^{c (2)}$ , where $S_{2}^{c (1)}$ = 33.95, $S_{2}^{c (2)}$ = 111.5, and γ = 0.08; a, b are derived from boundary conditions at $S_{2}^{c (2)}$ and $S_{2}^{c (2)}$ . In Eq. 4, S₄ is determined by Eq. 3 and S_p is the equivalent score converted by original PROCAIN E value based on the EVD estimation (25) of S₃, respectively. The weight of $S_{3}$ is defined as follows (SI Appendix, Figs. S10 and S12): w₃ = 0 if lnE_p < lnE_p⁽¹⁾, w₃ = 1 if lnE_p > lnE_p⁽²⁾, and w₃ = a + be^γlnEp if lnE_p⁽¹⁾ ≤ lnE_p ≤ lnE_p⁽²⁾, where lnE_p⁽¹⁾ = −50, lnE_p⁽²⁾ = −15, and γ= −0.5; a, b are derived from boundary conditions at lnE_p⁽¹⁾ and lnE_p⁽²⁾. Based on the final scores in Eq. 4, E values are calculated and similarities are ranked by statistical significance. Hierarchical clustering is performed by Cluster 3.0 (32).

Supplementary Material

Supplementary File

pnas.1424324112.sapp.pdf^{(852.3KB, pdf)}

Acknowledgments

The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing grid resources that have contributed to the research results reported within this paper. This work was supported in part by the National Institutes of Health (GM094575 to N.V.G.) and the Welch Foundation (I-1505 to N.V.G.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. A.S. is a guest editor invited by the Editorial Board.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1424324112/-/DCSupplemental.

References

1.Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. doi: 10.1016/j.cell.2013.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif) 2013;6:287–303. doi: 10.1146/annurev-anchem-062012-092628. [DOI] [PubMed] [Google Scholar]
3.Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA. 1987;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Huang YJ, Mao B, Aramini JM, Montelione GT. Assessment of template-based protein structure predictions in CASP10. Proteins. 2014;82(Suppl 2):43–56. doi: 10.1002/prot.24488. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci USA. 2006;103(8):2605–2610. doi: 10.1073/pnas.0509379103. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kryshtafovych A, et al. Challenging the state of the art in protein structure prediction: Highlights of experimental target structures for the 10th Critical Assessment of Techniques for Protein Structure Prediction Experiment CASP10. Proteins. 2014;82(Suppl 2):26–42. doi: 10.1002/prot.24489. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kryshtafovych A, Fidelis K, Moult J. CASP10 results compared to those of previous CASP experiments. Proteins. 2014;82(Suppl 2):164–174. doi: 10.1002/prot.24448. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
9.Altschul SF, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Karplus K, et al. Predicting protein structure using only sequence information. Proteins. 1999;(Suppl 3):121–125. doi: 10.1002/(sici)1097-0134(1999)37:3+<121::aid-prot16>3.3.co;2-h. [DOI] [PubMed] [Google Scholar]
11.Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 2000;9(2):232–241. doi: 10.1110/ps.9.2.232. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sadreyev R, Grishin N. COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol. 2003;326(1):317–336. doi: 10.1016/s0022-2836(02)01371-2. [DOI] [PubMed] [Google Scholar]
13.Margelevicius M, Venclovas C. Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison. BMC Bioinformatics. 2010;11:89. doi: 10.1186/1471-2105-11-89. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Madera M. Profile Comparer: A program for scoring and aligning profile hidden Markov models. Bioinformatics. 2008;24(22):2630–2631. doi: 10.1093/bioinformatics/btn504. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21(7):951–960. doi: 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]
16.Remmert M, Biegert A, Hauser A, Söding J. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9(2):173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
17.Ginalski K, et al. ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res. 2003;31(13):3804–3807. doi: 10.1093/nar/gkg504. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wang Y, Sadreyev RI, Grishin NV. PROCAIN: Protein profile comparison with assisting information. Nucleic Acids Res. 2009;37(11):3522–3530. doi: 10.1093/nar/gkp212. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Brin S, Page L. 1988. The anatomy of a large-scale hypertextual web search engine. Comput Networks ISDN 30(1):107–117.
20.Andreeva A, et al. Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res. 2008;36(Database issue):D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Qi Y, Sadreyev RI, Wang Y, Kim BH, Grishin NV. A comprehensive system for evaluation of remote sequence similarity detection. BMC Bioinformatics. 2007;8:314. doi: 10.1186/1471-2105-8-314. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Schäffer AA, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29(14):2994–3005. doi: 10.1093/nar/29.14.2994. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gumbel EJ. Les valeurs extrêmes des distributions statistiques. Ann Inst Henri Poincaré. 1935;5(2):115–158. [Google Scholar]
24.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
25.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Dembo A, Karlin S, Zeitouni O. Limit distribution of maximal non-aligned two-sequence segmental score. Ann Probab. 1994;22:18. [Google Scholar]
27.Sadreyev RI, Tang M, Kim BH, Grishin NV. 2007. COMPASS server for remote homology inference. Nucleic Acids Res 35(web server issue):W653–W658.
28.Pei J, Grishin NV. PROMALS: Towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics. 2007;23(7):802–808. doi: 10.1093/bioinformatics/btm017. [DOI] [PubMed] [Google Scholar]
29.Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993;233(1):123–138. doi: 10.1006/jmbi.1993.1489. [DOI] [PubMed] [Google Scholar]
30.Zhu Y, et al. Zinc-finger antiviral protein inhibits HIV-1 infection by selectively targeting multiply spliced viral mRNAs for degradation. Proc Natl Acad Sci USA. 2011;108(38):15834–15839. doi: 10.1073/pnas.1101676108. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Chen S, et al. Structure of N-terminal domain of ZAP indicates how a zinc-finger protein recognizes complex RNA. Nat Struct Mol Biol. 2012;19(4):430–435. doi: 10.1038/nsmb.2243. [DOI] [PubMed] [Google Scholar]
32.de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20(9):1453–1454. doi: 10.1093/bioinformatics/bth078. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1424324112.sapp.pdf^{(852.3KB, pdf)}

[r1] 1.Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155(1):27–38. doi: 10.1016/j.cell.2013.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r2] 2.Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif) 2013;6:287–303. doi: 10.1146/annurev-anchem-062012-092628. [DOI] [PubMed] [Google Scholar]

[r3] 3.Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: Detection of distantly related proteins. Proc Natl Acad Sci USA. 1987;84(13):4355–4358. doi: 10.1073/pnas.84.13.4355. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4] 4.Huang YJ, Mao B, Aramini JM, Montelione GT. Assessment of template-based protein structure predictions in CASP10. Proteins. 2014;82(Suppl 2):43–56. doi: 10.1002/prot.24488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5] 5.Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci USA. 2006;103(8):2605–2610. doi: 10.1073/pnas.0509379103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6] 6.Kryshtafovych A, et al. Challenging the state of the art in protein structure prediction: Highlights of experimental target structures for the 10th Critical Assessment of Techniques for Protein Structure Prediction Experiment CASP10. Proteins. 2014;82(Suppl 2):26–42. doi: 10.1002/prot.24489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Kryshtafovych A, Fidelis K, Moult J. CASP10 results compared to those of previous CASP experiments. Proteins. 2014;82(Suppl 2):164–174. doi: 10.1002/prot.24448. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]

[r9] 9.Altschul SF, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10.Karplus K, et al. Predicting protein structure using only sequence information. Proteins. 1999;(Suppl 3):121–125. doi: 10.1002/(sici)1097-0134(1999)37:3+<121::aid-prot16>3.3.co;2-h. [DOI] [PubMed] [Google Scholar]

[r11] 11.Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 2000;9(2):232–241. doi: 10.1110/ps.9.2.232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Sadreyev R, Grishin N. COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol. 2003;326(1):317–336. doi: 10.1016/s0022-2836(02)01371-2. [DOI] [PubMed] [Google Scholar]

[r13] 13.Margelevicius M, Venclovas C. Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison. BMC Bioinformatics. 2010;11:89. doi: 10.1186/1471-2105-11-89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Madera M. Profile Comparer: A program for scoring and aligning profile hidden Markov models. Bioinformatics. 2008;24(22):2630–2631. doi: 10.1093/bioinformatics/btn504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21(7):951–960. doi: 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]

[r16] 16.Remmert M, Biegert A, Hauser A, Söding J. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9(2):173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]

[r17] 17.Ginalski K, et al. ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res. 2003;31(13):3804–3807. doi: 10.1093/nar/gkg504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18] 18.Wang Y, Sadreyev RI, Grishin NV. PROCAIN: Protein profile comparison with assisting information. Nucleic Acids Res. 2009;37(11):3522–3530. doi: 10.1093/nar/gkp212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Brin S, Page L. 1988. The anatomy of a large-scale hypertextual web search engine. Comput Networks ISDN 30(1):107–117.

[r20] 20.Andreeva A, et al. Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res. 2008;36(Database issue):D419–D425. doi: 10.1093/nar/gkm993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21] 21.Qi Y, Sadreyev RI, Wang Y, Kim BH, Grishin NV. A comprehensive system for evaluation of remote sequence similarity detection. BMC Bioinformatics. 2007;8:314. doi: 10.1186/1471-2105-8-314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22] 22.Schäffer AA, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29(14):2994–3005. doi: 10.1093/nar/29.14.2994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23] 23.Gumbel EJ. Les valeurs extrêmes des distributions statistiques. Ann Inst Henri Poincaré. 1935;5(2):115–158. [Google Scholar]

[r24] 24.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[r25] 25.Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA. 1990;87(6):2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26] 26.Dembo A, Karlin S, Zeitouni O. Limit distribution of maximal non-aligned two-sequence segmental score. Ann Probab. 1994;22:18. [Google Scholar]

[r27] 27.Sadreyev RI, Tang M, Kim BH, Grishin NV. 2007. COMPASS server for remote homology inference. Nucleic Acids Res 35(web server issue):W653–W658.

[r28] 28.Pei J, Grishin NV. PROMALS: Towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics. 2007;23(7):802–808. doi: 10.1093/bioinformatics/btm017. [DOI] [PubMed] [Google Scholar]

[r29] 29.Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993;233(1):123–138. doi: 10.1006/jmbi.1993.1489. [DOI] [PubMed] [Google Scholar]

[r30] 30.Zhu Y, et al. Zinc-finger antiviral protein inhibits HIV-1 infection by selectively targeting multiply spliced viral mRNAs for degradation. Proc Natl Acad Sci USA. 2011;108(38):15834–15839. doi: 10.1073/pnas.1101676108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r31] 31.Chen S, et al. Structure of N-terminal domain of ZAP indicates how a zinc-finger protein recognizes complex RNA. Nat Struct Mol Biol. 2012;19(4):430–435. doi: 10.1038/nsmb.2243. [DOI] [PubMed] [Google Scholar]

[r32] 32.de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20(9):1453–1454. doi: 10.1093/bioinformatics/bth078. [DOI] [PubMed] [Google Scholar]

PERMALINK

Using homology relations within a database markedly boosts protein sequence similarity search

Jing Tong

Ruslan I Sadreyev

Jimin Pei

Lisa N Kinch

Nick V Grishin

Significance

Abstract

Fig. 1.

Results

Defining Close and Remote Template Homology Networks.

Establishing a Scoring Scheme by Using Homology Networks in the Search Database.

Improvement of COMPADRE Scoring Scheme by the Choice of Homology Network.

Fig. 2.

Fig. 3.

Comparison with Structure Similarity Score.

Fig. 4.

Detecting More Homologs at the SCOP Superfamily Level.

Fig. 5.

Detecting Homology Relationship for a Recently Determined Structure.

Fig. 6.

Conclusions

Methods

Protein Database.

Scoring Function.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Using homology relations within a database markedly boosts protein sequence similarity search

Jing Tong

Ruslan I Sadreyev

Jimin Pei

Lisa N Kinch

Nick V Grishin

Significance

Abstract

Fig. 1.

Results

Defining Close and Remote Template Homology Networks.

Establishing a Scoring Scheme by Using Homology Networks in the Search Database.

Improvement of COMPADRE Scoring Scheme by the Choice of Homology Network.

Fig. 2.

Fig. 3.

Comparison with Structure Similarity Score.

Fig. 4.

Detecting More Homologs at the SCOP Superfamily Level.

Fig. 5.

Detecting Homology Relationship for a Recently Determined Structure.

Fig. 6.

Conclusions

Methods

Protein Database.

Scoring Function.

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases