Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Apr 1.
Published in final edited form as: Proteins. 2016 Feb 4;84(4):467–472. doi: 10.1002/prot.24993

Protein Fold Recognition Using an Improved Single Source K Diverse Shortest Paths Algorithm

John Lhota 1, Lei Xie 2
PMCID: PMC4934902  NIHMSID: NIHMS754308  PMID: 26800480

Abstract

Protein structure prediction, when construed as a fold recognition problem, is one of the most important applications of similarity search in bioinformatics. A new protein fold recognition method is reported which combines a single-source k diverse shortest path (SSKDSP) algorithm with Enrichment of Network Topological Similarity (ENTS) algorithm to search a graphic feature space generated using sequence similarity and structural similarity metrics. A modified, more efficient SSKDSP algorithm is developed to improve the performance of graph searching. The new implementation of the SSKDSP algorithm empirically requires 82% less memory and 61% less time than the current implementation, allowing for the analysis of larger, denser graphs. Furthermore, the statistical significance of fold ranking generated from SSKDSP is assessed using ENTS. The reported ENTS-SSKDSP algorithm outperforms original ENTS that uses random walk with restart (RWR) for the graph search as well as other state-of-the-art protein structure prediction algorithms HHSearch and Sparks-X, as evaluated by a benchmark of 600 query proteins. The reported methods may easily be extended to other similarity search problems in bioinformatics and chemoinformatics. The SSKDSP software is available at http://compsci.hunter.cuny.edu/~leixie/sskdsp.html.

Keywords: Similarity search, structure prediction, graph algorithm, ENTS, structural bioinformatics

Introduction

Protein Structure Prediction and Similarity Search

Protein fold recognition, the goal of which is to predict a protein’s three-dimensional structure given only its amino acid sequence, is one of the largest open problems in bioinformatics. Due to substantial advances in genome sequencing technology, biologists now have access to an extremely large volume of protein sequences; however, due to the costly and time-intensive nature of x-ray crystallography and other structure determination techniques, it is much more difficult to empirically find the structures of these proteins (1). This bottleneck is especially unfortunate as structural information about a protein can provide significant insights into its functional and evolutionary attributes (2) and can facilitate the discovery of ligands with high affinity and selectivity that may function well as drugs (3).

Since the physics of protein folding are not well understood, all successful protein structure prediction algorithms created to date rely heavily on comparing the sequence of the query protein to the sequences of proteins of known structures under the assumption that sequence similarity will imply structural similarity. To further simplify the problem, the Structural Classification of Proteins (SCOP) has grouped proteins with very similar structures together into “folds,” which reduces protein structure prediction to a classification problem where similarity search may be applied (4, 5). For example, the state-of-the-art algorithm HHSearch computes the probability that a given protein belongs to a fold by constructing a Hidden-Markov Model (HMM) that can approximate the probability of a given amino acid appearing at a given location in a protein that belongs to that fold (6, 7).

Despite recent advances in computational protein structure prediction, only about 10% of structures predicted by current state-of-the-art algorithms are accurate enough to elucidate their protein’s biological role or to use as a model for drug discovery (4). It is believed that algorithms using similarity networks can identify similarity relationships that algorithms using only simple pairwise comparisons of the query protein to known proteins will miss (9).

ENTS Search Paradigm

Enrichment of Network Topological Similarity (ENTS) is a computational method for similarity search introduced by Lhota et al. (10), previously applied to the protein folding problem. In protein folding, the algorithm takes a single query sequence whose structure is to be predicted, combined with a large data set of known sequence-structure pairs covering a wide range of folds. The data is made into a graph, by connecting the query to the known structures by sequence similarity and by connecting the known structures all-against-all by structural similarity. Then, a variable graph search algorithm is performed to quantify the similarity of each known structure, which is represented as a node in the similarity graph, to the query. Each fold’s raw score is calculated as the mean of the similarity scores of all proteins of that fold in the graph; this mean score is then compared to the mean scores of randomly formed sets of nodes of the same size. Assuming a normal distribution, the number of standard deviations by which the fold score differs from these becomes the fold’s new, normalized similarity score. This process of assigning normalized scores to folds is called ENTS. Finally, the query protein is classified into the fold with the highest normalized similarity score. Lhota et al. (10) used Random Walk with Restart (RWR) for the graph search stage of this algorithm, but it remains to be seen whether a superior graph search algorithm may improve ENTS’s performance.

The Single-Source K Diverse Shortest Paths Algorithm

Shih et al. (11) have proposed a single-source k shortest paths algorithm that heuristically approximates the top k shortest paths from the source to all other nodes in the graph. The importance value or similarity score of each node relative to the source is then defined as the sum of the reciprocals of the k path lengths to that node. The algorithm takes an argument hthe maximum possible number of edges in a path, since finding the k shortest paths of unbounded length is intractable. The algorithm’s time complexity on a graph with n nodes and m edges is O(nklog(nk)+mk(h+k)).

Since the top k shortest paths to a node are likely to be very similar, Shih et al. (11) further propose a single-source diverse k shortest paths algorithm, in which the algorithm also takes a diversity threshold λ (0≤λ≤1). A constraint is imposed stating that for each node, if the nth shortest path has j edges, then that path must have at least λj edges not used in any of the n-1 shorter paths. The diverse algorithm runs the original k shortest paths algorithm, stores any diverse paths that happen to have been generated, deletes the most frequently used edges from the graph, and repeats this process until k diverse short paths to all nodes have been found or the results approximately converge. Shih et al. (11) suggested that the k diverse shortest path outperformed RWR in the graph mining of protein-protein interaction network.

Contributions of this work

This study makes two contributions. First, we improve the implementation of the single-source k diverse shortest path (SSKDSP) algorithm, making it applicable to larger graphs. Secondly, we investigate the possibility of using the SSKDSP algorithm in conjunction with ENTS (ENTS-SSKDSP), or with other fold-ranking protocols, in network-based similarity search. In the ENTS framework, a variable graph mining algorithm is used to measure the distance from a source vertex to the other nodes in a similarity network; then random set theory (or some other form of set enrichment analysis) is used to normalize these distance metrics, assigning each vertex a statistically significant measurement which is used to generate a similarity ranking. Our benchmark studies demonstrate that ENTS-SSKDSP outperforms the original ENTS based on RWR as well as the established algorithms HHsearch (7) and Sparks-X (8). Thus, ENTS-SSKDSP provides another powerful tool in similarity search, and may have broad applications in many areas of bioinformatics and chemoinformatics.

Materials and Methods

Modification of the Single-Source K Diverse Shortest Paths Algorithm

We have made several significant modifications on the original algorithms to reduce time complexity and memory usage. Between iterations of the non-diverse algorithm, each node’s k paths must be checked to determine which are diverse. Each of the O(k) paths must check O(k) shorter paths, which each have O(h) edges, against its own O(h) edges. Therefore, the overall update procedure’s time complexity is O(nh2k2). A new system was adopted where each node creates a binary search tree and adds all of the previously stored diverse paths’ edges to it. As soon as a path is found to be sufficiently diverse, all of its edges are added to the tree. The size of each search tree is O(hk) so all lookup and insertion operations are O(log(hk)). O(hk) insertions are required to build up the tree and O(hk) lookup operations are required for diversity testing at each node. The overall time complexity of the improved diverse path update is therefore O(nhklog(hk)).

The original algorithm also used fixed-length k-arrays to store paths; these were initially filled with placeholder paths of length infinity and replaced with real, shorter paths as the program progressed. These arrays were replaced with variable-length linked lists, so the algorithm would not have to waste time searching for the actual end of the paths array each time it added a new path.

In the original algorithm, paths were also stored as lists of edges; here, paths were stored recursively as a single edge and a reference to a “prefix” path. Since many paths have the same prefix, this formulation is much more memory-efficient.

To test the non-asymptotic change in speed and memory usage, the algorithm was run on 12 randomly generated graphs with 10,000, 50,000, 250,000, and 1,000,000 nodes and node degrees of 5, 10, and 20, and the memory and time used were recorded. The system used had 8.00 GB of RAM and a 2.40 GHz Intel® Core™ i7-3630QM CPU.

Benchmark and Protein Similarity Graphs

The protein similarity graphs and benchmarks are constructed similar to those in the ENTS paper (10). Briefly, 36,003 proteins’ sequences and structures were downloaded from the RCSB PDB. These proteins had less than 40% sequence identity to each other. 885 query proteins were selected such that each SCOP superfamily only includes one query protein, and each was connected to a graph with nodes corresponding to the 36,003 downloaded proteins. However, proteins that shared the same SCOP family or superfamily as the query protein were excluded to make the benchmark more rigorous. The HMM-HMM similarity between the query sequence and the sequences of all proteins in the graph was calculated using HHBlits, a faster version of HHSearch (13). Structural similarity between proteins of known structure was found using TM-Align (14) and the two proteins were connected in the graph if their structural similarity was greater than 0.4.

For each of the 885 queries, two graphs were generated. In each, pairwise similarities S (0.4≤S≤1) were converted into edge lengths p using equations (1) and (2):

p=log(S) (1)
p=1log(S) (2)

where “log” denotes the base-ten logarithm. They are termed (1) the negative log score (NLS) and (2) the reversed log score (RLS), respectively.

Among the total 885 queries in the benchmark, a subsample of randomly selected 285 queries was used to optimize the performance of ENTS-SSKDSP. The remaining 600-query subsample was used as the benchmark to compare ENTS-SSKDSP to other algorithms.

Fold recognition

Similar to our previous study (10), fold recognition was performed with two steps. First, the protein similarity between the query and each protein in the similarity graph was determined by the SSDKSP algorithm. An importance function is used to calculate a similarity score between the query and each known structure, based on the k path lengths found. Then folds were ranked by their fold score, which was calculated using ENTS (or one of several other fold scoring functions) to combine the individual similarity scores of structures in the fold into one aggregate score.

Given a node Nthe similarity score of that node was calculated using one of the three equations below:

Imp1(N)=i=1k1p(N,i) (3)
Imp2(N)=1i=1k(110p(N,i)) (4)
Imp3(N)=1i=1k(110n(N,i)p(N,i)) (5)

Where p(N, i) denotes the length of the ith shortest diverse path to the node Nand n(N, i) denotes the number of edges in that path. Imp2 and Imp3 were respectively used for edge lengths generated by equations (1) and (2); similarities bounded between 0.4 and 1 were thought of as probabilities of “true” relatedness and Imp2 and Imp3 represent the probability that at least one of the paths from the source to N was connected through a true relationship. Imp1 was used with both edge length equations. It is the same as the importance function used by Shih et al. (11).

Three different fold scoring functions were used. In the first method, each fold was assigned the similarity score of the highest-scored protein it contained. It is denoted the Max score. In the second, ENTS as described by Lhota et al. (10) was applied. This is called the ENTS score. In the third – used only for importance functions 2 and 3 – the fold’s score was the probability that it contained at least one protein that was “truly” connected to the query. It is referred to as the Prob score.

Results and Discussion

Efficiency of modified SSKDSP algorithm

Each individual modification to the algorithm was tested to verify that it did increase the algorithm’s time- and memory-efficiency. Overall, the modified algorithm used 81.61% less memory (with a standard deviation of 3.22%) and 60.96% less time (with a standard deviation of 14.04%) than the original, as shown in Figure 1. Data was not collected for the graphs with 1,000,000 nodes and average degrees of 10 and 20, as the unmodified version of the algorithm crashed while trying to process them.

Fig. 1.

Fig. 1

Comparison of time and memory usage of Shih et al.’s single source k diverse shortest paths algorithm and the one presented here. Our implementation of the algorithm outperforms the original algorithm in both speed and memory use.

The influence of parameter and scoring scheme on the accuracy of SSKDSP algorithm

To determine the effectiveness of the similarity search methods independent of fold scoring systems (ENTS, Max, and Prob), the ranks for each algorithm’s protein similarity scores were evaluated for the top 1, 3, 10, and 20 proteins as shown in Table 1. (The rank for the top k proteins is the percentage of the top k scored proteins that are in the same fold as their query. There were 285 queries used for optimization.) The results generated using edge length equation (1) were generally inferior (data not shown). Shih et al. (11) show that using k<3 paths produces noisy data and using k>5 generates largely redundant information. It was determined that varying between k=3, 4, and 5 produces no discernible difference in the algorithm’s results, so k=4 was arbitrarily chosen for use in for all later tests. Changing the values of h and the diversity threshold did not have any substantial effects on the algorithm’s performance. The probabilistic importance function Imp1 demonstrated superior rankings for some parameter values, as shown in Table 1. However, it was not significantly better, so all protein importance functions were also tested in conjunction with the fold-scoring functions later on. These rankings suggest that varying the k value of the search algorithm is not very consequential and that Imp1 may be slightly better at ranking individual proteins’ similarity scores than Imp3.

Table 1. Percentage of correctly identified folds with different SSDKSP scoring schema.

All scores used parameters k=4 and h=23, except for diversity thresholds of 0.5, which used h=50. The best performed ones are highlighted in bold. These rank values were generated using the 285 proteins not in the random 600-query subsample.

Rank Imp1 Imp3
λ=0.25 λ=0.5 λ=1.0 λ=0.25 λ=0.5 λ=1.0
Top 1 28.1% 27.7% 27.4% 27.4% 27.0% 27.4%
Top 3 34.4% 34.4% 33.7% 33.3% 32.3% 33.7%
Top 10 45.6% 45.6% 42.8% 38.9% 38.6% 43.2%
Top 20 56.1% 55.8% 51.9% 44.6% 44.2% 51.6%

Although the probabilistic importance function Imp3 was only slightly worse at ranking individual proteins than the function Imp1 (as shown in Table 1), its performance at appropriately ranking proteins did not translate into good fold prediction. Its constituent proteins’ similarity scores were aggregated using two fold-scoring techniques: the probabilistic method described in the methods (the “Prob” score) and simply selecting the score of the highest-scored protein in the fold (the “Max” score). In figure 2, all of the query-fold hits for the benchmark query proteins were compiled and ranked by fold score and then used to generate true positive ratios for the top n hits (0≤n≤2000). It was not originally paired with the ENTS fold scoring system, since probabilities are bounded between 0 and 1 and applying ENTS to values on a finite interval somewhat defeats the point of ENTS, which is designed to normalize unbounded similarity scores so they are more statistically meaningful. However, to determine whether it would make a difference, ENTS was tested on the Imp3 scores anyway. Its results were greatly inferior and are not shown. Instead, the results of scoring based on Imp3 were compared to the ENTS analysis of the scores produced by Imp1.

Fig. 2.

Fig. 2

The legend here lists different runs of the algorithm first with the fold-scoring system used, then, in parentheses, with the protein importance function used. “Max (Imp3)” refers to a fold scoring system where each fold was assigned the maximum probability of any protein it contains, and “Prob (Imp3)” refers to the probabilistic fold score method.

The scoring functions based on Imp3 found significantly fewer true positives in the top 2000 query-fold pairs, as shown in Figure 2. Based on this data, the protein importance function Imp1 in conjunction with ENTS is the optimal configuration for the SSDKSP similarity search.

As demonstrated in Figure 3, edge weights generated using RLS were superior to those generated with NLS – most likely because the former allows nodes with increasing degrees of separation from the source to be viewed as inherently less reliable – and ENTS was a better fold scoring system than the Max fold score. It was determined that ENTS performs optimally when proteins with scores lower than one standard deviation above the mean of all scores in the sample are neglected, so those nodes were ignored every time ENTS was performed.

Fig. 3.

Fig. 3

Comparison of ENTS and Max fold scoring systems.

ENTS-SSKDSP outperforms the state-of-the algorithms in fold recognition

To compare ENTS-SSKDSP algorithm with HHsearch, Sparks-X, and ENTS-RWR, the algorithms were tested on a random 600-protein subsample of the 885 queries. (The other 285 queries were used exclusively for optimization.) As shown in Table 2 and Figure 4, the optimized algorithm – using Imp1 and ENTS with a k value of 4, a diversity threshold of 50%, and an h value of 50 – can significantly outperform the state-of-the-art algorithm HHsearch and the more recently developed Sparks-X that uses 3D structural information in the fold recognition (8), especially on the top-ranked region. For each query, ENTS-SSKDSP identifies the most number of correct hits that are ranked first. ENTS-RWR performs the best for the top 3 and top 10 ranks, respectively. Sparks-X outperforms others for the top 20 ranks, as shown in Table 2. For the true positive ratio ranked by the similarity score, as shown in Figure 4, the ENTS-SSKDSP detected 30% more true positives in the top 100 hits than HHsearch, 190% more than Sparks-X, and 22% more than ENTS-RWR presented by Lhota et al. (10) in which the protein sequence-structure graph is searched using the RankProp implementation of random walk with restart (RWR). The performance of ENTS-SSKDSP flats at the low rank region. ENTS-RWR outperforms ENTS-SSKDSP after the rank of 1000. However, in real applications, the high ranked hits are more important than those ranked lower. Thus, in general, ENTS-SSKDSP may deliver better performance than ENTS-RWR.

Table 2.

Fold rankings for two state-of-the-art algorithms and the search algorithm used in the original version of ENTS, for comparison. The data for the column “ENTS-SSKDSP” was generated at k=4, λ=0.5, and h=50 using Imp1. The best performed ones are highlighted in bold.

Rank HHsearch Sparks-X ENTS-RWR ENTS-SSKDSP
Top 1 19.2% 25.8% 26.8% 29.4%
Top 3 24.2% 42.4% 44.1% 36.8%
Top 10 26.2% 61.4% 63.5% 47.3%
Top 20 26.4% 71.8% 70.4% 55.4%

Fig. 4.

Fig. 4

ENTS in conjunction with SSKDSP (ENTS-SSKDSP) vs. HHsearch, Sparks-X, and ENTS using RankProp (ENTS-RWR).

Conclusion

The single-source k diverse shortest paths algorithm was able to outperform state-of-the-art fold prediction software by searching a protein set without any sequence homologies detectable using conventional techniques, suggesting that it may be useful in applications where the structure of a protein without any known homologous sequence must be predicted. The algorithm’s advantages are believed to come from its incorporation of similarity networks generated using both sequence and structural similarity to find distant similarity relationships. The k shortest paths algorithm is shown to be superior to the RankProp algorithm for searching these similarity graphs.

The similarity networks were generated using the sequence similarities found by HHsearch and structural similarities found by TM-align; however, they could easily be replaced with other similarity metrics as new ones are developed. The probabilistic protein importance functions Imp2 and Imp3 may have some promise due to its higher fold rankings; in the future it may be possible to find a better fold scoring method using the probabilistic importance.

Although the k shortest paths algorithm was applied to the problem of protein fold recognition, the general methodology shown here could obviously be used in any other similarity search problems, and may have potential to elucidate evolutionary and functional sequence relationships that other algorithms are currently unable to detect.

Acknowledgments

We appreciate the constructive comments of reviewers and editor. This research was supported by National Library of Medicine of The National Institute of Health under grant R01LM011986, National Science Foundation under grants CNS-0958379 and CNS-0855217, and the City University of New York High Performance Computing Center at the College of Staten Island.

Contributor Information

John Lhota, Hunter College High School.

Lei Xie, Department of Computer Science, Hunter College; The Graduate Center, The City University of New York.

References

  • 1.Lee J, Wu S, Zhang Y. Ab initio protein structure prediction. In: Rigden DJ, editor. From Protein Structure to Function with Bioinformatics [Internet] London: Springer; 2009. [cited 2015 May 10]. pp. 3–4. Available from: http://zhanglab.ccmb.med.umich.edu/papers/2009_4.pdf. [Google Scholar]
  • 2.Kolodny R, Petrey D, Honig B. Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol. 2006;16:393–398. doi: 10.1016/j.sbi.2006.04.007. [DOI] [PubMed] [Google Scholar]
  • 3.Breda A, Valadares NF, de Souza ON, Garratt RC. Protein Structure, Modelling and Applications. In: Gruber A, Durham AM, Huynh C, del Portillo HA, editors. Bioinformatics in Tropical Disease Research [Internet] Bethesda: National Center for Biotechnology Information (US); 2008. [cited 2015 May 3]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK6818/ [Google Scholar]
  • 4.Dill KA, MacCallum JL. The protein-folding problem, 50 years on. Science. 2012;338:1042–1046. doi: 10.1126/science.1219021. [DOI] [PubMed] [Google Scholar]
  • 5.Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Research. 2012;42:D310–D314. doi: 10.1093/nar/gkt1242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Soding J, Biegert A, Lupas AN. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Research. 2005;33:W244–W248. doi: 10.1093/nar/gki408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. doi: 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]
  • 8.Yang Yuedong, Faraggi Eshel, Zhao Huiying, Zhou Yaoqi. Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics. 2011;27:2076–2082. doi: 10.1093/bioinformatics/btr350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One. 2009;4:e4345. doi: 10.1371/journal.pone.0004345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lhota J, Hauptman R, Hart T, Ng C, Xie L. A new method to improve network topological similarity search: applied to fold recognition. Bioinformatics. 2015;31:2106–2114. doi: 10.1093/bioinformatics/btv125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shih YK, Parthasarathy S. A single source k-shortest paths algorithm to infer regulatory pathways in a gene network. Bioinformatics. 2012;28:i49–i58. doi: 10.1093/bioinformatics/bts212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ma J, Peng J, Wang S, Xu J. A conditional neural fields model for protein threading. Bioinformatics. 2012;28:i59–i66. doi: 10.1093/bioinformatics/bts213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Remmert M, Biegert A, Hauser A, Soding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2012;9:173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
  • 14.Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES