Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2008 Nov 6;25(1):121–122. doi: 10.1093/bioinformatics/btn567

Rankprop: a web server for protein remote homology detection

Iain Melvin 1,*, Jason Weston 1, Christina Leslie 2, William Stafford Noble 3,4,*
PMCID: PMC2638939  PMID: 18990723

Abstract

Summary: We present a large-scale implementation of the Rankprop protein homology ranking algorithm in the form of an openly accessible web server. We use the NRDB40 PSI-BLAST all-versus-all protein similarity network of 1.1 million proteins to construct the graph for the Rankprop algorithm, whereas previously, results were only reported for a database of 108 000 proteins. We also describe two algorithmic improvements to the original algorithm, including propagation from multiple homologs of the query and better normalization of ranking scores, that lead to higher accuracy and to scores with a probabilistic interpretation.

Availability: The Rankprop web server and source code are available at http://rankprop.gs.washington.edu

Contact: iain@nec-labs.com; noble@gs.washington.edu

1 INTRODUCTION

Rankprop (Weston et al., 2004) is a network-based inference algorithm for identifying subtle protein sequence similarities, corresponding to remote homology relationships or to structural similarities. The algorithm operates on a protein similarity network, a graph in which each node is a protein and each weighted edge connecting two proteins indicates their similarity. Such a network can be built using existing tools, such as PSI-BLAST (Altschul et al., 1997).

The key idea of the Rankprop algorithm is to extract global information from a protein similarity network by propagating outward from a user-specified query protein. Effectively, the algorithm sums over all possible paths from the query to each target protein. Thus, after propagation, the resulting activation score for each node includes global information about that protein's relationship to the query. Ranking proteins by these scores is analogous to performing a database search using a tool such as PSI-BLAST, except that the ranking induced by Rankprop reflects the global topology of the protein similarity network.

In (Weston et al., 2004), PSI-BLAST is used to measure sequence similarity, and the unnormalized weight for the edge from node i to node j is Wij=exp(−Sj(i)/σ), where Sj(i) is the PSI-BLAST E-value assigned to protein i given query j, and the parameter σ is a positive constant. Edges are only included in the network for E-values smaller than a fixed threshold. We obtain a stochastic connectivity matrix M for the protein similarity network by row-normalizing edge weights Wij to obtain transition probabilities: Mij = Wij / ∑j Wij.

Given such a network and a query sequence q, the Rankprop algorithm is simple to describe. First, all nodes are assigned initial activation scores that reflect each target protein's similarity to q. Like the edge weights, these scores are computed from PSI-BLAST E-values using the same equation. At each iteration of the algorithm, the activation score at a given node is replaced by the weighted sum of the scores from all of its incoming edges. The update rule includes a diffusion constant α that controls the rate of diffusion through the network. Formally, we define the initial activation scores as P0i= exp(−Sq(i)/σ). Viewing Pt as the column vector of activation levels at iteration t, the algorithm is given by Pit+1= αMPit+Pi0 if Piq and Pit+1 = 1 otherwise, where α ∈(0,1). One can show that this iterative procedure converges to a fixed point, which in practice happens in a small number of iterations. The output of the Rankprop algorithm is a ranking of the nodes in the network according to their final activation values. Proteins that receive a high activation score are linked to the query via many strongly weighted paths and vice versa. A multidomain query protein will produce strong matches to any target protein that contains one or more of the query domains. A single domain query A may connect through a multidomain protein AB to infer a false relationship with B. However, previous work (Weston et al., 2004) has found that as long as the query sequence is connected to many other proteins, then the true homologs will be mutually reinforcing and receive a higher rank.

In this work, we extend the original Rankprop algorithm in two ways: (1) improving accuracy by propagating simultaneously from proteins that are very closely related to the query, and (2) improving the interpretability of the scores produced by Rankprop by empirically mapping them to probabilities. The mapped score can be interpreted as the probability that the target protein is a member of the same SCOP superfamily as the query. We also announce the availability of a free web server that allows individual queries against a protein similarity network derived from the NRDB40, comprising 1.1 million targets.

2 METHODS

The Rankprop server uses the PSI-BLAST all-versus-all similarity matrix for NRDB40 provided by the PairsDB website (Heger et al., 2008). NRDB40 is a subset of the non-redundant sequence database, filtered so that no pairs exhibit >40% sequence identity.

We generalize the Rankprop algorithm to accept a set Q of query proteins, rather than a single query protein. To use this extra information we perform propagation as usual, but we constrain the activation scores for all the query points such that they are highly ranked. In particular, we choose the set Q to be all the proteins that have a match with the initial query q with a PSI-BLAST E-value<0.001. We then constrain our algorithm to have Pj = 1−Sq(j), ∀ jQ. This modification is useful because, instead of propagating from a single query source node in the graph, we can propagate from several source nodes that all belong to the same family or superfamily that we are searching for.

The original Rankprop algorithm outputs scalar values that are not directly interpretable. In the new version of the algorithm, we map each Rankprop score to an estimate of the probability that the corresponding query and target proteins belong to the same structural superfamily. We employ the SCOP database (Murzin et al., 1995) to compute a histogram of empirical frequencies of the activation levels Pi for each protein i. More specifically, we choose bin centers vk and compute the following quantities: nk, the number of times Pi falls into bin vk, and sn, the number of times that the latter occurs and i is in the same superfamily as the query. We are interested in the value sk / nk, which can be interpreted as the probability for each activation value bin of the target being in the same superfamily as the query. We choose the bin centers v= (0,0.01,0.02,…, 0.2 , 0.3 ,…, 1), and we enforce monotonicity in the final output by setting pi/ni = pi−1/ni−1 if pi/ni<pi−1/ni−1.

3 RESULTS

Table 1 compares our large-scale Rankprop results with PSI-BLAST (using NRDB40 and the same blastpgp parameters as PairsDB: −j 10 −e 1 −h 0.001 −b 10000 −v 10000) and the previously published version of Rankprop (using the SWISSPROT database, 108k proteins). Rankprop NRDB40 is a straightforward scaling up of the previous Rankprop algorithm to NRDB40. In addition, Rankprop+homologs NRDB40 employs the extensions described in Section 2. Accuracy is measured following the methodology given in (Weston et al., 2004): SCOP version 1.59 is split into train and test portions, and hyperparameters are chosen by using the training set. Then, each test protein is treated as a query, and the quality of a method's protein ranking is measured by using the area under the receiver operating characteristic (ROC) curve, up to the first (ROC1) or 50th (ROC50) false positive. We report results as average ROC1 and ROC50 scores across all 3083 test proteins. Using a larger network yields improvements across all four performance metrics, and propagating from multiple queries improves the performance still further. A Wilcoxon signed rank test, corrected for multiple tests, shows that all differences in Table 1 are significant at 0.01, except for the three pairs of methods marked with asterisks.

Table 1.

Ranking accuracy

Family Family S-Fam S-Fam

Method ROC1 ROC50 ROC1 ROC50
PSI-BLAST 0.833* 0.851 0.609* 0.628
RankProp SWISSPROT 0.816* 0.906 0.592* 0.725
RankProp NRDB40 0.872 0.923 0.696 0.779*
RankProp+homologs NRDB40 0.884 0.928 0.710 0.775*

*Indicate pairs of values that are not different at P < 0.01 (Wilcoxon signed rank).

We also evaluate the performance of Rankprop using a combined ROC curve across all the queries in our test set, following the protocol of (Altschul et al., 1997). Figure 1 shows the combined ROC curves for Rankprop NRDB40 (ranked by activation value), Rankprop+homologs NRDB40 (ranked by probability) and PSI-BLAST (ranked by E-value). Compared with average per-query ROC scores, the combined ROC curve requires that scores are well calibrated from one query to the next. The figure shows that the mapping of Rankprop scores to probabilities significantly improves the calibration, yielding better performance than PSI-BLAST for all but the first few false positives (across 3083 queries).

Fig. 1.

Fig. 1.

Combined ROC curve across multiple queries. For each method, search results from 3083 queries were sorted into a single list. The figure plots, for varying thresholds in the ranked list, the fraction of all known homologs (SCOP superfamily members) that fall above the threshold, as a function of the number of non-superfamily members above the threshold.

The Rankprop web server first looks for an exact match of the query sequence against the sequences in NRDB40. If such a match is found, the server will retrieve the precomputed PSI-BLAST results from the database and then apply the Rankprop algorithm. In this case the server takes around 90 s to process a query. If the sequence is not found in the database, then the server will run PSI-BLAST first, which on average takes an additional 15 min.

Funding

National Institutes of Health award (R01 GM074257).

Conflict of Interest: none declared.

References

  1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Heger A, et al. Pairsdb atlas of protein sequence space. Nucleic Acids Res. 2008;36:D276–D280. doi: 10.1093/nar/gkm879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  4. Weston J, et al. Protein ranking: from local to global structure in the protein similarity network. Proc. Natl Acad. Sci. 2004;101:6559–6563. doi: 10.1073/pnas.0308067101. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES