Skip to main content
PLOS One logoLink to PLOS One
. 2011 Apr 19;6(4):e18607. doi: 10.1371/journal.pone.0018607

Scoring Protein Relationships in Functional Interaction Networks Predicted from Sequence Data

Gaston K Mazandu 1, Nicola J Mulder 1,*
Editor: Christophe Herman2
PMCID: PMC3079720  PMID: 21526183

Abstract

The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins.

Availability

Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.

Introduction

In recent years we have experienced an exponential growth of biological data, including primary data such as genomic sequences resulting from worldwide DNA sequencing efforts and as well as functional data from high-throughput experiments, respectively. This abundance of primary sequence data and the large availability of public gene and protein sequence databases have the capability to provide many new insights into the biology of organisms. Several studies have shown that very often functional properties of a protein are not necessarily determined by the whole sequence but only by some of its sub-sequences [1]. Sequences sharing similar or conserved features are referred to as homologous sequences, and these features can be used for inferring and scoring protein pair-wise functional connections. One of these features is a protein domain, defined as a part of a protein sequence and structure that can evolve, function and exist independently of the rest of the protein chain [2].

Discovering sequence homology and modelling functional interactions between homologues from sequence and experimental data constitutes an important problem in molecular biology, as these can help to describe their behaviour in cellular processes and reveal the interplay between particular genes and proteins. In order to determine functional similarity between proteins, many approaches try to identify the sub-sequences of the proteins that may contribute to their function. Several Bioinformatics tools have been designed for deriving and storing these functional features. These include standard sequence comparison tools such as BLAST [3], [4], protein sequence databases such as UniProt [5], and protein signature databases such as InterPro [6], which integrates together predictive models or protein signatures representing protein domains, families and functional sites, from multiple source databases, namely, PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY, Gene3D, PANTHER [7].

Using homologous datasets obtained from pair-wise sequence similarities, and protein domains and families in public databases, the inference of functional connections can be carried out based on the fact that two proteins sharing common domains or belonging to the same family are more likely to be functionally linked [8], Inline graphic, have similar functions with respect to molecular function and biological process. Note, the interactions discussed here are potential functional interactions, not direct physical interactions. These functional associations may be set in Boolean or binary form, Inline graphic, either two genes or proteins are functionally linked in which case the score is Inline graphic or they are not and the score is Inline graphic. Such a scoring scheme is not consistent since it does not take into account the nature of parameters used to derive these functional associations. Understanding the properties of these functional relationships is key to successful mathematical modelling of such a system and developing efficient scoring techniques.

There are several problems with generating functional interaction networks using diverse data types such as sequence and functional genomics data. Considering that we are dealing with inaccurate data obtained from different experiments [9], [10], the uncertainty of data and noise inherent in each experiment must be efficiently managed by systematically weighing or scoring these functional associations [11]. This is referred to as a reliability or confidence score of functional associations for the particular computational approach used for prediction. This produces a graph with confidence-weighted relationships between each protein pair, which weighs each evidence type on the basis of its accuracy. Data-driven prediction methods should be able to extract essential features from particular datasets and to discount unwanted information. So, these scoring schemes must be data source and technology dependent, meaning that a given scoring scheme should normally vary according to the data sources and be designed on the basis of the technology used. Furthermore, the effectiveness of a scoring scheme for functional associations is critical for the quality of the analyses performed on the resulting network, including functional and structural analysis. An inability to accurately infer and score these protein pair functional associations leads to the propagation of annotation errors [12] and may negatively impact on the prediction analyses performed on the basis of these networks.

Several scoring schemes have been proposed for sequence data and are, so far, limited to only finding the similarity scores of proteins that are referred to as scoring functions. In the case of protein domain and family data, the scoring function is deduced from the number of common signatures shared by two proteins [10], [13]. These schemes miss other features related to the data under consideration including their nature and sources. On the other hand, for sequence similarity data this scoring function is just the Inline graphic obtained from sequence comparison tools, and pair-wise functional interactions between proteins are obtained by simply applying an Inline graphic cut-off [10], [14][17]. However, there is no single fixed Inline graphic describing where homology ends and non homology begins. This shows that these schemes are not equipped to meet the requirements for scoring functional relationships, Inline graphic, they do not capture all information shared between sequences.

In order to overcome these shortcomings, we propose an information-theoretic based measure to score protein-protein relationships in functional interaction networks predicted from homology data. This approach is shown to be effective for scoring functional pair-wise relationships from homology data, and translating the amount of biological content shared between proteins into the score of their functional relationships. We apply our method to score functional relationships between proteins in Mycobacterium tuberculosis (MTB) strain CDC1551 to produce a functional network from sequence data for this organism. This approach is compared to the STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) [11], [18] homology scoring system for sequence similarity, and to existing scoring schemes for protein family and domain sharing [10], [13] in terms of functional classification coherence. Results show that the new scoring approach is as effective as that of the STRING approach, but produces a reliable functional network with higher coverage. The MTB functional network produced is then used to predict the functional class of proteins of unknown function, evaluated using leave-one-out cross validation.

Materials and Methods

This section describes novel scoring schemes for protein family and domain data extracted from protein family databases, as well as for protein sequence similarity obtained by running sequence comparison tools such as Basic Local Alignment Search Tool (BLAST). Sequences in Fasta format and InterPro data for the organism were downloaded from the Integr8 project of the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk/integr8. Scoring functional relationships for data from protein families and domains has been widely addressed by the Bionformatics community. However, the approaches described so far in the literature are limited to finding the similarity scores between proteins by the number of common signatures shared by proteins. Two examples of such a scheme are given below.

Scheme 1: Scoring Function of Pfam Domain Sharing [10].

The scoring function Inline graphic of Pfam domain sharing is simply the number of common domains of the two proteins defined as follows:

graphic file with name pone.0018607.e010.jpg (1)

where Inline graphic is the set of Pfam domains found in protein Inline graphic.

Scheme 2: Scoring Function based on Protein Signature Profiling [13].

The similarity score between a pair of proteins Inline graphic is computed using a binary similarity function between a pair of their signature profiles and is given by

graphic file with name pone.0018607.e014.jpg (2)

where Inline graphic is the number of signatures contained in proteins of a genome of interest and Inline graphic the signature profile of protein Inline graphic, with Inline graphic, if the signature Inline graphic exists in protein Inline graphic and Inline graphic otherwise.

Note that the scheme Inline graphic expressed by the equation (1) can be rewritten using Boolean operator ‘and (Inline graphic)’ as follows:

graphic file with name pone.0018607.e024.jpg

and similarly, the scheme Inline graphic in the equation (2) can also be written using set operators ‘intersection (Inline graphic)’ and ‘union (Inline graphic)’ as

graphic file with name pone.0018607.e028.jpg

with Inline graphic and Inline graphic as defined above.

These two schemes just count the number of shared signatures without taking into account the nature of the data and experiments used to derive them. In addition, the limitation of the second scheme can be seen in this small illustration: Let's consider three proteins Inline graphic Inline graphic and Inline graphic with Inline graphic Inline graphic and Inline graphic detected signatures, respectively. If we assume that Inline graphic and Inline graphic share Inline graphic signatures and Inline graphic signatures are shared by Inline graphic and Inline graphic we have: Inline graphic and Inline graphic. So, Inline graphic, whereas one should expect to have Inline graphic when looking at the number of the common signatures shared by these proteins. In fact, the scoring function as a function of the number of common signatures shared by a pair of proteins, is expected to be increasing. This property does not hold for scoring functions based on protein signature profiling, making this unattractive.

In the case of sequence similarity, the existing scoring schemes rely on the use of the negative logarithm of Inline graphic obtained from a sequence similarity tool. As pointed out previously, the problem with these scoring schemes is that initially there is no single fixed Inline graphic describing where homology ends and non homology begins. This constitutes an impediment to these scoring schemes beyond the fact that they may obviously lead to the singularities caused by the Inline graphic of zeros.

Thus, these schemes are not equipped to capture all the parameters related to the data under consideration and technology used to derive them. In order to overcome these shortcomings, we introduce novel scoring schemes based on the information-theoretic approach, taking into account the nature of the data and technology used and where the user can tune parameters based on their confidence in the data source.

Scoring Scheme For Protein Family and Domain

Consider two proteins denoted Inline graphic and Inline graphic, sharing signatures or entries Inline graphic We define the similarity score Inline graphic of proteins Inline graphic and Inline graphic as the minimum number of occurrences of these signatures in proteins Inline graphic and Inline graphic, Inline graphic,

graphic file with name pone.0018607.e059.jpg (3)

where Inline graphic is the number of occurrences of signatures Inline graphic in the protein Inline graphic

Broadly speaking, the reliability or confidence score increases with the confidence-level of data, which depends on the data source and is torn down by the uncertainty-level of data linked to the dispersion measure Inline graphic. As we are dealing with data from experiments containing a certain level of uncertainty, which propagates into the data, it is natural to use the normal distribution, as these data can be summarized in terms of mean and standard deviation. In fact, in this case this distribution constitutes an attractive approximation as it maximizes information entropy in the data. Thus, we set the confidence-level Inline graphic of the similarity score Inline graphic as

graphic file with name pone.0018607.e066.jpg (4)

with the function Inline graphic the cumulative probability of the standard Gaussian distribution defined by

graphic file with name pone.0018607.e068.jpg (5)

and Inline graphic the calibration control parameter, with Inline graphic strengthening the impact of the confidence-level for the data under consideration, in which case, Inline graphic is associated with low confidence data. The training dataset Inline graphic consists of all pairs Inline graphic, where Inline graphic is the number of times the signature Inline graphic was observed. In order to get rid of observations that lie at abnormal distances from the data, referred to as outliers, it is recommended to use the rectified dataset Inline graphic, the subset of the training dataset Inline graphic consisting of a data point which falls inside Inline graphic, Inline graphic,

graphic file with name pone.0018607.e080.jpg

with Inline graphic and Inline graphic, respectively, the Inline graphic (lower) and Inline graphic (upper) quartile, and Inline graphic the interquartile range. Inline graphic is thus the standard deviation of the rectified dataset, estimated from maximum likelihood and given by

graphic file with name pone.0018607.e087.jpg (6)

where Inline graphic is the number of signatures found in the rectified dataset, and Inline graphic, the mean or average of the set.

Given the confidence-level Inline graphic of the similarity score Inline graphic defined in equation (4), the uncertainty measure related to the outcome Inline graphic resulting from the data is obtained from the binary entropy function, given by

graphic file with name pone.0018607.e093.jpg (7)

In fact, the uncertainty measure function Inline graphic is defined in the interval Inline graphic with Inline graphic since Inline graphic and also Inline graphic Finally, we set up the capacity of inferring the functional relationship score between two proteins belonging to the same family or sharing common signatures as

graphic file with name pone.0018607.e099.jpg (8)

and the reliability or confidence score of the functional relationship between two proteins by

graphic file with name pone.0018607.e100.jpg (9)

Note that for Inline graphic significantly large, Inline graphic converges to Inline graphic Therefore, the uncertainty measure Inline graphic converges to Inline graphic leading to the maximum capacity of inferring the functional relationship of Inline graphic This means that the reliability of a functional relationship between two proteins is given by

graphic file with name pone.0018607.e107.jpg (10)

To illustrate the dependency of this new measure on the data under consideration and the technology used to produce them, we plot the variation of confidence level Inline graphic uncertainty Inline graphic and capacity Inline graphic in terms of common domains Inline graphic between proteins, for different values of Inline graphic, which keeps track of the technology used to produce data and Inline graphic controlling the impact of data under consideration, respectively. These are user-tunable parameters and results are shown in figures 14.

Figure 1. Confidence level variation for Inline graphic.

Figure 1

For a fixed calibration control parameter, as the number of shared domains increases, the confidence level also increases with a decrease in the standard deviation Inline graphic.

Figure 2. Confidence level variation for Inline graphic .

Figure 2

For a fixed standard deviation, as the number of shared domains increases, the confidence level also increases with an increase in the calibration control parameter.

Figure 3. Variation of uncertainty in terms of Inline graphic .

Figure 3

As the number of shared domains increases, the uncertainty composante decreases as the standard deviation Inline graphic decreases.

Figure 4. Variation of capacity in terms of Inline graphic .

Figure 4

As the number of shared domains increases, the capacity for inferring functional relationships between proteins, and therefore link confidence scores increases as the standard deviation Inline graphic decreases.

These results show that the confidence level Inline graphic increases as the number of common signatures between the two proteins increases, and that for a higher value of Inline graphic, indicating the efficiency level of the technology used to derive data, the confidence level Inline graphic is higher, and so is the reliability or confidence score, due to the fact that in this case the uncertainty component is smaller. Similarly, the impact of data obtained from each technology is taken into account through Inline graphic Interestingly, this confidence score formula accommodates the case where no common pattern is found between two proteins in the training dataset, in which case, the confidence score or reliability of a functional relationship is Inline graphic In addition, this scoring scheme takes into account a false positive assignment of any of the common patterns by narrowing down the confidence score of proteins containing only one common signature, depending on the measure of dispersion Inline graphic which can provide a hint on the nature of the data under consideration. Indeed, the measure of dispersion Inline graphic impacts on the confidence score in the sense that if data is far away from the average, in which case Inline graphic is high, the uncertainty component might be large and significant while calculating the confidence score, thus yielding a lower confidence score. Thus, with knowledge of the data source, the measure of dispersion Inline graphic can be penalized by a factor Inline graphic between 0 and 1, in order to reduce the impact of the uncertainty component.

Scoring Scheme For Protein Sequence Similarity

For a given set of pair-wise homologous sequences, Bastian [19], [ 20] showed that their biological evolution can be formalized by the evolution of their shared amount of information. This is measured by the mutual information in the sense of Hartley [21], [ 22], estimating the information they share due to their common origin and parallel evolution under similar selective pressure. Moreover, this mutual information is proportional to the bit score computed with standard methods in sequence comparisons.

Let Inline graphic be the bit score alignment of homologous sequences Inline graphic and Inline graphic, set with its standard units, and Inline graphic mutual information between these two sequences. We have

graphic file with name pone.0018607.e135.jpg (11)

where Inline graphic is a constant defining the unity, which depends on the statistical parameter scale Inline graphic for the search size (http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html) derived from the scoring matrix and amino acid composition of the sequence [23]. Therefore, generally Inline graphic and they are equal only if they have the same scale for the search size. However, the mutual information Inline graphic between two sequences Inline graphic and Inline graphic satisfies Inline graphic and Inline graphic [24].

Equation (11) shows that the mutual information Inline graphic increases with the bit score Inline graphic, which measures the average information available per position to distinguish an alignment from chance, calculated using relative entropy of target and background distributions [25] as

graphic file with name pone.0018607.e146.jpg (12)

where Inline graphic is the “target” residue substitution frequency, the probability of finding a residue Inline graphic aligned with a residue Inline graphic after a certain amount of evolution given that they have both evolved from a common ancestor who had a residue Inline graphic at that position. Inline graphic is the probability of occurrence of a residue Inline graphic in a collection of sequences, Inline graphic, the probability that a residue Inline graphic would align by chance based solely on its frequency in a sequence.

Thus, we define the reliability or confidence score Inline graphic of a functional relationship between two protein sequences Inline graphic and Inline graphic as normalized mutual information calculated [26] as

graphic file with name pone.0018607.e158.jpg (13)

measuring how the protein sequence Inline graphic is able to predict the protein sequence Inline graphic and where Inline graphic is the relative entropy obtained by aligning a protein sequence Inline graphic by itself. Indeed, the increase of mutual information with relative entropy yields bias, and this bias is corrected by dividing the mutual information by the maximum entropy of the sequence pair.

Using equation (11), the mutual information Inline graphic can be computed as follows:

graphic file with name pone.0018607.e164.jpg (14)

where Inline graphic and Inline graphic are constants defining unity for Inline graphic and Inline graphic, respectively. For a protein sequence Inline graphic Inline graphic, obtained using equation (14) and given by

graphic file with name pone.0018607.e171.jpg (15)

Finally, Inline graphic is independent of constants defining unity for Inline graphic and Inline graphic, and calculated as

graphic file with name pone.0018607.e175.jpg (16)

It is obvious that this scoring scheme relies only on the two protein sequences for which the confidence score is being computed. Two protein sequences whose mutual information of their evolutionary history embedded in their similarity score is Inline graphic, indicates that the two sequences are not similar and so, their confidence score is also Inline graphic. Thus, this scoring scheme accommodates the case where no similarity is found between two protein sequences and the error due to the arbitrary growth of the mutual information between two protein pairs is corrected by the maximum entropy induced.

Results and Discussion

MTB Functional Network Derived from Sequence Data

The computation of relationship scores (as described in the methods section) was performed on the whole Mycobacterium tuberculosis strain CDC1551 proteome to produce functional links between proteins from homology data, including pair-wise links from sequence similarity and protein family data derived from the InterPro database. Sequence similarity searches were carried out using BLASTP under a BLOSUM62 matrix based on the premise that if the Inline graphic is less than Inline graphic, the hit is similar to the query sequence and is likely to be evolutionarily related [27]. Resulting functional link scores are provided in Table S1.

We investigated the general behaviour of the link confidence scores induced from homology datasets. Results are depicted in Table 1 in terms of number and frequency of functional links in a given bin Inline graphic where Inline graphic corresponds to link score values ranging between Inline graphic and Inline graphic Inline graphic.These results indicate that the link confidence scores from protein family data are either low (Inline graphic) or high (Inline graphic). This is due to the calibration control parameter applied to data from the InterPro database, which is Inline graphic with penalty parameter Inline graphic, producing either low or high confidence according to the fact that two proteins share only one domain or more than one domain, respectively. Moreover, in most cases, prediction of functional links from sequence similarity matches that of protein family data but at different confidence levels. The link score Inline graphic between proteins Inline graphic and Inline graphic obtained for the combined data is given by

graphic file with name pone.0018607.e192.jpg (17)

under the assumption of independency, where Inline graphic and Inline graphic are link confidence scores obtained from sequence similarity and protein family datasets, respectively.

Table 1. MTB strain CDC1551 functional links derived from sequence data using our approach, STRING homology scheme for sequence similarity, and using the SFSP approach for protein family and domain sharing.

Sequence Similarity Protein Family and Domain
Confidence Bins Our Approach STRING scheme Our Approach SFSP-Under SFSP-Aver SFSP-Over
Low Inline graphic 4321 0 0 33240 0 0
Inline graphic 3001 0 0 4365 0 0
Inline graphic 1206 0 0 814 0 0
Inline graphic 606 44 20915 172 27494 0
Medium Inline graphic 424 263 0 6 6 6
Inline graphic 215 140 0 41 5746 0
Inline graphic 96 99 0 45 1394 0
High Inline graphic 31 57 7847 0 3906 0
Inline graphic 21 58 0 18 155 45
Inline graphic 25 52 9945 6 6 38656
Medium-High Total: 812 669 17792 116 11213 38707
Overall Total : 9946 713 38707 38707 38707 38707

Number of Interactions per Source and Link Score shown separately by bin.

Evaluating the Scoring Scheme

We compared our approach for scoring functional interactions inferred from sequence similarity to the STRING homology scoring scheme. STRING is a database of known and predicted protein-protein associations for a large number of organisms derived from high-throughput experimental data, the mining of databases and literature, and from predictions based on genomic analysis. For this assessment we used only their links derived from homology data, which uses a scoring scheme based on E-values obtained from the Smith-Waterman algorithm with a reasonably strict cut-off score to ensure high quality matches [28]. We also compared our approach for scoring functional interactions from protein family and domain to the scoring scheme for protein signature profiling (SFSP).

The STRING scheme classifies its functional link confidence scores into three different categories, low, medium and high confidence, with corresponding scores less than 0.4, between 0.4 and 0.7, and greater than 0.7, respectively [11]. These scores measure our confidence in pair-wise functional interactions in the networks produced. Even though sequence data are initially accurate, computational tools used to produce sequence similarity data may introduce noise due to certain unpredictable factors, such as arbitrary increases of bit score or over-estimation of similarity patterns between sequences. In order to take into account these uncertainties in sequence similarity data while ensuring the accuracy of functional interactions produced, one can set a cut-off score above which a given interaction is more likely to occur. Therefore, the comparison was performed in terms of functional classification accuracy for links with a medium confidence level and upwards (link score greater than Inline graphic). The number of associations predicted in different MTB functional networks produced using different approaches are shown separately in Table 1 for each approach and confidence ranging from low to high.

The SFSP as defined by equation (2) may produce several link scores for the same number of shared domains, we have considered the maximum score when over-estimating, their minimum when underestimating and their average score, referred to as SFSP-Max, SFSP-Under and SFSP-Mean, respectively. We plot the scores obtained using our approach and these from SFSP, and results are shown in figure 5. As pointed out previously, the scoring function should be increasing since our confidence level increases with the number of common signatures shared between pair-wise proteins. These results show that only SFSP-Under estimation provides the increasing scoring function but unfortunately it yields a poor coverage and for this reason it is not considered for further performance evaluation. The scoring scheme developed here produces an increasing scoring function and provides a better trade-off between SFSP-Max and SFSP-Mean. Considering the confidence score cut-off applied, the configuration of the network produced from SFSP-Max estimation is the same as that derived using the scheme based on the scoring function of domain sharing described by equation (1).

Figure 5. Variation of Scores in the Protein Signature Profiling (SFSP) based approach compared to our approach.

Figure 5

Change in Protein Signature Profiling Score minimum, mean and maximum and our approach when varying the number of shared domains between proteins.

Statistical significance of Functional Interactions Derived

We evaluated the statistical significance and biological relevance of the functional interactions inferred using our scoring approach in terms of functional classification coherence. To measure this, an interaction between two proteins is said to be significant or correct if these proteins belong to the same functional class.

The functional classes were extracted from Tuberculist (http://genolist.pasteur.fr/Tuberculist), and the repartition of interacting proteins in the functional network per functional class or category for different configurations is shown in Table 2. The evaluation was done using a sub-network generated by each protein in the functional network, consisting of functional interactions between a protein under consideration and its direct neighbours, referred to as a P-subgraph. The proteins in the unknown functional class were excluded from the evaluation.

Table 2. Distribution of MTB strain CDC1551 proteins per functional class.

Sequence Similarity Protein Family and Domain
Functional Class Our Approach STRING Scheme Our Approach SFSP-Under SFSP-Aver SFSP-Over
1 Virulence, detoxification and adaptation 34 33 89 0 82 143
2 Lipid Metabolism 47 97 190 19 133 222
3 Information Pathways 12 21 148 2 125 183
4 Cell-wall and Cell Process 82 101 236 2 181 355
5 Stable RNAs - - - - - -
6 Insertion Sequences and Phages 32 2 42 0 30 55
7 PE/PPE/PGRS Proteins 89 43 59 0 57 142
8 Intermediary Metabolism and Respiration 65 174 603 1 508 759
9 Protein of Unknown Function 77 77 287 0 222 555
10 Regulatory Proteins 17 14 148 0 145 165
Total 455 562 1802 24 1483 2579

Number of proteins per functional class in the functional networks produced using our approach and the STRING homology scheme, and using the SFSP approach for protein family and domain sharing.

To assess functional category coherence of functional interactions derived from a random model, we compute the P-value for each P-subgraph defined as the probability that the P-subgraph under consideration occurs by chance or is comprised of randomly drawn interactions. The hypergeometric distribution, which yields the probability of observing at least Inline graphic interactions between proteins from a given P-subgraph of size Inline graphic by chance among Inline graphic interactions of the same type in the entire functional network considered to be a background distribution, is used to model the P-value [14] given by

graphic file with name pone.0018607.e209.jpg (18)

where Inline graphic is the size of the functional network, Inline graphic, the number of functional links in the network, with all the proteins in the unknown class removed.

We assessed functional category coherence of functional interactions derived using our approach and STRING homology data for sequence similarity, as well as those inferred using our scheme for protein family and domain, and those obtained using SFSP-Mean and SFSP-Max estimation. Results displayed in figures 6 and 7 show that the functional interactions induced have a very low probability of occurring by chance. Note that this statistical test against a random distribution aims at checking if a given P-subgraph in the functional network consists of randomly grouped proteins. These figures show that using a significance level of Inline graphic as the optimal threshold, more P-subgraphs derived using our approach are statistically significant than those obtained from the STRING homology scoring and provides roughly equal statistically significant percentage of P-subgraphs with SFSP-Mean and SFSP-Max schemes. A total of Inline graphic out of Inline graphic, representing Inline graphic of P-subgraphs in our network are significant compared to Inline graphic out of Inline graphic representing Inline graphic of P-subgraphs for the STRING scoring system for sequence similarity. For SFSP scheme for protein family and domain, A total of Inline graphic out of Inline graphic representing Inline graphic of P-subgraphs in our network are significant compared to Inline graphic out of Inline graphic representing Inline graphic of P-subgraphs for SFSP-Mean and to Inline graphic out of Inline graphic representing Inline graphic for SFSP-Max.

Figure 6. Significance of functional interactions derived using our approach and the STRING scheme.

Figure 6

At each significance level Inline graphic in these graphs, we counted all relevant predicted associations for the two approaches and computed the percentage. Each Inline graphic corresponds to the number of associations with p-value Inline graphic and Inline graphic, where Inline graphic is the significance level just before Inline graphic in the plot.

Figure 7. Significance of functional interactions derived using our approach and SFSP approach.

Figure 7

At each significance level Inline graphic in these graphs, we counted all relevant predicted associations for the two approaches and computed the percentage. Each Inline graphic corresponds to the number of associations with p-value Inline graphic and Inline graphic, where Inline graphic is the significance level just before Inline graphic in the plot.

Effectiveness of The Novel Scoring Scheme

To evaluate the classification power of the new scoring scheme, we used the modified Receiver Operator Characteristic (ROC) curve analysis that measures the number of true positive (TP) predictions (number of functional interactions correctly identified) against the number of false positive (FP) (number of functional interactions incorrectly identified) [29], in which case the area under the ROC curve (AUC) is used as a measure of discriminative power. The larger the upper AUC value (the portion between the curve and the line TP  =  FP), the more powerful the scheme is.

For a given number of P-subgraphs ranging from Inline graphic to Inline graphic, we randomly generated Inline graphic independent samples and compute the average number of correct and incorrect predicted interactions expected to be normally distributed from the central limit theorem. Thus, we perform modified ROC analyses for the two scoring approaches, and results are shown in figure 8 for sequence similarity. These results indicate that our approach outperforms the STRING scheme, respectively, with an average of Inline graphic and Inline graphic of functional interactions correctly and incorrectly identified out of Inline graphic P-subgraphs, compared to the STRING scheme, which provides an average of Inline graphic and Inline graphic of functional interactions correctly and incorrectly identified, respectively, out of Inline graphic P-subgraphs. This shows not only that it is not sufficient to ensure high quality matches [28] by just applying a reasonably strict cut-off score when using the Smith-Waterman algorithm, but also this practice may lead to a poor coverage. Results in figure 9 indicate that our method performs comparably to the SFSP-Max and SFSP-Mean schemes, and provides a better trade-off between over-estimating and averaging scores for SFSP schemes in terms of precision and coverage. Our approach provides an average of Inline graphic and Inline graphic of functional interactions correctly and incorrectly, respectively, identified out of Inline graphic P-subgraphs. SFSP-Mean yields an average of Inline graphic and Inline graphic of functional interactions correctly and incorrectly identified, respectively, out of Inline graphic P-subgraphs while SFSP-Max produces an average of Inline graphic and Inline graphic of functional interactions correctly and incorrectly identified, respectively, out of Inline graphic P-subgraphs. Apart from the general limitation common to scoring schemes inferred from signature profiling based approaches, SFSP-Max produces a poor precision. This poor performance is due to the fact that when over-estimating it includes all false positives and our approach corrects this, providing an improved precision and coverage.

Figure 8. Modified ROC curves for functional interactions.

Figure 8

Number of incorrect functional interactions (false positives) versus number of correct functional interactions (true positives) in the MTB strain CDC1551 functional networks produced by our approach and the STRING homology network for sequence similarity.

Figure 9. Modified ROC curves for functional interactions.

Figure 9

Number of incorrect functional interactions (false positives) versus number of correct functional interactions (true positives) in the MTB strain CDC1551 functional networks produced by our approach and the SFSP scheme for protein family and domain.

General Analysis of the Structure of the Functional Network Produced

We performed a general analysis of the homology-based functional network produced by integrating into a single network all functional interactions inferred from sequence similarity and protein family and domain data using our scheme. The number of functional links in the combined network, which contains a total of Inline graphic proteins (nodes), is given in Table 3. The results in figure 10 show that this network exhibits scale-free topology, Inline graphic, the degree distribution of proteins approximates a power law Inline graphic with the degree exponent Inline graphic. We analyzed the general behavior of this network by finding the number of cliques and the distribution of hubs. Here protein hubs are described as “single points of failure” able to disconnect the network. This functional network contains Inline graphic clusters, or cliques, with Inline graphic hubs and with the biggest cluster containing Inline graphic gene products.

Table 3. MTB strain CDC1551 functional links derived from sequence data using our approach.

Interactions from Interactions From Protein
Confidence Bins Sequence Similarity Family (InterPro data) Combined Interactions
Low Inline graphic 4321 0 206
Inline graphic 3001 0 125
Inline graphic 1206 0 62
Inline graphic 606 20915 18381
Medium Inline graphic 424 0 1634
Inline graphic 215 0 605
Inline graphic 96 0 262
High Inline graphic 31 7847 6998
Inline graphic 21 0 855
Inline graphic 25 9945 10022
Medium-High Total: 812 17792 20376
Overall Total : 9946 38707 39150

Number of Interactions per Source and Link Score shown separately by bin.

Figure 10. Power law property of MTB strain CDC1551 functional network obtained from sequence data.

Figure 10

Connectivity distribution of detected functional links Inline graphic per protein, plotted as a function of frequency Inline graphic.

Predicting Protein Functional Class

Several approaches have been proposed for predicting protein functions from functional networks and are mainly classified into two categories, namely global network topology and local neighborhood based approaches. Global network topology based approaches use global optimization [30][32] or probabilistic methods [33][36] or machine learning [37][39] to improve the prediction accuracy using the global structure of the network under consideration. Unfortunately, these approaches raise a scalability issue which might not be proportional to the improvement in predictions compared to most straight forward approaches, which rely only on local neighborhood [40] of uncharacterized proteins.

In the case of local neighborhood based approaches, known as ‘Guilt-by-Association’ or ‘Majority Voting’ or ‘Neighbor Counting’ [41], direct interacting neighbors of proteins are used to predict protein functions. However, the biggest limitation of approaches relying on the direct neighbors of the protein under consideration is that they are unable to characterize proteins whose direct interacting neighbors are all uncharacterized, thus impacting negatively on annotation coverage. Investigating the relation between interacting neighbors of a given protein using network topology, Chua et al. [8], [42] show that in many cases, a protein shares functional similarity with level-Inline graphic neighbors (2 branch-lengths away) and proposed a functional similarity weight (FS-Weight) method for predicting protein functions from protein interaction data. Here, we analyze the performance of using direct interacting neighbors and second level interacting neighbors. The second level interacting neighbors were used when we were unable to use direct interacting neighbors, in order to improve coverage.

The functional network produced from sequence data was used to predict, where possible, the functional class of proteins in the Tuberculist unknown functional class using a local neighborhood based approach. Through this, a new functional class is assigned to an unknown protein based on the functional class frequently occurring among its direct interacting neighbors. In this case, the score of a given functional class Inline graphic for a protein Inline graphic is given by the frequency Inline graphic of occurrence of functional class Inline graphic among direct neighbors of Inline graphic, and calculated as follows:

graphic file with name pone.0018607.e283.jpg (19)

where Inline graphic refers to the set of direct interacting partners of protein Inline graphic, and Inline graphic is the Inline graphicfunction indicator given by

graphic file with name pone.0018607.e288.jpg

Since the objective is to assign to an unknown protein only one functional class, we make use of global network information, and the prediction of a given protein functional class is based on an over represented functional class found amongst its direct neighbors. The functional class with the largest chi-squared score is assigned to the protein. The chi-square score of functional class Inline graphic for protein Inline graphic [43] is given by

graphic file with name pone.0018607.e291.jpg (20)

where Inline graphic is defined in equation (19) and Inline graphic is the global expected number of proteins belonging to the functional class Inline graphic, given by Inline graphic, with Inline graphic that of proteins belonging to the class Inline graphic among all the proteins in the functional network under consideration and Inline graphic the order of the functional network, Inline graphic, number of proteins in the network.

As an illustration, protein ‘fadA6’ (MT3660 or Rv3557c), named Acetyltransferase FADA6 (UniProt accession P96834), which is involved in lipid metabolism (figure 11), is functionally linked to proteins annotated to the lipid metabolism class. This means that if we assumed that the protein ‘fadA6’ was not classified then it is likely that ‘fadA6’ would have been annotated to the lipid metabolism class. Similarly, protein ‘lprJ’ (MT1729 or Rv1690), named lipoprotein LPRJ (O33192), is also known to be involved in lipid metabolism (figure 12). All its direct interacting partners are of the unknown class, in which case if the class of ‘lprJ’ was not known, the use of level-Inline graphic neighbors would fail to classify this protein. However, using the level-Inline graphic neighbors would successfully classify this protein. Finally, figure 13 shows protein MT1417 (Rv1372, Q7D8I1), which is of unknown class in Tuberculist, but suggested by UniProt to belong to the chalcone/stilbene synthase family known to be involved in lipid metabolism. The prediction method annotates this protein to lipid metabolism, thus confirming the suspicion.

Figure 11. Illustration of Guilt-By-Association using level-Inline graphic interacting neighbors for protein classification.

Figure 11

P-subgraph showing the direct interacting partners of protein ‘FAdA6’ (in the center shown in white). Proteins in white are involved in lipid metabolism, while the gray nodes are of the unknown class.

Figure 12. Illustration of Guilt-By-Association using level-Inline graphic interacting neighbors for protein classification.

Figure 12

Graph depicting level-Inline graphic and level-Inline graphic interacting partners of protein ‘lprJ’. Proteins in white are involved in lipid metabolism and those shown in gray are of unknown class.

Figure 13. Illustration of protein functional classification inferrence.

Figure 13

P-subgraph showing the direct interacting partners of protein ‘M1417’ (gray node in the center) of unknown class. Proteins in white are involved in lipid metabolism.

Once again, the classification performance of these approaches can be evaluated with modified ROC curve analyses. We used leave-one-out cross-validation to evaluate the efficiency of these prediction approaches at computing the number of proteins correctly classified and those incorrectly classified. Note that when using the level-Inline graphic interacting neighbors to classify a protein, the instance of each protein is counted, Inline graphic, if a given level-Inline graphic neighbor interacts with different direct interacting neighbors, it will be counted twice. In order to compare the effectiveness of these approaches, we combined their related modified ROC curves and results are shown in figure 14. These results indicate that while the level Inline graphic interacting partners may be used to improve the coverage, they contain many false positives impacting negatively on the precision. Combining level Inline graphic and level Inline graphic interacting partners slightly improves precision and coverage. These two measures of protein classification quality are computed as follows:

graphic file with name pone.0018607.e312.jpg

where TP (true positive) is the number of proteins correctly classified, Inline graphic, number of proteins for which the actual classification is the same as the one predicted, FP (false positive) is the number of proteins for which the classification is different to the one predicted, and Inline graphic is the total number of classified proteins in the functional network. Thus, the precision measures the proportion of proteins with correct classifications among all proteins classified, and coverage measures the proportion of proteins correctly classified among the proteins in the functional network. The use of level-Inline graphic neighbors provides a precision of Inline graphic with a coverage of Inline graphic, while level-Inline graphic neighbors produces a precision of Inline graphic with a coverage of Inline graphic. Combining level-Inline graphic and Inline graphic neighbors yields a precision of Inline graphic with a coverage of Inline graphic. This is only a slight improvement over using level-Inline graphic neighbors only, but the illustration for LPRJ above shows the value in using both.

Figure 14. Performance evaluation of classification prediction approaches.

Figure 14

Number of proteins incorrectly classified (false positives) versus number of proteins correctly classified (true positives) using level-Inline graphic, level-Inline graphic, and combined level-Inline graphic and level-Inline graphic interacting partners to improve coverage.

Conclusions

We have developed novel information-theoretic based schemes for calculating the link confidence scores or link reliability for homology data, Inline graphic, data from protein family and sequence similarity. These convert the amount of biological content shared between proteins into confidence scores of their functional relationships. The methods could be used for a clustering analysis but here they are used for functional network generation.

We applied these schemes to the genome of Mycobacterium tuberculosis strain CDC1551 to produce a protein-protein functional network. Results showed that the novel scheme is efficient and effective compared to the existing schemes and can be used to improve functional networks inferred from sequence data in terms of precision and coverage.

We analyzed the global behaviour of the network obtained from the new scoring schemes. Furthermore, the functional network produced was used to classify proteins in the unknown class using a local neighborhood based approach extended to level-2 protein neighbors in order to improve genomic coverage.

Currently, we are integrating into a single protein-protein functional network, all pair-wise functional interactions obtained from different data sources, including genetic interactions, and functional genomics data, in order to predict functions, where possible, of uncharacterized proteins in the genome and to study the biology of the organism.

Supporting Information

Table S1

# scores of functional interactions derived from sequence data.

(XLS)

Acknowledgments

Any work dependent on open-source software owes debt to those who developed these tools. The authors thank everyone involved with free software, from the core developers to those who contributed to the documentation. Many thanks to the authors of the freely available libraries for making this work possible.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was funded by the National Bioinformatics Network in South Africa, grant number NBN RFA2008. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Baldi P, Brunak S. BIOINFORMATICS: The Machine Learning Approach, Massachusetts Institute of Technology 2001 [Google Scholar]
  • 2.Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. InterPro: the integrative protein signature database, Nucleic Acids Research. 2009;37:D211–D215. doi: 10.1093/nar/gkn785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. A basic local alignment search tool, Journal of Molecular Biolology. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 4.Altschul SF, Madden TL, Shaffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nuceic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.UniProt Consortium. The Universal protein resources, Nucleic Acid Research. 2007;35:D224–D228. doi: 10.1093/nar/gkl929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. New Development in InterPro Database, Nucleic Acid Research. 2007;35:D224–D228. doi: 10.1093/nar/gkl841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, et al. InterPro, progress and status in 2005, Nucleic Acids Research. 33:D201–D205. doi: 10.1093/nar/gki106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chua HN, Sung WK, Wong L. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions, Bioinformatic. 2006;22:1623–1630. doi: 10.1093/bioinformatics/btl145. [DOI] [PubMed] [Google Scholar]
  • 9.Myers CL, Troyanskaya OG. Context data integration and prediction of biological networks, Bioinformatics. 2007;23(17):2322–2330. doi: 10.1093/bioinformatics/btm332. [DOI] [PubMed] [Google Scholar]
  • 10.Chua HN, Sung WK, Wong L. An efficient strategy for extensive integration of diverse biological data for protein function prediction, Bioinformatics. 2007;23(24):3364–3373. doi: 10.1093/bioinformatics/btm520. [DOI] [PubMed] [Google Scholar]
  • 11.von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, et al. STRING: known and predicted protein-protein associations, integrated and transferred across organisms, Nucleic Acids Research. 2005;33:D433–D437. doi: 10.1093/nar/gki005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Devos D, Valencia A. Practical limits of function prediction, PROTEINS: Structure, Function, and Genetics. 2000;41(1):98–107. [PubMed] [Google Scholar]
  • 13.Mahdavi MA, Lin Y-H. Prediction of Protein-Protein Interactions Using Protein Signature Profiling, Genomics, Proteomics & Bioinformatics. 2007;5(3–4):177–186. doi: 10.1016/S1672-0229(08)60005-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mao X, Cai T, Olyarchuk JG, Wei L. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary, Bioinformatics. 2005;21(19):3787–3793. doi: 10.1093/bioinformatics/bti430. [DOI] [PubMed] [Google Scholar]
  • 15.Yellaboina S, Goyal K, Mande SC. Inferring genome-wide functional linkages in E. coli by combining improved genome context methods: Comparison with high-throughput experimental data, Genome Research. 2007;17:527–535. doi: 10.1101/gr.5900607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Raman K, Yeturu K, Chandra N. targetTB: A target identification pipeline for Mycobacterium tuberculosis through an interactome, reactome and genome-scale structure analysis, BMC Systems Biology. 2008;2 doi: 10.1186/1752-0509-2-109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Krawczyk J, Kohl TA, Goesmann A, Kalinowski J, Baumbach J. From Corynebacterium glutamicum to Mycobacterium tuberculosis-towards transfers of gene regulatory network and integrated data analyses with MycoRegNet, Nucleic Acid Research. 2009:1–15. doi: 10.1093/nar/gkp453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, et al. STRING 8-a global view on proteins and their functional interactions in 630 organisms, Nucleic Acids Research. 2008;37:D412–D416. doi: 10.1093/nar/gkn760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bastian O, Ortet P, Roy S, Maréchal E. A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities, BMC Bioinformatics. 2005;6 doi: 10.1186/1471-2105-6-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bastian O, Maréchal E. Evolution of Biological sequences implies an extrema value distribution of type I for both global and local pair-wise alignments scores, BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hartley RVL. Transmission of Information, The Bell System Technical Journal. 1928;3:535–564. [Google Scholar]
  • 22.Shannon CE. A Mathematical Theory of Communication, The Bell System Technical Journal. 1948;27:379–423. [Google Scholar]
  • 23.Pearson WR. Protein sequence comparison and Protein evolution, Tutorial-ISBM2000 [Google Scholar]
  • 24.Mackay JCD. Information Theory, Inference, and Learning algorithms, Cambridge University Press. 2004.
  • 25.Altschul SF. Amino acid substitution matrices from an information theoretic perspective, J. Mol. Biol. 1991;219:555–565. doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li M, Chen X, Li X, Ma B, Vitányi MBP. The Similarity Metric, IEEE transactions on Information Theory. 2004;50(12):3250–3264. [Google Scholar]
  • 27.Subramanian G, Koonin EV, Aravind L. Comparative Genome Analysis of the Pathogenic Spirochetes Borrelia burgdorferi and Treponema pallidum, Infection and Immunity. 2000;68(3):1633–1648. doi: 10.1128/iai.68.3.1633-1648.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, et al. STRING 7-recent developments in the integration and prediction of protein interactions, Nucleic Acids Res. 2007;35:D358–D362. doi: 10.1093/nar/gkl825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Aaron PG, Sonia ML, William AB, Lawrence EH, Debra SG. Improving protein function prediction methods with integrated literature data, BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein interaction networks, Nature Biotechnology. 2003;21(6):697–700. doi: 10.1038/nbt825. [DOI] [PubMed] [Google Scholar]
  • 31.Tsuda K, Shin H, Schölkopf B. Fast protein classification with multiple networks, Bioinformatics. 2005;21:ii59–ii65. doi: 10.1093/bioinformatics/bti1110. [DOI] [PubMed] [Google Scholar]
  • 32.Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps, Bioinformatics. 2005;21(1):i302–i310. doi: 10.1093/bioinformatics/bti1054. [DOI] [PubMed] [Google Scholar]
  • 33.Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), PNAS. 2003;100(14):8348–8353. doi: 10.1073/pnas.0832373100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Deng M, Chen T, Sun F. An Integrated Probabilistic Model for Functional Prediction of Proteins, Journal of Computational Biology. 2004;11(2–3):463–475. doi: 10.1089/1066527041410346. [DOI] [PubMed] [Google Scholar]
  • 35.Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach, Bioinformatics. 2003;19(Suppl 1):i197–i204. doi: 10.1093/bioinformatics/btg1026. [DOI] [PubMed] [Google Scholar]
  • 36.Cho Y-R, Shi L, Ramanathan M, Zhang A. A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge, BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lanckriet GRG, Deng M, Cristianini N, Jordan MI, Noble WS. Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast, Pacific Symposium on Biocomputing. 2004;9:300–311. doi: 10.1142/9789812704856_0029. [DOI] [PubMed] [Google Scholar]
  • 38.Chen Y, Xu D. Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae, Nucleic Acids Research. 2004;32(21):6414–6424. doi: 10.1093/nar/gkh978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Xiong J, Rayner S, Luo K, Li Y, Chen S. Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration, BMC Bioinformatics. 2006;7 doi: 10.1186/1471-2105-7-268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Murali TM, Wu CJ, Kasif S. The art of gene function prediction, Nature Biotechnology. 2006;24(12):1474–1475. doi: 10.1038/nbt1206-1474. [DOI] [PubMed] [Google Scholar]
  • 41.Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast, Nature Biotechnology. 2000;18(12):1257–1261. doi: 10.1038/82360. [DOI] [PubMed] [Google Scholar]
  • 42.Chua HN, Sung WK, Wong L. Using Indirect Protein Interactions for the Prediction of Gene Ontology Functions, BMC Bioinformatics. 2007;8(4) doi: 10.1186/1471-2105-8-S4-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Deng M, Sun F, Chen T. Assessment of the reliability of protein-protein interactions and protein function prediction, Pacific Symposium on Biocomputing. 2003;8:140–151. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1

# scores of functional interactions derived from sequence data.

(XLS)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES