Abstract
In recent years, gene networks have become one of the most useful tools for modeling biological processes. Many inference gene network algorithms have been developed as techniques for extracting knowledge from gene expression data. Ensuring the reliability of the inferred gene relationships is a crucial task in any study in order to prove that the algorithms used are precise. Usually, this validation process can be carried out using prior biological knowledge. The metabolic pathways stored in KEGG are one of the most widely used knowledgeable sources for analyzing relationships between genes. This paper introduces a new methodology, GeneNetVal, to assess the biological validity of gene networks based on the relevance of the gene-gene interactions stored in KEGG metabolic pathways. Hence, a complete KEGG pathway conversion into a gene association network and a new matching distance based on gene-gene interaction relevance are proposed. The performance of GeneNetVal was established with three different experiments. Firstly, our proposal is tested in a comparative ROC analysis. Secondly, a randomness study is presented to show the behavior of GeneNetVal when the noise is increased in the input network. Finally, the ability of GeneNetVal to detect biological functionality of the network is shown.
1. Background
Modeling process occurring in living organisms is one of the main goals in bioinformatics [1–4]. Gene networks (GNs) have become one of the most important approaches to discover which gene-gene relationships are involved in a specific biological process.
A GN can be represented as a graph where genes, proteins, and/or metabolites are represented as nodes and their relationships as edges [1].
It is important to note that GNs can vary substantially depending on the model architecture used to infer the network. These models can be categorized into four main approaches according to Hecker et al. [1]: correlation [5, 6], logical [7–9], differential equation-based, and Bayesian networks [10, 11]. These approaches have been broadly used in bioinformatics. For example, Rangel et al. [12] used linear modeling to infer T-cell activation from temporal gene expression data, or Faith et al. [13] adapted correlation and Bayesian networks to develop a method for inferring the regulatory interactions of Escherichia coli.
Once a model has been generated, it is very important to assure the algorithm reliability in order to demonstrate its efficacy. The quality of the algorithm(s) can be measured by applying so-called synthetic data [14] and/or by using prior biological knowledge [15]. Synthetic data approaches can be used to analyze the performance of the GN inference algorithm, whereas a study of biological validity is supported by real data.
Synthetic data methods produce an artificial data set according to a previously known network. The values of the simulated gene expression are stored in a data set and used as input for the GN inference algorithm. Finally, the performance of the algorithm is tested comparing both GNs. Currently, this process can be carried out using different tools as GeneNetWeaver [16] or SynTReN [17].
Although this approach is commonly used for comparing inference algorithms, it can not fully reproduce the internal features of real biological processes. This drawback means they are not suitable for the validation of the inferred models, from a biological point of view.
To address this issue, comparison with prior biological knowledge has been proposed [18, 19]. Currently, there are a number of different available biological repositories where the Kyoto encyclopedia of genes and genomes (KEGG) is one of the most widely used for analyzing relationships between genes [20, 21]. KEGG's metabolic pathways contain knowledge about different biological processes. These pathways are represented as a graph where nodes represent genes, enzymes, or compounds (i.e., carbohydrates, lipids, and amino acids) and edges encode relationships, reactions, or interactions between the nodes. The pathways contained in the KEGG database represent the actual knowledge of molecular interaction and reaction networks for metabolism, genetic information processing, environmental information processing, cellular processes, and human disease. They provide useful structured information for gene network validation. For example, C. Li and H. Li [15] used KEGG transcriptional pathways to perform a network analysis of the glioblastoma microarray data, or Ko et al. [22] tested a new Bayesian network approach using gene-gene relationships stored in KEGG. In this line we proposed a GN validation framework based on a direct comparison between a gene network and KEGG pathways [23].
The aforementioned approaches, hereafter called the classical use of KEGG, present three major shortcomings: (a) not all the biological information is used, (b) only strong gene-gene relationships are considered, and (c) the current biological knowledge is not complete.
Gene-gene relationships are only usually considered by metabolic pathways based GN validation approaches. Hence, all other biological information provided by pathways is ignored, such as gene-compound or compound-compound relationships (see Table 1). For example, Wei and Li [24] only used human gene-gene interactions stored in the KEGG pathways when performing simulation studies, excluding gene-compound and compound-compound relationships. Or Zhou and Wong [25] used the relationship between KEGG gene pairs (mainly PPrel and ECrel) to study protein-protein interaction data sets.
Table 1.
KEGG relationships | Description |
---|---|
Gene relationships | |
ECrel | Enzyme-enzyme relation |
PPrel | Protein-protein interaction |
GErel | Gene expression interaction |
PCrel | Protein-compound interaction |
Others | |
compound-compound | Compound-compound interaction |
Furthermore, current GN validation approaches are not entirely accurate as they only consider strong relationships between genes (direct gene-gene interactions), leaving weaker relationships to one side [4].
In addition, the use of prior biological knowledge could present another important lack, the current limitations of the biological databases. As described by Dougherty and Shmulevich [2], biological knowledge has some intrinsic limitations in the sense that they depend inherently on the nature of scientific knowledge. Others are contingent depending on the present state of knowledge, including technology. Current validation methods use these biological databases in order to classify the inferred relationships as true or false positives. Due to the intrinsic problem of the biological databases, it is not possible to argue that these false positives are actually caused by a bad prediction from the inference methods or because of incomplete knowledge.
This paper proposes a new methodology, GeneNetVal, to analyze the biological validity of a gene network by utilizing the biological information stored in KEGG by weighting the gene-gene relationships. GeneNetVal uses different types of relationships contained in KEGG pathways (gene-gene, gene-compound, and compound-compound), carrying out an exhaustive and complete conversion of a pathway into a gene network. The network obtained will be used as a gold standard in comparison with the input network. Moreover, a novel matching distance is proposed. This measure, based on gene-gene interaction relevance, takes into account the concept of weak relationships between a pair of genes to present a set of nondeterministic indices with different levels of accuracy. Thus, we do not categorically accept or refuse a gene-gene relationship, but a weighted value is assigned according to distance of those genes in the pathway. Through these values we generated a new gene network validity measure and mitigate the problem of the incomplete biological knowledge.
2. Methods
In this section, the GeneNetVal methodology and also the methods used to perform the experiments will be presented. These methods will be used in Results and Discussion section.
2.1. GeneNetVal Methodology
As already stated, the two-step methodology proposed, GeneNetVal, is based on KEGG metabolic pathways and summarized in Figure 1. In the first step, a complete conversion of a metabolic pathway into a gene association network is carried out. In the second step, the biological validity of a GN is determined. In order to do this, a novel matching distance between networks is used.
2.1.1. Step One: From Metabolic Pathways to Gene Association Networks
KEGG database stores knowledge about many different organisms, but we only need the information pertaining to the network to be analyzed. Hence, only the KEGG metabolic pathways for the same organism of the input network are considered. This is represented in Figure 1, where all pathways of the organism o are extracted.
These pathways are converted into gene association networks where all types of pathway relationships (see Table 1), including gene-gene (PPrel, ECrel, and GErel), gene-compound (PCrel), and compound-compound, are used.
As stated previously, a metabolic pathway is composed of different types of nodes (genes or other compounds) while genes are only used in gene networks. This difference exhibits that direct comparison between them is unreliable based on the information containing different elements. This difference is overcome by increasing the abstraction level of the pathways. Concretely, each pathway is converted into a gene association network, the highest level of abstraction for reconstruction of gene regulatory processes as it is described by Martínez-Ballesteros et al. [30]. This conversion process is represented in Figure 2 and explained bellow.
Firstly, all the compound nodes presented in the pathway are removed. However, gene nodes are conserved along with their relationships of influence (nondirect edges), be they PPrel, ECrel, or GErel. The PCrel, compound-compound, and others relationships are processed in different way.
The compound nodes located between two genes carry information from one gene to another. They act as a bridge between the genes, so these two gene nodes should be related. Based on this, after removing the compound nodes, new undirected gene-gene relationships will be created. These relationships are established between each pair of genes that were previously associated with the same compound node.
Figure 2 shows the conversion process from “Pathway M” (Figure 1) to a gene network in detail. For example, genes 3 and 8 are associated with a compound node in the pathway but there is no direct relationship between them. However, the information pertaining to this indirect gene-gene influence should be taken into account so that a new influence relationship between genes is created. Similarly, a relationship is generated between genes 6 and 7.
The conversion presented in Figure 2 is a simple example; pathways are often more complex. In a pathway, multiple genes are likely related to the same compound node, or the chemical compounds are transferred by two or more genes/enzymes. These two cases should be considered to carry out an exhaustive conversion. In the first type, multiple genes somehow interact with the same compound (substrate of a chemical reaction, product, etc.). This biological information is preserved creating new relationships (see Figure 3(a)). In the second group, the genes responsible for transferring the compounds should be related in the new GN, since they actually interact with the chemical compounds simultaneously. Hence, new relationships between these genes are included (see Figure 3(b)).
2.1.2. Second Step: Biological Validity
In the second step, the metabolic pathways are used as biological knowledge to evaluate the input network. Usually, the literature applies a scoring methodology [1, 27, 29] to evaluate an inferred model using prior knowledge, be it synthetic or biological data. Based on this idea and on the notion of the strong and weak relationships in GNs [4], the authors have developed a novel measure for evaluating the validity of an input network that is based on the relevance of the gene-gene interactions stored in KEGG.
Let G I = {N I, E I} and G P = {N P, E P} be two graphs, where (N I, N P) represent the nodes of the graphs and (E I, E P) represent the edges (gene-gene relationships). The validity of the input graph (G I), according to the biological information of pathway P represented in the graph G P, is measured as the difference between both graphs at certain level of distance.
Definition 1 (Level). —
Let a graph G = {N, E} and two nodes g α, g β ∈ N. The level of the relationship between (g α, g β) is calculated as the number of edges between nodes g α and g β in G.
For example, in Figure 4, the relationship between nodes 3 and 7 in G P has a level of 2 because there are two edges between these nodes.
Definition 2 (Hits at level l (Hit l)). —
The number of edges where the level between the nodes directly connected in G I is l in G P.
An example of Hit1 and Hit2 can be found in Figure 4, where the edge between genes 3 and 2 represents the Hit1 and the edge between 3 and 7 is Hit2. Obviously the greater the distance between nodes, the lower the relevance of the evaluated relationship. Thus, the new matching distance provides two weighted indices through comparison with the selected level.
Definition 3 . —
Cumulative hits at level n, H n, can be defined as the weighted sum of correctly inferred edges at level n in G I according to the information presented in G P. Consider
(1) where H n denotes the sum of edges that were correctly inferred weighted by their relevance in the network with distance (level) n.
Figure 4 presents an example of calculation of H 1 and H 2.
Definition 4 . —
Cumulative failures at level n, F n, can be defined as the number of incorrect inferred edges at level n in G I
(2) where ||E I|| is the number of edges in G I. Thus, F n denotes the number of edges that were not correctly inferred in the network with distance (level) n.
Figure 4 shows an example of calculation of F 1 and F 2. At level 1, the graph presents one cumulative failure because of the genes 3 and 7, which are directly connected in G I and have a distance of 2 in G P. As the interaction between 3 and 7 is weak (hit of level 2), the value of the cumulative failure level 2 is 0.5. Accordingly, the validity measure can be defined.
Definition 5 . —
The validity (GeneNetVal measure) of graph G I according to G P level n, V n, is defined as the proportion of correctly inferred edges at level n in G I. Consider
(3) This measure ranges between 0 and 1, where 0 is the lowest validity value and 1 the highest. The validity measure estimates the ratio of correctness of G I with respect to G P.
The biological validity is obtained as the proportion of positive prediction according to the cumulative hits and failures. This is the principal measure obtained by our methodology to rate the quality of a GN.
2.2. ROC Study
A receiver operating characteristic (ROC) analysis will be presented in the Results section. The goal of this study is to compare the performance of different gene network validity approaches, evaluating real networks against random networks (without biological sense). The three networks will be used in the experiment, attempt to encompass the regulation of a large number of functional processes in yeast. Hence, we have assumed that these networks contain biological meaning of each functional process described in the KEGG pathways (they are functionally complex networks).
Therefore, the evaluation of these networks should produce relevant validity results for each of the pathways considered. In contrast, the biological validity of random networks ought to yield poor results because, in fact, they should not contain biological meaning.
A validity threshold (T) has been used to decide if the input network has relevant information for each selected pathway. T denotes the minimum validity value for a network with a specific pathway to be considered as valid value. In order to generate the ROC curve for each experiment, we have used 101 different T values (from 0 to 1). A confusion matrix is obtained for each iteration. If the validity value obtained for a pathway exceeds the T value, the input network is classified as a positive (true positive or false positive, depending on whether the input network is a real network or a random network). If the obtained value is lower, the input network is described as a negative (true negative or false negative). With this idea, for each iteration the indexes are computed for the confusion matrix.
Hence, it is possible to calculate 101 confusion matrices and 101 true positive rates (TPR) and false positive rates (FPR) values to draw the curve.
Figure 5 provides a toy example showing the entire process (only for one random network). It offers a comparison between the results obtained by a real network and the results obtained by a random network. With the validity values obtained for both networks (Figure 5(a)), different confusion matrices were generated according to different thresholds, only 4 thresholds in this example (Figure 5(b)). Thus, for each iteration it is possible to obtain the values of TPR and FPR (Figure 5(c)). With these values, the ROC curve is finally represented (Figure 5(d)).
It is important to note that the results presented in Figure 6 are average values for a sample of random networks.
2.3. Selecting Functional Description with GeneNetVal
The specific functionality of the input network could be studied in accordance with the biological process information store in a specific KEGG pathway. A metabolic pathway represents a model of a particular biological process. Different sets of genes are involved in different pathways. This should be considered if a functional assessment of the input network is performed. If a pathway contains a set of genes, this set is annotated to the pathway's biological function. Hence, any information from the input network that does not belong to the specific biological process will be not taken into account for this validation. Note that these relationships should not be considered as a failure because actually there is no information to classify the validity of the interactions from genes in the input network that are not present in metabolic pathways.
This pruning process, which is depicted in Algorithm 1, entails removing any edge from the input network if the corresponding genes are not present in the specific pathway. The input network will suffer a different pruning for each pathway. Through this pruning, the input network can be evaluated independently for each process. An example of this pruning is shown in Figure 4 where the purple edges are removed for the comparison with the pathway.
After pruning, the comparisons with each pathway will show the validity measure. The functionality described by the pathway with the highest value of V n (GeneNetVal measure) will be the functionality that best fits the input network. A high V n value means that the input network fully or partially describes the functionality that is described by that particular metabolic pathway.
Hence, M different comparisons have been carried out in Figure 1, where the highest value was generated by the gene network extracted from “pathwayM.”
It is also possible for the input network to contain information about more than one specific biological process. Alternatively, the biological processes are usually interrelated (e.g., the cell cycle and the meiosis). An example of this situation in Figure 1 might be the comparison between the gene network from “pathw a y2” and the input network. In that case, the highest values of the validity measure could be considered, to determine which processes are better described.
3. Results and Discussion
The performance of our proposal was tested through three experiments using different types of networks. Firstly our proposal was compared with the classical use of KEGG. A ROC analysis of different distance level of GeneNetVal and precision measure were carried out. The behavior of the method proposed with different noise level is tested in the second experiment. Finally, the ability of GeneNetVal to detect the biological functionality encoded in a input network is analyzed in the third experiment.
3.1. ROC Analysis
The ROC analysis was performed to show the improvement achieved by our approach over those that only consider direct gene-gene relationships [24, 25], along with its robustness against information without biological meaning (see Section 2.2).
ROC analysis has been widely used in the literature [31, 32] because it is able to score the performance of classifiers and rankers as a trade-off between a true positive rate and false positive rate. Additionally, the area under the ROC curve (AUC) is presented, as it provides information about the level of randomness of the approach.
For this study three complex and contrasting yeast gene networks with different types of gene relationships were used. A protein-protein interaction network was used by Batada et al. [33] in the analysis of highly connected proteins in a network (hubs). The network resulting of selecting the protein-protein and protein-DNA interactions of the Saccharomyces Genome Database (SGD) [34] provides an access to the complete Saccharomyces cerevisiae (yeast) genomic sequence. And, finally, the network was presented by Lee et al. [35] (YeastNet v.2) which combines protein-protein, protein-DNA, coexpression, phylogenetic conservation, and literature information.
For every input network explained above, two different topologies of random networks were considered: pure random and scale-free. This latter topology is used since biological networks usually follow it [36, 37].
The sample size for each input network and topology was calculated with a confidence interval of 95% for an infinite population of networks [38]. Hence, a sample size of 385 random networks was used. Pure random networks were designed to have the same node and edge size as the input network, but gene-gene relationships were randomly generated. Scale-free networks were generated using the open source library JGraphT, with the same nodes as well. To use information stored in KEGG, we extracted the KGML files of yeast pathways using the KEGG API.
The results of the analysis are represented in Figure 6, where each row represents the study of a different input network. The left column in the figure represents the study for pure random topology, and the rightmost shows the scale-free topology. Each graph contains five lines that encode the behavior of GeneNetVal considering the distance levels from one to four and the precision measure [30, 39] for the classical use of KEGG. In total, more than 11000 (3 input networks × 2 topologies × 5 measures/levels × 385 networks) evaluations were carried out.
The ROC curves show that the results of the three networks follow a similar pattern for both topologies. Particularly striking is the distance between the point (1, 1) and the one above. FPR is 1 for a threshold equal to zero (see Section 2.2 for more details) but represents a very low value for the next checkpoint (threshold = 0.01). This could be due to the fact that the use of KEGG as gold standard is very effective detecting interactions with no biological meaning.
For some levels the lines do not start at point (0, 0) (Figures 6(b) and 6(d)). This is because some KEGG pathways do not contain many interactions (e.g., sce00062 pathway contains only 5); so a random network might contain those gene relationships at a certain distance level.
Regarding the values obtained for the area under the curve (AUC), it is important to note that the major is the number of type of relationships considered in the network the better the methodology performs. The best results are obtained by Lee's network [35] which combines four different types of relationships. The second best result is generated using SGD, while Batada's network presents the worst result. This makes sense since KEGG pathways gather biological data from various contrast sources.
Comparing the classical use of KEGG with level 1 of our proposal, which only differs on how the pathways information is managed, is possible to argue that the conversion proposed produces significant improvement in AUC. Level 1 produces better result in all cases. For example, the AUC value of 0.88 is increased to 0.92 in SGD for scale-free topology (Figure 6(d)). Furthermore, it is possible to improve AUC by increasing the distance level in the comparison. The best result is shown by level 2, while levels 3 and 4 perform worse than levels 1 and 2.
The results presented show that GeneNetVal is capable of detecting gene relationships with and without biological meaning. Furthermore, the methodology presents a significant improvement compared to the classical approach (precision) for all levels studied. In particular, the best performance is obtained by level 2 for all the experiments.
Finally, in spite of the fact that biological databases are crucial information sources for evaluating results obtained in any study, they have some limitations. These limitations are intrinsic to all of them, in the sense that they depend inherently on the nature of scientific knowledge; others are contingent, depending on the present state of knowledge, including technology [2, 40]. Such limitations can include incorrect event or entity labels, missdirections in the relationships, absence of associations, and other ambiguities. Consequently, the performance of prior knowledge-based methods could be affected by these limitations, including our approach. In particular, GeneNetVal could be affected for incorrect event or entity labels and also for the absence of association in the metabolic pathways in terms of bad classification of relationships (incorrect hit or failure). Despite this fact, it is worth mentioning that the classical approaches are also affected for the problems presented above. In this sense, GeneNetVal presents a more robust performance than the classical approaches, since the use of indirect relationships mitigates these problems. This affirmation is supported by the results presented in this ROC analysis, where GeneNetVal performs better than the classical approach even though the same databases (containing the same lacks) are used in both methods.
3.2. Randomness Study
Despite the fact that in the ROC analysis section it was shown that GeneNetVal is better distinguishing real networks from random networks than a classical approach extracted from the literature, in this section the behavior of the methodology to the progressive inclusion of noise will be shown.
Concretely, we have carried out the study for all of the yeast networks which were previously presented in the paper (Batada, Lee, and SGD networks). These input networks were changed increasing randomness in their gene relationships. Hence, in a loop process composed of 10 iterations, the random relationships added to the networks were increased in 10% at each iteration. In the same way a 10% of the original relationships were removed. To avoid bias, this was done 385 times (sample size with a confidence interval of 95% assuming an infinite population of random networks) [38]. Therefore, 15360 (385 networks × 10 iterations × 4 original networks) different random networks were analyzed.
According to the results presented in the ROC analysis section, the validity value level 2 was considered in this experiment. As gold standard, we have used the pathway sce04111 (yeast cell cycle) since it is one of the most studied pathways from yeast [41–43]. The results averages are summarized in Figure 7.
Figure 7 presents the evolution of the validity values for the yeast networks. It can be observed that the different validity values follow a similar behavior. This behavior verifies that the loss of relevant information in the networks is progressive, and it increases as the randomness is increased in them as well. These results show that our method is able to detect the loss of information as the randomness increases in the networks.
3.3. A Functional Study: Yeast Cell Cycle Networks
In this section some well-known yeast networks are used to prove the usefulness of our approach by detecting specific biological functionality as it was described in Section 2.3. These networks were produced by applying different gene networks inference approaches to the same time-series yeast cell-cycle microarray [44]. Concretely, the networks were generated by applying the approaches of the network presented by Nariai et al. [26], which is obtained through a Bayesian-based algorithm; Bulashevska and Eils [28] that is another Bayesian-based algorithm; Ponzoni et al. [29] whose algorithm called GRNCORP is based on a combinatorial optimization; and finally the network presented by Gallo et al. [27] (called GRNCORP2) that is a performance improvement of GRNCORP.
For this study, all the information stored in KEGG has been brought together in a single complex network. This global network (KEGG global network, KGN) is generated according to the knowledge gathered in each gene association network generated from Saccharomyces cerevisiae pathways. The aim of KGN is to conduct a global evaluation of the different networks to decide whether the networks contain biological knowledge or not. Specifically, the evaluation has been performed with level 2, according to the results obtained in the ROC analysis section. To compare the gene networks, only the relationships between genes contained in the input network and KGN have been considered. It is not possible to establish the quality of those interactions, because KEGG contains no information to ascertain whether the gene-gene interactions are biologically relevant or not.
In Table 2, the KGN rows, the global evaluation results, are shown. It is worth mentioning that two of the four networks obtain better validity results with the KGN because of the inclusion of a greater number of the indirect relationships (Hit2).
Table 2.
Once the results of the global KGN network are obtained, a specific functional biological analysis was performed. This analysis reveals whether the cell cycle's network describes a specific biological process or if the information is dispersed among all the pathways stored in the organism.
Therefore, all input networks have been compared with each of the 105 GNs obtained from Saccharomyces cerevisiae pathways. The most representative results were obtained with the sce04111 and sce04113 pathways, which represent the cell cycle and meiosis processes, respectively. These results are presented in the sce04111 and sce04113 rows of Table 2. They show that the networks store the majority of its biological information regarding the cell cycle metabolic pathway. For that reason, we obtain practically the same values using the cell cycle pathway or KGN for the evaluation.
Moreover, all the hits (Hits1 and Hits2) obtained in comparison to KGN are also found in cell cycle pathway sce04111. This information is summarized in Table 2 where the results of Hits1 and Hits2 are presented for all networks studied. As expected, the functionality found in the global evaluation is completed in relation to the yeast cell cycle process.
The input networks also contain biological information for the meiosis pathway (sce04113) (last row of Table 2). This does not contradict the previous result, since meiosis is related to the cell cycle [45], sharing gene-gene relationships.
The presented results show that our approach is able to identify the biological functionality that best describes the input network. Moreover, it is also able to find more related subprocesses.
4. Conclusions
A new methodology to score the biological validity of gene networks is proposed. Our presented approach, called GeneNetVal, entails identifying biological knowledge included in the input network. It is based on the gene-gene interaction relevance between the genes involved in the KEGG metabolic pathways. The method provides a complete and exhaustive conversion from a pathway to a gene association network. This approach also uses the concept of weak relationships between genes to present a new matching distance with different distance levels.
Three different experiments have been carried out. Firstly, our approach was compared to the classical use of KEGG to score the gene network validity. Comparisons were made for three different Saccharomyces cerevisiae complex networks. To demonstrate the robustness of the methodology, ROC analysis was performed for pure random and scale-free topologies. The results show that the proposal represents a significant improvement over the classical use of KEGG for assessing gene networks. Furthermore, it is possible to obtain better results by increasing the level of GeneNetVal in the comparison, where level 2 presents the best performance for all the experiments.
Secondly, a randomness study was performed. This study shows GeneNetVal is able to detect the loss of information in the input networks as the noise increases in them. Hence, our proposal distinguishes biologically correct from less correct networks.
Finally, the ability of GeneNetVal in finding a specific biological functionality was tested using some yeast cell cycle networks. The results show that our proposal is able to identify the main biological process described in an input network and other related processes.
Acknowledgments
This research was partially supported by the Ministry of Science and Innovation, Projects TIN2011-28956-C02-1, and Pablo de Olavide University. Francisco Gómez-Vela thanks Fernando Casares for helpful discussions about metabolic pathways and weak relationships.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Authors' Contribution
Francisco Gómez-Vela and Norberto Díaz-Díaz conceived the method, conducted the study, and wrote the paper. Francisco Gómez-Vela designed and implemented the algorithm. Norberto Díaz-Díaz motivated the research problem. All authors read, edited, and approved the final paper.
References
- 1.Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R. Gene regulatory network inference: data integration in dynamic models—a review. BioSystems. 2009;96(1):86–103. doi: 10.1016/j.biosystems.2008.12.004. [DOI] [PubMed] [Google Scholar]
- 2.Dougherty ER, Shmulevich I. On the limitations of biological knowledge. Current Genomics. 2012;13(7):574–587. doi: 10.2174/138920212803251445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yeh C, Yeh H, Arias CR, Soo V. Pathway detection from protein interaction networks and gene expression data using color-coding methods and a∗ search algorithms. The Scientific World Journal. 2012;2012:14 pages. doi: 10.1100/2012/315797.315797 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Poyatos JF. The balance of weak and strong interactions in genetic networks. PLoS ONE. 2011;6(2) doi: 10.1371/journal.pone.0014598.e14598 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pacific Symposium on Biocomputing. 2000:418–429. doi: 10.1142/9789814447331_0040. [DOI] [PubMed] [Google Scholar]
- 6.Zoppoli P, Morganella S, Ceccarelli M. Timedelay-ARACNE: reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics. 2010;11(1, article 154) doi: 10.1186/1471-2105-11-154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bornholdt S. Boolean network models of cellular regulation: prospects and limitations. Journal of the Royal Society Interface. 2008;5(1):S85–S94. doi: 10.1098/rsif.2008.0132.focus. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Glass L, Kauffman SA. The logical analysis of continuous, non-linear biochemical control networks. Journal of Theoretical Biology. 1973;39(1):103–129. doi: 10.1016/0022-5193(73)90208-7. [DOI] [PubMed] [Google Scholar]
- 9.Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002;18(2):261–274. doi: 10.1093/bioinformatics/18.2.261. [DOI] [PubMed] [Google Scholar]
- 10.Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. A primer on learning in Bayesian networks for computational biology. PLoS Computational Biology. 2007;3(8, article e129) doi: 10.1371/journal.pcbi.0030129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang Q, Cao Y, Li Y, Zhu Y, Sun SSM, Guo D. A sub-space greedy search method for efficient Bayesian Network inference. Computers in Biology and Medicine. 2011;41(9):763–770. doi: 10.1016/j.compbiomed.2011.06.012. [DOI] [PubMed] [Google Scholar]
- 12.Rangel C, Angus J, Ghahramani Z, et al. Modeling T-cell activation using gene expression profiling and state-space models. Bioinformatics. 2004;20(9):1361–1372. doi: 10.1093/bioinformatics/bth093. [DOI] [PubMed] [Google Scholar]
- 13.Faith JJ, Hayete B, Thaden JT, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology. 2007;5(1, article e8) doi: 10.1371/journal.pbio.0050008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gill R, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010;11, article 95 doi: 10.1186/1471-2105-11-95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
- 16.Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]
- 17.Van den Bulcke T, Van Leemput K, Naudts B, et al. SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006;7, article 43 doi: 10.1186/1471-2105-7-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Aguilar-Ruiz JS, Rodriguez-Baena DS, Diaz-Diaz N, Nepomuceno-Chamorro IA. CarGene: characterisation of sets of genes based on metabolic pathways analysis. International Journal of Data Mining and Bioinformatics. 2011;5(5):558–573. doi: 10.1504/ijdmb.2011.043033. [DOI] [PubMed] [Google Scholar]
- 19.Díaz-Díaz N, Aguilar-Ruiz JS. GO-based functional dissimilarity of gene sets. BMC Bioinformatics. 2011;12, article 360 doi: 10.1186/1471-2105-12-360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Clemente JC, Ikeo K, Valiente G, Gojobori T. Optimized ancestral state reconstruction using Sankoff parsimony. BMC Bioinformatics. 2009;10(1, article 51) doi: 10.1186/1471-2105-10-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Binder H, Schumacher M. Incorporating pathway information into boosting estimation of high-dimensional risk prediction models. BMC Bioinformatics. 2009;10, article 18 doi: 10.1186/1471-2105-10-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Ko Y, Zhai CX, Rodriguez-Zas S. Inference of gene pathways using mixture Bayesian networks. BMC Systems Biology. 2009;3, article 54 doi: 10.1186/1752-0509-3-54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Díaz-Díaz N, Gómez-Vela F, Rodriguez-Baena DS, Aguilar-Ruiz J. Gene regulatory networks validation framework based in kegg. Proceedings of the 6th International Conference on Hybrid Artificial Intelligent Systems (HAIS '11); May 2011; Wroclaw, Poland. pp. 279–286. [Google Scholar]
- 24.Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics. 2007;23(12):1537–1544. doi: 10.1093/bioinformatics/btm129. [DOI] [PubMed] [Google Scholar]
- 25.Zhou H, Wong L. Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets. BMC Genomics. 2011;12(supplement 3):p. S20. doi: 10.1186/1471-2164-12-S3-S20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Nariai N, Kim S, Imoto S, Miyano S. Using protein-protein interactions for refining gene networks estimated from microarray data by Bayesian networks. Pacific Symposium on Biocomputing. 2004;347:336–347. doi: 10.1142/9789812704856_0032. [DOI] [PubMed] [Google Scholar]
- 27.Gallo CA, Carballido JA, Ponzoni I. Discovering time-lagged rules from microarray data using gene profile classifiers. BMC Bioinformatics. 2011;12, article 123 doi: 10.1186/1471-2105-12-123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bulashevska S, Eils R. Inferring genetic regulatory logic from expression data. Bioinformatics. 2005;21(11):2706–2713. doi: 10.1093/bioinformatics/bti388. [DOI] [PubMed] [Google Scholar]
- 29.Ponzoni I, Azuaje FJ, Augusto JC, Glass DH. Inferring adaptive regulation thresholds and association rules from gene expression data through combinatorial optimization learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2007;4(4):624–633. doi: 10.1109/tcbb.2007.1049. [DOI] [PubMed] [Google Scholar]
- 30.Martínez-Ballesteros M, Nepomuceno-Chamorro IA, Riquelme JC. Discovering gene association networks by multi-objective evolutionary quantitative association rules. Journal of Computer and System Sciences. 2014;80(1):118–136. [Google Scholar]
- 31.Alakwaa FM, Solouma NH, Kadah YM. Construction of gene regulatory networks using biclustering and bayesian networks. Theoretical Biology and Medical Modelling. 2011;8, article 39 doi: 10.1186/1742-4682-8-39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Xu T, Du LF, Zhou Y. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinformatics. 2008;9, article 472 doi: 10.1186/1471-2105-9-472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Batada NN, Reguly T, Breitkreutz A, et al. Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biology. 2006;4(10, article e317) doi: 10.1371/journal.pbio.0040317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Cherry JM, Hong EL, Amundsen C, et al. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Research. 2012:D700–D705. doi: 10.1093/nar/gkr1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lee I, Li Z, Marcotte EM. An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae. PLoS ONE. 2007;2(10, article e988) doi: 10.1371/journal.pone.0000988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Albert R. Scale-free networks in cell biology. Journal of Cell Science. 2005;118, part 21:4947–4957. doi: 10.1242/jcs.02714. [DOI] [PubMed] [Google Scholar]
- 37.Khanin R, Wit E. How scale-free are biological networks. Journal of Computational Biology. 2006;13(3):810–818. doi: 10.1089/cmb.2006.13.810. [DOI] [PubMed] [Google Scholar]
- 38.Bartlett JE, Kotrlik JW, Higgins CC. Organizational research: determining appropriate sample size in survey research. Information Technology, Learning, and Performance Journal. 2001;19(1):43–50. [Google Scholar]
- 39.Cuevas Tello JC, Hernández-Ramírez D, García-Sepúlveda CA. Support vector machine algorithms in the search of KIR gene associations with disease. Computers in Biology and Medicine. 2013;43(12):2053–2062. doi: 10.1016/j.compbiomed.2013.09.027. [DOI] [PubMed] [Google Scholar]
- 40.Green ML, Karp PD. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Research. 2005;33(13):4035–4039. doi: 10.1093/nar/gki711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Nagarajan S, Kruckeberg AL, Schmidt KH, et al. Uncoupling reproduction from metabolism extends chronological lifespan in yeast. Proceedings of the National Academy of Sciences of the United States of America. 2014 doi: 10.1073/pnas.1323918111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ponzoni I, Nueda M, Tarazona S, et al. Pathway network inference from gene expression data. BMC Systems Biology. 2014;8(supplement 2):p. S7. doi: 10.1186/1752-0509-8-S2-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Reshetova P, Smilde A, van Kampen A, Westerhuis J. Use of prior knowledge for the analysis of high-throughput transcriptomics and metabolomics data. BMC Systems Biology. 2014;8, supplement 2:p. S2. doi: 10.1186/1752-0509-8-S2-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Spellman PT, Sherlock G, Zhang MQ, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998;9(12):3273–3297. doi: 10.1091/mbc.9.12.3273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Honigberg SM, Purnapatre K. Signal pathway integration in the switch from the mitotic cell cycle to meiosis in yeast. Journal of Cell Science. 2003;116(11):2137–2147. doi: 10.1242/jcs.00460. [DOI] [PubMed] [Google Scholar]