Abstract
The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database provides a manual curation of biological pathways that involve genes (or gene products), metabolites, chemical compounds, maps, and other entries. However, most applications and datasets involved in omics are gene or protein-centric requiring pathway representations that include direct and indirect interactions only between genes. Furthermore, special methodologies, such as Bayesian networks require acyclic representations of graphs. We developed KEGG2Net, a web resource that generates a network involving only the genes represented on a KEGG pathway with all of the direct and indirect gene-gene interactions deduced from the pathway. KEGG2Net offers four different methods to remove cycles from the resulting gene interaction network, converting them into directed acyclic graphs (DAGs). We generated synthetic gene expression data using the gene interaction networks deduced from the KEGG pathways and performed a comparative analysis of different cycle removal methods by testing the fitness of their DAGs to the data and by the number of edges they eliminate. Our results indicate that an ensemble method for cycle removal performs as the best approach to convert the gene interaction networks into DAGs. Resulting gene interaction networks and DAGs are represented in multiple user-friendly formats that can be used in other applications, and as images for quick and easy visualisation. The KEGG2Net web portal converts KEGG maps for any organism into gene-gene interaction networks and corresponding DAGS representing all of the direct and indirect interactions among the genes.
Introduction
The KEGG pathway database provides hundreds of manually curated maps that involve molecular interactions between gene products, compounds, maps, DNA, RNA, and other molecules (Kanehisa and Goto, 2000). The maps are categorised under seven groups, such as “Metabolism” or “Environmental Information Processing,” which are stored in proprietary files in XML format, called KGML. There exist numerous approaches that process the KEGG pathway maps, such as KEGGtranslator (Wrzodek et al., 2011), KEGGParser (Arakelyan and Nersisyan, 2013), CyKEGGParser (Nersisyan et al., 2014), KEGGgraph (Zhang and Wiemann, 2009), KEGGconverter (Moutselos et al., 2009), and graphite (Sales et al., 2012) among others (Wrzodek et al., 2013). These approaches convert KGML files to other formats (e.g., SBML, BioPAX) to be used in applications for data visualisation (e.g., Cytoscape) or graph-theoretic analysis (e.g., MATLAB®, Bioconductor).
Although these approaches are extremely useful, they do not provide a representation that only includes genes deduced from the KEGG pathways considering all of the direct and indirect gene interactions. However, when analysing experimental datasets that only involve the genes (or gene products) in the context of KEGG pathways (e.g., transcriptomic or proteomic data), an interaction network that only involves these molecules is required. Among the existing tools, KEGGgraph provides a “genesOnly” parameter that results in a gene-oriented graph; but that approach only deduces the direct gene interactions provided in the maps. graphite also represents gene-only networks, but it does not take into account all of the compounds between the genes to obtain the exhaustive set of indirect interactions – some compounds based on their identity or localisation are ignored. Indeed, there is a need for an approach that recovers all of the direct and indirect gene-gene interactions from a KEGG pathway that can be used in downstream analysis involving data coming only from these molecules.
KEGG pathways include cycles that may be problematic in analysis approaches, such as Bayesian networks (BNs), which use directed acyclic graphs (DAGs) (Friedman et al., 2000; Isci et al., 2014; Isci et al., 2011; Korucuoglu et al., 2014; Liu et al., 2016). None of the existing approaches that process KEGG pathways provide DAGs as their output. Furthermore, there has been no study that compares different cycle removal methods in the context of the fitness of biological data to the resulting DAGs.
In light of these observations and perceived needs, we developed KEGG2Net, a web resource that converts KEGG pathways into gene interaction networks involving all of the direct and indirect relations between the genes that can be deduced from the map. KEGG2Net offers four alternative methods to convert the resulting gene interaction networks into DAGs. This paper also provides a comparative assessment of the cycle removal methods via their fitness to the data obtained from the original gene interaction network deduced from KEGG.
Implementation
Given the KGML file for a pathway map, the engine parses the file to obtain an adjacency matrix that represents all of the interactions (relation, reaction, etc.) between all of the node types (compound, map, gene, etc.) defined in the file. In this graph, if there exists a path between two genes that contain non-gene nodes only, then an indirect relation between the two genes is established by placing an edge between them. Next, nodes that are not genes are removed from the adjacency matrix. This way, the resulting graph represents all of the direct and indirect interactions between the genes that can be deduced from the KEGG pathway map.
The resulting gene network may contain cycles that are removed using four different methods: a depth-first search (DFS) (Suominen and Mader, 2014), a greedy local heuristic to the minimum feedback arc set (MFAS) problem (Eades et al., 1993), and two graph-hierarchy-based methods where the hierarchy is inferred either through PageRank (PR) (Page et al., 1999) or an ensemble method (EN) (Sun et al., 2017) based on the TrueSkillTM (Herbrich et al., 2007) and social agony (Gupte et al., 2011) metrics. The DFS-based approaches use fast, simple heuristics to remove back edges, MAFS-based approaches try to minimise the number of edges removed, and hierarchy-based methods define a hierarchy in the graph first and then devise an edge removal strategy that prioritises the maintenance of the defined hierarchy as much as possible.
The input to KEGG2Net is the KGML files for the pathways that belong to the organism selected by the user. The output of KEGG2Net consists of the gene interaction networks deduced from the pathways and four DAGs per network where the cycles are removed by the aforementioned four algorithms. The networks and DAGs are represented as adjacency matrices and simple interaction files (a.k.a. SIF or .sif format) for use in the downstream analysis by other software and for visualisation purposes.
In order to compare the accuracy of different cycle removal methods and to provide a sample output using graph images for visualisation purposes, we applied the KEGG2Net approach on the 335 available human KEGG pathways. The KEGG2Net workflow adopted in this paper is shown in Figure 1.
Out of the 335 pathways, we considered only the networks with six or more non-isolated nodes, which left us with 280 networks. Of these networks, 150 did not have any cycles. For each of the remaining 130 that were cyclic, a synthetic gene expression data fitting the graph topology was generated using SynTReN (v. 1.2) (Van den Bulcke et al., 2006). The expression data was processed to be used by BN scoring methods as previously described (Isci et al., 2011). The four cycle removal methods were applied to the 130 networks with cycles generating four DAGs per network, which were scored based on the Bayesian Information Criterion (BIC) in the bnlearn R package (Scutari, 2010) using the processed synthetic expression data. For each of the four DAGs per network, we generated 1,000 random DAGs (with the same number of edges and nodes as the original DAG) and obtained their BIC scores based on the same processed synthetic expression data to assess the goodness of the original DAG’s score.
Results
Our web portal provides the gene interaction networks obtained for all of the 335 human KEGG pathways. For the networks that have cycles, we also list the DAGs obtained using the four methods. The gene interaction networks and the DAGs obtained from them are represented by adjacency matrices, SIF format files, and graph images. KEGG2Net can be used for all of the organisms listed in KEGG where the user can download the relevant network and DAG files via our web portal.
One direct way to assess the performance of different cycle removal methods is to compare the number of edges removed by each method. However, this approach does not infer the degree to which the topology and the dependency structure among the nodes of the network are preserved in the DAGs. For this purpose, we first generated synthetic gene expression data that follow the regulatory dynamics explained by the gene interaction network deduced by KEGG2Net. We then scored each of the four DAGs with this expression data using BIC scoring, where a higher score indicated a better fit. Finally, for each of the four DAGs that result from the given network, we generated 1,000 random DAGs (i.e., 4,000 random DAGs per network) where the random DAGs had the same number of nodes and links as the DAG they were associated with. This exercise was repeated for all of the 130 networks, and the DAG statistics were compared.
The complete set of results for our simulations are given in the Supplementary Data, which involves the 130 KEGG2Net gene interaction networks with six or more non-isolated nodes and a cycle. We provide the original pathway’s ID, name, number of edges, number of nodes, the number of edges removed by the four cycle removal methods, the rank of each method for each pathway based on the edges removed, and the rank of the score for each method (among themselves and their 1,000 random DAGs).
These results, summarised in Figure 2, showed that on average, the EN method required the least number of edges to be removed and provided the DAG with the highest BIC score.
On average, the EN method removed 4.71 edges per pathway; and it was the method that required the least number of edges to be removed in 127 out of 130 networks. The EN method accomplished an average rank of 1.03 in all networks, where 1 represented the rank that removed the minimum number of edges.
The average rank of the EN score among the four DAGs was also the highest, where it attained an average rank of 2.04. In other words, on average, the EN DAG provided the best topology that fit the synthetic expression data. The only category where the EN method was outperformed was its average rank among the 1,000 random DAGs. On average, 16.25 random DAGs for a given network performed better than the EN DAG, whereas this number was 15.13 for the PR DAG, the only method that beat the EN approach in this category. Given the comprehensive evaluation summarised in Figure 2, we recommend EN as the method of choice for DAG generation.
Discussion
In this work, we provide a web resource that converts KEGG pathways into gene interaction networks representing all of the direct and indirect interactions between the genes. We also provide four alternative ways of converting the resulting graph into a DAG. Our results showed that the EN method finds the best DAG that explains the underlying hierarchy and dependency structure defined in the interaction network with the minimum number of edge removals. Our web portal lists the graph structure and the four DAGs for each of the KEGG human pathways. The KEGG2Net web resource can be used to obtain the networks and DAGs for any organism listed in the KEGG database.
Supplementary Material
Key Points.
Deduces only the gene-gene interactions from a KEGG pathway considering all direct and indirect interactions.
Provides networks that are ready to be applied to transcriptomic or proteomic data for system-level analysis.
Using multiple alternative methods, converts networks to directed acyclic graphs (DAGs) that can be used in methodologies such as Bayesian networks, which require DAGs.
Generates multiple output formats that can be directly used in different network visualisation or analysis software.
Reports a comparative analysis of cycle removal methods for biological pathways.
Funding
The research reported in this publication was supported by the National Library of Medicine (NLM) of the National Institutes of Health (NIH) under award number R21LM012759. The funding body was not involved in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Footnotes
Competing interests: SKC none; MA none; HC none; HHO none
References
- 1.Arakelyan A and Nersisyan L (2013) KEGGParser: parsing and editing KEGG pathway maps in Matlab. Bioinformatics 29(4): 518–519 10.1093/bioinformatics/bts730 [DOI] [PubMed] [Google Scholar]
- 2.Eades P, Lin X and Smyth WF (1993) A fast and effective heuristic for the feedback arc set problem. Inform Process Lett 47(6): 319–323 10.1016/0020-0190(93)90079-O [DOI] [Google Scholar]
- 3.Friedman N, Linial M, Nachman I and Pe’er D (2000) Using Bayesian networks to analyze expression data. J Comput Biol 7(3–4): 601–620 10.1089/106652700750050961 [DOI] [PubMed] [Google Scholar]
- 4.Gupte M, Shankar P, Li J, Muthukrishnan S and Iftode L (2011) Finding Hierarchy in Directed Online Social Networks. Proceedings of the 20th International Conference on World Wide Web (WWW ‘11): 557–566 10.1145/1963405.1963484 [DOI] [Google Scholar]
- 5.Herbrich R, Minka T and Graepel T (2007) TrueSkillTM: A Bayesian Skill Rating System. In: Advances in Neural Information Processing Systems (NIPS), Scholkopf PB, Platt JC, Hoffman T (eds.) pp. 569–576.
- 6.Isci S, Dogan H, Ozturk C and Otu HH (2014) Bayesian network prior: network analysis of biological data using external knowledge. Bioinformatics 30(6): 860–867 10.1093/bioinformatics/btt643 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Isci S, Ozturk C, Jones J and Otu HH (2011) Pathway analysis of high-throughput biological data within a Bayesian network framework. Bioinformatics 27(12): 1667–1674 10.1093/bioinformatics/btr269 [DOI] [PubMed] [Google Scholar]
- 8.Kanehisa M and Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1): 27–30 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Korucuoglu M, Isci S, Ozgur A and Otu HH (2014) Bayesian pathway analysis of cancer microarray data. Plos One 9(7): e102803 10.1371/journal.pone.0102803 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Liu F, Zhang SW, Guo WF, Wei ZG and Chen L (2016) Inference of Gene Regulatory Network Based on Local Bayesian Networks. PLoS Comput Biol 12(8): e1005024 10.1371/journal.pcbi.1005024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Moutselos K, Kanaris I, Chatziioannou A, Maglogiannis I and Kolisis FN (2009) KEGGconverter: a tool for the in-silico modelling of metabolic networks of the KEGG Pathways database. BMC Bioinformatics 10(324) 10.1186/1471-2105-10-324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Nersisyan L, Samsonyan R and Arakelyan A (2014) CyKEGGParser: tailoring KEGG pathways to fit into systems biology analysis workflows. F1000Res 3(145) 10.12688/f1000research.4410.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Page L, Brin S, Motwani R and Winograd T, (1999). “The PageRank citation ranking: Bringing order to the web”, Technical Report Stanford InfoLab.
- 14.Sales G, Calura E, Cavalieri D and Romualdi C (2012) graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics 13(20) 10.1186/1471-2105-13-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Scutari M (2010) Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35(3): 1–22 10.18637/jss.v035.i0321603108 [DOI] [Google Scholar]
- 16.Sun J, Ajwani D, Nicholson PK, Sala A and Parthasarathy S (2017) Breaking Cycles In Noisy Hierarchies. Proceedings of the 2017 ACM on Web Science Conference: 151–160 10.1145/3091478.3091495 [DOI] [Google Scholar]
- 17.Suominen O and Mader C (2014) Assessing and Improving the Quality of SKOS Vocabularies. Journal on Data Semantics 3(1): 47–73 10.1007/s13740-013-0026-0 [DOI] [Google Scholar]
- 18.Van den Bulcke T, Van Leemput K, Naudts B, van Remortel P, Ma H et al. (2006) SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics 7(43) 10.1186/1471-2105-7-43 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wrzodek C, Buchel F, Ruff M, Drager A and Zell A (2013) Precise generation of systems biology models from KEGG pathways. BMC Syst Biol 7(15) 10.1186/1752-0509-7-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wrzodek C, Drager A and Zell A (2011) KEGGtranslator: visualizing and converting the KEGG PATHWAY database to various formats. Bioinformatics 27(16): 2314–2315 10.1093/bioinformatics/btr377 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang JD and Wiemann S (2009) KEGGgraph: a graph approach to KEGG PATHWAY in R and bioconductor. Bioinformatics 25(11): 1470–1471 10.1093/bioinformatics/btp167 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.