Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2013 Sep 19;9(9):e1003194. doi: 10.1371/journal.pcbi.1003194

Using Biological Pathway Data with Paxtools

Emek Demir 1,*, Özgün Babur 1, Igor Rodchenkov 2, Bülent Arman Aksoy 1,3, Ken I Fukuda 4, Benjamin Gross 1, Onur Selçuk Sümer 1, Gary D Bader 2, Chris Sander 1
Editor: Andreas Prlic5
PMCID: PMC3777916  PMID: 24068901

Abstract

A rapidly growing corpus of formal, computable pathway information can be used to answer important biological questions including finding non-trivial connections between cellular processes, identifying significantly altered portions of the cellular network in a disease state and building predictive models that can be used for precision medicine. Due to its complexity and fragmented nature, however, working with pathway data is still difficult. We present Paxtools, a Java library that contains algorithms, software components and converters for biological pathways represented in the standard BioPAX language. Paxtools allows scientists to focus on their scientific problem by removing technical barriers to access and analyse pathway information. Paxtools can run on any platform that has a Java Runtime Environment and was tested on most modern operating systems. Paxtools is open source and is available under the Lesser GNU public license (LGPL), which allows users to freely use the code in their software systems with a requirement for attribution. Source code for the current release (4.2.0) can be found in Software S1. A detailed manual for obtaining and using Paxtools can be found in Protocol S1. The latest sources and release bundles can be obtained from biopax.org/paxtools.


This is a PLOS Computational Biology Software Article

Introduction

The total volume of computational pathway data mapped by biologists has entered a rapid growth phase. There are more than 100,000 biochemical reactions and more than a million molecular interactions in publicly available 325 on-line resources [1]. Pathway analysis has been used to determine genetically altered pathways in diverse cancer types [2], [3], predict the functions and phenotypic effects of previously uncharacterized cancer genes [4], [5], prioritize candidate genes and pathways for functional validation in model systems [6], [7], identify network biomarkers correlated with clinical outcome in human disease [5], and translate genomic information into new targeted clinical applications [8].

Despite these promising studies, the use of computable pathway information in biology is still in its infancy. Pathway data sources were originally developed to use their internal pathway representation, resulting in a heterogeneous set of resources that are difficult to combine and use. There are two key technical challenges we need to address to streamline the use of this rapidly expanding information : (i) standardizing and integrating pathway data from multiple heterogeneous sources and (ii) developing methods and tools that can help scientists easily access, map and analyze pathway information.

BioPAX (Biological Pathway Exchange) [9] a standard language developed by a community of pathway data providers, tool developers and scientists enables data exchange between data providers and acts as a common interface for software tools to access pathway data from multiple sources. Recently released BioPAX Level 3 covers a large spectrum of pathway types, including signaling and metabolic pathways as well as genetic and protein-protein interactions. BioPAX, as a common discussion forum, also led to convergence on the semantics of pathway representation over the years. There are multiple aggregation services for publicly available pathway data in BioPAX [10], [11]. Pathway Commons (PC) [12], for example, validates, normalizes and integrates pathway information from Reactome [13], NCI/Nature PID [14], BioCyc [15], Panther Pathway [16], PhosphoSitePlus [17] and other major pathway databases.

The second key challenge is to build methods, algorithms and software tools to work with pathway data to answer biological questions [18][21]. Given the large number of classes and properties in BioPAX, developing software that consumes BioPAX can be a daunting task. To address this problem, we developed Paxtools, an open-source, Java library that contains facilities and algorithms for common but difficult to implement tasks such as reading, writing, searching, merging, comparing and transforming pathway information. Paxtools is the only complete BioPAX API available and is quickly becoming a common platform for accessing and analyzing pathway data as well as development of pathway software [22][25].

Design and Implementation

At the core of Paxtools is a complete and consistent object-oriented implementation of the BioPAX specification. A BioPAX record includes pathways, interactions, reactions and their participants. These elements are implemented as Java beans that have methods to access and manipulate the properties as described in BioPAX specification. The Model class acts as a container that allows adding, removing and searching for elements.

BioPAX uses OWL semantics that are not automatically covered by object oriented languages such as Java. Properties can be symmetric, transitive or subtyped into other properties. For example, a protein binding relationship is symmetric, if A binds to B, the inverse is also true. Also, the “standard name” property is a subtype of “name” property, so updating the standard name of the protein should also update its list of names. Paxtools implements these additional semantics and automatically updates object fields.

In the BioPAX specification, properties are unidirectional for brevity. For example, the “participant” property links interactions to physical entities. Paxtools provides additional “inverse” links for key properties that allow efficient bidirectional navigation. These inverse links are also automatically updated when the forward link is updated.

Paxtools includes two input/output handlers for reading and writing BioPAX files. Jena based IO can handle most OWL encodings but can be relatively slow and demands a significant amount of memory for very large files. The StaX based “Simple” IO handler class can only read BioPAX formatted as RDF/XML and does not have the flexibility of the Jena handler but it can read hundreds of thousands of elements within minutes with a very small memory footprint.

On top of this foundation Paxtools implements methods, algorithms and tools to solve common tasks encountered while working with pathway data (Figure 1). Some components, such as graph search algorithms, are difficult to implement correctly without a detailed understanding of BioPAX. For example, developers unfamiliar with the BioPAX specification might not be aware of generic entities and their semantics and fail to traverse these portions of the network. These algorithms in return might fail to find interesting and important interactions. Paxtools components were developed by researchers who have extensive experience working with BioPAX and are involved in BioPAX development. They were further tested and corrected by an active user community. Some of these components include:

Figure 1. Paxtools can be used to obtain pathway data from different sources and use the data to answer a wide range of biological questions.

Figure 1

Paxtools facilitates working with pathway information by removing technical barriers and acts as a common development platform for other tools and algorithms.

Efficient traversal and editing

The “Controller” package allows users to manipulate BioPAX models without actually hard coding property and class names, define XML Path Language-like queries and traverse the object graph efficiently. An example which implements an object inspector to display the properties of objects can be found in the user's guide (Text S1). This pattern considerably simplifies development of BioPAX exporters and other tools and makes it easier to extend and update them to support future changes in the BioPAX specification.

Graph search algorithms for biological questions

Paxtools offers an array of graph search algorithms that are specifically targeted towards answering common biological questions. For example a mutually exclusive relationship between the mutations observed in two genes in a set of drug-resistant tumors can be explained by a common downstream pathway. Similarly, an observed correlation in gene expression between two genes can be caused by a signaling path that connects one gene to the control of the expression of the other. Starting from a set of entities, users can find common upstream and downstream events, feedback loops, connecting paths and networks of interest (Figure 2). A list of algorithms and their complexity analysis can be found in [26].

Figure 2. A plausible signaling path (yellow) between the androgen receptor (AR, top) and the TP53 protein (p53, right) is found by a Paxtools “paths-between” query and visualized in the Chisio BioPAX Editor.

Figure 2

These cross-talks are difficult to find manually without a graph search because of the large number of connections to/from AR and p53 and because of fragmentation of data across multiple pathway data sources.

Algorithms for merging, extracting and reducing pathways

Working with BioPAX often requires complex manipulations to merge models from different sources, extract subparts or reduce them to simpler formats (Figure 3). Paxtools provides an array of carefully designed algorithms for these tasks that pays considerable attention to the subtle details of the BioPAX format such as subsumption relationships between entities (i.e. generic proteins and complex molecules).

Figure 3. A Sif (Simple Interaction Format) rule reduces mechanistic BioPAX interactions (left column) to simpler binary interactions (right column) based on the pattern defined in the rule.

Figure 3

Different rules should be used dependent on the biological question at hand. For example a State Change interaction is inferred when the rule detects phosphorylation of p53 by p38. This is a useful relationship that can be applied to protein signaling cascade analysis from proteomics data. “Sequential Catalysis” rule on the other hand links entities that are responsible for catalyzing subsequent reactions. This is a relationship that frequently occurs in metabolic pathways and is useful for metabolomic studies where changes in the concentration of substrates are observed.

Converting to and from different formats

Paxtools can convert Proteomics Standards Initiative - Molecular Interaction (PSI-MI) [27] models to BioPAX Level 3. In addition, BioPAX models can be exported back to OWL and several other useful formats, including Systems Biology Graph Notation markup language (SBGN-ML) [28], Simple Interaction Format (SIF) and Gene Set Enrichment Analysis (GSEA) [29].

Seamless handling of different BioPAX levels

The recently released BioPAX Level 3 introduced significant extensions to previous levels with some loss of backwards compatibility. Paxtools supports all three BioPAX levels (1, 2 and 3) and provides facilities for upgrading older BioPAX models to Level 3.

Syntactic validation

Each operation that modifies the model is internally validated by Paxtools to comply with BioPAX syntax, including RDF well-formedness, domain and range restrictions, bidirectional links, and redundancies. This is especially useful for tools that create BioPAX such as pathway editors, converters or exporters as it allows early detection of errors, increasing the quality of the produced BioPAX data. This syntactic validation does not include more detailed checking of semantics that is performed by the BioPAX validator.

Object/relational mapping

Paxtools comes with mappings for Java Persistence API that allows out-of-the-box persistence of BioPAX models into relational database systems, such as MySQL. This is an especially important feature for working with large biopax models, such as complete pathway database exports.

Modular and lightweight structure

Paxtools source code is currently distributed as a modular Maven project which allows developers to easily select just the parts of Paxtools they need in their application. The Paxtools core module, which provides a complete implementation of BioPAX and read in/write out functionality, is a very compact library (400 kB) that can be expanded as required with additional modules, such as those described above.

Results

Paxtools implements all of these features in a lightweight, extensible structure. Paxtools can be embedded in a software tool as a library, can be used for programmatic access to pathway data with Java or Jython or can be used to run most commonly used Paxtools algorithms and searches from the command line. Here, we briefly review several applications where Paxtools was used to answer biological questions or develop bioinformatics tools.

Finding cellular processes connecting a set of genes of interest

When an experimental method identifies a set of genes associated with a phenotype, such as prostate cancer, cellular processes that connect these genes can offer a causal explanation. In this scenario, a user writes a small Java program that loads the cellular process network curated by a pathway database. Alternatively, instead of using BioPAX exports, the data can be queried dynamically from PC using the PC-client module. The program then runs a paths-between query, implemented in the Query package that finds the paths in the network that connect the seed genes. Entities will be mapped using UniProt identifiers and the algorithm will return the connecting sub-network in BioPAX. This BioPAX network can be visualized or used for subsequent analysis. An implementation of this scenario can be found in cBio Cancer Genomics Portal [30].

Finding overrepresented modules or paths in a set of genomic alteration profiles

In this use case, a set of high throughput profiles are used in conjunction with pathway information to find significantly altered modules. Due to the nature of the experimental data, detailed BioPAX information, such as the phosphorylation states of interacting proteins is unnecessary in this case. To facilitate analysis, complex BioPAX pathways can be reduced to a simple interaction network using the SIF converter that effectively merges multiple states of the entity into a single node. Alteration profiles can be directly mapped onto these “entity” nodes using external references to Entrez Gene and UniProt databases as defined in BioPAX models. Existing clustering and module-finding algorithms [18], [31] can then be directly used on this SIF network. Example implementations of this usage can be found in the NetPath [32] and MeMo [33] algorithms.

Pathway visualization and editing

Visualization is key for understanding complex relationships between biological entities. Paxtools significantly facilitates implementing Pathway visualizers and editors by handling the reading/writing and syntactic validation. Paxtools also provides additional common functionality for interactive tools such as undo/redo and object inspection. Finally, the Pathway Commons client enables pathway visualization tools to directly access and query Pathway Commons. Pathway Editors and visualizers that use Paxtools for these purposes include ChiBE [34], CellDesigner [23], Cytoscape [22] and Vanted [35].

Protein activity network

The effect of post-translational modifications on protein activity can be useful for assessing the function of mutations in the vicinity of a modification site. Specifically, we would like to know if a given post-translational modification causes a protein to gain or lose a function. We can search pathways for processes that modify a protein and predict how this change will affect the controlled downstream processes. The algorithm also needs to consider processes that contain the protein as a part of molecular complex or as a part of a protein family. Since these inclusion relationships and the type and location of the modifications are stated explicitly in BioPAX, equivalent protein modifications can be matched to integrate findings from different pathways. A complete implementation of this algorithm can be found in the Paxtools code.

Availability and Future Directions

Paxtools can run on any platform that has a Java Runtime Environment and was tested on most modern operating systems. Paxtools is open source and is available under the Lesser GNU public license (LGPL), which allows users to freely use the code in their software systems with a requirement for attribution. Source code for the current release (4.2.0) can be found in Software S1. A detailed manual for obtaining and using Paxtools can be found in Protocol S1. The latest sources and release bundles can be obtained from biopax.org/paxtools. Paxtools is supported by an active community of developers and users that can be reached at http://lists.sourceforge.net/lists/listinfo/biopax-paxtools. Sample code that demonstrates basic tasks and operations can be found in the org.biopax.paxtools.examples package.

A rapidly growing open-source BioPAX software infrastructure, available as part of the Pathway Commons project, is directly built on top of Paxtools, including a state-of-the-art persistence and full-text search system cPath2, an advanced validator that allows checking complex rules and best practices, a pathway alignment tool and a framework for quickly mapping experimental data on BioPAX pathways. Since these tools use a common API, it is possible to combine and re-use software components across multiple applications. Detailed examples of using Paxtools for real-life applications can be found in these projects. We encourage users to contribute to the codebase and expand it to build a pathway informatics platform that addresses the growing needs of the community.

One of our major future goals is to reach a wider community of researchers by implementing different language bindings and usage modes for Paxtools. We are planning to include bindings for the R statistical programming language and implement an interactive scripting mode for more efficient non-programmatic manipulation.

Paxtools can significantly reduce the time spent parsing, normalizing and querying BioPAX information. Without removing these barriers, it is difficult to tap into the rapidly growing computable pathway information corpus produced by multiple pathway databases and curation groups. Paxtools allows researchers to shift their effort from data plumbing to answering biological questions.

Supporting Information

Protocol S1

Paxtools User's Guide. This guide provides details on how to obtain and use Paxtools. It also provides examples and technical details for the key features as well as a frequently asked questions section.

(PDF)

Software S1

The complete source code for all Paxtools modules as of the 4.2.0 release as a single bundle. The latest sources and release bundles can be obtained from biopax.org/paxtools.

(TAR)

Acknowledgments

We would like to thank Paxtools users for their valuable feedback.

Funding Statement

This research was funded by the US National Institutes of Health (grant #5U41 HG006623-02). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bader GD, Cary MP, Sander C (2006) Pathguide: a pathway resource list. Nucleic acids research 34: D504–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Network TCGA (2008) Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455: 1061–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Jones S, Zhang X, Parsons DW, Lin JCH, Leary RJ, et al. (2008) Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science (New York, NY) 321: 1801–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hu P, Bader G, Wigle DA, Emili A (2007) Computational prediction of cancer-gene function. Nature reviews Cancer 7: 23–34. [DOI] [PubMed] [Google Scholar]
  • 5. Chuang HY, Lee E, Liu YT, Lee D, Ideker T (2007) Network-based classification of breast cancer metastasis. Molecular systems biology 3: 140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. (2006) Gene prioritization through genomic data fusion. Nature biotechnology 24: 537–44. [DOI] [PubMed] [Google Scholar]
  • 7. Chen J, Bardes EE, Aronow BJ, Jegga AG (2009) ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic acids research 37: W305–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, et al. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics (Oxford, England) 26: i237–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Demir E, Cary MP, Paley S, Fukuda K, Lemer C, et al. (2010) The BioPAX community standard for pathway data sharing. Nat Biotechnol 28: 935–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Yu N, Seo J, Rho K, Jang Y, Park J, et al. (2012) hiPathDB: a human-integrated pathway database with facile visualization. Nucleic acids research 40: D797–802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Sreenivasaiah PK, Rani S, Cayetano J, Arul N, Kim DH (2012) IPAVS: Integrated Pathway Resources, Analysis and Visualization System. Nucleic acids research 40: D803–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, et al. (2011) Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res 39: D685–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Croft D, O'Kelly G, Wu G, Haw R, Gillespie M, et al. (2011) Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res 39: D691–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, et al. (2009) PID: the Pathway Interaction Database. Nucleic acids research 37: D674–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, et al. (2010) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic acids research 38: D473–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, et al. (2010) PANTHER version 7: improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic acids research 38: D204–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, et al. (2012) PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic acids research 40: D261–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Kirouac DC, Saez-Rodriguez J, Swantek J, Burke JM, Lauffenburger DA, et al. (2012) Creating and analyzing pathway and protein interaction compendia for modelling signal transduction networks. BMC systems biology 6: 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Searls DB (2005) Data integration: challenges for drug discovery. Nature reviews Drug discovery 4: 45–58. [DOI] [PubMed] [Google Scholar]
  • 20. Sorger PK (2005) A reductionist's systems biology: opinion. Current opinion in cell biology 17: 9–11. [DOI] [PubMed] [Google Scholar]
  • 21. Kelder T, Conklin BR, Evelo CT, Pico AR (2010) Finding the right questions: exploratory pathway analysis to enhance biological discovery in large datasets. PLoS biology 8: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using Cytoscape. Nature protocols 2: 2366–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Mi H, Muruganujan A, Demir E, Matsuoka Y, Funahashi A, et al. (2011) BioPAX support in CellDesigner. Bioinformatics (Oxford, England) 27: 3437–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Li C, Donizelli M, Rodriguez N, Dharuri H, Endler L, et al. (2010) BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models. BMC systems biology 4: 92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Babur O, Demir E, Gönen M, Sander C, Dogrusoz U (2010) Discovering modulators of gene expression. Nucleic Acids Res 38: 5648–5656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Dogrusoz U, Cetintas A, Demir E, Babur O (2009) Algorithms for effective querying of compound graph-based pathway databases. BMC Bioinformatics 10: 376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kerrien S, Orchard S, Montecchi-Palazzi L, Aranda B, Quinn AF, et al. (2007) Broadening the horizon–level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol 5: 44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. van Iersel MP, Villéger AC, Czauderna T, Boyd SE, Bergmann FT, et al. (2012) Software support for SBGN maps: SBGN-ML and LibSBGN. Bioinformatics 28: 2016–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102: 15545–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, et al. (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery 2: 401–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ma'ayan A (2009) Insights into the organization of biochemical regulatory networks using graph theory analyses. The Journal of biological chemistry 284: 5451–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Cerami EG, Bader GD, Gross BE, Sander C (2006) cPath: open source software for collecting, storing, and querying biological pathways. BMC bioinformatics 7: 497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Ciriello G, Cerami E, Sander C, Schultz N (2012) Mutual exclusivity analysis identifies oncogenic network modules. Genome research 22: 398–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Babur O, Dogrusoz U, Demir E, Sander C (2010) ChiBE: interactive visualization and manipulation of BioPAX pathway models. Bioinformatics (Oxford, England) 26: 429–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Czauderna T, Klukas C, Schreiber F (2010) Editing, validating and translating of SBGN maps. Bioinformatics (Oxford, England) 26: 2340–1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Protocol S1

Paxtools User's Guide. This guide provides details on how to obtain and use Paxtools. It also provides examples and technical details for the key features as well as a frequently asked questions section.

(PDF)

Software S1

The complete source code for all Paxtools modules as of the 4.2.0 release as a single bundle. The latest sources and release bundles can be obtained from biopax.org/paxtools.

(TAR)


Articles from PLoS Computational Biology are provided here courtesy of PLOS

RESOURCES