Abstract
Summary
Submission to the MetaboLights repository for metabolomics data currently places the burden of reporting instrument and acquisition parameters in ISA-Tab format on users, who have to do it manually, a process that is time consuming and prone to user input error. Since the large majority of these parameters are embedded in instrument raw data files, an opportunity exists to capture this metadata more accurately. Here we report a set of Python packages that can automatically generate ISA-Tab metadata file stubs from raw XML metabolomics data files. The parsing packages are separated into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). Overall, the use of mzML2ISA & nmrML2ISA reduces the time needed to capture metadata substantially (capturing 90% of metadata on assay and sample levels), is much less prone to user input errors, improves compliance with minimum information reporting guidelines and facilitates more finely grained data exploration and querying of datasets.
Availability and Implementation
mzML2ISA & nmrML2ISA are available under version 3 of the GNU General Public Licence at https://github.com/ISA-tools. Documentation is available from http://2isa.readthedocs.io/en/latest/.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
MetaboLights, a database of experimental and derived metabolomics data (Kale et al., 2016), currently requires studies to be submitted in the ISA-Tab file format (Sansone et al., 2008), a hierarchical structure consisting of three components (Investigation, Study and Assay). ISA-Tab allows experimentalists to describe and record metadata in a simple format that aids reproducibility and shareability of experimental data and results. In addition to being a requirement for MetaboLights, the format is used for data centric journals (GigaScience, Scientific Data) and can facilitate data analysis (González-Beltrán et al., 2014; Lekschas and Gehlenborg, 2016).
Data acquisition within a metabolomics experiment involves a wide range of instrument and data pre-processing parameters. Monitoring and recording those parameters is essential to maximize the reproducibility of experimental data. Currently MetaboLights utilizes the ISA software suite (Rocca-Serra et al., 2010), particularly ISAcreator, to capture this information in ISA-Tab syntactic elements. Manual entry of >40 potential parameters and associated ontology references for MetaboLights submission can be a laborious, time-consuming and error-prone process. However, the majority of this instrument metadata exists already within the instrument vendor data files and corresponding open source data formats.
The use of freely available open source data formats for both mass spectrometry (MS) and nuclear magnetic resonance (NMR) spectroscopy is advantageous for software development as it allows a single route to access instrument data and allows metadata to be standardized between vendors. The open source MS file format mzML (Martens et al., 2011) was created in 2008 as part of the Human Proteome Organization (HUPO) Proteomics Standards Initiative working group for mass spectrometry (PSI-MS). The mzML format amalgamated many of the benefits of the older MS formats (mzData (Orchard et al., 2004)) and mzXML (Pedrioli et al., 2004) while adding new features. Importantly, the mzML format has a MS controlled vocabulary and is easily extendable (Martens et al., 2011). The imzML format is a relatively recent extension designed specifically for imaging MS (Schramm et al., 2012). The nmrML file format is a newly developed file format for NMR data, while not an extension of mzML, it is built around the same principles (http://nmrml.org/).
Through the use of python packages, we automated both the extraction of the instrument metadata and the subsequent population of an ISA-Tab structure. In so doing, we aim to both remove the submission bottleneck and help ensure the experimental data is reusable, sharable and standards compliant (Salek et al., 2015).
2 Materials and methods
2.1 ISA-Tab
The ISA-Tab structure consists of three file types. The Investigation file describes the overall objectives of a project as well as defining factors, protocols and parameters. The Study files describe the subjects studied and their characteristics and sampling methods. For every Study file there are one or more associated Assay files describing a measurement (e.g. metabolic profiling) and the technique (e.g. NMR). The instrument metadata extracted from either mzML, imzML or nmrML files are predominately used to automatically populate fields within the Assay file(s).
2.2 Implementation and availability
The mzML2ISA & nmrML2ISA software can be used as an application programming interface (API), via the command line interface (CLI) or via a graphical user interface (GUI). To allow integration into Galaxy, a popular workflow platform for metabolomics and other ’omic data analysis (Afgan et al., 2016), the python packages have been wrapped into a Galaxy compatible tool format. Additionally, mzML2ISA & nmrML2ISA are available as tools within the PhenoMeNal Galaxy public instance and as Docker Containers as detailed in the supplementary materials. See Supplementary Table S1 for full list of nmrML2ISA & mzML2ISA software and Supplementary section S2 for further implementation details. Instruction on how to use the tools and examples for common use cases can be found in the documentation http://2isa.readthedocs.io/en/latest/.
2.3 Workflow
Due to the nature of the analytical technologies and structural differences between MS and NMR file formats, the software described here is divided into mzML2ISA (encompassing mzML and imzML formats) and nmrML2ISA (nmrML format only). However, the same workflow applies for both technologies (see Fig. 1).
From a user supplied collection of XML based data files and some experimental information (minimally a study identifier), the metadata are extracted from the XML data files. The Python Pronto package (https://pypi.python.org/pypi/pronto) is used to extract parameters referring to their accession number within either the HUPO PSI-MS ontology (Mayer et al., 2013), imagingMS ontology (Schramm et al., 2012) or nmrCV (http://nmrml.org/).
Some instrument metadata is not available within the XML file format (e.g. chromatography parameters); such meta data can be added by the user manually. Additionally, study design or wet lab experimental metadata (e.g. sample preparation details) can also be added by the user manually. The GUIs developed provide the easiest route for users who wish to provide additional metadata at this stage.
The metadata is then converted into the ISA-Tab file format where a large number of fields are automatically populated. Depending on the level of additional metadata already provided by the user, the ISA-Tab files may require review and expansion to meet annotation requirements (Fiehn et al., 2007; Sansone et al., 2007). These remaining fields can be manually added through the ISAcreator tool.
The metadata generated from the XML file parsing is stored as a Python dictionary that can be rendered in JSON (JavaScript Object Notation), making it accessible to other software tools and analysis independent of ISA-Tab generation.
See Supplementary section S2 for full details of all MetaboLights studies used for assessing software in this paper.
3 Results
3.1 Extracted experimental metadata
Up to 23, 42 and 46 instrument metadata terms for mzML, imzML and nmrML respectively are automatically extracted and parsed into the assay component of the ISA-Tab structure.
Extracted terms for the XML files include generic instrument descriptors (e.g. the instrument name, manufacturer and software), data transformation descriptors (e.g. file conversion details) and platform specific descriptors (e.g. mass analyzer, detector and m/z range for MS derived XML files). Where possible, extracted terms from each file format should be found within a relevant controlled vocabulary. Full details of the extracted terms can be found at http://2isa.readthedocs.io/en/latest/ (see the extracted terms sections).
3.2 MetaboLights case studies
The mzML2ISA & nmrML2ISA packages have parsed and created ISA-Tab structures for all MetaboLights studies that have associated mzML, imzML, nmrML files. The resulting ISA-Tab structures have then been successfully validated using the ISAvalidator (https://github.com/ISA-tools/ISAvalidator-ISAconverter-BIImanager) tool. See Supplementary Table S2 for details of the 21 studies tested.
Using sample sets of 50 XML files (either mzML, imzML or nmrML) derived from MetaboLights, XML to ISA-Tab conversion is completed in less than 45 seconds. The exact time will vary depending on file type and size. See Supplementary Table S3 for details.
4 Discussion
The mzML, imzML and nmrML data file formats used in metabolomics provide a parameter rich, technical layer of metadata that we have exploited here to both improve the reliability of MetaboLights submissions and increase the ease and speed of submission. Additionally, the generated ISA-Tab structures can be used with subsequent downstream analysis with software such as Risa (González-Beltrán et al., 2014) and the Refinery Platform (Lekschas and Gehlenborg, 2016).
However, to reach core information for metabolomics reporting compliance (CIMR) (BioSharing: Bsg-s000175), derived from the Metabolomics Standards Initiative (Fiehn et al., 2007; Sansone et al., 2007), a range of distinct descriptors still have to be reported into the ISA-Tab structure. Typically, these are experimental design key descriptors (predictor variables, replicate information, study group information), subject to sample relationships, as well as biological sample characteristics, and sample separation procedures if relevant (e.g chromatography). The development of the ISA-API (https://github.com/ISA-tools/isa-api) will further facilitate the automation of these missing descriptors, especially regarding experimental design, and aid in reducing the barrier and time required to submit metabolomics data to repositories. Indeed, instrument vendor raw files may harbour many more parameters that are often absent in the open source equivalent files.
Metadata extraction and integration into an open source file format can however be an involved process to implement for each instrument vendor API. Additionally, licensing issues can limit full exploitation of the vendor raw files. Further work that brings together the metabolomics and proteomics community, along with the instrument vendors, could help identify and integrate useful but missing parameters into the appropriate open source format.
Supplementary Material
Acknowledgement
We thank Drs David Johnson, Alejandra Gonzalez-Beltran and Peter Li for contributions provided during the China UK data dissemination in metabolomics (CUDDEL) workshop (2016).
Funding
This work was supported financially through a NERC CASE PhD studentship with GigaScience, grant number NE/L002493/1; BBSRC, grant numbers BB/L005077/1, BB/M019985/1 and BB/M027635/1; MRC UK MEDical BIOinformatics partnership, grant number MR/L01632X/1; and the PhenoMeNal European Commission’s Horizon2020 programme, grant number 654241, Wellcome Trust, grant number 202952/Z/16/Z.
Conflict of Interest: none declared.
References
- Afgan E. et al. (2016) The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res., 44, W3–W10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fiehn O. et al. (2007) The metabolomics standards initiative (MSI). Metabolomics, 3, 175–178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- González-Beltrán A. et al. (2014) The risa R/Bioconductor package: integrative data analysis from experimental metadata and back again. BMC Bioinformatics, 15, S11.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orchard S. et al. (2004) Advances in the development of common interchange standards for proteomic data. Proteomics, 4, 2363–2365. [DOI] [PubMed] [Google Scholar]
- Kale N.S. et al. (2016) MetaboLights: An Open-Access database repository for metabolomics data. Curr. Protoc. Bioinf., 53, 14.13.1–14.13.18. [DOI] [PubMed] [Google Scholar]
- Lekschas F., Gehlenborg N. (2016) SATORI: A system for Ontology-Guided visual exploration of biomedical data repositories. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martens L. et al. (2011) mzML—a community standard for mass spectrometry data. Mol. Cell. Proteomics, 10, R110–000133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mayer G. et al. (2013) The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary. Database, 2013, bat009.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedrioli P.G.A. et al. (2004) A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol., 22, 1459–1466. [DOI] [PubMed] [Google Scholar]
- Rocca-Serra P. et al. (2010) ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics, 26, 2354–2356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salek R.M. et al. (2015) COordination of standards in MetabOlomicS (COSMOS): facilitating integrated metabolomics data access. Metabolomics, 11, 1587–1597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sansone S.A. et al. (2007) The metabolomics standards initiative. Nat. Biotechnol., 25, 846–848. [DOI] [PubMed] [Google Scholar]
- Sansone S.A. et al. (2008) The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?” Omics, 12, 143–149. [DOI] [PubMed] [Google Scholar]
- Schramm T. et al. (2012) imzML — a common data format for the flexible exchange and processing of mass spectrometry imaging data. J. Proteomics, 75, 5106–5110. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.