Omics Metadata Management Software (OMMS)

Martha O Perez-Arriaga; Susan Wilson; Kelly P Williams; Joseph Schoeniger; Russel L Waymire; Amy Jo Powell

doi:10.6026/97320630011165

. 2015 Apr 30;11(4):165–172. doi: 10.6026/97320630011165

Omics Metadata Management Software (OMMS)

Martha O Perez-Arriaga ¹, Susan Wilson ², Kelly P Williams ³, Joseph Schoeniger ³, Russel L Waymire ⁵, Amy Jo Powell ^2,^4,^6,^*

PMCID: PMC4479048 PMID: 26124554

Abstract

Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide weak curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for sequencing production environments that routinely deal with hundreds, if not thousands of samples at a time. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines). Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and perform analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to validate installation and integrated function, and suggest possible methodological road maps for prospective users. Provided examples highlight possible OMMS workflows for metadata curation, multistep analyses, and results management and downloading. The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for webbased deployment supporting geographically-dispersed projects. The OMMS was developed using an open-source software base, is flexible, extensible and easily installed and executed. The OMMS can be obtained at http://omms.sandia.gov.

Availability

The OMMS can be obtained at http://omms.sandia.gov

Keywords: Bioinformatics, relational database management system, omics, next-generation sequencing, biological curation, open-source software, integrated workflow

Background

Next-generation sequencing has revolutionized research and medicine, and has been accompanied by increasingly challenging, if underappreciated, metadata management requirements [1]. Sequencing projects that have large, complex metadata demand lightweight, flexible, stable software to support dataset standardization, biological curation activities, and streamlining of pre- and post- processing steps with pipeline analyses [2]. The few existing open-source tools for managing next-generation sequencing metadata have limited flexibility, thus hampering project-specific tailoring, and cannot be easily deployed and managed by a single administrator for multiple research sites [3, 4]. Commerciallyavailable laboratory information management alternatives are typically expensive, cumbersome and require proprietary administration and maintenance for the software lifecycle [5], and none of the available tools, open-source or otherwise, readily integrate curation, processing and advanced analyses. We developed the Omics Metadata Management Software (OMMS), a flexible, extensible, open-source, web-based tool that provides semi-automated curation utilities, and integrated implementation with widely-used bioinformatics executables, such as BLAST [6] and Bowtie [7], for human-microbiomeoriented research in our laboratories [8]. Example use cases with publicly available human microbiome and chimpanzee RNASeq datasets [9, 10] are detailed to demonstrate OMMS function and versatility, and operation as a pipeline frontend.

Omics Metadata Management Software:

The OMMS was engineered to support a large, multidisciplinary, geographically-dispersed research team developing next-generation sequencing-based approaches for identifying potentially rare etiologic agents in human microbiomes across hundreds of distinct sample types [8]. The OMMS graphical user-interface (GUI) enables semi-automated project-specific metadata entries in each table (Figures 1 & Figure 2; Table 1, 3-5), and associated input sequence data are referenced to archiving locations in directories generated on Linux-based file systems. Intuitive point-and-click selection of input sequence files, analysis configuration and execution are carried out in the “Analysis” portal (Figure 3; Table 2). In examples provided here, BLAST, Bowtie 2, TopHat and Cufflinks were integrated and implemented with the OMMS interface [6, 7, 10], and in principle, any open-source application can be integrated with the OMMS, including inhouse pipelines, with custom scripts developed for that purpose.

Unified framework for metadata management and state-of-the-art analyses. Curation (highlighted in aqua) and analyses (indicated in yellow) tasks are intrinsically related (overlap region) in next-generation sequencing studies, because sample handling and sequence production are multistep processes, and careful metadata tracking and management are required for downstream analyses and publication preparation. The OMMS supports user input of project metadata, automated creation of consistently named and enumerated unique identifiers for specimens, samples and sequence production information, and straightforward integration with bioinformatics utilities. Spreadsheets can be generated for structured data extraction and local download. Standard input and output of executables used here are stored in automatically-generated files and directories.

Omics Metadata Management Software (OMMS) curation and analysis interface. The OMMS was designed to integrate and implement with open-source bioinformatics tools, such as BLAST, Bowtie 2 and Tophat and/or custom pipelines. These tools are accessed via the “Analysis Portal” (panel A). End users select the identifier (“Sequence Run ID”) of interest, which is referenced to particular sequence files (panel B and inset). Following input selection, the desired program is chosen and parameterized (panel C and inset) to launch a run. Output from a given analysis run can be downloaded via the OMMS “Results” portal (not shown).

Methodology

Creating a record:

The three main portals are displayed after login. Tables for detailed biological curation (“Specimen Info,” “Sample Processing,” “Sequence MetaInfo”) reside in the “MetaData” portal. To create an entry, select “Create New,” and then click on “New (Empty fields)” in the “Specimen Info” table, and provide required information (Figures 1; Table 1). The following (parentheses) were entered: Host Species (Homo sapiens); Tissue Sampled (Stool). Click the “Add Specimen” button to generate the “Specimen Unique ID” (HsStoo_01). For the second record, repeat the previous steps, but insert “Pan troglodyte” and “Brain” for the host and tissue, respectively, to generate “Specimen Unique ID” (PtBrai_02).

Corresponding records are generated in the “Sample Processing” table, with dropdown menus provided to streamline curation by explicitly linking the “Specimen Unique ID” (e.g., HsStoo_01, PtBrai_02) and the user-defined “Sample Alias” with the new sample entries. “Sample Unique ID” entries are generated by clicking “Add Sample” (e.g., HsStoo_01_01, PtBrai_02_01). Four corresponding records were created in the “Sequence MetaInfo” table to complete the curation exercise, and to illustrate functional integration with executables (Table 2); in the “Provider Sequence Directory Name” field, arbitrary directory names were given; for “Fastq File Mate Pair 1” and “Fastq File Mate Pair 2,” test input file names were used, and appropriate options were chosen for “Read Type” field for testing in this order: Bowtie 2 (single–, then paired end with human microbiome stool fastq files), BLAST (with human microbiome stool fasta input), Tophat and Cufflinks (with the single-end chimp RNASeq file). The “Sequence Run ID” and “Unique Experiment Name” were generated by clicking the “Add Sequence.” In each of the tables, the “Update” function can be used to extend curation. Methods for generating and downloading custom metadata tables are further detailed in the “OMMS Integrated Workflow” link (under Quick Start).

Enabling integrated workflows:

To call integrated executables via the “Analysis” portal, click on “Select Input” for the relevant Sequence Run ID (e.g., HsStoo_01_01_01), and then choose the desired program (Figure 2). To launch a Bowtie 2 run on “Sequence Run ID” HsStoo_01_01_01, select the “Staphylococcus_aureus” index from the dropdown menu, and enter an integer value in “Processors” and click “Go.” The results file name will appear, and standard output can be downloaded upon run completion. The same steps apply for paired-end analyses with Bowtie 2. For BLAST runs, select the pertinent Sequence Run ID and input (HsStoo_BLAST500.fa), and choose the desired program (blastn) and database (Clostridium kluyveri). Set the significance threshold expectation (E) value at 0.001 or higher, and indicate the desired output format in the dropdown menu, and click “Go.” Similar steps are followed for splice-variant and/or differential expression analyses using TopHat and Cufflinks (Figure 3; Table 2). The chimp “Sequence Run ID” was selected (PtBrai_02_01_01, referenced to RNASeq file SRR023838_RNASeq.fq) and aligned with the hg19 index [7]. Standard output can be downloaded via the “Results” portal by choosing the “Sequence Run ID” of interest (e.g., HsStoo_01_01_01). The website provides additional instructions for building integrated analyses (in the “OMMS Integrated Workflow” link under Quick Start).

Software

Design and function:

Interoperable, open-source software packages (i.e., the LAMP bundle, Linux, Apache, MySQL, PHP) wereused to develop the browser-based OMMS interface to support next generation sequencing-based research efforts in our laboratories [8]. Realworld metadata associated with the test datasets were entered in the three tables, “Specimen Info,” “Sample Processing,” and “Sequence MetaInfo” (Figure 1; Table 1, 3-5) to instantiate example database records. Most of the fields in the tables accommodate varied data types (e.g., the “Sample_Alias” field in the “Sample_Processing” table), such as character strings, but in cases with fewer possibilities, dropdown menus are provided (e.g., the “Nucleic Acid” field in the “Sample Processing” table).

Test datasets for validating installation and benchmarking integrated tools:

Distinct hosts and tissues (Homo sapiens, Pan troglodyte; Stool, Brain) were used to demonstrate automated metadata tracking, storing and functional integration with utilities [9, 10]. Test datasets were obtained from the GenBank Short Read Archive (accessions SRX025177: SRR063480 and SRX008322: SRR023838), and were pre-processed using the NCBI SRA and Fastx Toolkits, and in-house custom scripts. These preprocessing steps are explained in the README file included in the distribution and in the “OMMS Integrated Workflows” link (see the supplementary material for fine details pertaining to curation and pre-processing steps).

Semi-automated curation and results downloading:

After entering the minimum required information (indicated by asterisks) for a specimen, the OMMS generates a unique identifier under the “Specimen_UID” field (Figures 1; Table 1 & Table 3) describing the subject/host and tissue/microhabitat from which nucleic acid preparations and sequence data will be derived (Figures 1 & Figure 2; Table 1). Unique identifiers are automatically propagated to corresponding fields in the other tables (Figures 1 & Figure 2), intuitively linking specimen, sample and sequence data. Input sequence data files put can be uploaded, and results files (output) downloaded, as can metadata for specific entries, as well as table-overview custom spreadsheets (Figures 2 & Figure 3).

Concluding Remarks

The freeware reported here guarantees standardized, intelligible, automated curation and management of biological metadata, and supports integrated analyses. Recent events, from the outbreak of Ebola Virus Disease in West Africa, to the emergence of antibiotic-resistant bacteria (e.g., Clostridium difficile, Carbapenem-resistant Enterobacteriaceae), make it impossible to overstate the importance of rigorous metadata curation and management systems in high-intensity scenarios, clinical and otherwise. For our project, the OMMS frontend was foundational for handling metadata inherent to nextgeneration sequencing-based experiments involving large numbers of samples at a time. In the context of a host microbiome, potential etiologic agents are typically rare and difficult to detect using standard in silico and experimental approaches, and careful metadata curation is crucial for identifying signal (infectious disease) in the presence of overwhelming noise (background microbiota) and results interpretation. Looking ahead, the OMMS and OMMS-user tailored versions will represent easy-to-use promising tools for addressing microbiome-centric research, from clinical and public health challenges, to exploring new frontiers in agricultural research and development, where handing and tracking hundreds, if not thousands, of samples from diverse subjects at a particular location and time are necessary. Additionally, the OMMS enables development of integrated workflows with state-of-the-art utilities (Blast, Bowtie 2) and in-house pipelines (e.g., local implementations of Galaxy), facilitating fine-grained comparative analyses, such as strain discrimination (e.g., Zaire vs. Sudan ebolavirus) and microbiome composition and functional profiling.

Supplementary material

Data 1

97320630011165S1.pdf^{(106.3KB, pdf)}

Acknowledgments

This R & D was supported by the Laboratory Directed Research and Development (LDRD) program at Sandia National Laboratories [under the auspices of the Rapid Threat Organism Recognition (RapTOR) Grand Challenge (LDRD # 142042)]. Sandia National Laboratories are multi-program laboratories managed and operated by Sandia Corporation, a wholly-owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration (contract DE-AC04-94AL85000). We offer our sincerest thanks to the RapTOR team, and Dr. Elisa LaBauve, in particular, for invaluable discussions around software design and development, and Professor Melanie Moses for guidance on manuscript preparation. As ever, I extend my deepest gratitude to Professors Donald O. Natvig and Gavin C. Conant for invaluable, ongoing scientific discussions, editorial assistance and software testing. Sincerest thanks to Steven Arroyo, Brian Nelson, Michael W. Folsom and Mark D. Murton for software testing and website development. We are grateful to the University of New Mexico Center for Advanced Research Computing and Dr. Susan Atlas, Director, for computational resources and technical assistance provided in support of this work.

Footnotes

Citation:Arriaga et al, Bioinformation 11(4): 165-172 (2015)

References

1.Pareek C, et al. J Appl Genetics. 2011;52:413. doi: 10.1007/s13353-011-0057-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wruck W, et al. Brief Bioinform. 2014;15:65. doi: 10.1093/bib/bbs064. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Rocca-Serra P, et al. Bioinformatics. 2010;26:2354. doi: 10.1093/bioinformatics/btq415. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wolstencroft K, et al. Bioinformatics. 2011;27:2021. doi: 10.1093/bioinformatics/btr312. [DOI] [PubMed] [Google Scholar]
5. http://www.labguru.com/
6.Altschul SF, et al. Nucleic Acids Research. 1997;25:3389. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Langmead B, Salzberg SL. Nat Meth. 2012;9:357. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bent ZW, et al. Anal Biochem. 2013;438:90. doi: 10.1016/j.ab.2013.03.008. [DOI] [PubMed] [Google Scholar]
9.Human Microbiome Project Consortium. Nature. 2012;486:207. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Trapnell C, et al. Nat Protocols. 2012;7:562. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1

97320630011165S1.pdf^{(106.3KB, pdf)}

[R01] 1.Pareek C, et al. J Appl Genetics. 2011;52:413. doi: 10.1007/s13353-011-0057-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R02] 2.Wruck W, et al. Brief Bioinform. 2014;15:65. doi: 10.1093/bib/bbs064. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R03] 3.Rocca-Serra P, et al. Bioinformatics. 2010;26:2354. doi: 10.1093/bioinformatics/btq415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R04] 4.Wolstencroft K, et al. Bioinformatics. 2011;27:2021. doi: 10.1093/bioinformatics/btr312. [DOI] [PubMed] [Google Scholar]

[R05] 5. http://www.labguru.com/

[R06] 6.Altschul SF, et al. Nucleic Acids Research. 1997;25:3389. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R07] 7.Langmead B, Salzberg SL. Nat Meth. 2012;9:357. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R08] 8.Bent ZW, et al. Anal Biochem. 2013;438:90. doi: 10.1016/j.ab.2013.03.008. [DOI] [PubMed] [Google Scholar]

[R09] 9.Human Microbiome Project Consortium. Nature. 2012;486:207. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Trapnell C, et al. Nat Protocols. 2012;7:562. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Omics Metadata Management Software (OMMS)

Martha O Perez-Arriaga

Susan Wilson

Kelly P Williams

Joseph Schoeniger

Russel L Waymire

Amy Jo Powell