xGDBvm is a novel tool for scalable, reproducible, and expandable genome annotation via Web-based interfaces that seamlessly integrate background cloud-based data storage and high-performance computer resources.
Abstract
Genome-wide annotation of gene structure requires the integration of numerous computational steps. Currently, annotation is arguably best accomplished through collaboration of bioinformatics and domain experts, with broad community involvement. However, such a collaborative approach is not scalable at today’s pace of sequence generation. To address this problem, we developed the xGDBvm software, which uses an intuitive graphical user interface to access a number of common genome analysis and gene structure tools, preconfigured in a self-contained virtual machine image. Once their virtual machine instance is deployed through iPlant’s Atmosphere cloud services, users access the xGDBvm workflow via a unified Web interface to manage inputs, set program parameters, configure links to high-performance computing (HPC) resources, view and manage output, apply analysis and editing tools, or access contextual help. The xGDBvm workflow will mask the genome, compute spliced alignments from transcript and/or protein inputs (locally or on a remote HPC cluster), predict gene structures and gene structure quality, and display output in a public or private genome browser complete with accessory tools. Problematic gene predictions are flagged and can be reannotated using the integrated yrGATE annotation tool. xGDBvm can also be configured to append or replace existing data or load precomputed data. Multiple genomes can be annotated and displayed, and outputs can be archived for sharing or backup. xGDBvm can be adapted to a variety of use cases including de novo genome annotation, reannotation, comparison of different annotations, and training or teaching.
INTRODUCTION
The number of sequenced eukaryotic genomes is increasing rapidly due to advances in sequencing technology and cost-effectiveness; for recent lists, see https://gold.jgi.doe.gov (Reddy et al., 2015) and http://www.diark.org/diark (Hammesfahr et al., 2011). However, the pace of data acquisition leads to bottlenecks at both assembly and annotation stages, before the sequence data can be consumed for research. In particular, annotating a novel genome is often challenging due to our incomplete knowledge of what constitutes a gene across a wide range of species, meaning that ab initio gene prediction, although useful, is inadequate (Yandell and Ence, 2012). Full genome annotation typically consists of at minimum (1) optionally repeat masking the genome, (2) splice-aligning transcripts and proteins from related species for evidence-based gene structure prediction, (3) using ab initio gene finding algorithms to annotate possible gene structures, (4) combining the above data sources to create a set of possible gene structures, and (5) filtering the results through quality and/or similarity filters to find the most probable set of structures that represent full-length or near-full-length coding genes. As a result, genome annotation is necessarily a time-consuming and computationally intensive process that combines numerous types of sequence analysis and heuristic prediction, typically relying on well-annotated genomes as a reference and typically resulting in a far from perfect (but arguably useful) draft annotation. A number of groups have published complete computational pipelines for eukaryotic genome annotation (Mungall et al., 2002; Potter et al., 2004; Uberbacher et al., 2004; Cantarel et al., 2008; Foissac et al., 2008; Holt and Yandell, 2011; Specht et al., 2011; Grigoriev et al., 2012; Leroy et al., 2012; Thibaud-Nissen et al., 2013; Hoff et al., 2015). However, these pipelines require considerable expertise to install, configure, troubleshoot, and manage. We propose that a “turnkey” genome annotation system could greatly benefit researchers who desire a credible draft genome annotation to facilitate further research, as well as foster comparative genomics, as early as possible in the life of their project. Among the desirable attributes of such a system would be the following (as described in more detail below): easy configuration, easy to use, editable, reproducible, scalable, and publishable.
An annotation workflow will necessarily combine a wide range of computational tools whose successful configuration and interoperability would be challenging for the nonspecialist, so ideally it should be available as a precompiled package. A common method for packaging and distributing such a complex system is via a virtual machine (VM), which encapsulates the underlying server operating system, the application software components along with all requisite software dependencies, and configuration settings, all of which are stored (“imaged”) in such a way that they can be copied and launched by means of commonly available virtualization tools and made available to anyone with access to virtual server software, such as KVM (http://www.linux-kvm.org) or VirtualBox (https://www.virtualbox.org). VMs have a number advantages for complex informatics analysis (Nocq et al., 2013), of which the preinstallation of all required software for complex tasks as well as temporary access to all the computer resources needed for completion of the task are of most practical value for a typical biologist user. Cloud computing platforms such as OpenStack (https://www.openstack.org) and Docker (https://www.docker.com) offer VM and container-based technologies that can be managed, accessed remotely, and readily deployed on commercial cloud-based services such as Amazon Web Services (https://aws.amazon.com). Government-funded consortia such as the iPlant Collaborative (now CyVerse) (Goff et al., 2011) make such virtual platforms readily accessible to individual users via the internet.
Although most genome researchers are familiar with a wide range of online tools to evaluate sequence data, they will not necessarily know how to put them together and configure them appropriately. Ideally, an annotation platform should have a cohesive graphical user interface (GUI) that guides the user through setup, configuration, parameter setting, and status reporting. Importantly, all setup and processing steps should be managed with data sanity checks (for completeness and format), context-dependent menus, error logging and reporting, and help documentation/tutorials.
Ability to edit and improve automated annotation should be built in. This means the ability both to add additional data once the workflow has completed and to modify individual annotations in such a way that the most critical regions of the genome are well annotated are needed.
With variable parameters and source data sets, automated documentation and simple archiving are essential for ensuring repeatability of the genome annotation process.
With large genomes and large transcript data sets, computations such as spliced alignment can take days or weeks on a typical lab computer, whereas with access to high-performance computing (HPC) resources the process can be completed in a few hours. Many research facilities have such resources, but their use is complex and not necessarily available to any researcher who might be interested.
Once computation is complete, the annotated genome and its input/output files should be available online either to a select community (with password access) or to the research community at a whole, thus placing output data and/or community annotation tools in the hands of the target audience in a timely manner.
With the above attributes in mind, we created a self-contained genome annotation platform, xGDBvm, for use by the research community. We report below our initial release of xGDBvm in the iPlant (CyVerse) Atmosphere cloud infrastructure (http://www.iplantcollaborative.org/ci/atmosphere) as an on-demand virtual server for genome annotation that can be adapted for wide range of research needs.
RESULTS
Overview of xGDBvm
xGDBvm is a Linux-based platform that accepts genomic and transcript and/or protein sequence inputs and creates a genome annotation that can be displayed in the included, full-featured genome browser, with separate tracks for genome segments, transcript and protein alignments, gene predictions, and repeat masked regions (Figure 1). xGDBvm uses a modified and extended version of the xGDB (Extensible Genome Data Broker) Web platform (Schlueter et al., 2006) written in Perl and PHP, along with a Web server, workflow automation scripts, and executables packaged together as a virtual server and configured for access over HTTP or HTTPS via a GUI. xGDBvm is compact in size, occupying ∼13 gigabytes (GB) of a typical 20-GB VM root partition. Data inputs/outputs are preferably stored on external volumes mounted to the VM, thus alleviating constraints on VM size.
Figure 1.
Overview of xGDBvm as Implemented at CyVerse (iPlant).
xGDBvm is a virtual server environment for gene structure annotation that can be cloned, configured, populated with input data, and run from a Web browser in a few steps, as summarized here.
(A) Log in to the CyVerse Atmosphere Control Panel (https://atmo.iplantcollaborative.org/application) (1) and click to create a new instance (cloned copy) of xGDBvm (2), create a block storage volume for output data, and attach it to the instance (3). Open a Web shell interface (4), accessible from the Control Panel, and type a series of commands to set up and configure the new xGDBvm instance, also mounting the Data Store and the attached volume.
(B) Log in to the CyVerse Data Store cloud storage system (https://de.iplantcollaborative.org/de/) and upload input data files to an input data directory (accessible to the VM) using a batch uploading tool. Naming conventions are used to identify each input type.
(C) Log in to the xGDBvm instance’s GUI using HTTPS via its unique IP address or using a VNC (1). All subsequent steps are performed using the xGDBvm GUI. Authorize the VM to connect to remote HPC resources via the Agave API (http://agaveapi.co) (2). Configure the path to Data Store inputs and set other parameters including remote job execution (optional). xGDBvm will validate files, return expected outputs, and flag any input file errors (3). Initiate automated workflows and monitor progress (4). The workflow sends some data remotely for processing on HPC resources (https://www.xsede.org/) managed by Agave APIs and processes other files locally using the attached volume as a scratch disk. The xGDBvm workflow waits for HPC outputs and then proceeds with the annotation process. Output data are written to the external volume and can be accessed from xGDBvm Web browser as GDB001, GDB002, etc. (5). In addition to a fully featured genome browser, xGDBvm includes tools to query, update, reannotate, download, or archive outputs to the user’s Data Store. For details, refer to the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php).
Computational processes in xGDBvm (Figure 2) are managed by automated, user-configurable workflows, with a built-in option for calls to HPC resources. Optional masking of genome segments is performed using Vmatch (Abouelhoda et al., 2002) based on user-provided masking libraries. Spliced alignment of transcripts and proteins to the genome are computed using GeneSeqer (Usuka et al., 2000) and GenomeThreader (Gremme et al., 2005), respectively. xGDBvm optionally creates gene model predictions using CpGAT (Comprehensive Gene Annotation Tool; http://plantgdb.org/AtGDB/cgi-bin//WebCpGAT.pl), a set of scripts and binaries that integrates spliced alignment data and ab initio gene predictions along with BLAST similarity filters and alternative structures to derive a high-quality gene prediction data set. The xGDBvm workflow can also upload precomputed gene predictions from a user-provided GFF3-formatted file. All steps are logged and displayed dynamically during workflow operation. Once complete, each feature is displayed as a separate track in a fully featured genome browser complete with search/download tools and tabular feature views. A quality score assigned to each annotated locus facilitates the identification of low-quality models, which can then be reannotated and curated using the built-in yrGATE annotation tool (Wilkerson et al., 2006). Additional genomes can be configured and created with the same VM, and the user can archive and retrieve single or global data sets. Any data type can be appended or replaced using an “update” feature. The outcome is a rich, editable environment for genome exploration and annotation, accessible locally or remotely on the Web (for feature overview, see Table 1).
Figure 2.
Data Process Schema.
Input data types (with standardized names as indicated), computational modules, and outputs are shown. Images are screenshots of color-coded track glyph types (gene models; splice alignments) and track flags (quality scores) displayed in the xGDBvm genome browser.
Table 1. xGDBvm Features.
| Section | Feature | Functions |
|---|---|---|
| Manage | Administrate | Modify password protection; customize site name; administer yrGATE user accounts |
| Create/configure | Configure new GDB; validate input files; view/edit configuration; initiate, monitor automated workflows; view log files; archive/restore/delete GDB, copy archive to Data Store | |
| Remote jobs | Configure OAuth2 login, job APIs, app IDS; submit jobs; view job status; manage jobs (CyVerse login required) | |
| View GDB | GDB home page | GDB summary data; view genome region or search for sequence |
| Genome context view | View all tracks by genome segment and region; zoom, jump up- or downstream, view nucleotide level alignments | |
| View GDB: feature tracks | Gene predictions (loci) | All annotated loci and metadata in tabular view; search/filter queries; yrGATE summaries for each locus; download as .csv |
| Aligned proteins, aligned transcripts | All spliced alignments in tabular views; search/filter queries; download as .csv | |
| GAEVAL scores | Detailed gene quality scores for each Gene Prediction track; search/filter queries | |
| View GDB: tools (genome context view) | Download region | Download any sequence type from region as FASTA; download annotations from region in GFF3 or NCBI format |
| Download data | Download individual input files, output files (all types), or GDB archive files to the local drive | |
| Search ID or keyword | Search and retrieve FASTA sequence or subsequence (introns, exons, up/downstream) for any feature displayed on GDB | |
| BLAST GDB | Match sequence within GDB | |
| BLAST all GDB | Match sequence across multiple GDB | |
| CpGAT annotate region | Regional gene predictions and quality scores | |
| Add custom track | Add custom track from local GFF3 file | |
| GenomeThreader region | Regional spliced alignment of proteins | |
| yrGATE | Tool for creating/submitting user-contributed annotations; with portals to NCBI ORF finder; NCBI BLAST; GENSCAN; GeneMark; CpGAT | |
| Community central | Searchable list of curated yrGATE (user-submitted) annotations; download annotations (FASTA, GFF3) | |
| Annotate | My annotations | Manage user annotations (admin account and login required) |
| My groups | View group annotations (admin account and login required) | |
| My admin | Curate user-submitted annotations (admin account and login required) | |
| Help | Help pages | User instructions and video tutorials; also available as contextual help pop-ups |
| xGDBvm wiki (external) | Documentation and instructions for users/admins/developers | |
| GitHub repository (external) | Source code; issue tracking; case studies |
Features as implemented on iPlant (CyVerse) Atmosphere cloud service.
xGDBvm-iPlant
We implemented xGDBvm as a VM image on iPlant’s Atmosphere cloud platform (https://atmo.iplantcollaborative.org/application), available to registered life sciences researchers (http://www.iplantcollaborative.org/content/acceptable-use-policy). We further customized the VM taking advantage of iPlant’s data and job execution application programming interfaces (APIs), making xGDBvm a one-stop destination for genome annotation and display. Registered iPlant users can create and configure an xGDBvm instance via the Atmosphere control panel and then access the xGDBvm instance via a Web browser to perform all subsequent tasks: validate inputs, run HPC jobs, initiate local workflows, check progress, and view/edit the resulting genome annotation. The genome browser(s) can be made public or private as desired. The following sections detail xGDBvm’s functionality in its current version on iPlant Atmosphere.
Inputs and Data Processing
Figure 3 diagrams the modular architecture used by xGDBvm at iPlant. For managing inputs, xGDBvm uses iPlant’s Data Store cloud storage service (http://www.iplantcollaborative.org/ci/data-store), which provides high-capacity storage and tools for quickly uploading user data files. During the xGDBvm configuration process, the user’s Data Store home directory is mounted to the VM’s file system using IRODS FUSE (http://irods.org), and files uploaded to the Data Store are thus accessible on the VM using Unix file system commands. For output data (alignment files, GFF3 files, sequence indexes, MySQL database tables, configuration files, and archives), the user can attach a block storage volume to the VM via the Atmosphere control panel and mount it to the VM’s file system. This data partitioning strategy has the advantage that all data outputs are separate from the VM and do not consume its limited storage capacity while at the same time, providing scalability as the data transfer for HPC jobs occurs directly with the data store. Moreover, the complete xGDBvm display can be reconstituted by mounting the volume to a new xGDBvm instance, useful in the event a VM becomes unavailable.
Figure 3.
xGDBvm Architecture.
An xGDBvm VM instance, as hosted on the CyVerse Atmosphere cloud infrastructure (https://atmo.iplantcollaborative.org/application), has separate file system partitions under root (containing the xGDBvm Web GUI, scripts, binaries, and other software) and /home/ (which is configured with mount points for the user’s Data Store home directory for data input and a block storage volume for data output). The Agave API, hosted by the CyVerse Discovery Environment, is used for authentication of the VM via OAuth2 and for management of HPC applications and job submission. A key feature of xGDBvm is the ability to attach and mount the output volume to a different VM and reconstitute the annotation outputs and display. See text for details.
Managing files and ensuring validity of inputs (sanity checks) is a challenge for computational pipelines where multiple inputs of various types and formats may be used. xGDBvm makes use of filename standardization and extensive validation tools to reduce the incidence of input errors. Each input file is required to be named according to its data type and file format, e.g., ∼est.fa for a FASTA file of EST sequences, where “∼” is any user prefix, and all input files are placed in a single directory whose path is saved as a configuration variable. In addition, output files (including copies of input files) are all named according to the same conventions, with the GDB number as a prefix, e.g., GDB001est.fa, and deposited in subdirectories according to their type/process. Once an input path has been specified, xGDBvm displays valid filenames in the input directory according to type, displays predicted output tracks, and alerts to any missing files that would compromise output. The user then initiates a script to validate sequence deflines (description lines), error-check IDs, and enumerate file contents either singly or in batch mode (Supplemental Figure 1). File validity metadata are stored along with a unique file stamp, so files need only be validated once unless modified.
Supplemental Figure 2 shows the complete, automated workflow for creating and updating a genome annotation. Typical inputs include a genome sequence assembly and a set of transcript sequences (EST, cDNA, or short read/transcript sequence assembly [TSA]) and/or predicted protein sequences, in FASTA format. Depending on availability, transcripts may be from the same or a closely related species (Wang et al., 2008). Protein sequences should be from a well-characterized genome as close as possible taxonomically to the target species. With transcript (EST, cDNA, or TSA) inputs, xGDBvm will compute spliced alignments, according to user-specified or default parameters, using the multithreaded GeneSeqer-MPI spliced alignment program (Usuka et al., 2000) installed locally or on an HPC server with up to 128 cores. For this step, the user can opt to apply repeat masking to the genome sequence using vmktree/vmatch (Abouelhoda et al., 2002) to reduce computation time, with inclusion of a suitable repeat mask sequence library. Alternatively, the user can provide an N-masked genome file as input. For related-species protein inputs, xGDBvm computes spliced alignments using the GenomeThreader program (Gremme et al., 2005) either locally or on an HPC server. Spliced alignments that meet a quality threshold are ultimately displayed in the xGDBvm genome browser as discrete tracks with standard box-line glyphs to indicated exon/intron boundaries (Figure 2). The user can also provide GeneSeqer and/or GenomeThreader output files, created offline, as inputs, bypassing the above steps.
The xGDBvm workflow next uses spliced alignment data as input for CpGAT, which assembles gene model predictions for the genome. CpGAT uses EVM (Evidence Modeler; http://evidencemodeler.github.io) (Haas et al., 2008) to evaluate GeneSeqer transcript alignments and/or GenomeThreader protein spliced alignments, together with ab initio gene finder results from BGF (http://bgf.genomics.org.cn), GeneMark (http://exon.gatech.edu/GeneMark/) (Borodovsky and Lomsadze, 2011), and Augustus (http://bioinf.uni-greifswald.de/augustus/) (Stanke et al., 2006) and derives an optimal set of transcript models that are then BLASTed against a reference protein data set (if supplied by the user). In addition, some PASA (Haas et al., 2003) functions are used to aggregate splice variant models where indicated by evidence alignments. Optionally, the user can request repeat masking of the genome prior to ab initio gene prediction. The output from CpGAT is a set of BLAST-filtered or unfiltered gene model structures for each genome segment, complete with coordinates for start/stop codon and predicted untranslated regions where possible, in GFF3 format, which are loaded to the xGDBvm database. Several CpGAT parameters are user-configurable with the xGDBvm GUI, allowing the user to select species model or bypass ab initio gene finders, relax reference protein BLAST filtering, or request repeat masking, and the complete set of CpGAT parameters can be modified by editing the CpGAT configuration file.
As a final step, xGDBvm calculates the GAEVAL score for each gene model, consisting of a set of statistics representing the degree of congruence of the model with available alignment evidence (http://plantgdb.org/GAEVAL/docs/index.html). GAEVAL also reports alternative splicing evidence and classifies annotation errors into discrete types such as gene fusion, gene fission, etc. GAEVAL data summaries are displayed in xGDBvm as a flag associated with each track glyph (Schlueter et al., 2005).
Users can also upload precomputed genome annotations provided as GFF3 file(s) along with optional transcript and translation FASTA files. These data are displayed in the form of a separate annotation track, with GAEVAL scores calculated as described above. If gene descriptions are available in tabular form, these can also be uploaded to augment gene annotation tracks.
xGDBvm Setup, Configuration, and Data Processing
xGDBvm was designed to be easy to configure and run (Figure 1). As a supplement to online help and video tutorials (see below), beginning users can consult the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php), which includes step-by-step instructions and information about how to choose the correct VM size and storage capacity for their particular genome annotation needs.
After instance creation, the user accesses the shell via a terminal emulator or the Atmosphere’s built-in shell emulator and types a series of simple commands to configure and password-protect the VM environment. Subsequent steps are accomplished using a Web browser connecting to the VM via HTTPS or by connecting to the VM using a virtual network computing (VNC) client (Atmosphere offers a built-in VNC window as well). xGDBvm’s hierarchical user interface is organized by task type, i.e., “Manage,” “View,” “Annotate,” and “Help,” with submenus under each section. Under “Manage” are “Admin” (manage site passwords, admin emails, and yrGATE users), “Configure/Create” (create or update a genome browser), and “Remote Jobs” (configure and manage remote HPC jobs; see next section). End-user-oriented sections include “View” (browse/analyze genomes) and “Annotate” (submit/manage user annotations). Each section and subsection includes a “Getting Started” page that outlines the suggested workflow along with key links and one or more “Help” pages with detailed documentation, including video tutorials that can be viewed on the VM. Contextual pop-up help dialogs are also provided for each page/step.
Under Manage → Configure/Create, a user can check volume capacity of the VM, manage license keys for certain installed software, and then consult a decision tree to guide them to the correct data sources, a table of file name conventions, and a guide to CpGAT annotation. Once the data files are in place, the user clicks “Create New GDB,” selects a file path pointing to the data input files, enters any nondefault parameters as well as genome metadata, and then saves the configuration setup, which is assigned “Development” status and an ID (GDB001, etc.) that will be associated with the output database (Figure 4A). The user can now click to validate file contents as described above. To initiate data processing, the user selects “Data Process Options” followed by “Create GDB,” which changes status to “Locked,” initiates the central data processing workflow, and displays a running report of progress together with any errors. The workflow can be aborted at any time by clicking the “Abort” button under “Data Process Options”; this removes all dynamically created directories and kills all associated processes, returning the configuration to “Development” status. On successful workflow completion, GDB status is changed to “Current” and the new genome is added to the “View” menu structure. Input data sets, annotation statistics, and output data sets can be viewed online. Output errors are logged and displayed to the user along with context-specific help dialogs (Supplemental Figure 3).
Figure 4.
xGDBvm Data Management.
(A) Screenshot of the GDB Configuration page, set up for processing Example data. Each genome annotation is assigned a unique identifier (GDB001, GDB002, etc.) and a user-provided name. In addition to form fields for input data path, annotation parameters, and metadata, this page provides extensive color-coded information about all system settings (e.g., license keys, storage capacity, and login status, displayed in blue-green), input data validity (light green), and expected output (orange). The form includes buttons that launch modal windows to initiate computational workflow or edit configuration.
(B) Screenshot of Archive/Delete menu, showing genome databases with “Current” (blue; computation complete) or “Development” (gray; not yet run) status. Genome annotations are identified as GDB001, GDB002, etc. Each table row displays information about a GDB including time stamps as well as action buttons that allow the user to drop, delete, archive, delete archive, or copy database (see text for details). Global action buttons (top right) allow the user to delete or archive all data on the VM.
(C) Screenshot of “List All Jobs” page with tools to monitor and manage remote HPC jobs. The page displays IDs, job metadata, time stamps, color-coded status indicators, and action buttons to manage output (Stop Job, Delete Job, View Logs, Copy Output) via the Agave API. See text for details.
Any of several lightweight, preconfigured sample data sets (Supplemental Figure 4) can be loaded with a single button click from the “Create New” page and then saved and processed to a finished GDB in no more than a few minutes. Because these examples cover the complete range of processes and workflows in the xGDBvm code, they also serve as functional tests for functionality when first setting up an xGDBvm instance or modifying its code.
High-Performance Computing Option
On multiprocessor VMs, xGDBvm automatically invokes parallel processing where possible, for certain computational steps (Supplemental Figure 1). This can speed up spliced alignment and genome annotation (CpGAT) jobs, in that more than one genome segment can be evaluated concurrently on separate processor threads. As an alternative for even more processing power, xGDBvm is capable of sending input data for spliced alignment jobs to high-performance computing facilities, either as a standalone job or as part of an annotation workflow. For this option, the user’s input data must be on a VM-mounted iPlant Data Store directory and assigned to a GDB with “Development” status. GeneSeqer-MPI and GenomeThreader binaries, along with wrapper scripts for job submission to an HPC server, are installed in iPlant’s Discovery Environment (https://de.iplantcollaborative.org/de/) as executable apps. Client access to HPC resources and apps is managed via the Agave API (Dooley et al., 2012), http://agaveapi.co, which provides an open-source platform for interacting with computational resources that are managed under the XSEDE system (https://www.xsede.org/). xGDBvm uses Agave’s implementation of the OAuth2 (http://oauth.net) standard for authorization and subsequent authentication to use apps. Under Manage → Remote Jobs, users first submit their iPlant user name/password in return for OAuth2 credentials that are stored securely on the VM and allow access to remote applications (GeneSeqer-MPI and GenomeThreader). The user can then log in and obtain a temporary access token and refresh token for authentication. The VM-cached refresh token is also used by local scripts to reauthenticate API access during automated workflow processing. The user can select the app size (i.e., number of processors) for optimal efficiency given their genome size and complexity and then return to the GDB Configuration page, select the “remote” option for spliced alignment, and initiate the automated workflow. The xGDBvm workflow script copies relevant input data (genome, transcript, and/or protein) to a temporary directory on the user’s mounted Data Store directory and issues a job submission command via cURL (https://curl.haxx.se) to a custom wrapper script (Figure 3). The wrapper script accepts parameters, splits and indexes input files as appropriate for multiple processors, and then issues a command to launch GeneSeqer-MPI or GenomeThreader on the specified HPC server cluster. The xGDBvm workflow updates remote job status periodically using a callback URL to xGDBvm and/or email notification service. Output data are copied to specified subdirectory on the user’s Data Store directory where xGDBvm’s workflow can access them for further processing. Remote job details and status are tracked by xGDBvm, and users can access job lists, query remote job status, and kill a remote job using the Manage → Remote Jobs GUI (Figure 4C).
Remote GeneSeqer or GenomeThreader spliced alignment jobs can also be run as a standalone process via Manage → Remote Jobs. Output is archived on the users’ Data Store directory, and xGDBvm can be directed to evaluate the output and copy output files to an input directory for inclusion in workflow processing.
Logging/Troubleshooting
Each step in xGDBvm’s computational workflow script (Supplemental Figure 2) is displayed dynamically during automated workflow operation and saved in a process log. Common errors (e.g., mismatch in data input/output, incorrect format, duplicate IDs) are flagged and logged in an error file, along with user hints to remedy the problem (Supplemental Figure 3). A separate file is created for logging CpGAT progress.
Outputs and Data Analysis Tools
xGDBvm displays the output of workflow processing as schematized glyphs, organized into color-coded tracks, in a full-featured genome browser (Figure 5). Standard tracks include EST, cDNA, TSA, and protein spliced alignments; precomputed and CpGAT gene predictions; and regions that have been repeat masked or assigned as spacer regions (N-substituted). Additional user-generated tracks include yrGATE annotations and region-specific CpGAT annotations. Advanced users can create unlimited additional tracks by manually populating new data tables and modifying configuration files. The xGDBvm genome browser has track features similar to those currently available at http://plantgdb.org (zoom/scroll, show/hide or reorder tracks, change font size, view base pair level). The genome browser also includes a suite of analysis tools including search and retrieve for sequence or subsequence regions (introns, exons, up/downstream regions), NCBI-BLAST for sequence queries within or across genomes, region-specific GenomeThreader and CpGAT tools, and the ability to add a custom track from a local GFF file. Complementing the Genome Context View are searchable, tabular views for each Feature Track type ordered by genome position. The Gene Models table displays annotated loci along with structural metadata, similarity descriptions, GAEVAL gene quality/coverage, and yrGATE annotation status (see below). The Aligned Proteins and Aligned Transcripts tables display splice-aligned sequences of each type with filters for alignment quality/coverage and links to alignment details. A separate page for GAEVAL Scores displays comprehensive gene quality databased on comparison of gene predictions with alignment evidence and offers multiple search filters.
Figure 5.
Genome Context View.
Shown is a typical region from the C. rubella genome annotation described in Results. Genome span is shown in yellow, and genome features (tracks) are as labeled to the left and above each track. Drag-and-drop reorder and “hide track” features are implemented here. Top bar provides search and navigation controls; left bar contains links to tools and views, as well as to configuration and help pages. Region submenu (orange) contains zoom/scroll, region-specific tools, and formatting controls. See Table 1 for details of xGDBvm tools and features.
All inputs, outputs, and archives (see below) are stored hierarchically under /xGDBvm/data/GDBnnn/data/, and they are also available for download to local storage using the VM’s GUI (View → GDBnnn → Data Download). Using this download service, the user could, for example, retrieve GFF-formatted annotation outputs from CpGAT for use in further analysis or display on a different genome browser. Data files can also be copied to the Data Store either manually or by creating and copying a GDB Archive (see below).
Updating or Adding Tracks
In cases where the user may wish to append or replace data, xGDBvm includes an “Update” branch to the data workflow allowing any track to be appended or replaced. The user sets an “Update” flag on the configuration page, specifies a directory where update data resides, and selects the data type(s) and update action(s) desired. The user then clicks “Update,” which adds or replaces data inputs and reruns appropriate scripts to update the genome data tables, indices, and display. All update actions are logged in the same way as a new GDB, appended to the same process log.
The xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/) includes complete instructions for adding additional annotation or alignment tracks beyond the five standard tracks available. Users familiar with MySQL and the necessary computational steps can completely customize an instance of xGDBvm, using precomputed data as inputs.
Managing xGDBvm Data Sets
Output data sets can be managed on the Manage → Config/Create → Archive/Delete page (Figure 4B). For archiving a GDB, the entire output directory tree is compressed as a tar archive and stored in an archive directory under /xGDBvm/data/ArchiveGDB/, and the archive can be copied to the user’s Data Store with a single click. If the corresponding GDB is later dropped (see below) or becomes corrupted, the archive can be readily restored using the “Restore from Archive” button. GDB archives also facilitate sharing data with other researchers, who can use the “Restore from Archive” function to load any archive to their own VM. In addition, all GDB can be archived together using the “Archive All” function. Any “Current” xGDBvm database can be discarded using the “Drop” button. This removes all GDB-associated directories and their output data but preserves the GDB ID and its stored configuration data, allowing users to build on the previous configuration or restore (see above) a GDB. Finally, the most recently added GDB can be deleted using “Delete,” or all GDB can be deleted using “Delete All.”
Reannotating with yrGATE
A key feature of xGDBvm is the ability to flag low-quality gene structures and improve them in place by manual reannotation. For each genome displayed on xGDBvm, the “Gene Models” page provides filters to select high-coverage/low-integrity models (based on GAEVAL quality score and coverage) that might be improved by manual inspection (Figure 6A). Users can create an annotation login account and correct, confirm, or disqualify any gene prediction using the yrGATE annotation tool (Wilkerson et al., 2006; Figure 6B). The yrGATE tool offers point-and-click simplicity for building a gene structure, enhanced by dynamic reporting of GAEVAL scores to guide the user to the best possible model based on evidence alignments. yrGATE includes curation tools for users who are assigned Administrator status, providing a quality check for submitted annotations prior to their display. All reannotation and curation steps are performed in a single browser window with portals to NCBI BLAST and other analysis tools, and users can manage their own annotations (save, submit for curation, delete) on the “Community Central” pages. Administrative features include the ability to assign users to annotation working groups, track annotation totals for each user, and configure one or more email addresses for administrative notification. Once curated, yrGATE annotations are displayed as a separate track in the xGDBvm genome browser with color-coding to indicate reannotation class (Figure 6B), and these can be downloaded in GFF3 or FASTA format.
Figure 6.
Gene Model Improvement Using yrGATE.
(A) A published gene model from C. rubella (Carubv1011418m.g) showing high coverage/low integrity in the Locus Table (upper table, highlighted columns).
(B) Corresponding gene model in genome context view (blue glyph). CpGAT annotated this region as two distinct loci (magenta glyph), backed up by both Arabidopsis protein (black) and cDNA (light blue). The region was then reannotated using yrGATE (dark and light green glyphs) to confirm the most probably genic structure of this region based on available evidence. yrGATE glyphs are color-coded according to the type assigned by the annotator, e.g., dark green (improved structure) and light green (new structure not previously annotated).
Benchmarking xGDBvm
Whole-Genome Annotation
Capsella rubella is an Arabidopsis thaliana relative with a sequenced genome totaling 134.8 Mb (Slotte et al., 2013). We evaluated xGDBvm as a tool for new genome annotation using the C. rubella genome assembly (see Methods for sequence sources and parameters). We obtained both Arabidopsis cDNA sequences and Arabidopsis predicted proteins as input for evidence alignments. We first computed high-quality transcript and protein spliced alignments using the standalone HPC job submission tool in an xGDBvm instance at iPlant. The GeneSeqer-MPI job (8 processors with 64 threads) and GenomeThreader job (2 processors with 12 threads) finished in 7 h and 1 h, respectively. These outputs were used as input for an annotation workflow (with CpGAT option selected) in xGDBvm. The CpGAT reference data set was the entire set of UniRef90 Viridiplantae proteins (see Methods). In addition, the C. rubella annotation data set (in GFF3 format) was uploaded to xGDBvm for comparison. The annotation of 873 scaffolds was completed in ∼12 d on a single core processor VM with 4 GB RAM. The results are shown in Table 2. xGDBvm completed 49,947 cDNA spliced alignments and 28,595 protein spliced alignments. The CpGAT annotation generated 25,498 gene models, compared with 28,447 gene models from the published C. rubella annotation. A total of 4368 loci from the published annotation had no match in the CpGAT set (as determined by overlap), while 861 loci were unique to CpGAT. Comparison of 19,892 loci with gene models from both CpGAT and the published annotation using ParsEval (Standage and Brendel, 2012) revealed a high level of congruence between the two data sets. More than 60% of the gene models compared had identical coding sequences. At the level of individual exons, the sensitivity (true positive rate) was 69% and the specificity (true negative rate) was 68%, or 89 and 88%, respectively, if restricted to coding exons. At the level of individual nucleotides, the sensitivity and specificity were 97 and 96%, respectively. These data demonstrate the reliability of CpGAT as a workflow for producing a provisional genome annotation (our purpose is not to present a detailed comparison of these two annotations; the respective evidence alignment data sets and thresholds were likely not identical, making such detailed analysis complex).
Table 2. Annotation of the C. rubella Genome.
| Genome Segments | Total Length (bp) | Arabidopsis cDNA Spliced Alignments |
Arabidopsis Protein Spliced Alignments | CpGAT Gene Predictions |
Published Gene Predictionsa |
|||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total | Cognateb | Transcripts | Loci | Questionablec | Transcripts | Loci | Questionablec | |||
| 853 | 134,834,574 | 49,947 | 44,870 | 34,629 | 25,498 | 22,698 | 254 | 28,447 | 26,521 | 558 |
See also http://goblinx.soic.indiana.edu/GDB002/ for data display and download.
Source: ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz.
The single location with the best alignment score for a given query sequence.
Less than 75% integrity score and greater than 75% coverage based on GAEVAL analysis (see Methods).
Reannotation of Low-Quality Predictions
We evaluated GAEVAL gene quality for the C. rubella annotation data set on a locus basis by setting a locus table filter for average integrity <75% and coverage >75%. This filter resulted in 254 questionable loci with likely annotation errors for CpGAT models compared with 558 questionable models in the published annotation set (Table 2). This subset represents models for which reannotation has a high probability of improving gene prediction via the yrGATE tool. We chose an example of a locus from the published annotation that was flagged by GAEVAL as possibly erroneous, Carbubv1011418.m.g (Figure 6). The CpGAT annotation for this region was split into two distinct, complete gene structures, identified as scaffold_1.g5.t1 and scaffold_1.g6.t1. Using the yrGATE tool, we confirmed the CpGAT models as more accurately representing the evidence alignments (dark and light-green tracks in Figure 6B).
Genome Region
Another use for xGDBvm is to annotate a genome segment containing a specific gene or region of interest. This would typically be a rapid turnaround analysis compared with whole-genome analysis and thus could be performed using internal computing resources, possibly repeatedly under different parameter regimes. As an example, we used a Setaria italica predicted protein, annotated as “stem-specific protein TSJT1-like” as a tBLASTn query against the Musa acuminata subsp malaccensis whole-genome sequence data in GenBank. We retrieved a contig (839) that contained a region of high similarity to this sequence (see Methods). We then configured xGDBvm inputs consisting of Musa genomic contig 839, the current M. acuminata EST data set from GenBank, and the predicted protein translations from the annotated genome of a related monocotyledonous plant species Brachypodium distachyon (http://www.brachypodium.org). The workflow included gene prediction using CpGAT with UniRef90 proteins from Viridiplantae as a reference data set (see Methods). The CpGAT output included 4 evidence-based loci and 12 ab initio predicted genes, including a model fully supported by transcript alignment in the region with high similarity to XP_004977556 (Supplemental Figure 5).
xGDBvm Implementation
iPlant
xGDBvm has been deployed as a public image on iPlant’s Atmosphere Cloud Service (https://atmo.iplantcollaborative.org/application). Researchers can launch an xGDBvm instance and explore it once they have obtained an iPlant user account (https://user.iplantcollaborative.org/register/) using an institutional email address. An iPlant account also grants the user a home page on iPlant’s Data Store. Step-by-step instructions for setting up xGDBvm, available at http://goblinx.soic.indiana.edu/wiki/doku.php?id=user_instructions, can be summarized as follows: (1) In the Atmosphere Control Panel, find the latest xGDBvm image, launch an instance, and attach an external block storage volume using drag-and-drop; (2) access the instance’s secure shell using iPlant credentials and type simple commands to update xGDBvm code, set a Web password, initialize IRODS/FUSE, mount external storage, and launch a configuration script; and (3) access the VM’s GUI via HTTPS or VNC and follow instructions there to configure/create a genome annotation.
Indiana University
xGDBvm has also been implemented on a “production” virtual server at Indiana University, serving as a host for PdomGDB (http://goblinx.soic.indiana.edu/PdomGDB), a genome database for Polistes dominula (European paper wasp), as well as the test data sets described here (see Data Access). PdomGDB provides a showcase for the xGDBvm platform, including the addition of extra nonstandard feature tracks created using methods outlined in the xGDBvm wiki (http://goblinx.soic.indiana.edu/wiki/doku.php?id=configure_new_track). PdomGDB is actively being updated by the P. dominula research community using the yrGATE tool for contributing expert-curated gene annotations, as described in this manuscript (accepted submissions are accessible at http://goblinx.soic.indiana.edu/yrGATE/GDB001/CommunityCentral.pl). This website also includes general information on the xGDBvm project on the project home page (http://goblinx.soic.indiana.edu/index.php).
Public Repository
The xGDBvm project maintains a presence at http://brendelgroup.github.io/xGDBvm/. The xGDBvm-specific software can be accessed and updated from https://github.com/BrendelGroup/xGDBvm, where developers can contribute via pull requests, and users can screen pending issues and report new ones. xGDBvm is licensed under GNU General Public License, version 3. The repository includes case studies that illustrate real-world projects implemented using xGDBvm (https://github.com/BrendelGroup/xGDBvm/tree/master/case-studies/).
DISCUSSION
xGDBvm’s Utility
As an all-in-one solution to genome annotation and analysis, xGDBvm is unique among currently available packages. Configured as a virtual server with a complete GUI interface and HPC capabilities, xGDBvm removes barriers to entry imposed by extensive software installation, testing and troubleshooting, and command-line operation. The xGDBvm GUI guides inexperienced users by presenting only actionable choices and instructions at each step, as well as providing preinstalled sample data sets, input data validation, error flagging, and extensive help pop-ups. Data management is handled entirely within the xGDBvm environment, allowing the user to focus on the overall annotation task rather than managing intermediate input/output files. The resulting website can be either public or password-protected as desired, and the contents can be archived, shared, or exported for display using other genome display platforms. We expect that this combination of features will make xGDBvm attractive to research groups with a desire to annotate genome data but limited access to informatics support.
There are several use cases for xGDBvm in its current implementation at iPlant: (1) Researchers with a newly assembled genome who can quickly align relevant transcript assembly and/or protein data to determine probable gene location and then perform gene structure computation on either a portion of the genome or the genome in its entirety, resulting in a “first pass” genome annotation; (2) researchers with a recently annotated genome who wish to share it and improve annotation quality via community annotation; (3) researchers who wish to create their own copy of a “finished” genome annotation in order to run gene quality analyses with up-to-date transcript data, and/or carry out targeted or general reannotation; and (4) instructors desiring a hands-on environment for exploring the principles of genome annotation with real data and access to HPC resources.
In scope, xGDBvm provides an easy-to-use and versatile platform for annotating and analyzing genomes at various stages of completion. At one extreme, a finished genome can be loaded from data files available online, giving the user complete freedom to analyze and reannotate genes previously published. At the opposite extreme, a newly assembled genome can be loaded together with related-species data and/or short read assemblies, and CpGAT can be invoked to automatically build a credible draft genome annotation for further analysis. With any implementation, the powerful built-in tools for gene quality analysis and reannotation make xGDBvm a valuable asset for improving genome structure annotation as well.
Another advantage of xGDBvm is its flexibility, as it allows multiple genome views to be created in one instance and supports updates to any type of existing data. Finally, xGDBvm provides extensive documentation of the annotation and update process, important both for troubleshooting and for reporting results.
Comparison to Similar Tools
Other cloud-based annotation tools are available: Maker (http://www.yandell-lab.org/software/maker.html) is a eukaryotic genome annotation pipeline that can be installed in a variety of server environments (Cantarel et al., 2008), and a version of Maker (Maker-P) is installed at iPlant Atmosphere as a virtual machine with links to HPC (https://pods.iplantcollaborative.org/wiki/display/sciplant/MAKER-P+at+iPlant). The Web-based genome analysis platforms Galaxy (Goecks et al., 2010) offers cloud installation via Amazon’s Elastic Cloud Compute (EC2) service (https://aws.amazon.com/ec2/). xGDBvm differs from these tools in that it offers a comprehensive package combining a structured environment for data inputs, automated data processing with sanity checks, and tools for genome display, search, and reannotation built in.
Limitations
As currently configured, xGDBvm is unable to map short read data onto a genome, so users will need to assemble short reads de novo, prior to submitting data to xGDBvm as a TSA data set. xGDBvm’s computational workflow can currently accommodate only one track per spliced alignment data type (EST, cDNA, TSA, protein) and two tracks for gene model predictions. Users who require additional tracks must configure them manually. xGDBvm’s HPC processes are currently limited to spliced alignment computations, whereas gene structure annotation via CpGAT is limited by the processing power of the VM.
VM availability and usage at iPlant, as well as access to HPC resources, can be expected to be limited based overall capacity and the amount of demand on the respective systems. Users wishing to increase their usage quotas may be required to justify their request.
Future Directions
xGDBvm is still being developed and improved. The road map includes additional features, such as modular data workflows allowing unlimited track numbers and additional options for gene annotation and evaluation. xGDBvm’s implementation of the Agave API should facilitate the addition of new standalone or pipeline-integrated computation tools that can take advantage of high-performance processing (e.g., Maker). We also envision integrating xGDBvm with other analysis platforms including one that allows visualization of common introns (Wilkerson et al., 2006).
METHODS
xGDBvm Architecture and Software
The xGDBvm architecture is shown in Figure 3, and a more detailed description can be found in the wiki (http://goblinx.soic.indiana.edu/wiki/). We currently maintain two parallel implementations of xGDBvm, one at Indiana University (xGDBvm-GoblinX) on a virtual server using Red Hat Enterprise Linux (http://www.redhat.com) and the other on the iPlant Atmosphere platform (xGDBvm-iPlant) using CentOS Linux (https://www.centos.org). Both implementations run Apache Web server (http://www.apache.org) with very similar configurations, but xGDBvm-iPlant also includes openSSL (https://www.openssl.org) and Apache’s mod_ssl for secure access over HTTPS. Additional software includes MySQL client/server software (https://www.mysql.com), Perl (http://www.perl.org), and PHP (http://php.net/) to handle Web scripts and some server-side functions, with additional Perl modules for cgi and session management. Installed Javascript libraries include JQuery and JQuery UI (https://jquery.com). BioPerl (http://www.bioperl.org/wiki/Main_Page) and EMBOSS (http://emboss.sourceforge.net) were installed to handle certain operations. Additional binaries, including NCBI-BLAST+ (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/) as well as the computation-related software described earlier, were installed under /usr/local/bin/ or /usr/local/src/ (see Supplemental Table 1 for a complete list of installed binaries).
The document root directory is /xGDBvm/ under the VM’s root partition. xGDB scripts (modified from Schlueter et al. [2006]), PHP scripts, and other assets (Javascript files, css files, and images) were installed under /xGDBvm/XGDB/ and administrative scripts under /xGDBvm/admin/. Workflow-related shell scripts are found under /xGDBvm/scripts/, and custom yrGATE, GAEVAL, and CpGAT packages were installed under /xGDBvm/src/. The entire document root contents (excluding binaries) are maintained as a public repository at GitHub (https://github.com/BrendelGroup/xGDBvm).
The xGDBvm architecture is designed to segregate input data, dynamically generated output data, and static Web scripts that comprise the xGDBvm core (Figure 3). The user’s Data Store directory (for inputs, segregated under a common subdirectory xgdbvm/) and block storage volume (for outputs) are mounted under /home/xgdb-input/ and /home/xgb-data/, respectively. These are symbolically linked to paths under the document root (/xGDBvm/input and /xGDBvm/data), and all xGDBvm scripts reference these data paths for reading and writing data. Data destination directories are assigned ownership by group “xgdb” with read-write privileges, and the “apache” user is added to the “xgdb” group under /etc/group. Temporary data are saved to /xGDBvm/data/tmp.
To provide secure transactions where passwords are being sent over the Web, xGDBvm-iPlant enforces HTTPS (with self-signed cert) on all pages. Website password protection via .htaccess is required upon initial configuration, so only users who have the password can view the website online. Password protection can also be modified using the xGDBvm “Admin” GUI to include just the “Manage” functions (Admin, Configure/Create, and Remote Jobs); in this configuration, the VM’s genome browsers and data download sections are public. The backend MySQL password can also be customized via the GUI for additional site security. Web access to the mounted storage directories is blocked by the Apache configuration, so the user’s mounted disks are not exposed on the Internet. Certain VM assets (OAuth2 credentials, MySQL password) are stored under /xGDBvm/admin/ which is protected via the Apache configuration.
Benchmarking xGDBvm
The hardmasked Capsella rubella assembly (Slotte et al., 2013) was downloaded from the Joint Genome Initiative (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Crubella/assembly/Crubella_183_hardmasked.fa.gz; user account required). Arabidopsis thaliana cDNA FASTA sequences were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/nuccore/?term=(“mrna+NOT+est”%5bFilter%5d)+AND+Arabidopsis+thaliana%5bOrganism” http://www.ncbi.nlm.nih.gov/nuccore/?term=(“mrna+NOT+est”[Filter])+AND+Arabidopsis+thaliana[Organism]). Predicted protein translations were obtained from the Arabidopsis TAIR10 genome release (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/). UniRef90 proteins from Viridiplantae were retrieved in FASTA format from Uniprot (http://www.uniprot.org/uniref/?query=uniprot:(taxonomy:viridiplantae)+identity:0.9) and the file renamed as UniRef90-Viridiplantae.fa. A genome annotation based on these input data was created on an xGDBvm instance at iPlant with two CPUs and 4 GB RAM. (https://atmo.iplantcollaborative.org/application). xGDBvm’s GeneSeqer parameters were species model: Arabidopsis, alignment stringency: strict. CpGAT parameters were BGF: Arabidopsis, Augustus: Arabidopsis, GeneMark: a_thaliana; Skip Mask = T. For comparison, the current C. rubella annotation (GFF3) was downloaded (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Crubella/annotation/Crubella_183_gene.gff3.gz) and included as input in the genome workflow. Additional spliced alignment benchmarking and case studies used GeneSeqer-MPI and GenomeThreader running on high-performance computing systems at Texas Advanced Computing (https://www.tacc.utexas.edu), accessed from xGDBvm as public apps via the Agave API.
For the second use case, we queried the NCBI whole-genome shotgun sequence (wgs) library for Musa acuminata subsp malaccensis (banana; http://www.ncbi.nlm.nih.gov/assembly/GCF_000313855.1/) using tblastn (http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=tblastn) with a Setaria italica predicted protein (XP_004977556.1). M. acuminata contig 839 (GenBank accession CAIC01023586.1) was retrieved from NCBI (http://www.ncbi.nlm.nih.gov/Traces/wgs/fdump.cgi?CAIC01,23586); the resulting file was named Musa_contig_839.gdna.fa, and the FASTA header was simplified to “>Musa_contig839.” M. acuminata EST sequences in FASTA format were retrieved from NCBI (http://www.ncbi.nlm.nih.gov/nucest?term=Musa_acuminata%5BOrganism%5D]) and renamed as musa_est.fa. UniRef90 proteins from Viridiplantae were retrieved in FASTA format as described above. xGDBvm’s GeneSeqer parameters were species model: rice, alignment stringency: strict. CpGAT parameters were BGF: rice, Augustus: maize, GeneMark: o_sativa; Skip Mask = T.
Data Access
Data sets described under Benchmarking can be viewed and downloaded from the xGDBvm project pages at http://goblinx.soic.indiana.edu/GDB002/ (C. rubella genome) and http://goblinx.soic.indiana.edu/GDB003/ (M. acuminata contig 839). A list of all Web resources referenced in this manuscript is found in Supplemental Table 2.
Supplemental Data
Supplemental Figure 1. Input data validation.
Supplemental Figure 2. The xGDBvm automated workflow.
Supplemental Figure 3. Output data validation.
Supplemental Figure 4. Preconfigured example data sets.
Supplemental Figure 5. Annotation of a single genomic contig.
Supplemental Table 1. xGDBvm installed software.
Supplemental Table 2. Hyperlinks referenced in the manuscript.
Supplementary Material
ACKNOWLEDGMENTS
We thank Ann Fu for help with initial development of the automated workflow, Shannon Schlueter for advice in adapting his XGDB core code for the virtual environment, James Denton for extensive debugging and yrGATE feature development, Jianqing Guan for code to calculate dynamic GAEVAL scores, and Bruce Shei for system support at Indiana University. We especially thank collaborators and colleagues at the iPlant Collaborative (CyVerse) and Texas Advanced Computing Center (TACC) for their assistance in integrating xGDBvm into the Atmosphere cloud environment and the Agave API: Roger Barthelson and Shabari Subramaniam, who wrote and tested HPC wrapper scripts for GeneSeqer-MPI and GenomeThreader, respectively; Andre Mercer, who provided prototype PHP scripts for the API; and Edwin Skidmore, Rion Dooley, and Matthew Vaughn who provided system troubleshooting and advice. This work was supported by National Science Foundation Award 1221984 to V.P.B.
AUTHOR CONTRIBUTIONS
V.P.B. conceived the project and provided overall guidance. J.D. carried out the project and managed collaborations. D.S.S. tested xGDBvm functionality with actual data sets, configured and extended a production xGDBvm server, ran ParsEval comparisons, and contributed some parsing scripts. N.M. provided guidance for xGDBvm’s implementation at iPlant and created the prototype HPC wrapper scripts.
Glossary
- VM
virtual machine
- GUI
graphical user interface
- HPC
high-performance computing
- TSA
transcript sequence assembly
- VNC
virtual network computing
Footnotes
Articles can be viewed without a subscription.
References
- Abouelhoda M.I., Kurtz S., Ohlebusch E. (2002). The enhanced suffix array and its applications to genome analysis. In Second Workshop on Algorithms in Bioinformatics, R. Guigo and D. Gusfield, eds (Rome:Springer-Verlag; ), pp. 449–463. [Google Scholar]
- Borodovsky M., Lomsadze A. (2011). Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr. Protoc. Bioinformatics 4: 4.6.1–4.6.10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cantarel B.L., Korf I., Robb S.M., Parra G., Ross E., Moore B., Holt C., Sánchez Alvarado A., Yandell M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18: 188–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dooley R., Vaughn M., Stanzione D., Terry S., Skidmore E. (2012). Software-as-a-Service: The iPlant Foundation API. In 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) (IEEE; ). [Google Scholar]
- Foissac S., Gouzy J.P., Rombauts S., Mathé C., Amselem J., Sterck L., Van de Peer Y., Rouzé P., Schiex T. (2008). Genome annotation in plants and fungi: EuGene as a model platform. Curr. Bioinform. 3: 87–97. [Google Scholar]
- Goecks J., Nekrutenko A., Taylor J.; Galaxy Team (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11: R86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goff S.A., et al. (2011). The iPlant Collaborative: Cyberinfrastructure for plant biology. Front. Plant Sci. 2: 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gremme G., Brendel V., Sparks M.E., Kurtz S. (2005). Engineering a software tool for gene structure prediction in higher organisms. Inf. Softw. Technol. 47: 965–978. [Google Scholar]
- Grigoriev I.V., et al. (2012). The genome portal of the Department of Energy Joint Genome Institute. Nucleic Acids Res. 40: D26–D32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas B.J., Delcher A.L., Mount S.M., Wortman J.R., Smith R.K. Jr., Hannick L.I., Maiti R., Ronning C.M., Rusch D.B., Town C.D., Salzberg S.L., White O. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31: 5654–5666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas B.J., Salzberg S.L., Zhu W., Pertea M., Allen J.E., Orvis J., White O., Buell C.R., Wortman J.R. (2008). Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9: R7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hammesfahr B., Odronitz F., Hellkamp M., Kollmar M. (2011). diArk 2.0 provides detailed analyses of the ever increasing eukaryotic genome sequencing data. BMC Res. Notes 4: 338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoff K.J., Lange S., Lomsadze A., Borodovsky M., Stanke M. (2015). BRAKER1: Unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32: 767–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holt C., Yandell M. (2011). MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12: 491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leroy P., et al. (2012). TriAnnot: A versatile and high performance pipeline for the automated annotation of plant genomes. Front. Plant Sci. 3: 5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mungall C.J., et al. (2002). An integrated computational pipeline and database to support whole-genome sequence annotation. Genome Biol. 3: 0081.1–0081.11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nocq J., Celton M., Gendron P., Lemieux S., Wilhelm B.T. (2013). Harnessing virtual machines to simplify next-generation DNA sequencing analysis. Bioinformatics 29: 2075–2083. [DOI] [PubMed] [Google Scholar]
- Potter S.C., Clarke L., Curwen V., Keenan S., Mongin E., Searle S.M., Stabenau A., Storey R., Clamp M. (2004). The Ensembl analysis pipeline. Genome Res. 14: 934–941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reddy T.B.K., Thomas A.D., Stamatis D., Bertsch J., Isbandi M., Jansson J., Mallajosyula J., Pagani I., Lobos E.A., Kyrpides N.C. (2015). The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 43: D1099–D1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlueter S.D., Wilkerson M.D., Dong Q., Brendel V. (2006). xGDB: open-source computational infrastructure for the integrated evaluation and analysis of genome features. Genome Biol. 7: R111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schlueter S.D., Wilkerson M.D., Huala E., Rhee S.Y., Brendel V. (2005). Community-based gene structure annotation. Trends Plant Sci. 10: 9–14. [DOI] [PubMed] [Google Scholar]
- Slotte T., et al. (2013). The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet. 45: 831–835. [DOI] [PubMed] [Google Scholar]
- Specht M., Stanke M., Terashima M., Naumann-Busch B., Janssen I., Höhner R., Hom E.F., Liang C., Hippler M. (2011). Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome. Proteomics 11: 1814–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Standage D.S., Brendel V.P. (2012). ParsEval: parallel comparison and analysis of gene structure annotations. BMC Bioinformatics 13: 187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stanke M., Keller O., Gunduz I., Hayes A., Waack S., Morgenstern B. (2006). AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34: W435–W439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thibaud-Nissen F., Souvorov A., Murphy T., DiCuccio M., Kitts P. (2013). Eukaryotic Genome Annotation Pipeline. In The NCBI Handbook, 2nd ed (Bethesda, MD: National Center for Biotechnology Information), http://www.ncbi.nlm.nih.gov/books/NBK169439/.
- Uberbacher E.C., Hyatt D., Shah M. (2004). GrailEXP and Genome Analysis Pipeline for genome annotation. Curr. Protoc. Hum. Genet. 39: 6.5.1–6.5.15. [DOI] [PubMed] [Google Scholar]
- Usuka J., Zhu W., Brendel V. (2000). Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16: 203–211. [DOI] [PubMed] [Google Scholar]
- Wang B.B., O’Toole M., Brendel V., Young N.D. (2008). Cross-species EST alignments reveal novel and conserved alternative splicing events in legumes. BMC Plant Biol. 8: 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkerson M.D., Schlueter S.D., Brendel V. (2006). yrGATE: a web-based gene-structure annotation tool for the identification and dissemination of eukaryotic genes. Genome Biol. 7: R58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yandell M., Ence D. (2012). A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 13: 329–342. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






