Abstract
In this work, we present the Genome Modeling System (GMS), an analysis information management system capable of executing automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. The GMS also serves as a platform for bioinformatics development, allowing a large team to collaborate on data analysis, or an individual researcher to leverage the work of others effectively within its data management system. Rather than separating ad-hoc analysis from rigorous, reproducible pipelines, the GMS promotes systematic integration between the two. As a demonstration of the GMS, we performed an integrated analysis of whole genome, exome and transcriptome sequencing data from a breast cancer cell line (HCC1395) and matched lymphoblastoid line (HCC1395BL). These data are available for users to test the software, complete tutorials and develop novel GMS pipeline configurations. The GMS is available at https://github.com/genome/gms.
This is a PLOS Computational Biology Software Article.
Introduction
The increasing sequence data output of massively parallel sequencing platforms [1] has allowed the application of sequencing to an incredible diversity of research projects in the biological, genomic, and medical fields [2–6]. These technologies have inundated their adopters with petabytes of data, outpacing their ability to effectively manage and analyze the data. A rapid proliferation of tools and resources to analyze these data [7–10] complicates the creation and maintenance of analysis pipelines.
The GMS is the core analysis system at The Genome Institute (TGI) of Washington University, processing terabases of genomic data and proving integral to a wide variety of large- and small-scale sequencing projects (Fig 1). Pipelines implemented within the GMS include reference sequence alignment, germline variant detection, somatic variant detection, RNA-seq (expression, novel transcript detection, and fusion detection), differential expression, and others (Table 1). The GMS also includes an integration, annotation, and interpretation pipeline, ‘MedSeq’, which attempts to converge all single-subject data into a form suitable for identification of clinically actionable events [11]. A typical genome analysis using the GMS might start from any combination of whole-genome, exome or RNA-seq data and produce alignments against a reference genome, somatic variant calls including single nucleotide variants (SNVs), structural variants (SVs), copy-number variants (CNVs), transcript expression levels, RNA fusion predictions, and more. To date, the GMS has been used to process >4,800 human whole genome samples, >40,000 exomes and >1,400 transcriptomes for a total of >700 terabases of sequence data (Table 2).
Table 1. Major GMS pipelines.
Pipeline | Description | Products |
---|---|---|
Genotype Microarray | Performs genotype calling on SNP array data against a reference sequence. | SNVs BED file. |
Reference Alignment | Performs alignment and variant detection for reads from a single sample. Works with WGS data and capture data. | BAM file of aligned reads, VCF files and BED files for germline SNVs, Indels, SVs, and CNVs. Reports on coverage. |
Somatic Variation | Performs tumor/normal variant detection. Extends reference alignment with somatic evaluation, LOH analysis, annotation and prioritization. Works with WGS data and capture data. | VCF files and BED files for somatic SNVs, Indels, SVs, and CNVs. |
RNA-seq | Uses Bowtie/TopHat/Cufflinks to assemble transcripts and estimate abundance, alternative splicing, alternative promoter usage, etc. Also uses various tools to perform comprehensive quality and coverage analysis of RNA-seq libraries | Spliced alignment BAM, FPKM expression, digital expression, fusion detection, etc. |
Differential Expression | Combines results from a pair of RNA-seq builds and performs differential expression analysis. | CuffDiff and CummeRbund output. |
Med Seq (aka Clin Seq) | Integrates data from WGS, exome and transcriptome sequencing of a single patient’s tumor. Visualization and annotation of somatic events. Prioritization of somatic events by relevance to cancer biology and therapeutic decision making. | Approximately 2,000 files, including: spreadsheets of ranked and annotated variants, drug-gene interactions, Circos plots, copy number images, mutation diagrams, etc. |
Table 2. Data processed by the GMS.
Metric | Human | Non-human | Total |
---|---|---|---|
WGS cases (samples) | 2,517 (4,349) | 355 (534) | 2,872 (4,883) |
Exome/targeted cases (samples) | 30,343 (35,366) | 6,027 (8,270) | 36,370 (43,636) |
RNA/cDNA cases (samples) | 375 (555 samples) | 711 (855 samples) | 1,086 (1,410) |
Bp of Illumina NGS reads | 622 terabases | 82 terabases | 704 terabases |
As a demonstration of the GMS, we describe a complete integrated analysis of whole genome, exome and transcriptome sequencing of a breast cancer cell line (HCC1395) and a matched lymphoblastoid cell line (HCC1395 BL). The complete dataset is publicly available (https://xfer.genome.wustl.edu/gxfer1/project/gms/). The GMS is available as open-source software with installation instructions at http://github.com/genome/gms. Once installed, users can run tutorials, reproduce the results from this publication, and test novel GMS pipeline configurations. This ability to replicate and iteratively improve upon large and complex genome analyses will allow researchers to more easily manage the immense challenges of modern large-scale sequence analysis.
Design and Implementation
To address challenges of scale, tracking, optimization, and reproducibility, we have developed an analysis information management system called the Genome Modeling System (GMS). The GMS tracks analysis processing steps while also managing project and sample information (Fig 1). It records sufficient detail in a relational database about each computational experiment to reproduce it entirely from metadata. The information is stored and indexed to enable free-form search. Results are also stored in standard formats (e.g. BAM [12], Variant Call Format (VCF) [13], Tabix [14]) with a record of the methods and inputs that produced them. The system can automatically bypass regeneration of intermediate results when those results have already been created as part of another process, hence saving immense amounts of disk and compute resources. It can also automatically aggregate data across samples within a project to provide high-level overviews of analysis status and results. Finally, the GMS facilitates the comparison of analysis results. A user can compare the output of several analysis pipelines utilizing different alignment parameters, variant callers, filters, and many more variables. The ultimate goal of the GMS is to make data management, analysis, and integration more accessible at scale.
The GMS is driven by a flexible command-line interface and a web interface for monitoring. The web interface includes search capability for all of the major entities stored in the system, with free-form search based on full-text matching. The command-line interface is built around a single command, "genome", which offers a multi-level tree of sub-commands. These commands give access to all of the tools and data in the system (S1 Fig). The top level of the command tree allows interaction with instrument data, samples, and analysis results that are stored in the database (Box 1), including the ability to create, list, update and delete them (Box 2).
Box 1. Terminology for the Genome Modeling System.
Term | Definition |
---|---|
Subject | The entities around which analysis occurs. Exist at multiple levels of granularity. For example, an individual, a cohort, a sample from an individual, or even a species. Anything that can be described abstractly as “having a genome”. When the subject is a human patient, use of the GMS will normally require appropriate ethics review and informed consent of the patient. Related documentation will be linked to analyses via an anonymized unique patient number (UPN) stored in the GMS subject database table along with additional metadata. |
Model | The basic unit of analysis. Each model represents one state of belief about the sequence and features of a given subject. Multiple models can be made of the same subject, with different processing profiles, and/or different input data used as evidence. |
Pipeline | Each type of model defines a distinct analysis pipeline. The definition includes a specification for inputs and parameters to each model, as well as logic to construct a workflow to build results given specific values for those inputs and parameters. |
Processing Profile | A reusable collection of parameters describing how to build a model of a particular type/pipeline. Each is a complete computational method specification, including exact tool names and versions, as well as sufficient logic to determine the precise workflow. All models with the same processing profile have been processed the same way, though input data may vary. |
Build | One attempt to execute the required workflow for a model, given its inputs. The last complete build for a model represents the current “state” of the model. While models can be updated, the information content in each build is a static snapshot of results. |
Instrument Data | A unit of data from a sequencer, microarray instrument, or other device, used as primary input to the GMS. Illumina data, for instance, produces one unit of instrument data per flow cell, lane, and index sequence. It is typically associated with a file of reads, and a collection of metrics. |
Software Result | A reusable intermediate result made by the build process. When the exact same process is to occur a second time on the same inputs with the same parameters, the software result produced the first time is detected. The GMS uses these to prevent redundant work, and expedite processing after minor analysis protocol changes. |
Disk Allocation | A record of a slice of disk being allocated to a given owner. Builds, software results, and instrument data are owners of disk allocations. |
Workflow | A graph of steps, and the data flow between those steps. A workflow is generated for each attempt to build a model. Individual steps may also define subordinate workflows, leading to a nested graph of tasks to accomplish the analysis goal. |
Box 2. Example Usage
Simplified examples of command-line usage are provided for illustrative purposes (see the tutorials at http://github.com/genome/gms/wiki for fully functional examples.) First, samples are listed for a given patient/subject by anonymized identifier (patient1). All commands that work with database entities support an expression syntax that allows items to be selected from the database by ID, or by other characteristics. Next, specific units of instrument data are examined for the first sample (S1). Processing profiles are listed for the reference alignment pipeline. A model is then defined for the first sample (S1) using the second processing profile (P2). Instrument data (I1, I2, and I3) are assigned as an input. The build process is then initiated, recording the new build (B1) uniquely in the database, and starting jobs on the compute cluster. A build view command is then used to monitor the steps involved in the build workflow, examine logs and check run times. The results are accessible as files, for downstream analysis with additional metrics also in the database.
> genome sample list “individual.common_name = patient1”
id common_name individual.common_name
S1 tumor patient1
S2 normal patient1
S3 relapse patient1
> genome instrument-data list sample.id = S1
id flow_cell_id lane index_sequence sample.id
I1 ABC123 1 <NULL> S1
I2 ABC123 2 AGCT S1
I3 ABC123 2 TCAG S1
> genome processing-profile list reference-alignment
id type_name name
P1 reference alignment BWA 0.5.9 and samtools
P2 reference alignment BWA-MEM 0.7.2a and Gatk
> genome model define reference-alignment--subject id=S1--processing-profile id=P2--name=“TST1 tumor”
defined genome model M1
> genome model input add instrument_data id=M1 "flow_cell_id='ABC123' and lane in [1,2]"
assigned instrument data I1, I2 and I3 to model M1
> genome model build start id=M1
new build B1 started for model M1 with data directory at /opt/gms/MYSYS1/fs/model_data/M1/buildB1/
> genome model build view id=B1
> cd /opt/gms/XYZ123/fs/model_data/M1/buildB1
> samtools view alignment_results/12345.bam
Tool Tree and Application Programming Interface (API)
At the core of the GMS is a "tool tree", into which bioinformaticians collaboratively add components to build up a software library of computational tools and methods for their organization. Tools are accessible through the “genome tools” command, aliased by “gmt”. Adding a component to the tool tree requires writing a command class by following detailed documentation aimed at prospective developers with basic programming skills. Tools work directly on simple files, and provide fast access to the small scripts an analyst typically creates during their daily bioinformatics work. These components can evolve into complex systems, gradually, and only as needed. Additional features such as tests, documentation, and compositional pieces can be added incrementally. A low barrier to initial entry is essential to keeping the tool tree at the center of method development. S2 Fig shows an example tool, its position in the tree, its source code, and the help text generated from metadata in the software module.
Any analyst using the system automatically works in their own software ‘sandbox’, allowing private changes to any part of the system. Tools and pipelines can be added without outside registration and function for that user as though the user had deployed the tool at large in the GMS. The analyst can then push their changes to be used more broadly in the organization, or share them with the community at large. The tool tree packaged with the GMS contains over 1,500 bioinformatics components organized into 150 categories. These include tools to work with established bioinformatics software such as BWA [15], TopHat [16], Blat [17], HTSeq [18], and liftOver [19], as well as in-house tools such as DGIdb [11].
Models
The central metaphor for analysis products in the GMS is the ‘genome model’ (Fig 2A). Each model represents one state of belief about the sequence data and features of a given subject. Multiple approaches to arrive at a conclusion for the same subject will be represented as multiple models in the system, each with a different ‘processing profile’ to describe the methods in precise computational terms (Box 1).
Processing Profiles
Each processing profile describes in detail how an analysis should occur. It does so in a declarative fashion. A processing profile embeds exact tool versions and parameters, such that two models built with the same processing profile, inputs, and GMS software version will have identical results (Box 1). This also allows all subjects in a given cohort to be processed in the same way if consistency is desired (Fig 2B). Each pipeline in the GMS has a collection of processing profiles that describe each of the ways the pipeline can be run. Each profile is given an identifier in the database, and new processing profiles can be created to apply different computational approaches, either by constructing a new processing profile from scratch, or by copying an existing one, and adjusting the parameters. For example, a user might decide to detect variants with a different tool, or to apply read trimming before alignment. This system allows an analyst to experiment with different methods almost as easily as describing those methods in a conversation. Hence, complex workflows do not require manual construction and can be computationally derived from a declarative specification (Box 1).
Subjects
The subject of a model determines which genome it intends to examine, much as the processing profile determines how it will be examined (Box 1). The subject of a model is sometimes a particular individual, but is more often a specific sample from some individual. In germline analysis of human disease, one model will be created for each individual, and a model group or population model used to summarize across a cohort. In cancer analysis, one model will be made for the genome of the tumor, and another for the genome of a matched normal, with a third performing the comparison between the two. The MedSeq (aka ClinSeq) models target the individual in general, taking other models as inputs, each with more specific subjects relating to tumor or normal DNA or RNA. It should be noted that while this work primarily describes a computational/analysis platform, when the GMS is applied to real patients its use will normally require appropriate IRB review and informed consent as per the requirements of the user’s jurisdiction and institutional policies.
Inputs
In the most basic case, a model’s inputs will include instrument data. The system can handle sequence data generated by sequencing instruments from Illumina (GAII, HiSeq 2000, HiSeq 2500, and MiSeq), Pacific Biosciences and Ion Torrent. In addition to sequence data, microarray data can also be supplied as input. Models often require reference sequences, annotation, or lists of regions of interest, depending on the model type. The subject of a model may limit what inputs can be assigned, ensuring that assigned reads are actually from the subject in question, and that an input reference sequence applies to the species of the subject.
Builds (Performing Analysis)
Once a model is defined, it is ‘built’. Each attempt to build a model launches a ‘workflow’ on the compute cluster, and adds a record of that build to the database for the model in question to track processing. The workflow management process is described below.
A model may be built multiple times. This occurs typically when new instrument data are assigned (Fig 2C), a new reference sequence becomes available, or new gene annotations are published and imported into the GMS. It also occurs when processing errors cause a build to fail. A complete build of a model represents a collection of results of the processing specified by the model (e.g. germline variants discovered in blood, somatic variants discovered in a tumor, novel transcripts expressed in a tissue, genes differentially expressed between conditions, etc.). The disk space allocated for the build contains VCF files for variants, BAM files for alignments, and a variety of other reports and images. At a logical level, the bundle of data produced during the build process can be interrogated by build ID to query the state of the genome in question. The resulting model can subsequently be used as an input to other models. In this case, each build of the downstream model records the current build of the upstream model as an input (Fig 2D). Because builds are conceptually immutable, every data product in the GMS can be traced back to original sequencing instrument data, and can be reproduced reliably.
Pipelines
Each type of model defines a distinct analysis pipeline, including a specification for inputs and parameters to be supplied when models are created as well as logic to construct the workflow and to parse build results. Adding new pipelines requires writing a software module to describe the new sub-type of model. The simplest pipelines are no more complicated than a small script, and the most complicated have an elaborate graph of steps, each with distinct processing requirements. As an example of the latter, Fig 3 details the workflow of the Somatic Variation pipeline. In most cases, the exact tools and versions to use for any given stage in a pipeline are configurable in the processing profile. Some fields are specific thresholds or other simple parameters. In many cases, however, the processing profile fields contain expressions that can be expanded into a sub-workflow. For example, variant detection is specified with four fields. The ‘sv_detection_strategy’ shown in Fig 3 involves a pair of variant detectors, one of which is run twice in different modes, and a series of different filters and intersection logic for the results. The entire process will create a sub-workflow based on the specification shown. One of the detectors defines another sub-workflow to process data by chromosome, and another to look for inter-chromosomal translocations. Some of the filters simply examine metrics, while others perform realignment. Other filters perform small de novo assemblies to validate structural variant predictions in silico. This example illustrates how arbitrarily complex workflows can be specified by creation of custom processing profiles.
For additional details on design and implementation, refer to the Supplementary Methods (S1 Text).
Results
The GMS has been used at The Genome Institute to analyze a large number of genomes in both clinical and discovery contexts (Table 2). For example, the GMS has been instrumental for the analysis of nearly all The Cancer Genome Atlas (TCGA), Pediatric Cancer Genome Project (PCGP) [20], and other large-scale cancer genomics efforts at the Genome Institute, helping to map the landscapes of endometrial carcinomas [21], acute myeloid leukemias [22], pediatric low-grade gliomas [23], breast cancers [24], non-small-cell lung cancers [25], colon and rectal cancers [26], and ovarian cancers [27], among others. The GMS has also been used to assemble new genomes [28, 29], conduct studies of common [30] and rare disease [31, 32], track the evolution of viruses [33], and characterize the human microbiome [34, 35].
As a demonstration we applied the GMS to an integrated analysis of whole genome (WGS), exome, and transcriptome sequencing of a breast cancer cell line (HCC1395) and matched ‘normal’ lymphoblastoid cell line (HCC1395/BL [36]). The latter cell line is matched to the same individual (also referred to as ‘TST1’ below). A total of 10 lanes of HiSeq 2000 (v3 chemistry) sequence data consisting of ~1.8 billion 2x100bp reads were produced for HCC1395 and HCC1395/BL. Whole genome sequencing, exome sequencing and RNA-seq were performed as described previously ([25, 37] and S1 Text). HCC1395 and HCC1395/BL were sequenced to average coverage levels of 56x (WGS)/155x (exome) and 31X (WGS)/124x (exome), respectively. RNA sequencing achieved 20x coverage of >50% of known junctions for 8,640 genes for HCC1395 and 9,437 genes for HCC1395/BL respectively. Complete quality and coverage statistics from automatically generated GMS reports were summarized for WGS (S1 Table), exome (S2 Table) and RNA-seq data (S3 Table). Genotypes determined from whole genome NGS data were compared to those determined by Illumina Infinium microarrays and an overall concordance of 98.7% and 99.6% was observed for the tumor and normal calls respectively. Fig 4 shows the collection of models and their forward progression through the HCC1395 analysis. All of the following statistics and figures were drawn directly from automated output of the following GMS pipelines: ‘genotype microarray’, ‘reference alignment’, ‘somatic variation’, ‘rna seq’, ‘differential expression’ and ‘med seq’ (aka ‘clin seq’). Distinct somatic-variation processing profiles were used for the whole genome and exome data sets. The HCC1395 data is made publicly available (https://xfer.genome.wustl.edu/gxfer1/project/gms/) to allow GMS end users to reproduce this analysis. All tutorials and examples in the online documentation are based on these data. For complete details on how these data were generated, refer to the Supplementary Methods (S1 Text).
Examples of key data produced by GMS analysis pipelines are summarized in Fig 5 and provided in the supplementary materials (S3–S11 Figs and S1–S7 Data). S3 Fig shows the copy-number analysis for WGS data of tumor and normal, and one example of a selected CNV amplification on chromosome 12. Amplifications of known cancer-related genes such as KRAS and ETV6 are automatically labeled. Unsurprising for a cell line, the ploidy of HCC1395 is highly aberrant with large-scale amplifications and deletions evident on all chromosomes. The highly copy number altered genome of HCC1395 complicates accurate somatic event detection. The GMS facilitates integrated use of multiple variant detectors to take advantage of the varying strengths of each. A breakdown of somatic SNV calls by algorithm, and the results from manual review by the Integrative Genomics Viewer [38] (IGV) of those variants are provided in S4 Fig. A high mutation rate was observed in HCC1395 (47 mutations/Mbp), likely due to the large number of cell divisions in multiple cell line passages and to the mutations we detected in DNA damage surveillance/DNA repair genes, including: MSH6, TP53, ATRX, BRCA2, MSH5, and POLH. Selected lists of cancer genes, curated by the Genome Institute from a variety of sources and released with this system, are intersected with high-confidence variant calls (S5 Fig). This allows rapid sorting of mutated gene lists according to those identified as previously mutated in Cosmic [39] or belonging to cancer-relevant gene categories according to GO [40], the cancer gene census [41], Entrez [42], and other sources. A selection of these mutations and associated annotations are provided in S4 Table. When variants affect protein coding genes, ‘lolliplot’ mutation diagrams of the predicted amino acid effect are automatically generated, showing the location of the mutation(s) relative to known domains and to the known mutational landscape according to Cosmic (S6 Fig). For example, in HCC1395 we observed a potentially novel mutation in BRCA2 as well as mutations in NCOR2 and TP53 that occur at previously observed hotspots. A complete list of all somatic SNVs detected in HCC1395 is provided in S1 Data. S7 Fig shows a TAF1 deletion, with an image of the reads in all five of the samples, and a clear visualization of the variant in the tumor DNA, WGS and exome, as well as tumor RNA and a compelling absence of such variation in any of the normal samples. The MedSeq pipeline automatically creates XML session files to allow rapid loading of all necessary BAM alignment files, BED files of variant calls and the appropriate reference genome in the IGV browser from which this screenshot was produced. We find this particularly useful for putative Indels where a high false positive rate is common. A companion ‘lolliplot’ shows that this is an in-frame deletion of TAF1. The complete list of predicted Indels in HCC1395 is provided in S2 Data. S8 Fig shows coverage and variant allele frequency (VAF) data for tumor and normal samples and contrasts the values derived from the WGS, exome and RNA-seq data. The complete list of predicted CNV events is provided as S3 Data. S9 Fig shows a list of putative ORF-maintaining gene fusions detected with the SV pipeline using BreakDancer [43] and CREST [44] (aka ‘SquareDancer’). A ‘pairoscope’ plot illustrates the supporting reads for one of these potential fusions between PRTG and MALT1 on chromosomes 15 and 18 (S9 Fig). The complete list of predicted SVs from BreakDancer is provided as S4 Data. A complete set of gene expression and exon splicing results are provided as S5 Data and S6 Data. The complete list of RNA gene fusion predictions from ChimeraScan [45] is provided as S7 Data. S10 Fig shows a clonality plot, demonstrating a very pure and homogenous sample as evidenced by a single clear distribution of variant allele frequencies (VAF) centered almost exactly at 50% VAF, as expected for heterozygous variants. S11 Fig illustrates a small sample of the many graphs automatically generated to interpret RNA-seq results. Library quality can be assessed by observed insert size distribution (S11A Fig) and end bias (S11B Fig) plots. Alignment quality is evaluated by percentages of reads aligning to the expected transcribed regions (S11C Fig) and coverage metrics for known exon-exon junctions (S11D Fig). The observed patterns of splice site usage provide a general overview of alternative splicing patterns (S11E Fig). Finally, the expression of individual genes can be compared to the overall distribution to identify potentially up-regulated outliers (S11F Fig).
The preceding analysis was repeated in its entirety multiple times on standalone installations of the GMS with various hardware configurations on systems at our center, on consumer hardware available to ‘citizen scientists’, and on cloud computing services such as Amazon AWS EC2 (see S5 Table for examples). While potential alternative genome analysis platforms to the GMS are under development as both commercial and academic solutions, the breadth and comprehensiveness of cancer analysis described above and combination of additional features are to our knowledge unique to the GMS (S6 Table).
The GMS is a highly flexible and scalable system designed to enable genome analysts to maximize the yields from their data by increasing their ability to run a wide variety of analysis programs and explore the parameter space of each. The ability to reuse processing profiles offers reproducibility for complex processes (S12 and S13 Figs). A researcher can thus focus on just the variable of interest (e.g., tumor subtype, drug concentration, disease status, age of onset, etc.), confident that other variables (e.g., alignment software version, variant calling software parameters, reference genome sequence version, reference transcript annotation version, etc.) are truly constant. It also acts as the foundation for hypothesis testing of new computational methods. By allowing an analyst to produce alternatives to a given analysis pipeline with a few commands, the GMS permits an increased pace of tool and method development. Our testing of the GMS on cloud computing platforms demonstrates a mechanism for sharing complex results with collaborators or the community at large (S14 Fig). Finally, it allows standardization of analysis approaches when producing large sets of data in collaborative groups or consortia. A UML diagram of key GMS concepts is provided as S15 Fig.
In addition to the development advantages of the GMS described above, adoption of the GMS may provide practical advantages for a group attempting analysis of genome sequence data, especially in the context of cancer genomics. For example, a current adopter has access to well-vetted pipelines and tools for cancer genome analysis including: BWA, Strelka [46], VarScan2 [47], SomaticSniper [48], Pindel [49], GATK [50], BreakDancer [43], CREST, TIGRA_SV, ChimeraScan, the Tuxedo suite [51], the HTSeq and edgeR [52] combination, CopyCat (unpublished), and many more. Results include annotations according to cancer relevance; useful visualizations such as ‘lolliplot’ mutation diagrams, mutation spectrum diagrams, Circos [53] plots, XML session files for manual review in IGV, and intersection of altered genes with potential druggability from DGIdb.
Availability and Future Directions
The HCC1395 analysis demonstrates the current abilities of the GMS to detect, summarize, visualize, and interpret the various types of somatic and germline events encountered in variant analysis such as SNVs, Indels, SVs, CNVs, differential expression, alternative expression and more. This analysis, while extensive, is still far from complete. Many further improvements are currently under way and will be released publicly at regular intervals. The HCC1395 data itself may also serve as a resource for external development. There are few publicly available datasets of this quality, with all three of the major sequence data types (WGS, exome, and RNA-seq), for a single tumor/normal pair, on a current platform, to facilitate development of tools. As the clinical sequencing analysis facilitated by the MedSeq pipeline is a primary area of interest, several new resources are under development for release in future versions of the GMS to further aid the interpretation of genomic events in a clinical translation and reporting context.
Flexibility, scalability, and ease of use have been the guiding principles behind development of the GMS. The GMS makes open, high-throughput genome analysis available to groups currently tasked to analyze the deluge of data from high-throughput sequencing experiments.
The GMS is made available under the open source GNU Lesser General Public License Version 3 (http://www.gnu.org/copyleft/lesser.html) and can be found on the GitHub Genome Institute pages (https://github.com/genome/gms).
Supporting Information
Acknowledgments
The system was tested and developed in cooperation with the production sequencing team at The Genome Institute, led by Robert S. Fulton and Lucinda Fulton, with IT support from the LIMS team and the Systems team. Development was facilitated by work from Krishna Kanchi, Ling Lin, Heather Schmidt, Joelle Veizer, James Koval, Rick Meyer, Xin Hong, Jerome Peirick, Jon Schindler, Todd C. Carter, Eric deMello, Kevin Crouse, Kenneth Swanson, Shin Leong and Susanna Siebert. The system core was influenced by work from Ryan Richt, Phil Kimmey, Randy Hancock, Karyn Meltz-Steinberg, John Martin, Noorus Sahar Abubucker, Karthik Kota, Sasi Suruliraj, John Osborne, Mark Johnson, Shunfang Hou, John W. Wallis, and Michael C. Wendl.
Data Availability
All relevant data are within the paper and its Supporting Information files. The GMS is made available under the open source GNU Lesser General Public License Version 3 (http://www.gnu.org/copyleft/lesser.html) and can be found on the GitHub Genome Institute pages (https://github.com/genome/gms). All source code is available on GitHub and all demonstration data is available for download here: https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/.
Funding Statement
The development of the Genome Modeling System was funded by an NHGRI Large Scale Sequencing and Analysis Center grant (U54 HG003079) to RKW. Additional funding to make this system usable by the community was also provided by NHGRI Genome Sequencing Informatics Tools (GS-IT) Program U01 HG006517 to DJD (year 1) and LD (years 1-4). Test data hosting was generously donated by an Amazon AWS in Education Research Grant Award to MG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif). 2013;6:287–303. Epub 2013/04/09. [DOI] [PubMed] [Google Scholar]
- 2. Mardis ER. Genome sequencing and cancer. Current opinion in genetics & development. 2012;22(3):245–50. Epub 2012/04/27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Parker M, Chen X, Bahrami A, Dalton J, Rusch M, Wu G, et al. Assessing telomeric DNA content in pediatric cancers using whole-genome sequencing data. Genome biology. 2012;13(12):R113 Epub 2012/12/13. 10.1186/gb-2012-13-12-r113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Shaffer HB, Minx P, Warren DE, Shedlock AM, Thomson RC, Valenzuela N, et al. The western painted turtle genome, a model for the evolution of extreme physiological adaptations in a slowly evolving lineage. Genome biology. 2013;14(3):R28 Epub 2013/03/30. 10.1186/gb-2013-14-3-r28 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456(7218):66–72. Epub 2008/11/07. 10.1038/nature07485 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493(7430):45–50. Epub 2012/12/12. 10.1038/nature11711 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Gonzalez-Perez A, Mustonen V, Reva B, Ritchie GR, Creixell P, Karchin R, et al. Computational approaches to identify functional genetic variants in cancer genomes. Nature methods. 2013;10(8):723–9. Epub 2013/08/01. 10.1038/nmeth.2562 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012;28(24):3169–77. Epub 2012/10/13. 10.1093/bioinformatics/bts605 [DOI] [PubMed] [Google Scholar]
- 9. Li JW, Schmieder R, Ward RM, Delenick J, Olivares EC, Mittelman D. SEQanswers: an open access community for collaboratively decoding genomes. Bioinformatics. 2012;28(9):1272–3. Epub 2012/03/16. 10.1093/bioinformatics/bts128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Parnell LD, Lindenbaum P, Shameer K, Dall'Olio GM, Swan DC, Jensen LJ, et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS computational biology. 2011;7(10):e1002216 Epub 2011/11/03. 10.1371/journal.pcbi.1002216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Griffith M, Griffith OL, Coffman AC, Weible JV, McMichael JF, Spies NC, et al. DGIdb: mining the druggable genome. Nature methods. 2013;10(12):1209–10. Epub 2013/10/15. 10.1038/nmeth.2689 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. Epub 2009/06/10. 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8. Epub 2011/06/10. 10.1093/bioinformatics/btr330 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718–9. Epub 2011/01/07. 10.1093/bioinformatics/btq671 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. Epub 2009/05/20. 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25(9):1105–11. Epub 2009/03/18. 10.1093/bioinformatics/btp120 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kent WJ. BLAT—the BLAST-like alignment tool. Genome research. 2002;12(4):656–64. Epub 2002/04/05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Anders S, Huber W. Differential expression analysis for sequence count data. Genome biology. 2010;11(10):R106 Epub 2010/10/29. 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, et al. The UCSC Genome Browser Database: update 2006. Nucleic acids research. 2006;34(Database issue):D590–8. Epub 2005/12/31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Downing JR, Wilson RK, Zhang J, Mardis ER, Pui CH, Ding L, et al. The Pediatric Cancer Genome Project. Nature genetics. 2012;44(6):619–22. Epub 2012/05/30. 10.1038/ng.2287 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kandoth C, Schultz N, Cherniack AD, Akbani R, Liu Y, Shen H, et al. Integrated genomic characterization of endometrial carcinoma. Nature. 2013;497(7447):67–73. Epub 2013/05/03. 10.1038/nature12113 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. The New England journal of medicine. 2013;368(22):2059–74. Epub 2013/05/03. 10.1056/NEJMoa1301689 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zhang J, Wu G, Miller CP, Tatevossian RG, Dalton JD, Tang B, et al. Whole-genome sequencing identifies genetic alterations in pediatric low-grade gliomas. Nature genetics. 2013;45(6):602–12. Epub 2013/04/16. 10.1038/ng.2611 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70. Epub 2012/09/25. 10.1038/nature11412 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell. 2012;150(6):1121–34. Epub 2012/09/18. 10.1016/j.cell.2012.08.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012;487(7407):330–7. Epub 2012/07/20. 10.1038/nature11252 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474(7353):609–15. Epub 2011/07/02. 10.1038/nature10166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. McGaugh SE, Gross JB, Aken B, Blin M, Borowsky R, Chalopin D, et al. The cavefish genome reveals candidate genes for eye loss. Nature communications. 2014;5:5307 Epub 2014/10/21. 10.1038/ncomms6307 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Montague MJ, Li G, Gandolfi B, Khan R, Aken BL, Searle SM, et al. Comparative analysis of the domestic cat genome reveals genetic signatures underlying feline biology and domestication. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(48):17230–5. Epub 2014/11/12. 10.1073/pnas.1410083111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Service SK, Teslovich TM, Fuchsberger C, Ramensky V, Yajnik P, Koboldt DC, et al. Re-sequencing expands our understanding of the phenotypic impact of variants at GWAS loci. PLoS genetics. 2014;10(1):e1004147 Epub 2014/02/06. 10.1371/journal.pgen.1004147 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Daiger SP, Bowne SJ, Sullivan LS, Blanton SH, Weinstock GM, Koboldt DC, et al. Application of next-generation sequencing to identify genes and mutations causing autosomal dominant retinitis pigmentosa (adRP). Advances in experimental medicine and biology. 2014;801:123–9. Epub 2014/03/26. 10.1007/978-1-4614-3209-8_16 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Yu Y, Triebwasser MP, Wong EK, Schramm EC, Thomas B, Reynolds R, et al. Whole-exome sequencing identifies rare, functional CFH variants in families with macular degeneration. Human molecular genetics. 2014;23(19):5283–93. Epub 2014/05/23. 10.1093/hmg/ddu226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Wylie KM, Wylie TN, Orvedahl A, Buller RS, Herter BN, Magrini V, et al. Genome sequence of enterovirus D68 from St. Louis, Missouri, USA. Emerging infectious diseases. 2015;21(1):184–6. Epub 2014/12/23. 10.3201/eid2101.141605 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Wylie KM, Mihindukulasuriya KA, Zhou Y, Sodergren E, Storch GA, Weinstock GM. Metagenomic analysis of double-stranded DNA viruses in healthy adults. BMC biology. 2014;12:71 Epub 2014/09/13. 10.1186/s12915-014-0071-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Zhou Y, Holland MJ, Makalo P, Joof H, Roberts CH, Mabey DC, et al. The conjunctival microbiome in health and trachomatous disease: a case control study. Genome medicine. 2014;6(11):99 Epub 2014/12/09. 10.1186/s13073-014-0099-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Gazdar AF, Kurvari V, Virmani A, Gollahon L, Sakaguchi M, Westerfield M, et al. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. International journal of cancer Journal international du cancer. 1998;78(6):766–74. Epub 1998/12/02. [DOI] [PubMed] [Google Scholar]
- 37. Mardis ER, Ding L, Dooling DJ, Larson DE, McLellan MD, Chen K, et al. Recurring mutations found by sequencing an acute myeloid leukemia genome. The New England journal of medicine. 2009;361(11):1058–66. Epub 2009/08/07. 10.1056/NEJMoa0903840 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92. Epub 2012/04/21. 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research. 2011;39(Database issue):D945–50. Epub 2010/10/19. 10.1093/nar/gkq929 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000;25(1):25–9. Epub 2000/05/10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, et al. A census of human cancer genes. Nature reviews Cancer. 2004;4(3):177–83. Epub 2004/03/03. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic acids research. 2011;39(Database issue):D52–7. Epub 2010/12/01. 10.1093/nar/gkq1237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature methods. 2009;6(9):677–81. Epub 2009/08/12. 10.1038/nmeth.1363 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nature methods. 2011;8(8):652–4. Epub 2011/06/15. 10.1038/nmeth.1628 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics. 2011;27(20):2903–4. Epub 2011/08/16. 10.1093/bioinformatics/btr467 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics. 2012;28(14):1811–7. Epub 2012/05/15. 10.1093/bioinformatics/bts271 [DOI] [PubMed] [Google Scholar]
- 47. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research. 2012;22(3):568–76. Epub 2012/02/04. 10.1101/gr.129684.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012;28(3):311–7. Epub 2011/12/14. 10.1093/bioinformatics/btr665 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25(21):2865–71. Epub 2009/06/30. 10.1093/bioinformatics/btp394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010;20(9):1297–303. Epub 2010/07/21. 10.1101/gr.107524.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols. 2012;7(3):562–78. Epub 2012/03/03. 10.1038/nprot.2012.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. Epub 2009/11/17. 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information aesthetic for comparative genomics. Genome research. 2009;19(9):1639–45. Epub 2009/06/23. 10.1101/gr.092759.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data are within the paper and its Supporting Information files. The GMS is made available under the open source GNU Lesser General Public License Version 3 (http://www.gnu.org/copyleft/lesser.html) and can be found on the GitHub Genome Institute pages (https://github.com/genome/gms). All source code is available on GitHub and all demonstration data is available for download here: https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/.