MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease

Lishuang Shen; Maria Angela Diroma; Michael Gonzalez; Daniel Navarro-Gomez; Jeremy Leipzig; Marie T Lott; Mannis van Oven; Douglas C Wallace; Colleen Clarke Muraresku; Zarazuela Zolkipli-Cunningham; Patrick F Chinnery; Marcella Attimonelli; Stephan Zuchner; Marni J Falk; Xiaowu Gai

doi:10.1002/humu.22974

. Author manuscript; available in PMC: 2017 Jun 1.

Published in final edited form as: Hum Mutat. 2016 Mar 21;37(6):540–548. doi: 10.1002/humu.22974

MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease

Lishuang Shen ^1,², Maria Angela Diroma ^3,⁴, Michael Gonzalez ^5,⁶, Daniel Navarro-Gomez ², Jeremy Leipzig ⁷, Marie T Lott ⁸, Mannis van Oven ⁹, Douglas C Wallace ^8,¹⁰, Colleen Clarke Muraresku ¹¹, Zarazuela Zolkipli-Cunningham ¹², Patrick F Chinnery ¹³, Marcella Attimonelli ³, Stephan Zuchner ^5,⁶, Marni J Falk ^11,^14,^*,^#, Xiaowu Gai ^1,^2,^*,^#

PMCID: PMC4846568 NIHMSID: NIHMS763101 PMID: 26919060

Abstract

MSeqDR is the Mitochondrial Disease Sequence Data Resource, a centralized and comprehensive genome and phenome bioinformatics resource built by the mitochondrial disease community to facilitate clinical diagnosis and research investigations of individual patient phenotypes, genomes, genes, and variants. A central Web portal (https://mseqdr.org) integrates community knowledge from expert-curated databases with genomic and phenotype data shared by clinicians and researchers. MSeqDR also functions as a centralized application server for Web-based tools to analyze data across both mitochondrial and nuclear DNA, including investigator-driven whole exome or genome dataset analyses through MSeqDR-Genesis. MSeqDR-GBrowse supports interactive genomic data exploration and visualization with custom tracks relevant to mtDNA variation and disease. MSeqDR-LSDB is a locus specific database that currently manages 178 mitochondrial diseases, 1,363 genes associated with mitochondrial biology or disease, and 3,711 pathogenic variants in those genes. MSeqDR Disease Portal allows hierarchical tree-style disease exploration to evaluate their unique descriptions, phenotypes, and causative variants. Automated genomic data submission tools are provided that capture ClinVar-compliant variant annotations. PhenoTips is used for phenotypic data submission on de-identified patients using human phenotype ontology terminology. Development of a dynamic informed patient consent process to guide data access is underway to realize the full potential of these resources.

Keywords: mitochondria, genetics, informatics, database

Introduction

MSeqDR is a centralized, expert-curated, comprehensive genomic and phenotype data resource built by and for the mitochondrial disease community to facilitate the diagnosis and improved understanding of individual mitochondrial diseases [Falk et al., 2015]. The MSeqDR website (https://mseqdr.org) is a secure web portal that meets privacy and data security requirements. MSeqDR also provides a suite of Web-based bioinformatics tools to support diverse end-user analyses of their own genomic datasets from clinical patients and/or study subjects. Such a centralized compilation of custom and public bioinformatics resources enable clinicians and researchers to directly contribute and analyze a wide range of genomic data at the aggregate level. Focused efforts are underway to also enable the direct contribution by MSeqDR investigators of de-identified phenotype data on individual patients following the full implementation of a dynamic process to document their informed consent for data usage. MSeqDR supports account authentication and data exchange via OAuth2 and APIs such that the existing web-based genomic data analysis tools such as Genesis 2.0 (http://www.viagenetics.com/genesis-2.0.html) can be seamlessly integrated.

The MSeqDR system is organized into 3 major local components. Firstly, the central MSeqDR website supports heterogeneous data capture, curation, and flexible mining and analysis by end-users. Second, the MSeqDR Locus Specific Database (MSeqDR-LSDB) provides centralized disease, gene, and variant level information relevant to known or candidate mitochondrial disease genes. Third, a custom MSeqDR genome browser (MSeqDR-GBrowse) supports genomic data visualization of both mtDNA and nuclear genomes at aggregated cohort or individual patient levels through public, community, and private data tracks. MSeqDR also enables focused exploration and mining of de-identified genomic data on individual patients or patient cohorts by intuitive Web-based data filtering in MSeqDR-Genesis (formerly GEM.app) [Gonzalez et al., 2015]. Through MSeqDR-Genesis, genomic and phenotype data deposited within MSeqDR is seamlessly integrated with an external data resource established by the Genesis Project, which supports direct exome or genome data set mining on individual patients or cohorts by clinicians and researchers, as well as provides matchmaking capabilities with the broader genomics community.

Building the MSeqDR knowledge and data repository

Organizing a genomic data reference for mitochondrial disease

MSeqDR has a data back-end whose goal is to support the seamless integration and analysis of genomic and phenotype data from individual patients and patient cohorts with known or suspected mitochondrial disease, as well as to provide a centralized compilation of highly heterogeneous sets of data on mitochondrial diseases and their causative genes across both nuclear and mtDNA genomes. Initial effort in creating this data resource was devoted to community engagement to determine their specific needs, and to support facile data mining at different levels of resolution through Web interfaces specifically designed for non-bioinformatics experts. The strategy employed to realize this goal involved a sequential series of steps necessary to build the MSeqDR Web infrastructure, identify and prepare relevant reference data sets, and solicit the mitochondrial disease community for data contributions in the form of genomic datasets and aggregated variation data. Following achievement of this goal, we have now turned to supporting the inclusion of genomic data on a smaller-scale from individual investigators, clinicians, and diagnostic laboratories.

MSeqDR-LSDB currently focuses on the curation of disease, gene, and variant level information on 1,363 unique genes that represent known and candidate mitochondrial disease genes (Supp. Table S1). Specifically, this aggregated gene set includes all 37 mtDNA genes, nuclear genes known to cause mitochondrial disease, nuclear genes known to encode mitochondrial proteins, and genes included on clinical panels or research investigations of mitochondrial biology and disease. Additional genes are added over time when new evidence supports their having a causative role in mitochondrial disease or encoding a protein localized in mitochondria. Indeed, when updating data on pathogenic variants for mitochondrial disease, newly recognized disease genes will be automatically added into MSeqDR-LSDB. Ensembl gene models and the HGNC RefSeq nomenclature are used for general whole genome reference, as HGNC gene symbols are required for inclusion in an LSDB database. The variation reference is dbSNP 138 annotations from Ensembl v75 (GRCh37), including variant consequence predictions. Other reference variants are from the 1000 Genomes project data [1000 Genomes Project Consortium, 2013], and the UK10K project whole genome variant data from the 4,000 healthy individuals (http://www.uk10k.org/). mtDNA genome variant references include MITOMAP curation [Brandon et al., 2005, Lott MT et al., 2013], HmtDB, the Human Mitochondrial Database [Rubino et al., 2012] and GeneDx, which respectively provide variant allele frequencies from 30,500, 25,991, and 6,397 mtDNA genomes.

For the MSeqDR whole exome data reference, four major data sets (M1-M4) were used. (M1): MSeqDR collected whole exome data for approximately 1,700 exomes from the mitochondrial disease community and collaborators, notably including 324 exomes from Genome England. (M2): Genesis (formerly Gem.app) from the University of Miami provided 6,300 whole exome data sets that is particularly enriched for patients with neurologic disease [Gonzalez et al., 2015]. (M3): NHLBI GO Exome Sequencing Project (ESP) publicly released ESP6500 exome variant data coming from 6,500 unrelated samples (Citation per its website: Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA [URL: http://evs.gs.washington.edu/EVS, 08/2014 accessed]. (M4): Variant data from the Exome Aggregation Consortium [ExAC Consortium, 2015], coming from 60,706 unrelated individuals who were sequenced as part of various disease-specific and population genetic studies. Datasets (M1) and (M2) are used as the more patient-enriched population references than (M3) and (M4) in MSeqDR for purposes of determining allele and genotype frequencies in different ethnic groups. There were 3 major reasons we used both M3 and M4 data sets. First, EVS (M3) was included back in 2013, while ExAC (M4) was added in 2015. As these were added at different times, we did not retire the M3 dataset. A second reason to keep both M3 and M4 is for back compatibility with published results based on EVS. A third rationale for maintaining both datasets is the partial inclusion of EVS in ExAC per ExAC documentation (http://exac.broadinstitute.org/faq). In addition, prioritized rare variants in individual patient datasets can be identified and retrieved from analysis in the MSeqDR-Genesis system.

Organizing a phenotype data reference in mitochondrial disease: Cross-referencing and standardizing definitions across dictionaries

Systematically collected and readily interpretable representations of mitochondrial disease features are crucial to mitochondrial disease investigations, given the highly complex nature of these diseases and associated symptoms. MSeqDR's goal is to link mitochondrial disease patient genotypes with specific phenotypic information to facilitate clinical diagnostic practice and research data integration. Therefore, MSeqDR focuses on developing robust tools and systems for phenotype data compilation, standardization, and annotation. We elected to use both the Human Phenotype Ontology (HPO) [Robinson et al., 2008, Köhler et al., 2013] and MeSH/MEDIC [Humphrey, 1984; Davis et al., 2009, 2015] vocabularies for different purposes within MSeqDR. HPO terminology was used to list the phenotypic abnormalities and describe the term relationships as an ontology hierarchy. However, recognizing that HPO is not designed as a “disease ontology”, MeSH/MEDIC was utilized to help to organize the disease terms into a hierarchical structure, which is a combination of MeSH hierarchy and the flat OMIM disease list.

Considering the historical preferences of clinicians and biologists, the need to unify the various components of MSeqDR, and the comprehensiveness and quality of different data resources, the resources selected to provide the general phenotype backbone included HPO [Robinson et al., 2008, Köhler et al., 2013], OMIM [Hamosh et al., 2005], ClinVar [Landrum et al., 2014] MedGen (http://www.ncbi.nlm.nih.gov/medgen) and Ensembl Diseases and Phenotype (http://www.ensembl.org/info/website/tutorials/phenotype.html) [Cunningham et al., 2015], as further described below. For disease and phenotype, reference data is largely drawn from the OMIM disease dictionary, HPO, and MedGen databases. In particular, OMIM is the cornerstone of the LOVD systems used by MSeqDR-LSDB [Fokkema et al., 2011]. NCBI ClinVar phenome fields are used to provide the MSeqDR annotation for sequence variants and their pathogenicity. Indeed, MSeqDR uses ClinVar as the most comprehensive global pathogenic variant data source, which is then complemented by phenotype data that is derived from the mitochondrial disease community through disease reports.

The HPO project provides a structured, comprehensive and well-defined set of 10,088 classes (terms) that describe human phenotypic abnormalities, and annotations to HPO classes of 7,278 human hereditary syndromes. MSeqDR adopted HPO to provide structured and standardized phenotype descriptions of mitochondrial diseases. Disease-phenotype-gene associations are curated in MSeqDR based on annotations from HPO, ClinVar, MedGen and OMIM. This structure will help to enable machine actable data integration across labs, institutions, and projects, so that the rare disease cohorts can be precisely joined by their phenotype category.

178 distinct mitochondrial disease entities were identified from curation of OMIM [Hamosh et al., 2005], United Mitochondrial Disease Foundation (UMDF, 45 diseases, http://umdf.org), North American Mitochondrial Disease Consortium (NAMDC, 23 diseases, http://www.rarediseasesnetwork.org/namdc/), and literature review through keyword match with manual review. Disease names were taken from OMIM, as is required by LOVD for MSeqDRLSDB. Some disease names were mapped to multiple OMIM terms upon manual review. For example, ‘Q10 Coenzyme Deficiency’ was mapped to 6 OMIM IDs. All diseases not described directly in OMIM terms were reviewed by the MSeqDR curator and assigned the most likely OMIM names.

MSeqDR organizes diseases as a poly-hierarchical tree for visualization, where a term may appear as a node in more than one branch, and a disease may have different descendants in each of its subbranches. Hierarchical information came from the Medicine's Medical Subject Headings (MeSH®) hierarchical vocabulary [Humphrey, 1984], which was captured in the Comparative Toxicogenomics Database (CTD) MEDIC disease vocabulary [Davis et al., 2009, 2015]. MEDIC is a modified subset of descriptors from the “Diseases” [C] branch of MeSH. We further combined these MESH terms with genetic disorder annotations from OMIM, and synonym MedGen IDs. When a mitochondrial disease name is not included in CTD MEDIC, we internally represent it as a direct descendant of the top mitochondrial disease tree node. We further extracted the term to term relationships and the graph path from the CTD OBO file, which are then used in the ‘MSeqDR Disease Portal’, described later, to visualize the cross-referenced phenome data as a dynamic tree.

MSeqDR obtained diagnostic criteria data and a symptom manifestation dictionary from NAMDC. The 160 NAMDC disease manifestations terms were mapped into HPO terms where the NAMDC terms were reformatted and manually retrieved by reviewing of the keyword matches and context in HPO term hierarchical tree. Some NAMDC terms were divided into multiple HPO child terms. Over 90% of NAMDC clinical manifestations were definitely mapped into 123 HPO terms, including 53 terms that were exactly matched by keyword terms. We also identified several NAMDC manifestation terms that were not included in the HPO dictionary and are likely candidates for new HPO terms. For example, no HPO match was identified for a major NAMDC criterion, “thymidine phosphorylase deficiency”, and other complex terms such as “spontaneous improvement by age 3 years” or “muscle biopsy showing COX-negative fibers”. An expert-curated condensed term list was manually created from these 123 NAMDC-HPO terms, which is used by MSeqDR and within MSeqDR-Genesis for patient classification. NAMDC diagnostic manifestations are used to annotate disease manifestations side-by-side with HPO terms, when applicable.

MSeqDR provides a centralized community web portal for genomic and phenome resources

MSeqDR aims to serve as the central Web portal to bridge the often scattered resources that exist across the basic mitochondrial biology and mitochondrial disease communities. Multiple richly annotated mitochondrial gene and/or disease repositories are maintained by domain experts or individual organizations but not readily accessible or known. With a global network of over 100 mitochondrial disease community experts participating in the MSeqDR Consortium [Falk et al., 2015], community out-reach was pursued to request the sharing and dissemination of data in MSeqDR that will ultimately benefit the whole community. The nature of the contributed data could be genomic or phenotypic, data sizes could range from whole genome or exome to targeted sequencing results on one or many individuals, and individual level data could be contributed on single variant pathogenicity or deidentified patient-specific phenotype entry. Data access privileges can be set at the discretion of the contributor as public mode access, protected mode access such as to a specified group, or private mode access. All shared data sources hosted by or linked to MSeqDR are appropriately credited, with data links to original data repositories provided and routinely updated.

MSeqDR is creating its own reference whole exome sequencing populations M1 and M2 [Gonzalez et al., 2015], with approximately 1,700 and 6,300 exomes respectively. Variants and raw sequence data from these datasets are stored in MSeqDR and Genesis, and are accessible for collaboration at various access levels in accordance with evolving MSeqDR membership, regulatory, and ethical guidelines [Thorogood et al., 2015; Shabani et al., 2015].

MSeqDR-Genesis: Investigator-driven whole exome or genome dataset analyses

Genesis 2.0 (formerly named GEM.app) is available to not-for-profit researchers at no cost, accessed from within the MSeqDR Web portal. Genesis 2.0 enables primary investigators to manage, analyze and share their genomic datasets in real time. MSeqDR-Genesis integration supports gene panel, exome, genome, and phenotype data archiving as well as phenotype-based mining of patient-, family-, and cohort-level genomic and phenotypic data. Complete usage details of the Genesis platform was recently described [Gonzalez et al., 2015].

MSeqDR-GBrowse: Interactive Data Exploration and Visualization

MSeqDR-GBrowse supports integrated visualization of multiple types of variants in the context of other genomic annotations, as well as data sharing across other domains within MSeqDR. For this purpose, the Generic Genome Browser (GBrowse) [Stein LD, 2002] was used with minor modifications. Specifically, MSeqDR-GBrowse integrates data from various pathogenic and population sequence variation resources, particularly including mitochondrial disease resources. The key data tracks in GBrowse are organized into 5 major categories (Supp. Figure S1): (1) Genes and Transcripts, as based upon Ensembl release 75, build 37 and UCSC hg19, along with genes that encode mitochondrial proteins or mitochondrial disease candidate genes; (2) Variation (reference), including population-level variation from dbSNP and specific projects such as the 1000 Genomes Project and the NHLBI ESP Project; (3) Variant Phenotype, including variants with pathogenicity and phenotypic annotations, as extracted from ClinVar and Ensembl phenotype, MITOMAP mtDNA variants, and non-mitochondrial variants including shared sets of POLG and TAZ mutations; (4) Mitochondrial Variation, mtDNA variants from various sources, primarily including the approximately 10,500 variants from MITOMAP, the approximately 14,700 variants from healthy genomes and 4,600 from patients deposited in HmtDB, the 4,280 haplogroup defining variants from PhyloTree, as well as the 804 large scale deletions and 44 duplications from MitoBreak; and (5) Custom Tracks and Data Sharing, where MSeqDR-GBrowse allows users to upload their own track data by submitting their own data in gff3, BED and other supported data formats. The owner can designate the access privileges to such custom tracks in 3 types: Public, Group, and Private. External data can also be shared by including the URL from the 3^rd party GBrowse instances as custom tracks. MSeqDR tracks can be used as data sources for external genome browsers as URL, or downloaded as in gff3 format and upload to 3^rd party databases. These features allow cross-website integration with external data sources.

Within these 5 categories of key data tracks, MSeqDR-GBrowse hosts custom data visualization tracks developed for the mitochondrial disease community. Several tracks are available with data from MITOMAP which is a primarily human mitochondrial genome variation database with expert manual curation. MITOMAP has shared with MSeqDR its full list of approximately 10,500 known mtDNA variants from 30,500 genomes, which is enhanced with population level allele frequency and literature citations. This serves as the most complete reference dataset for mtDNA mutations and variation data. The MITOMAP team has manually curated 580 mtDNA mutations having reported disease-associations, including 300 rRNA or tRNA mutations plus 280 coding and control region mutations. These data therefore represent the largest de novo mtDNA mutation resource in terms of expert pathogenicity annotation.

The HmtDB tracks are a comprehensive collection of all the variants annotated in the Human Mitochondrial Database, HmtDB [Rubino et al., 2012], which contains more than 25,000 complete human mitochondrial genomes from 22691 healthy people and 3300 patients derived from GenBank or submitted by end users. HmtDB tracks are generated with respect to both mitochondrial reference sequences, rCRS (revised Cambridge Reference Sequence) and RSRS (Reconstructed Sapiens Reference sequence) [Behar et al., 2012]. RSRS differs from the preferred rCRS reference genome in 52 positions. For the pathogenicity prediction track (“MT-patho.CDS”), all 24,202 possible non-synonymous mtDNA variants were identified in the 13 mitochondrial protein coding genes, and pathogenicity predictions and related scores were obtained. Similarly, the “MT-patho.STOP” track reports all the 1,740 possible stop-gain and 77 stop-loss mutations within the mtDNA genome. Specific tracks were produced for mtDNA variants within rRNA and tRNA genes (“MT-RNA” and “MT-patho.RNA” tracks), and another mtDNA variant track called “1KG/WES Mito.Variants” was made for mtDNA variants detected within the 1000 Genomes Project dataset of 2,368 samples that was obtained from off-target exome data [Diroma et al. 2014, Calabrese et al. 2014].

PhyloTree (mtDNA tree Build 16, rCRS reference) is the comprehensive phylogenetic tree of global human mtDNA variations [van Oven M and Kayser M, 2009]. The PhyloTree dataset consists of 4,806 haplogroups, with 4,228 unique haplogroup-defining variants at 3,940 mtDNA positions. All of these haplogroups, variants, and haplogroup-defining variant annotations were parsed into MSeqDR to generate a gff3 Phylotree track file for visualization and data mining within or outside MSeqDR-GBrowse.

A Mitobreak track provides information on mtDNA copy number alterations obtained from MitoBreak for 805 mtDNA deletions and 44 mtDNA duplications, which is a comprehensive database of mtDNA breakpoints for three classes of somatic mtDNA rearrangements: circular deleted (deletions), circular partially duplicated (duplications), and linear mtDNAs [Damas et al., 2014]. The database was constructed using mtDNA genome rearrangement information that was collected from nearly 400 publications, as well as information gathered from the MITOMAP and MitoTool databases.

Gene-specific tracks are available in MSeqDR for a few locus specific databases that collaborate with MSeqDR. This includes a POLG track with data from the Human DNA Polymerase Gamma (POLG) Mutation Database Project [Longley et al., 2005], which lists all known pathogenic mutations in the coding region of the POLG gene. In addition, the Barth Syndrome Foundation (BSF) human Tafazzin (TAZ) mutation database includes mutations and variants curated from the literature, direct submission by laboratories, and direct submission by affected families (https://www.barthsyndrome.org/science--medicine/human-taz-gene-variants-database).

In addition to collaboration with academic and non-profit organizations, MSeqDR seeks to also collaborate with and share knowledge gained from sequencing by clinical diagnostic laboratories. The Transgenomic track provides information shared by Transgenomic, Inc., which contains aggregated variation data containing 108,000 variants from 151 patients that were detected by ‘NuclearMitome Comprehensive Sequence Analysis’ of 448 nuclear genes having relevance to normal mitochondrial function or to conditions that mimic mitochondrial disease. GeneDx, Inc., a genetic diagnostic company that specializes in genetic testing for rare hereditary disorders, shared a GeneDx custom track that provides aggregated mtDNA genome data from 6,397 healthy individuals, thereby enhancing the mtDNA genome curation efforts of MSeqDR. Indeed, clinical sequencing datasets remain valuable although are often left unused after the primary target mutations are analyzed and initial diagnostic review is completed. Encouraging commercial entities to share such data in the aggregate within standardized searchable formats will greatly assist in the community effort to diagnose and perform research in rare disorders, including mitochondrial diseases.

MSeqDR-LSDB, a Customized Mitochondrial Disease Locus Specific Database

MSeqDR-LSDB manages the submission and access to data for diseases, patients and known pathogenic variants. MSeqDR-LSDB is highly customized instance of the Leiden Open Variant Database (LOVD) [Fokkema et al., 2011]. LOVD has been adopted by pathogenic variant databases of many human diseases and is also supported by major databases such as NCBI, Ensembl, and the UCSC Genome Browser because of its sharing of heterogeneous variant data via API. However, the native LOVD system was a single entry system that did not fully support the mitochondrial genome at the time when the MSeqDR project started. We therefore performed extensive customizations and enhancements in both its back and front ends to add support for mitochondrial genome data entry and mining with links to other heterogeneous data. Firstly, this involved creation of a quick prototyping strategy, where we developed a batch loading solution with a seeding set of genes, transcripts, diseases, and their associations. This capability represents a major improvement over the native LOVD single entry submission strategy. Indeed, this revision allowed us to quickly establish the prototype with 178 mitochondrial diseases, and 1,363 nuclear and mtDNA genes. Second, we developed support for the mitochondrial genome through custom programming and rigorous curation. At the time of MSeqDR-LSDB creation (LOVD v3.08, November 2013), the native LOVD system did not support mitochondrial genome, partially due to the fact that NCBI was not fully supporting the mitochondrial chromosome in Refseq. We therefore integrated data from HGNC, dbSNP, pathogenic variant data sources, and Ensembl genes. We also manually annotated mitochondrial DNA genes and transcripts, with special considerations for tRNA and rRNA genes. Third, we incorporated a “plug-in” strategy to enhance report pages with heterogeneous information and require minimal changes in LOVD scripts and the database schema. As the native LOVD captures limited details for each data type and is rigid in data structure, it is not very flexible in reporting heterogeneous genomic data. We developed enhanced gene/transcripts/variant report pages in MSeqDR-LSDB, which flexibly integrates information and graphics from GBrowse and other data sources external to the LOVD database. The ‘Quick Comment’ and ‘ClinVar Style’ annotation plug-ins, for example, provide flexible and semi-automatic variant annotation tools for end users. Lastly, we integrated MSeqDR-LSDB with other MSeqDR components through unified account and authorization management, and further customized two-way interlinking of LSDB pages with MSeqDR and GBrowse components. Thus, data integration is achieved at the gene, variant, and disease levels.

MSeqDR-LSDB currently hosts 178 mitochondrial diseases and 1,363 genes (Figure 1). Disease-variation information was drawn from multiple sources: MSeqDR literature mining, MITOMAP, HmtDB, ClinVar, Ensembl phenotype data, other community resources including the POLG mutation database and TAZ mutation database from the Barth Syndrome Foundation. There are now 3,975 public variants, among which 3,711 are unique (Table 1). Overall, MSeqDR is among the largest LSDB instance among all LOVD installations.

MSeqDR-LSDB Current Status Panel and MSeqDR Annual Usage by City. **Left Panel**: MSeqDR-LSDB Current Status Panel (as of November 19, 2015) details the numbers of genes, variants per gene, diseases associated per gene, and the numbers of variants per variant type. **Right Panel**: MSeqDR annual usage statistics generated by Google Analytics from November 2014 through October 2015. *Top*, unique MSeqDR visit sessions per month over last 12 months. *Bottom*, a global view of MSeqDR site visits by city, where the visit number by city are highlighted by gradients.

Table 1.

Summary of pathogenic or likely pathogenic variants in MSeqDR-LSDB (2015-11-22).

Variant Sources	Variant Entries	Unique Variants
All	3,979	3,711
ClinVar	1,825	1,821
Ensembl	1,073	1,059
MitoMap	555	555
POLG	208	208
BSF_TAZ	337	199
mtDNA	1,143	1,143
Manuscript review	1	1

Open in a new tab

mtDNA-specific genome mining tools

A number of other mtDNA-specific genome mining tools have been either developed as part of the MSeqDR project or incorporated from other sources. This includes mvTool, which is a universal mtDNA variant converter and one-stop annotation resource. mvTool converts dozens of existing mtDNA variant formats into a standard rCRS-based (NC_012920.1) HGVS format. It also converts African Yoruba (AF347015) based positions into rCRS-based reference positions. It is a mtDNA specific one-stop variant annotation tool as VariantOneStop, with additional multiple-population frequencies from major mtDNA resources from MITOMAP, HmtDB, GeneDX and the community.

In addition, Phy-Mer is a novel alignment-free and reference-independent mitochondrial haplogroup classifier, developed as part of MSeqDR project [Navarro-Gomez et al., 2015]. It supports input in single or multiple fasta, single sample fastq, bam, csv formats. MToolBox implements an effective computational strategy for human mitochondrial genomes assembling and analysis from mitochondria-targeted and off-targeted sequencing data [Calabrese et al., 2014]. The outputs include reconstructed mitochondrial genomes (for NGS data), haplogroup(s) prediction, functional annotation and prioritization of detected variants.

Description of MSeqDR in-house central component and customized tools

The in-house central component for the MSeqDR website includes core website components that convey MSeqDR project information, education, and updates. Additional components detailed below include (A) account and data access management, (B) genomic and phenotypic data search and mining interfaces, and (C) variant and genomic study data submission and curation tools.

MSeqDR account and data access management

MSeqDR values data security and patient privacy. MSeqDR is currently hosted at Children's Hospital Los Angeles (CHLA), a secure Web portal (https://MSeqDR.org) using ssl protocol and the server is protected by hospital's firewall and policy conforming to HIPPA and PHI protection standard guidelines. MSeqDR adapted OAuth, an open standard authentication protocol that allows users to approve application to act on their behalf without sharing their password, to enable a cross-institution and unified authentication account system between the locally-managed server and the Genesis cloud-based server. The distinct account management databases from the four MSeqDR components (MSeqDR portal, GBrowse, LSDB, Genesis) are being unified and synchronized. Already, users only need to register and log on in a single portal to gain access to data across the whole site. Most of the MSeqDR data and tools are open to all general users regardless of whether they register. However, registration is required to access protected data deposited in MSeqDR, to fully utilize the system's functionality, or for users to contribute data and control the access policy for their data. Registration is free and open to all academic users. Commercial entities including clinical diagnostic laboratories and pharmaceutical companies are expected to register for fee-based use to support the ongoing activities of the MSeqDR Consortium. A data access and use oversight committee is being established to collectively determine data utilization and download access rights to assure maximal protection of data privacy.

Interactive and programmatic data search and mining capabilities within MSeqDR

MSeqDR supports multiple data search and analysis strategies to utilize the rich set of curated genome, exome, and phenotype data that has been compiled relevant to mitochondrial diseases. Most users are likely to utilize the MSeqDR Web portal for data deposition, curation, sharing, searching and visualization. However, more computer-savvy users and other databases or groups may use the GBrowse Distributed Annotation System (DAS) protocol (http://www.biodas.org) to exchange genomic annotations across the Internet, as well as Application Programming Interfaces (APIs) being developed at MSeqDR or provided by LOVD to programmatically retrieve data for genes, regions, or variants. Occasionally, non-standard data sharing requests can be processed off the website. To facilitate single input entry searches, a top Google style ‘search box’ is accessible from all pages with auto-completion that uses AJAX/JSON real-time searches of disease and phenotype databases. From this central search box, MSeqDR users can quickly survey and match queries to the most appropriate data types including genomic, gene, transcript, variant names, HGVS terms, or chromosomal regions. Similarly, a phenotype-based search combines fuzzy matching and a complicated ranking method that will search the cross-referenced databases for phenotype keywords, including OMIM, HPO, MedGen dictionaries, or ClinVar IDs. Results are ranked by similarity and weight of matches, with the matching entry then reported with all relevant context including genomic or phenotypic annotations that are linked to internal and external gene, variant, disease, or HPO browser report pages including MSeqDR-GBrowse and MSeqDR-LSDB. In this fashion, the three major data access components have been fully integrated within the integrated data backend of MSeqDR.

For multiple entry searches, variant lists can be searched and annotated in the VariantOneStop (Supp. Figure S2) or Human BP Codon Resource Variant Annotation Pipeline (HBCR) exome dataset annotation page. Phenotype term lists can be searched in the phenotype data search page, where matches per term are ranked by full text search score. VariantOneStop is a complete custom variant annotation solution that supports variant input in various and mixed formats and nomenclatures, and integrates functional and pathogenicity information with multiple-population allele frequencies. VariantOneStop annotations are derived from MSeqDR's internal data, Ensembl Variant Effect Predictor (VEP) [McLaren et al., 2010] for novel SNP annotation, Mutalyzer [Wildeman et al., 2008] name checking, ClinVar, and multiple pathogenicity prediction algorithms including PolyPhen [Adzhubei et al., 2010], SIFT [Ng and Henikoff, 2001], dbNSFP [Liu et al., 2013], and the Combined Annotation Dependent Depletion (CADD) scores [Kircher et al., 2014]. It provides flexible variant naming style conversions (e.g. cDNA, genomic DNA, various names for chromosomes or mtDNA variants), and conversion to standard genomic coordinate based HGVS_g using web service APIs from Mutalyzer and Ensembl. It then conducts de novo prediction by calling Ensembl VEP RESTful for novel SNP annotation, and checking against existing core pathogenic variant data sets in MSeqDR. Variant input can be genomic or cDNA based and organized in either VCF, HBCR text, or HGVS formats. Enhanced annotations include: (1) Diseases, phenotypes and variants in MSeqDR, ClinVar, Ensembl, and hyperlinked to MSeqDR internal pathogenic variant database, ClinVar, NCBI, and Ensembl; (2) Population-specific allele frequencies including within the MSeqDR exome reference population (currently 1,700 exomes), Genesis platform (6,300 patient exome samples), 1000 Genomes Project (1KG), UK10K, ExAC and EVS; (3) Key information from dbNSFP, which is a very complete annotation resource for all potential non-synonymous SNPs; and (4) GERP score [Cooper et al, 2005] CADD score, SIFT and PolyPhen score for almost any variant.

To analyze exome level variant data, users can upload the data file directly to the custom HBCR annotation Web tool that is supported by MSeqDR. HBCR is a suite of database-driven (MySQL) perl programs that functionally annotates variants. HBCR was used extensively in clinical diagnosis [Consugar et al., 2015] and also to discover a number of novel disease genes [Falk et al., 2012; Gai et al., 2013]. Predicted sequence changes are based on Ensembl gene models. Pathogenicity data is extracted from the MSeqDR curated database that includes propriety reference exome data and MSeqDR-LSDB disease information associated with each variant, as well as multiple annotations from dbSNP, PolyPhen, SIFT, the dbNSFP resource, and the CADD scores. Input is supported in HGVS, VCF, or custom HBCR text formats. A local Exomiser [Smedley et al., 2015] instance is supported to prioritize variants identified in individual exome data based on the patient phenotype information coded within HPO and OMIM disease data.

In addition, an MSeqDR disease portal enables unified exploration of all MSeqDR disease annotations. MSeqDR Disease Portal highlights MSeqDR's data integration product, which unified phenotypic and genomic data obtained from ClinVar, CTDBase, HPO, OMIM and MSeqDR-LSDB. The MSeqDR Disease Portal lists all 178 known mitochondrial diseases that have been identified in MSeqDR, along with their detailed clinical manifestations as described in HPO and NAMDC dictionaries. MSeqDR Disease Portal organizes mitochondrial diseases in a dynamic tree that is based on MESH and OMIM to support users pivoting between a global view and single disease details. The tree is coupled with a drop-down disease list. Each disease can be accessed either from the complete drop-down list, by traversing the disease tree, and by keyword search. Disease annotation is organized into 4 distinct sections: (1) the full OMIM and MedGen definition and synonyms, (2) the disease-associated genes and variants from MSeqDRLSDB with gene and variant data integrations in the LOVD system, (3) the disease-associated ClinVar variants, and (4) the disease-associated phenotypes in HPO and NAMDC that includes the frequency of each HPO term within each disease. Nodes are highlighted in the tree if they are mitochondrial diseases. An HPO browser tool has been created in PHP and JavaScript to browse and traverse each HPO node for terms describing its phenotypic anomalies and to potentially match diseases, genes, and MSeqDR genomics features.

Data submission to MSeqDR

MSeqDR supports either quick single pathogenic variant submission (Supp. Figure S2, or full study and variant data submission. For a full submission, users submit new study with meta description data in a Web form that is compatible with ClinVar specifications. VCF, HGVS and other supporting files can then be uploaded. Users can use an existing study as a template and then follow the step-by-step workflow. Users can also opt to submit deidentified patient information in LOVD or Phenotips. The pathogenic variant submission tool captures data compatible with ClinVar specifications where disease experts can further refine annotations of pathogenic variants in this semi-automatic tool. The submission tool is running the same annotator as VariantOneStop, which compiles the complete functional, historical and population annotation to aid in the pathogenicity evaluation. Results can be exported as Excel, CSV or JSON files or API calls. This step has been automated so that a user can simply paste in variants in VCF or HGVS format, perform a “annotate” click, and the VariantOneStop tool will automatically complete up to 24 fields, then one more click will store the data permanently to MSeqDR-LSDB. Further refinement of the draft annotation provided for each variant is supported in MSeqDR by allowing either quick add-on annotation or ClinVar style full annotation. In the future, ClinVar authorized users may use it to submit variants to ClinVar directly. Therefore, MSeqDR becomes the fully compatible community stopping point to evaluate and have expert disease panels review variant information before authorizing submission into the global ClinVar data repository. After variant submission into MSeqDR, variants will be published according to access policy following MSeqDR expert panel review.

PhenoTips is a software designed for efficient collecting and analysis of phenotypic data [Girdea et al., 2013]. PhenoTips is set up for patient data input within MSeqDR. MSeqDRPhenotips fully utilizes the HPO dictionary and phenotype-disease associations that are curated by the HPO Consortium. MSeqDR has implemented tools to translate raw clinical descriptions to the HPO dictionary in batch, and allows users to review term mappings, push terms into PhenoTips, retrieve genes and diseases associated with their set of input terms, and profile phenotypes with OWLslim that has been implemented by the Monarch Initiative (https://monarchinitiative.org/). The resulting HPO term list, along with patient-specific variant list, can be fed to phenotype-aware exome variant analysis tools such as Exomiser, which is running locally within MSeqDR.

Conclusions and Future Directions

MSeqDR provides the essential data back-end through integration of secure database-driven websites that support flexible data input and search functions via multiple in-house developed and modified third party bioinformatics tools. MSeqDR has already collected and processed genomic data from dozens of investigators as are related to mitochondrial diseases through collaboration with both non-profit institutions and commercial companies (Figure 1). MSeqDR currently supports the integrated analysis of collected compilations of genomic data for variants, genes, and phenotypes as well as organizes all relevant information at the level of individual mitochondrial diseases, defined clinical terminology, and well-organized descriptions in a hierarchical tree structure of disease manifestations. Dedicated tools for investigators to analyze their own datasets are centrally organized for investigations of both nuclear and mitochondrial genomes relevant to mitochondrial disease. Finally, MSeqDR has developed and incorporated a user-friendly data submission tool that is compatible with the broader NCBI ClinVar format but largely automated to save submitters’ effort to compile all variant data annotations needed for ClinVar upload. Programmatic data sharing between MSeqDR domains of LSDB, GBrowse, VariantOneStop and other interfaces is supported through APIs and web-services.

MSeqDR development is underway to enable phenotype-exome data integration and to develop mining tools for phenotype and pedigree-guided variant prioritization and diagnosis in molecular testing. To enable individual patient level data inclusion and sharing while respecting patient privacy and directives, we are building a centralized Web portal to both streamline and automate the process to obtain informed patient consent prior to incorporation of genomic and phenotype data that will be uploaded at and/or to be deposited directly into MSeqDR. This will require obtaining Institutional Review Board (IRB) approval, as well as implementation of a data access and use oversight committee of all interested parties. All data will be de-identified, labeled only with a global universal identifier (“GUID”) to prevent data duplication and enable matching of disparate data types entered at different times or locations on the same individuals [Falk et al., 2015]. Efforts are underway to build a Web-based phenotype data entry system supported within MSeqDR that utilizes community-determined common data elements defined by the hierarchical human phenotype ontology. Expert panels are being organized to curate variant and gene-level data on all known mitochondrial disease genes supported in LOVD, according to expert panel curation recommendations of the ClinGen project [Landrum et al., 2014]. Finally, ongoing education efforts are central to assure proper community engagement and utilization of existing and emerging MSeqDR component features. In this way, MSeqDR will continue to support clinicians and researchers in their efforts to better diagnose and understand the widely heterogeneous group of mitochondrial diseases.

Supplementary Material

Supp info

NIHMS763101-supplement-Supp_info.pdf^{(1,009.9KB, pdf)}

Acknowledgments

We are grateful to the late Dr. Richard Cotton for his guidance and support of the MSeqDR community effort from its inception. We are also grateful to the outstanding leadership and staff of the United Mitochondrial Disease Foundation, including Chuck Mohan, Dan Wright, Patrick Kelley, Philip Yeske, Cliff Gorski, and Janet Owens for their tireless efforts to organize as well as provide ongoing financial support for the MSeqDR Consortium activities. The server is hosted since August 2015 at CHLA with support from the CHLA IT department, and was previously located at MEEI and with helps from the OCI MBC group and the MEEI IT department. This work was also supported in part by the National Institutes of Health (U54-NS078059 and U41-HG006834) and the Netherlands Genomic Initiative (NGI)/Netherlands Organization for Scientific Research (NWO) within the framework of the Forensic Genomics Consortium Netherlands (FGCN) to MvO. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

MSeqDR Consortium Participants: Renkui Bai, Sherri Bale, Jirair Bedoyan, Doron Behar, Richard G. Boles, Penelope Bonnen, Virginia Brilhante, Lisa Brooks, Michael Brudno, Claudia Calabrese, Sarah Calvo, John Christodoulou, Deanna Church, Rosanna Clima, Bruce Cohen, IFM de Coo, William C. Copeland, Jeana T. DaRe, Olga Derbenevoa, Johan T. den Dunnen, David Dimmock, Gregory Enns, Giuseppe Gasparre, Rebecca Ganetzky, Amy Goldstein, Katrina Gwinn, Sihoun Hahn, Richard Haas, Hakon Hakonarson, Michio Hirano, Douglas Kerr, Danuta Krotoski, Austin Larson, Dong Li, Maria Lvova, Finley Macrae, Donna Maglott, Elizabeth McCormick, Grant Mitchell, Vamsi Mootha, Iris Gonzalez, Yasushi Okazaki, Melissa Parisi, Juan Carlos Perin, Eric Pierce, Vincent Procaccio, Holger Prokisch, Aurora Pujol, Shamima Rahman, David Ralph, Honey Reddi, Heidi Rehm, Erin Riggs, Richard Rodenburg, Yaffa Rubinstein, Russell Saneto, Mariangela Santorsola, Curt Scharfe, Claire Sheldon, Eric Shoubridge, Domenico Simone, Bert Smeets, Jan Smeitink, Christine Stanley, Fons Stassen, Anu Suomalainen-Waartiovaara, Mark Tarnopolsky, Isabelle Thiffault, David Thorburn, Johan Van Hove, Lynne Wolfe, Lee-Jun Wong, Philip Yeske, Zhe Zhang

References

1000 Genomes Project Consortium. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
Behar DM, van Oven M, Rosset S, Metspalu M, Loogväli EL, Silva NM, Kivisild T, Torroni A, Villems R. A “Copernican” reassessment of the human mitochondrial DNA tree from its root. Am J Hum Genet. 2012;90(4):675–684. doi: 10.1016/j.ajhg.2012.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Calabrese C, Simone D, Diroma MA, Santorsola M, Guttà C, Gasparre G, Picardi E, Pesole G, Attimonelli M. MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing. Bioinformatics. 2014;30(21):3115–3117. doi: 10.1093/bioinformatics/btu483. [DOI] [PMC free article] [PubMed] [Google Scholar]
Consugar MB, Navarro-Gomez D, Place EM, Bujakowska KM, Sousa ME, Fonseca-Kelly ZD, Taub DG, Janessian M, Wang DY, Au ED, Sims KB, Sweetser DA, Fulton AB, Liu Q, Wiggs JL, Gai X, Pierce EA. Panel-based genetic diagnostic testing for inherited eye diseases is highly accurate and reproducible, and more sensitive for variant detection, than exome sequencing. Genet Med. 2015;17(4):253–261. doi: 10.1038/gim.2014.172. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cunningham F, Ridwan Amode M, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Garcín Girón C, et al. Ensembl 2015. Nucleic Acids Research 43 Database issue. 2015:D662–D669. doi: 10.1093/nar/gku1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Damas J, Carneiro J, Amorim A, Pereira F. MitoBreak: the mitochondrial DNA breakpoints database. Nucleic acids Res. 2014;42:D1261–1268. doi: 10.1093/nar/gkt982. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(Database issue):D914–920. doi: 10.1093/nar/gku935. [DOI] [PMC free article] [PubMed] [Google Scholar]
Diroma MA, Calabrese C, Simone D, Santorsola M, Calabrese FM, Gasparre G, Attimonelli M. Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data. BMC Genomics 15 Suppl. 2014;3:S2, 25077682. doi: 10.1186/1471-2164-15-S3-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
European Federation of Neurological Sciences. Finsterer J, Harbo HF, Baets J, Van Broeckhoven C, Di Donato S, Fontaine B, De Jonghe P, Lossos A, Lynch T, Mariotti C, Schöls L, Spinazzola A, et al. EFNS guidelines on the molecular diagnosis of mitochondrial disorders. Eur J Neurol. 2009;16(12):1255–64. doi: 10.1111/j.1468-1331.2009.02811.x. [DOI] [PubMed] [Google Scholar]
Exome Aggregation Consortium. Lek M, Karczewski K, Minikel E, Samocha K, Banks E, Fennell F, O'Donnell-Luria A, Ware J, Hill A, Cummings B, Tukiainen T, Birnbaum D, et al. Analysis of protein-coding genetic variation in 60,706 humans. bioRxiv. 2015:030338. doi: 10.1038/nature19057. doi:10.1101/030338. [DOI] [PMC free article] [PubMed] [Google Scholar]
Falk MJ, Zhang Q, Nakamaru-Ogiso E, Kannabiran C, Fonseca-Kelly Z, Chakarova C, Audo I, Mackay DS, Zeitz C, Borman AD, Staniszewska M, Shukla R, et al. NMNAT1 mutations cause Leber congenital amaurosis. Nat Genet. 2012;44(9):1040–1045. doi: 10.1038/ng.2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Falk MJ, Shen L, Gonzalez M, Leipzig J, Lott MT, Stassen AP, Diroma MA, Navarro-Gomez D, Yeske P, Bai R, Boles RG, Brilhante V, et al. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities. Mol Genet Metab. 2015;114(3):388–396. doi: 10.1016/j.ymgme.2014.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fokkema IF, Taschner PE, Schaafsma GC, Celli J, Laros JF, den Dunnen JT. LOVD v.2.0: the next generation in gene variant databases. Hum Mutat. 2011;32(5):557–563. doi: 10.1002/humu.21438. [DOI] [PubMed] [Google Scholar]
Girdea M, Dumitriu S, Fiume M, Bowdin S, Boycott KM, Chénier S, Chitayat D, Faghfoury H, Meyn MS, Ray PN, So J, Stavropoulos DJ, et al. PhenoTips: patient phenotyping software for clinical and research use. Hum Mutat. 2013;34(8):1057–65. doi: 10.1002/humu.22347. doi: 10.1002/humu.22347. Epub 2013 May 24. [DOI] [PubMed] [Google Scholar]
Gai X, Ghezzi D, Johnson MA, Biagosch CA, Shamseldin HE, Haack TB, Reyes A, Tsukikawa M, Sheldon CA, Srinivasan S, Gorza M, Kremer LS, et al. Mutations in FBXL4, encoding a mitochondrial protein, cause early-onset mitochondrial encephalomyopathy. Am J Hum Genet. 2013;93(3):482–495. doi: 10.1016/j.ajhg.2013.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gonzalez M, Falk MJ, Gai X, Postrel R, Schüle R, Zuchner S. Innovative Genomic Collaboration Using the GENESIS (GEM.app) Platform. Hum. Mutat. 2015;36:950–956. doi: 10.1002/humu.22836. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2015;33(DI):D514–517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014 Mar;46(3):310–5. doi: 10.1038/ng.2892. doi: 10.1038/ng.2892. Epub 2014 Feb 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J, FitzPatrick DR, Eppig JT, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42(DI):D966–974. doi: 10.1093/nar/gkt1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(DI):D980–985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu X, Jian X, Boerwinkle E. dbNSFP v2.0: A Database of Human Non-synonymous SNVs and Their Functional Predictions and Annotations. Hum Mutat. 2013;34:E2393–E2402. doi: 10.1002/humu.22376. [DOI] [PMC free article] [PubMed] [Google Scholar]
Longley MJ, Graziewicz MA, Bienstock RJ, Copeland WC. Consequences of mutations in human DNA polymerase gamma. Gene. 2005;354:125–131. doi: 10.1016/j.gene.2005.03.029. [DOI] [PubMed] [Google Scholar]
Lott MT, Leipzig JN, Derbeneva O, Xie HM, Chalkia D, Sarmady M, Procaccio V, Wallace DC. mtDNA variation and analysis using MITOMAP and MITOMASTER. Curr Protoc Bioinformatics. 2013;1(123):1.23.1–1.23.26. doi: 10.1002/0471250953.bi0123s44. [DOI] [PMC free article] [PubMed] [Google Scholar]
Navarro-Gomez D, Leipzig J, Shen L, Lott M, Stassen AP, Wallace DC, Wiggs JL, Falk MJ, van Oven M, Gai X. Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier. Bioinformatics. 2015;31(8):1310–2. doi: 10.1093/bioinformatics/btu825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pagliarini DJ, Calvo SE, Chang B, Sheth SA, Vafai SB, Ong SE, Walford GA, Sugiana C, Boneh A, Chen WK, Hill DE, Vidal M, Evans JG, Thorburn DR, Carr SA, Mootha VK. A mitochondrial protein compendium elucidates complex I disease biology. Cell. 2008 Jul 11. 2008;134(1):112–23. doi: 10.1016/j.cell.2008.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubino F, Piredda R, Calabrese FM, Simone D, Lang M, Calabrese C, Petruzzella V, Tommaseo-Ponzetta M, Gasparre G, Attimonelli M. HmtDB, a genomic resource for mitochondrion-based human variability studies. Nucleic acids research. 2012;40:D1150–1159. doi: 10.1093/nar/gkr1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scharfe C, Lu HH, Neuenburg JK, Allen EA, Li GC, Klopstock T, Cowan TM, Enns GM, Davis RW. Mapping gene associations in human mitochondria using clinical disease phenotypes. PLoS Comput Biol. 2009 Apr;5(4):e1000374. doi: 10.1371/journal.pcbi.1000374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shabani M, Dyke SO, Joly Y, Borry P. Controlled Access under Review: Improving the Governance of Genomic Data Access. PLoS Biol. 2015 Dec 31;13(12):e1002339. doi: 10.1371/journal.pbio.1002339. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smedley D, Jacobsen JOB, Jager M, Kohler S, Holtgrewe M, Schubach M, Siragusa E, Zemojtel T, Buske OJ, Washington NL, Bone WP, Haendel MA, Robinson PN. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Prot. 2015;10(12):2004. doi: 10.1038/nprot.2015.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thorogood A, Zawati MH. International Guidelines for Privacy in Genomic Biobanking (or the Unexpected Virtue of Pluralism). J Law Med Ethics. 2015 Dec;43(4):690–702. doi: 10.1111/jlme.12312. doi: 10.1111/jlme.12312. [DOI] [PubMed] [Google Scholar]
van Oven M, Kayser M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat. 2009;30(2):E386–E394. doi: 10.1002/humu.20921. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS763101-supplement-Supp_info.pdf^{(1,009.9KB, pdf)}

[R1] 1000 Genomes Project Consortium. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Behar DM, van Oven M, Rosset S, Metspalu M, Loogväli EL, Silva NM, Kivisild T, Torroni A, Villems R. A “Copernican” reassessment of the human mitochondrial DNA tree from its root. Am J Hum Genet. 2012;90(4):675–684. doi: 10.1016/j.ajhg.2012.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Calabrese C, Simone D, Diroma MA, Santorsola M, Guttà C, Gasparre G, Picardi E, Pesole G, Attimonelli M. MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing. Bioinformatics. 2014;30(21):3115–3117. doi: 10.1093/bioinformatics/btu483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Consugar MB, Navarro-Gomez D, Place EM, Bujakowska KM, Sousa ME, Fonseca-Kelly ZD, Taub DG, Janessian M, Wang DY, Au ED, Sims KB, Sweetser DA, Fulton AB, Liu Q, Wiggs JL, Gai X, Pierce EA. Panel-based genetic diagnostic testing for inherited eye diseases is highly accurate and reproducible, and more sensitive for variant detection, than exome sequencing. Genet Med. 2015;17(4):253–261. doi: 10.1038/gim.2014.172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cunningham F, Ridwan Amode M, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Garcín Girón C, et al. Ensembl 2015. Nucleic Acids Research 43 Database issue. 2015:D662–D669. doi: 10.1093/nar/gku1010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Damas J, Carneiro J, Amorim A, Pereira F. MitoBreak: the mitochondrial DNA breakpoints database. Nucleic acids Res. 2014;42:D1261–1268. doi: 10.1093/nar/gkt982. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(Database issue):D914–920. doi: 10.1093/nar/gku935. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Diroma MA, Calabrese C, Simone D, Santorsola M, Calabrese FM, Gasparre G, Attimonelli M. Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data. BMC Genomics 15 Suppl. 2014;3:S2, 25077682. doi: 10.1186/1471-2164-15-S3-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] European Federation of Neurological Sciences. Finsterer J, Harbo HF, Baets J, Van Broeckhoven C, Di Donato S, Fontaine B, De Jonghe P, Lossos A, Lynch T, Mariotti C, Schöls L, Spinazzola A, et al. EFNS guidelines on the molecular diagnosis of mitochondrial disorders. Eur J Neurol. 2009;16(12):1255–64. doi: 10.1111/j.1468-1331.2009.02811.x. [DOI] [PubMed] [Google Scholar]

[R11] Exome Aggregation Consortium. Lek M, Karczewski K, Minikel E, Samocha K, Banks E, Fennell F, O'Donnell-Luria A, Ware J, Hill A, Cummings B, Tukiainen T, Birnbaum D, et al. Analysis of protein-coding genetic variation in 60,706 humans. bioRxiv. 2015:030338. doi: 10.1038/nature19057. doi:10.1101/030338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Falk MJ, Zhang Q, Nakamaru-Ogiso E, Kannabiran C, Fonseca-Kelly Z, Chakarova C, Audo I, Mackay DS, Zeitz C, Borman AD, Staniszewska M, Shukla R, et al. NMNAT1 mutations cause Leber congenital amaurosis. Nat Genet. 2012;44(9):1040–1045. doi: 10.1038/ng.2361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Falk MJ, Shen L, Gonzalez M, Leipzig J, Lott MT, Stassen AP, Diroma MA, Navarro-Gomez D, Yeske P, Bai R, Boles RG, Brilhante V, et al. Mitochondrial Disease Sequence Data Resource (MSeqDR): a global grass-roots consortium to facilitate deposition, curation, annotation, and integrated analysis of genomic data for the mitochondrial disease clinical and research communities. Mol Genet Metab. 2015;114(3):388–396. doi: 10.1016/j.ymgme.2014.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Fokkema IF, Taschner PE, Schaafsma GC, Celli J, Laros JF, den Dunnen JT. LOVD v.2.0: the next generation in gene variant databases. Hum Mutat. 2011;32(5):557–563. doi: 10.1002/humu.21438. [DOI] [PubMed] [Google Scholar]

[R15] Girdea M, Dumitriu S, Fiume M, Bowdin S, Boycott KM, Chénier S, Chitayat D, Faghfoury H, Meyn MS, Ray PN, So J, Stavropoulos DJ, et al. PhenoTips: patient phenotyping software for clinical and research use. Hum Mutat. 2013;34(8):1057–65. doi: 10.1002/humu.22347. doi: 10.1002/humu.22347. Epub 2013 May 24. [DOI] [PubMed] [Google Scholar]

[R16] Gai X, Ghezzi D, Johnson MA, Biagosch CA, Shamseldin HE, Haack TB, Reyes A, Tsukikawa M, Sheldon CA, Srinivasan S, Gorza M, Kremer LS, et al. Mutations in FBXL4, encoding a mitochondrial protein, cause early-onset mitochondrial encephalomyopathy. Am J Hum Genet. 2013;93(3):482–495. doi: 10.1016/j.ajhg.2013.07.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Gonzalez M, Falk MJ, Gai X, Postrel R, Schüle R, Zuchner S. Innovative Genomic Collaboration Using the GENESIS (GEM.app) Platform. Hum. Mutat. 2015;36:950–956. doi: 10.1002/humu.22836. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2015;33(DI):D514–517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014 Mar;46(3):310–5. doi: 10.1038/ng.2892. doi: 10.1038/ng.2892. Epub 2014 Feb 2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J, FitzPatrick DR, Eppig JT, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42(DI):D966–974. doi: 10.1093/nar/gkt1026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(DI):D980–985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Liu X, Jian X, Boerwinkle E. dbNSFP v2.0: A Database of Human Non-synonymous SNVs and Their Functional Predictions and Annotations. Hum Mutat. 2013;34:E2393–E2402. doi: 10.1002/humu.22376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Longley MJ, Graziewicz MA, Bienstock RJ, Copeland WC. Consequences of mutations in human DNA polymerase gamma. Gene. 2005;354:125–131. doi: 10.1016/j.gene.2005.03.029. [DOI] [PubMed] [Google Scholar]

[R24] Lott MT, Leipzig JN, Derbeneva O, Xie HM, Chalkia D, Sarmady M, Procaccio V, Wallace DC. mtDNA variation and analysis using MITOMAP and MITOMASTER. Curr Protoc Bioinformatics. 2013;1(123):1.23.1–1.23.26. doi: 10.1002/0471250953.bi0123s44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Navarro-Gomez D, Leipzig J, Shen L, Lott M, Stassen AP, Wallace DC, Wiggs JL, Falk MJ, van Oven M, Gai X. Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier. Bioinformatics. 2015;31(8):1310–2. doi: 10.1093/bioinformatics/btu825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Pagliarini DJ, Calvo SE, Chang B, Sheth SA, Vafai SB, Ong SE, Walford GA, Sugiana C, Boneh A, Chen WK, Hill DE, Vidal M, Evans JG, Thorburn DR, Carr SA, Mootha VK. A mitochondrial protein compendium elucidates complex I disease biology. Cell. 2008 Jul 11. 2008;134(1):112–23. doi: 10.1016/j.cell.2008.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Rubino F, Piredda R, Calabrese FM, Simone D, Lang M, Calabrese C, Petruzzella V, Tommaseo-Ponzetta M, Gasparre G, Attimonelli M. HmtDB, a genomic resource for mitochondrion-based human variability studies. Nucleic acids research. 2012;40:D1150–1159. doi: 10.1093/nar/gkr1086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Scharfe C, Lu HH, Neuenburg JK, Allen EA, Li GC, Klopstock T, Cowan TM, Enns GM, Davis RW. Mapping gene associations in human mitochondria using clinical disease phenotypes. PLoS Comput Biol. 2009 Apr;5(4):e1000374. doi: 10.1371/journal.pcbi.1000374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Shabani M, Dyke SO, Joly Y, Borry P. Controlled Access under Review: Improving the Governance of Genomic Data Access. PLoS Biol. 2015 Dec 31;13(12):e1002339. doi: 10.1371/journal.pbio.1002339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Smedley D, Jacobsen JOB, Jager M, Kohler S, Holtgrewe M, Schubach M, Siragusa E, Zemojtel T, Buske OJ, Washington NL, Bone WP, Haendel MA, Robinson PN. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Prot. 2015;10(12):2004. doi: 10.1038/nprot.2015.124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Thorogood A, Zawati MH. International Guidelines for Privacy in Genomic Biobanking (or the Unexpected Virtue of Pluralism). J Law Med Ethics. 2015 Dec;43(4):690–702. doi: 10.1111/jlme.12312. doi: 10.1111/jlme.12312. [DOI] [PubMed] [Google Scholar]

[R33] van Oven M, Kayser M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat. 2009;30(2):E386–E394. doi: 10.1002/humu.20921. [DOI] [PubMed] [Google Scholar]

PERMALINK

MSeqDR: A Centralized Knowledge Repository and Bioinformatics Web Resource to Facilitate Genomic Investigations in Mitochondrial Disease

Lishuang Shen

Maria Angela Diroma

Michael Gonzalez

Daniel Navarro-Gomez

Jeremy Leipzig

Marie T Lott

Mannis van Oven

Douglas C Wallace

Colleen Clarke Muraresku

Zarazuela Zolkipli-Cunningham

Patrick F Chinnery

Marcella Attimonelli

Stephan Zuchner

Marni J Falk

Xiaowu Gai

Abstract

Introduction

Building the MSeqDR knowledge and data repository

Organizing a genomic data reference for mitochondrial disease

Organizing a phenotype data reference in mitochondrial disease: Cross-referencing and standardizing definitions across dictionaries

MSeqDR provides a centralized community web portal for genomic and phenome resources

MSeqDR-Genesis: Investigator-driven whole exome or genome dataset analyses

MSeqDR-GBrowse: Interactive Data Exploration and Visualization

MSeqDR-LSDB, a Customized Mitochondrial Disease Locus Specific Database

Figure 1.

Table 1.

mtDNA-specific genome mining tools

Description of MSeqDR in-house central component and customized tools

MSeqDR account and data access management

Interactive and programmatic data search and mining capabilities within MSeqDR

Data submission to MSeqDR

Conclusions and Future Directions

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases