Skip to main content
Genetics logoLink to Genetics
. 2016 Aug 5;203(4):1491–1495. doi: 10.1534/genetics.116.188870

Navigating the Phenotype Frontier: The Monarch Initiative

Julie A McMurry *, Sebastian Köhler , Nicole L Washington , James P Balhoff §, Charles Borromeo **, Matthew Brush *, Seth Carbon , Tom Conlin *, Nathan Dunn , Mark Engelstad *, Erin Foster *, Jean-Philippe Gourdine *, Julius OB Jacobsen ††, Daniel Keith *, Bryan Laraway *, Jeremy Nguyen Xuan , Kent Shefchek *, Nicole A Vasilevsky *, Zhou Yuan **, Suzanna E Lewis , Harry Hochheiser **, Tudor Groza ‡‡, Damian Smedley §§, Peter N Robinson , Christopher J Mungall , Melissa A Haendel *,1
PMCID: PMC4981258  PMID: 27516611

Abstract

The principles of genetics apply across the entire tree of life. At the cellular level we share biological mechanisms with species from which we diverged millions, even billions of years ago. We can exploit this common ancestry to learn about health and disease, by analyzing DNA and protein sequences, but also through the observable outcomes of genetic differences, i.e. phenotypes. To solve challenging disease problems we need to unify the heterogeneous data that relates genomics to disease traits. Without a big-picture view of phenotypic data, many questions in genetics are difficult or impossible to answer. The Monarch Initiative (https://monarchinitiative.org) provides tools for genotype-phenotype analysis, genomic diagnostics, and precision medicine across broad areas of disease.

Keywords: comparative medicine, data integration, disease diagnosis, disease discovery, phenotype ontologies


TO solve challenging disease problems, we need to unify the heterogeneous data that relates genomics to disease traits. Most databases tend to focus either on a single data type across species, or on a single species across data types. Although each database may provide rich, high-quality information, none is unified and comprehensive across species, over biological scales, and throughout data types (Figure 1A).

Figure 1.

Figure 1

Role of phenotypes in data integration. Computable phenotypes make it possible to deeply integrate databases and infer new insights.

Without a big-picture view of phenotypic data, many questions in genetics are difficult or impossible to answer. Use of computable phenotypes—which can be analyzed efficiently with algorithms—is a crucial strategy for gaining this broader view. When a disease has an unknown genetic basis, or is associated with mutations in multiple genes, computable phenotypes can provide valuable clues to the underlying complexity. Aggregating the data in one place is necessary for search and retrieval (Figure 1B), but aggregation often results in a loss of data richness and meaning. Connecting the dots enables the bigger picture to emerge: computable phenotypes (Figure 1C) provide key links across sources, species, and data types.

The Monarch Initiative (https://monarchinitiative.org) provides tools for genotype-phenotype analysis, genomic diagnostics, and precision medicine across broad areas of disease. These tools depend on the data integrated through computable phenotypes for cross-species comparisons.

To fully exploit the power of computable phenotypes, several obstacles must be overcome. These obstacles are illustrated for a specific example related to a family of diseases in Figure 2:

Figure 2.

Figure 2

Challenges associated with integration of data using phenotypes. Published relationships are shown in solid lines. Dashed lines show relationships that require computation and/or data integration. Around the perimeter of the figure are examples of the types of questions that are difficult to answer using traditional (nonintegrative) methods. These questions are divided into “clinical,” “basic,” and “translational” research categories. Each challenge is explained in the main text of the article.

  1. Different communities: Different communities use different language to describe the same phenotypes, even for the same species (e.g., clinicians use the term “micrognathia,” where patients would use “small jaw”). Similarly, a mouse researcher might describe the orthologous phenotype as “small mandible.”

  2. Phenotype profile matching: Many diseases closely resemble each other, and the constellation of phenotypes associated with any given disease rarely, if ever, manifest in the same way in every affected patient. Identifying the hallmark features as well as key differentiating phenotypes is an essential part of a differential diagnosis. In the above example, “proptosis” is common in Shprintzen–Goldberg syndrome, but not in Loeys–Dietz or Marfan syndromes. Any observed phenotype profile varies depending on the location and nature of the gene variation. This degree of variability means that fuzzy matching between sets of phenotypes can play a key role, both in differential diagnosis and mechanistic inquiry.

  3. Relevance of pathways and co-involvement: Knowing which genes/proteins are co-implicated in similar phenotypes provides important clues. Some evidence in mice shows that mutations in FBN1 as well as TGFBR1 and TGFBR2 lead to an overlapping set of abnormal phenotypes including those of the skeletal system (Arslan-Kirchner et al. 2016). Alteration of the murine ortholog of SKI (Ski) is associated with skeletal craniofacial anomalies in mouse models, and the SKI gene plays a role in TGF metabolism.

  4. Orthology: Gene orthology between species is not black and white: some sequences are more closely related than others. Moreover, describing orthology is complicated by factors such as gene duplication and splice variation. Sources differ as to whether given genes are orthologs and if so, what type (e.g., least diverged ortholog, paralog, one-to-one, many-to-many, etc.) (O’Brien et al. 2005; Altenhoff et al. 2016). In the above example, the zebrafish has two copies of the SKI ortholog: skia and skib.

  5. Relevant models across species: A single animal model rarely recapitulates all of the phenotypes exhibited in human disease. It often takes a combination of models to help form a complete picture. In this case, the Marfan mouse model exhibits the “arachnodactyly” whereas the zebrafish exhibits the craniofacial abnormalities (Doyle et al. 2012).

  6. Atomic phenotype similarity: Unlike genes, phenotypes are not discrete entities; this makes querying databases for phenotypes a difficult problem related to granularity. For instance, queries for any term (e.g., hyperkeratosis) should contain all results associated with more specific variations of the underlying concept (e.g., palmoplantar hyperkeratosis). It is not only hierarchical relationships that matter, but basic similarity. For instance, micrognathia is similar to small mandible but neither of these is a parent of the other; rather both terms descend from “abnormal jaw morphology”.

  7. Anatomy and biological scales: Similar phenotypes are recorded in different species for analogous anatomical regions (e.g., hand vs. paw). They also apply to different scales (e.g., “neurological phenotype” is related to more specific concepts such as “dopaminergic cell loss”). Structuring these concepts into networks allows both machines and humans to navigate complex interlinked data.

  8. Hypothesis generation: Simultaneously overcoming challenges 1–7 enables us to generate new hypotheses. For instance, we could speculate that mutation in the human SKI gene might lead to a disease with similarities to Marfan syndrome and Loeys–Dietz syndrome. In fact, these considerations supported the discovery of mutations in the SKI gene as the cause of Shprintzen–Goldberg syndrome (Schepers et al. 2015).

      Additional biological complexities make it even harder to build a complete and accurate picture with the available phenotypic information. To name a few not illustrated above:

  9. Inference: Phenotypes are often associated with diseases and diseases to genes; thus the relationship between a specific phenotype and a specific gene may need to be inferred. For instance, if we know that FBN1 is implicated in Marfan syndrome and that skeletal anomalies such as arachnodactyly are associated with Marfan syndrome, then we can infer that FBN1 is likely to play some role in skeletal development and homeostasis.

  10. Staging and severity: Interpretations are affected by the stage of an organism or the stage of disease at which the phenotype is observed, in combination with phenotypic severity.

  11. Time: Biological processes occurring at different developmental times are hard to compare across organisms.

  12. Phylogenetic distance: A model organism may not present the exact spectrum of phenotypes when faced with an orthologous genetic variation, and the similarities between phenotypes become subtler and thus harder to find and quantify as phylogenetic distances increases (e.g., pleiotropic phenotypes in Bardet–Biedl syndrome are similar to effects seen in the cilia and basal bodies of single-celled eukaryotes).

  13. Noise: Observations may be incomplete or artifactual (noise).

A Common Conceptual Framework

Data scientists often apply ontologies to organize heterogeneous data. Ontologies are collections of concepts logically organized and linked. Most anatomy, phenotype, and disease ontologies describe the biology of one particular species. Examples are the Human Phenotype Ontology (Köhler et al. 2014) and the Mouse Anatomy Ontology (Hayamizu et al. 2015). The Monarch Initiative has developed four species-agnostic ontologies designed to unify their species-specific counterparts: GENO for genotypes (Brush et al. 2013), Uberpheno for phenotypes (Köhler et al. 2013), UBERON for anatomy (Haendel et al. 2014), and MONDO for diseases (Mungall et al. 2016). These ontologies provide a bridge between species-/domain-specific ontologies, allowing unified analysis of disparate data sources (Figure 3). Monarch also contributes to the Gene Ontology, which also unifies gene function and subcellular anatomy across species (Ashburner et al. 2000).

Figure 3.

Figure 3

The Monarch Initiative’s ontology-driven data integration pipeline. Diverse data from disparate sources and annotated to disparate species-specific ontologies is integrated with unifying ontologies. The unified data corpus is used by analysis tools and interfaces.

Monarch tools leverage this conceptual framework to help users understand and diagnose disease. Statistical similarity calculations enable comparison across species (Figure 2.5), biological scales (Figure 2.7), and community-specific vocabularies (Figure 2.1) (Smedley et al. 2013). Monarch supports researchers and clinicians using this data with visualization tools, application programming interfaces, and a rich web site (https://monarchinitiative.org). These approaches make it possible to overcome limitations in the data for many applications; including disease diagnostics (Bone et al. 2016), drug repurposing, and improved phenotyping; both clinically and in model organisms (e.g., helping identify candidate phenotyping assays based on preliminary phenotyping). Indeed, Monarch’s unified data corpus and tools have been applied to diagnosing real patients and plans are underway to scale up their use with larger efforts, including the Undiagnosed Diseases Network (Brownstein et al. 2015) and the 100,000 Genomes Project (http://www.genomicsengland.co.uk/the-100000-genomes-project/).

To achieve this vision, we need technological advances and collaborative processes beyond the common conceptual framework. Existing descriptions of phenotypes and their relationships to genomic variations are all-too-frequently provided in community-specific formats, which lack the details and computational meaning needed for integration. Although we have made some progress with natural language approaches to extracting key details (Groza et al. 2015), expensive manual curation is still necessary.

To increase the portability and computability of phenotype descriptions, data providers and journals should use common phenotype information models. Such models require proper identification of the organisms being phenotyped. We have shown that ∼33% of mouse strains and 13% of fish strains were not uniquely identifiable in the literature, causing the associated phenotype data to be lost to public repositories (Vasilevsky et al. 2013). Increased use of organism-specific nomenclature and identifiers, as supported by the Model Organism Databases, will be necessary for more effective sharing of phenotype data.

A standard data exchange format is needed to ensure that phenotypic knowledge is computable and accessible across a variety of sources. Toward this end, we are developing an exchange format (http://phenopackets.org) that will do for phenotype data what existing formats [e.g., FASTA, Variant Cell Format (VCF), and Browser Extensible Data (BED)] have done for sequence data. Phenopackets can be used in a variety of settings, such as for submission to journals, in public databases, for biodiversity collections, and for clinical data sharing. They can apply to one organism or to groups of organisms, and for qualitative or quantitative data.

We invite the community to aid in the sharing, aggregation, and integration of cross-species phenotype data. By using, testing, and contributing to the phenopacket standard, you will be connecting the very dots that maximize mechanistic discovery of the genetic bases of health and disease.

Footnotes

Communicating editor: M. Johnston

Literature Cited

  1. Altenhoff A. M., Boeckmann B., Capella-Gutierrez S., Dalquen D. A., DeLuca T., et al. , 2016.  Standardized benchmarking in the quest for orthologs. Nat Methods. 13: 425–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Arslan-Kirchner M., Arbustini E., Boileau C., Charron P., Child A. H., et al. , 2016.  Clinical utility gene card for: Hereditary thoracic aortic aneurysm and dissection including next-generation sequencing-based approaches. Eur. J. Hum. Genet. 24: 146–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ashburner M., Ball C. A., Blake J. A., Botstein D., Butler H., et al. , 2000.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bone W. P., Washington N. L., Buske O. J., Adams D. R., Davis J., et al. , 2016.  Computational evaluation of exome sequence data using human and model organism phenotypes improves diagnostic efficiency. Genet. Med. 18: 608–617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brownstein C. A., Holm I. A., Ramoni R., Goldstein D. B., 2015.  Data sharing in the undiagnosed diseases network. Hum. Mutat. 36: 985–988. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brush, M. H., C. J. Mungall, N. Washington, and M. A. Haendel, 2013 What’s in a Genotype?: An Ontological Characterization for Integration of Genetic Variation Data, pp. 105–108 in ICBO 2013 - Proceedings of the 4th International Conference on Biomedical Ontology 2013. CEUR-WS, Montreal, Canada. [Google Scholar]
  7. Doyle A. J., Doyle J. J., Bessling S. L., Maragh S., Lindsay M. E., et al. , 2012.  Mutations in the TGF-beta repressor SKI cause Shprintzen-Goldberg syndrome with aortic aneurysm. Nat. Genet. 44: 1249–1254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Groza T., Köhler S., Moldenhauer D., Vasilevsky N., Baynam G., et al. , 2015.  The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. Am. J. Hum. Genet. 97: 111–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Haendel M. A., Balhoff J. P., Bastian F. B., Blackburn D. C., Blake J. A., et al. , 2014.  Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J. Biomed. Semantics 5: 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hayamizu T. F., Baldock R. A., Ringwald M., 2015.  Mouse anatomy ontologies: enhancements and tools for exploring and integrating biomedical data. Mamm. Genome 26: 422–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Köhler S., Doelken S. C., Ruef B. J., Bauer S., Washington N., et al. , 2013.  Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research. F1000Res 2: 30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Köhler S., Doelken S. C., Mungall C. J., Bauer S., Firth H. V., et al. , 2014.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 42: D966–D974. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Mungall C. J., Koehler S., Robinson P., Holmes I., Haendel M. A., 2016.  k-BOOM: A Bayesian approach to ontology structure inference, with applications in disease ontology construction. bioRxiv DOI: http://dx.doi.org/10.1101/048843. [Google Scholar]
  14. O’Brien K. P., Remm M., Sonnhammer E. L., 2005.  Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33: D476–D480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Schepers D., Doyle A. J., Oswald G., Sparks E., Myers L., et al. , 2015.  The SMAD-binding domain of SKI: a hotspot for de novo mutations causing Shprintzen-Goldberg syndrome. Eur. J. Hum. Genet. 23: 224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Smedley D., Oellrich A., Köhler S., Ruef B., Westerfield M., et al. , 2013.  PhenoDigm: analyzing curated annotations to associate animal models with human diseases. Database (Oxford) 2013: bat025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Vasilevsky N. A., Brush M. H., Paddock H., Ponting L., Tripathy S. J., et al. , 2013.  On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ 1: e148. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES