Using MARRVEL v1.2 for bioinformatics analysis of human genes and variant pathogenicity

Julia Wang; Dongxue Mao; Fatima Fazal; Seon-Young Kim; Shinya Yamamoto; Hugo Bellen; Zhandong Liu

doi:10.1002/cpbi.85

. Author manuscript; available in PMC: 2020 Sep 1.

Published in final edited form as: Curr Protoc Bioinformatics. 2019 Sep;67(1):e85. doi: 10.1002/cpbi.85

Using MARRVEL v1.2 for bioinformatics analysis of human genes and variant pathogenicity

Julia Wang ¹, Dongxue Mao ², Fatima Fazal ³, Seon-Young Kim ⁴, Shinya Yamamoto ⁵, Hugo Bellen ⁶, Zhandong Liu ^7*

PMCID: PMC6750039 NIHMSID: NIHMS1037046 PMID: 31524990

Abstract

One of the greatest challenges for bioinformatic analysis of human sequencing data is identifying which variants are pathogenic. To solve this problem, numerous databases and tools have been generated. However, much of these useful data and tools are spread out and requires users to search for their variants of interest through human genetics databases, variant function prediction tools, and model organism databases. To solve this problem, we collected data from and observed human geneticists, clinicians, and model organism researchers to carefully select and display valuable information that facilitates the evaluation of whether or not a variant is likely pathogenic. This program, Model organism Aggregated Resources for Rare Variant ExpLoration (MARRVEL) v1.2 allows users to collect relevant data from 27 public sources for further efficient bioinformatic analysis of human variants for prioritization.

Keywords: Genetics, Genomics databases, Variants of unknown significance, Genes of unknown significance, Model organisms

INTRODUCTION

As sequencing technology evolves, more genes and variants of unknown significance come into focus of clinicians and human geneticists who tries to understand the cause of genetic disorders. There currently exist many databases and prediction algorithms that are useful to assess the significance of genes and variants to establish a hypothesis that can be experimentally tested. However, the compilation of these dispersed data usually demands either a lot of time or bioinformatics skills. Here, Model organism Aggregated Resources for Rare Variant ExpLoration (MARRVEL) (Wang et al., 2017) provides a simple, easy-to-use tool for non-computational users interested in gathering data dispersed throughout dozens of tools and databases across the world-wide-web.

This protocol includes methods to search MARRVEL v1.2 starting with a human gene and variant or human gene only. In addition, methods to start the search with a gene of interest in key model organisms are discussed. Finally, the support protocol describes how to use the MARRVEL API.

STRATEGIC PLANNING

Information required to initiate a MARRVEL search is very simple and requires a gene name and/or a variant information. The most likely source of error is either an outdated gene symbol or an incorrect variant nomenclature. If you are unsure, please refer to HUGO Gene Nomenclature Committee (HGNC) (Povey et al., 2001) for the correct human gene symbol, Mutalyzer (den Dunnen, 2016; Wildeman, van Ophuizen, den Dunnen, & Taschner, 2008) for the correct variant (nucleic) nomenclature, and TransVar (Zhou et al., 2015) for the correct variant (amino acid) nomenclature.

Please also note that MARRVEL1.2 uses hg19/GRCh37, so variant information based on hg38/GRCh38 must be converted to hg19/GRCh37 using UCSC LiftOver (https://genome.sph.umich.edu/wiki/LiftOver) before initiating a variant based search.

BASIC PROTOCOL 1

HUMAN GENE AND VARIANT INITIATED SEARCH

The purpose of this method is to query human genetics databases (OMIM, ExAC, gnomAD, Geno2MP, ClinVar, DGV, and DECIPHER), key eukaryotic genetic model organism and gene ontology databases (SGD, PomBase, WormBase, FlyBase, ZFin, MGI, RGD, GO Central), and others (DIOPT, dbNSFP, Mutalyzer, TransVar, GTEx, Human Protein Atlas) using MARRVEL v1.2 (updated February 2019). Please see Table 1 for a comprehensive list and descriptions of each database/tool.

Table 1:

List of all databases and tools curated by MARRVEL v1.2

Name of Database	URL/Link to Database	Rationale for Inclusion into MARRVEL	Reference (PMID)
Human Genetics Databases
OMIM (Online Mendelian Inheritance in Man)	https://omim.org/	The three main pieces of information that we draw from OMIM are: gene function, associated phenotypes, and reported alleles. It is helpful to know if a gene is associated with a known Mendelian phenotype (# entries) whose molecular basis is known. Genes without this knowledge are candidates for novel gene discovery. For genes that are this category, if the patient’s phenotype does not match the reported disease and phenotype as well as those of the patients in the literature, then this increases the opportunity to provide a phenotypic expansion for the gene of interest.	PMID: 28654725
gnomAD	http://gnomad.broadinstitute.org/	gnomAd contains a total of 123,136 exome sequences and 15,496 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. Significant portion of ExAC data is intergrated into gnomAD. In MARRVEL we currently display the population frequencies that pertains to specific variant.	PMID: 27535533
ExAC	http://exac.broadinstitute.org/	ExAC contains more than 60,000 exomes and is, other than gnomAD (http://gnomad.broadinstitute.org/), the largest public collection of exomes that have been selected against individuals with severe early-onset Mendelian phenotypes. For MARRVEL’s purposes, ExAC and gnomAD serves as the best control population dataset to calculate minor allele frequency. We provide two sets of outputs from ExAC. The first output is the gene-centric overview of the expected versus observed number of missense and loss of function (LOF) alleles. A metric called pLI (probability of LOF Intolerance) ranges between 0.00 and 1.00 reflects the selective pressure on certain variants before reproductive age. pLI score of 1.00 means that this gene is very intolerant of any LOF variants and haploinsufficiency of this gene may cause disease in human. The second output is data from ExAC that pertains to the specific variant. If identical variant is seen in ExAC, MARRVEL will display the minor allele frequency.	PMID: 27535533
ClinVar	https://www.ncbi.nlm.nih.gov/clinvar/	ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. Variants with interpretations reported by researchers and clinicians are valuable for analyzing how likely a variant is pathogenic.	PMID: 29165669
Geno2MP (Genotype to Mendelian Phenotype)	http://geno2mp.gs.washington.edu/Geno2MP/	Geno2MP is a collection of samples from the University of Washington Center for Mendelian Genetics. It contains ~9,650 exomes of affected individuals and unaffected relatives. This database links the phenotypic as well as mode of inheritance information to specific alleles. For phenotype, by comparing the affected organ system of the patient of interest to the affected individuals in Geno2MP, one may find potential matches. A match in allele, mode of inheritance, and phenotype provides an increased probability that the variant likely pathogenic. However, due to small sample size a negative association does not necessarily decrease a variant’s pathogenic priority. A mechanism to contact the primary physician of a patient of interest is provided in the original source.	N/A
DECIPHER	https://decipher.sanger.ac.uk/	The DECIPHER data displayed on MARRVEL includes common variants from the control population. The data displayed includes structural variants that cover the genomic location of the input variant. DECIPHER also contains variant and phenotypic information for affected individuals but can only be accessed directly through their website.	PMID: 19344873
DGV (Database of Genomic Variants)	http://dgv.tcag.ca/dgv/app/home	To our knowledge, DGV is the largest public-access collection of structural variants from more than 54,000 individuals. The database includes samples of reportedly healthy individuals, at the time of ascertainment, from up to 72 different studies. Possible limitations to this data include variation in source and method of the data acquired the lack of information regarding incomplete penetrance of pathogenic CNVs, and whether individuals will develop associated diseases subsequent to data collection.	PMID: 24174537
Model Organisms Databases
FlyBase (Drosophila)	http://flybase.org	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:26467478
IMPC (International Mouse Phenotyping Consortium) (mouse)	http://www.mousephenotype.org/	MARRVEL provides a hyperlink to coresponding mouse gene pages on the IMPC website. If there has been a knock-out mouse made by the IMPC, an exhaustive list of assays and their results are made available publicly and can provide insight into the phenotype when a gene is lost. Some information is curated in MGI but there maybe a time lag.	PMID: 27626380
MGI (Mouse Genome Informatics) (mouse)	http://www.informatics.jax.org/	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:25348401
Monarch Initiative	https://monarchinitiative.org/	MARRVEL provides a link to the Phenogrid of a human gene on Monarch Initiative. This grid provides comparisons between the phenotype of model organisms and known human diseases.	PMID: 27899636
PomBase (fission yeast)	https://www.pombase.org/	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:22039153
WormBase (C. elegans)	http://wormbase.org	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:26578572
ZFIN (zebrafish)	https://zfin.org/	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:26097180
RGD (Rat Genome Database) (rat)	https://rgd.mcw.edu/	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID:25355511
GTEx (The Genotype-Tissue Expression Project)	https://gtexportal.org/home/	MARRVEL displays both mRNA and protein expression pattern in human tissues of each gene. The expression pattern can add insight into the phenotypes observed in patients and/or model organisms.	PMID: 29019975, 23715323
The Human Protein Atlas	https://www.proteinatlas.org/	MARRVEL displays both mRNA and protein expression pattern in human tissues of each gene. The expression pattern can add insight into the phenotypes observed in patients and/or model organisms.	PMID: 21752111
GO (Gene Ontology) Central	http://www.geneontology.org/	MARRVEL displays only Gene Ontology (GO) terms (Molecular Function, Cellular Component, and Biological Process) derived from experimental evidence for each gene. They are filtered by “experimental evidence codes” and GO terms based on “computational analysis evidence codes” and “electronic annotation evidence codes” (predictions) are avoided.	PMID: 10802651, 25428369
SGD (Saccharomyces Genome Database) (budding yeast)	https://www.yeastgenome.org/	MARRVEL collects and displays data from multiple model organism databases. We provide a summary of the molecular, cellular and biological function of the gene using GO terms. The most likely ortholog is derived by DIOPT.	PMID: 22110037
Nomenclature/Identifier Databases
HGNC (HUGO Gene Nomenclature Committee)	https://www.genenames.org/	HGNC official gene symbols are used for MARRVEL searches.	PMID: 27799471
DIOPT (DRSC Integrative Ortholog Prediction Tool)	https://www.flyrnai.org/cgi-bin/DRSC_orthologs.pl	DIOPT provided multiple protein sequence alignment of the best predicted orthologs in six model organisms against the protein sequence of the human gene of interest. The alignment will provide information on the conservation of specific amino acids as well as functional protein domains.	PMID: 21880147
Mutalyzer	https://mutalyzer.nl/	MARRVEL uses Mutalyzer’s API to convert different variant nomenclatures to genomic location.	PMID: 18000842
TransVar	https://bioinformatics.mdanderson.org/transvar/	MARRVEL uses TransVar to convert protein (amino acid) variant nomenclature to genomic location nomeclature, which most databases that MARRVEL links to uses.	PMID: 26513549
Gene2Function	http://www.gene2function.org/search/	MARRVEL collaborates with DIOPT and Gene2Function to provide the “Model Organism Search” feature. Hyperlink is provided for users to access their website that integrates a number of MO databases and displays them in a different style from how MARREL does.	PMID: 28663344
Ensembl	https://useast.ensembl.org/	Ensembl gene IDs are used to link the different databases.	PMID: 29155950
Miscellaneous Databases
PubMed	https://www.ncbi.nlm.nih.gov/pubmed/	MARRVEL provides a hyperlink to “Gene” based PubMed search. Clicking this link will allow one to search biomedical papers that refers to the gene of interest based on previous gene names and symbols.	N/A
dbNSFP	http://varianttools.sourceforge.net/Annotation/dbNSFP	MARRVEL uses dbNSFP to provide pathogenicity prediction scores.	PMID: 21520341

Open in a new tab

A case example will be used to illustrate how each of the elements may be useful to the user. The entire background of the case can be found in Ansar and Chung, et al, 2018 (Ansar et al., 2018). In summary, homozygous loss-of-function variants in the gene Dynamin-binding protein (DNMBP) causes bilateral infantile cataracts. DNMBP:p.Arg271* is one of the variants identified as pathogenic in this autosomal recessive disease. In addition to the pathogenic variant reported in Ansar and Chung, et al, 2018, a benign variant, DNMBP:p.Cys1413Trp, will be used to illustrate what to expect when a benign variant is queried. When these variants lead to negative data, additional examples will be used to illustrate alternative results that are possible.

Necessary Resources

Hardware

A computer with internet access.

Software

Web browser (Chrome, Firefox, etc).

Steps and Annotations

1
Navigate to http://marrvel.org on your web browser
2
Click on either “Gene” or “Protein Variant” according to the nomenclature of your variant (Figure 1).

For “Gene”, enter HGNC (Povey et al., 2001) gene name and Human Genome Variation Society (HGVS) standard variant notation (den Dunnen, 2016).

The search bar for “Gene” is compatible with two types of variant nomenclature: genome location and transcript-based nomenclature. For genomic location nomenclature, use the coordinates according to hg19/GRCh37. Then, click on “SEARCH.” MARRVEL uses Mutalyzer for this function (den Dunnen, 2016; Wildeman et al., 2008).

For example, 6:99365567 T>C / FBXL4 or 6:99365567 T>C or FBXL4 or NM_012160.3:c.541A>G

For “Protein”, enter your variant in the following format: GENE NAME:p.[Reference Amino Acid][Amino Acid location][Variant Amino Acid].

For example, TK2:p.Leu178Met, IRF2BPL:p.P372R, IRF2BPL:p.Gln126*. Then, click on “SEARCH.”

When the amino acid change is ambiguous and can be matched to multiple transcripts and genomic locations, MARRVEL will provide the options for you to select the appropriate transcript. Then, click “MARRVEL IT.”

Note that for Protein Variant, you can only search for missense or stop-gain variants. For all other variants, please use the nucleotide change nomenclature and the “Gene” search.

Figure 1: — A) Click on Model organism Search for starting a search based on a gene in a model organism. The user will be directed to a page to select the model organism of interest and then enter the gene symbol. Suggestions of gene symbols will be available as the user types.

B) For variants with amino acid/protein nomenclature, click on “Protein Variant”. Users will see a single search bar where both the gene and variant can be entered. The search bar accepts either three letter amino acid code or single letter amino acid code. When there is ambiguity for which isoform the user is referring to, all options will be listed for the user to select.

C) For “About MARRVEL”, “FAQ”, “Feedback”, and “API”, click on “About” on the top menu. “About MARRVEL” includes information on the development team, acknowledgements of data sources, and more. “FAQ” includes the answers to frequently asked questions. “Feedback” page allows users to report errors or provide suggestions for new features in future updates. “API” page provides instructions on how to access MARRVEL API.

D) After searching for DNMBP:p.Cys1413Trp from the “Protein Variant” search bar, a box titled “Reverse Annotation Candidates” displays the gene name, genomic coordinate of the variant, variant type, transcripts, and a link to proceed to the MARRVEL results page.

Case example:

For searching the two variants, DNMBP:p.Arg271* and DNMBP:p.Cys1413Trp, click on “Protein Variant” and enter one of the variants. An intermediate page will display the genomic coordinates of the variant and the corresponding isoform. Click on “MARRVEL IT”.

The results page will contain data from all the databases that MARRVEL queries.

3
Click on OMIM (Online Mendelian Inheritance in Man)(Amberger, Bocchini, Scott, & Hamosh, 2019) from the menu on the left to navigate to OMIM data boxes as a starting point to determine if your gene of interest is associated with a human disease.

Locate the “Human Gene Description” box from OMIM for a short summary of what is known about the gene and gene product.

Locate the “Gene-Phenotype Relationships” box to determine if this gene is a known phenotype-associated gene or not.

Locate the “Reported Alleles From OMIM” box to get a list of pathogenic variants curated by OMIM.

Note that OMIM may still miss recently reported disease associations. Thus, we recommend users to conduct PubMed searches as well.

Case example:

For both DNMBP:p.Arg271* and DNMBP:p.Cys1413Trp, the results from OMIM will be the same. Since this is a new disease only recently reported in 2018, OMIM has not yet curated it as a Gene-Phenotype Relationship. In the left box titled “Human Gene Description,” DNMBP is described as a protein involved in the regulation of cell junctions (Figure 2).

On the other hand, searching for the gene IRF2BPL (Figure 2) shows that it is associated with an autosomal dominant disease.

3
Click on gnomAD/ExAC on the left menu to access ExAC (Exome Aggregation Consortium) (Lek et al., 2016) and gnomAD (genome Aggregation Database) (Karczewski et al., 2019) to determine the prevalence of your variant of interest in a large population database.

ExAC and gnomAD are large population genomics databases based on Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) of people who are selected to exclude severe pediatric diseases. MARRVEL displays data from both sources. Both ExAC and gnomAD can be used as a “control population database”, especially for severe pediatric disorders, but its interpretation requires some degree of caution.

Note that gnomAD includes most of the data in ExAC since gnomAD is an updated release of ExAC.

Use the “Control Population Gene Summary” box to obtain gene-level statistics such as probability of finding loss of function (LOF) alleles in the general population. An important metric is the upper bound of the confidence interval for LOF expected/observed (o/e) ratio and pLI (probability of LOF Intolerance) scores (Karczewski et al., 2019). These metrics measure how intolerant a gene is to LOF variants.

Locate the next two boxes titled “Population Allele Frequencies (ExAC Database)” and Population Allele Frequencies (gnomAD Database)” to obtain the allele frequencies of the variant of interest in ExAC (Lek et al., 2016) and gnomAD (Karczewski et al., 2019), respectively. These boxes will only be displayed when the user inputs variant information when performing a MARRVEL search.

Case example:

DNMBP:p.Arg271* is not found in gnomAD, indicating that it is a very rare variant, which increases its likelihood that it is a pathogenic variant.

For DNMBP:p.Cys1413Trp, there are 16868 individuals who are homozygous in the Population Allele Frequencies (gnomAD Database) box. This is critical information that indicates this variant is not likely to cause a disease (Figure 2).

Searching for the gene IRF2BPL, an important metric to note is the upper bound of the loss-of-function o/e score’s confidence interval. When this number is below 0.35, in this case it is 0.33, then this gene is highly intolerant of loss-of-function.

3
Locate to the Pathogenicity Prediction Scores box to determine how likely it is that a variant is pathogenic (Figure 2).
MARRVEL obtains variant function prediction scores from Single Nucleotide Polymorphism Database (dbNSFP) (Liu, Jian, & Boerwinkle, 2011). The scores we display include:
1. Combined Annotation Dependent Depletion (CADD) (Rentzsch, Witten, Cooper, Shendure, & Kircher, 2019) uses 63 annotations (949 features) to predict how damaging a variant likely is. CADD_phred is displayed on MARRVEL where the range is from 0 (least damaging) to 50 (most damaging).
2. Rare Exome Variant Ensemble Learner (REVEL) (Ioannidis et al., 2016) is an Ensemble method of 13 tools that predict the pathogenicity of missense variants. The range of the output is 0 (least damaging) to 1 (most damaging).
1. Mendelian Clinically Applicable Pathogenicity score (M-CAP) (Jagadeesh et al., 2016) uses conservation data and trained on data from mutations linked to Mendelian diseases. The possible outputs are: “Tolerated” or “Damaging.”
2. PolyPhen-2 (Adzhubei, Jordan, & Sunyaev, 2013) has two different scores that MARRVEL displays:
  1. PolyPhen-2 HumDiv uses eight sequence-based and three structure-based predictive features. It is trained with Mendelian disease mutations data and single nucleotide variations (SNVs) data from close mammalian homolog proteins. The possible outputs are: “Benign”, “Possibly Damaging”, and “Probably Damaging”.
  2. PolyPhen-2 HumVar uses eight sequence-based and three structure-based predictive features. It is trained with disease associated and common SNVs. The possible outputs are: “Benign”, “Possibly Damaging”, and “Probably Damaging”.
1. Genomic Evolutionary Rate Profiling (GERP + +) (Davydov et al., 2010) uses multiple protein sequence alignments and phylogenetic tree of 34 mammalian species to report how conserved the amino acid of interest is. The range of the score is −12.3 (least conserved) to 6.17 (most conserved).
2. phyloP 100way Vertebrate (Ramani, Krumholz, Huang, & Siepel, 2018) uses multiple alignments and phylogenetic tree of 100 vertebrate species to report how conserved the amino acid of interest is. The range of the score is - 20.0 (least conserved) to 10.003 (most conserved).
For each prediction/conservation output, a Rank Score is provided by dbNSFP to allow comparison between different scores. It is between 0 and 1 and a score of 0.9 means it is more likely to be damaging than 90% of all potential non-synonymous single nucleotide variations (nsSNVs) predicted by that method.

Case example:

Please refer to Figure 3 for a screenshot of the results.

The Single-Nucleotide Variant Functional Prediction box lists the scores that each prediction tool provides for the variants. CADD indicates that DNMBP:p.Arg271* is quite pathogenic with a score of 37 (range: 1 – 50) (and a rank score of 0.97 indicating that it is in the 97^th percentile pathogenicity out of all missense variants). GERP and phyloP both indicate that this residue is moderately well-conserved throughout evolution.

For another variant, DNMBP:p.Cys1413Tr,p the Rank Scores indicate that the six tools predict this variant to be in the 4.7–34.5 percentile pathogenicity of all variants, indicating that this variant is likely benign.

3
Refer to data from Genotype to Mendelian Phenotype Browser (Geno₂MP) (http://geno2mp.gs.washington.edu) in the “Disease population” and “Gene-Phenotype Relationships” boxes to check if there are other individuals with rare variants in the gene of interest and their phenotypic descriptions.

Geno₂MP contains about 9,600 exomes of individuals with a rare disease and their unaffected relatives enrolled in the Washington University Center for Mendelian Genomics study (Chong et al., 2015). Some crude phenotypic descriptions are also provided.

Locate the “Disease population” box to obtain the allele frequency of the variant of interest.

Locate the “Gene-Phenotype Relationships” box to obtain HPO (human phenotype ontology) terms (Kohler et al., 2017) for the individuals with the variant of interest.

Case example:

No matches with DNMBP:p.Arg271* or DNMBP:p.Cys1413Trp are found in Geno2MP.

However, if the user searches for variant 6:99365567 T>C in the gene FBXL4 (Figure 4), there are indeed individuals with this variant of interest and the respective HPO (Human Phenotype Ontology) profiles are listed in the “Gene-Phenotype Relationships (Geno2MP)” box.

3
Refer to data from ClinVar (Landrum & Kattman, 2018) in the “Reported Alleles From ClinVar” box to check for the clinical significance of the variant of interest.

ClinVar is a database supported by the National Institutes of Health (NIH) where researchers and clinicians submit variants with or without determination of pathogenicity. Both information of SNVs and CNVs are collected in ClinVar.

Locate the top row of colorful boxes (in green, blue, yellow, and orange) to review a summary of the number of each type of variant reported in ClinVar.

Check the list of variants below in the box “Reported Alleles From ClinVar.” Note that if a variant was included in the initial search, the highlighted variants in teal are all variants that include the genomic location of the variant of interest. For many CNVs, the region will likely include many more genes than your gene of interest. The columns “Clinical Significance” and “Review Status” can inform you of the significance of the variant and how certain that designation is. Under “Clinical Significance” look for pathogenic and under “Review Status” look for multiple reviewers and no conflict.

Case example:

Only copy number variants are reported in ClinVar so there are no matches with DNMBP:p.Arg271* or DNMBP:p.Cys1413Trp.

However, if the user searches for variant 6:99365567 T>C in the gene FBXL4 (Figure 4), and scroll further down in the ClinVar table, there are missense variants that are reported to be pathogenic associated with Mitochondria DNA depletion syndrome 13 with the criteria and submission status provided for each variant.

3
Click on to DGV/DECIPHER on the left menu to access data from Database of Genomic Variants (DGV) (MacDonald, Ziman, Yuen, Feuk, & Scherer, 2014) and DECIPHER_CONTROL (Firth et al., 2009) to check if your gene of interest is included in copy number variations present in the control population.

DGV (Database of Genomic Variants, http://dgv.tcag.ca/dgv/app/home) and DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources, https://decipher.sanger.ac.uk/) are both collections of CNVs. DGV includes samples of reportedly healthy individuals, at the time of ascertainment, from up to 72 different studies.

For DECIPHER, MARRVELv1.2 displays population copy number variants (DECIPHER_CONTROL). In future updates MARRVEL, will display patient derived data from DECIPHIER (DECIPHER_DISEASE). We recommend the users of MARRVELv1.2 to visit the original DECIPHER website to access potentially pathogenic CNV information for the time being.

Case example:

By clicking on the “Loss” column twice, control individuals with loss of DNMBP will appear on the top of the list (Figure 5). Only two entries have more than one allele with the loss of DNMBP. Users can select the respective references to read about the original study. In this case, even the studies with three alleles of DNMBP deletion have only a 0.00148075 allele frequency. This indicates that loss of DNMBP may cause disease.

Searching for another variant, IRF2BPL:p.Pro372Arg, DECIPHER’s control population (Common Copy Number Variants) does contain one individual with a duplication of that location (Figure 5).

3
Refer to data from “Gene Function Table” to obtain information on gene or protein expression patterns and Gene Ontology (GO) terms associated with the gene of interest in human (Carithers & Moore, 2015; The Gene Ontology, 2019; Thul & Lindskog, 2018), rat (Shimoyama et al., 2015), mouse (Law & Shaw, 2018), zebrafish (Ruzicka et al., 2018), fruit fly (Thurmond et al., 2019), nematode worm(Grove et al., 2018), budding yeast(Cherry, 2015) and fission yeast (Lock, Rutherford, Harris, & Wood, 2018).
1. Gene name: Each gene name is hyperlinked to gene pages on respective model organism databases which provide further phenotypic information and the resources available for each model organism.
2. PubMed link: Click on the PubMed link for a list of publications that relates to the gene of interest in each organism.
3. DRSC Integrative Ortholog Prediction Tool (DIOPT) score (Hu et al., 2011): Check this column for a score of the number of ortholog prediction algorithms that predict the specific model organism gene as an ortholog of the human gene of interest. A DIOPT score of 3 or above can be used a reasonable cut-off to identify solid ortholog candidates. However, there are cases where genuine orthologs only have a DIOPT score of 1 due to limited homology. Users should keep in mind that one human gene may correspond to multiple model organism genes and vice versa. At the top of the “Gene Function Table”, un-check the “Show only best DIOPT score gene” box to display all potential candidates.
4. Expression: Locate this column for a list of tissues where the gene or protein of interest has reported to be expressed.
5. GO terms: All GO terms are filtered by “experimental evidence codes.”
6. Other links such as Monarch Initiative and IMPC:
  1. The “Monarch Initiative” (Mungall et al., 2017) hyperlink brings the user to the Phenogrid page for the specific human gene, a chart that provides phenotypic comparisons between human conditions linked to the gene and model organisms genes (not necessarily the orthologs) that have similar conditions. For more information on Monarch Phenogrid, please visit https://monarchinitiative.org/.
  2. If a mouse gene has a knockout made and phenotypically characterized by the International Mouse Phenotyping Consortium (IMPC) (Munoz-Fuentes et al., 2018), the “IMPC” links to the page that details the phenotype of the knockout mouse and its availability from public stock centers. For more information on IMPC, please visit http://www.mousephenotype.org/.

Case example

In the Gene Function Table, homologs of DNMBP in model organisms are listed (Figure 6). Notably, very little is known about this gene in humans and mice, but in zebrafish, fruit flies, worms, and yeast, there are several molecular functions and biological processes that DNMBP is reported to play a role in. This provides users with a starting point if they would like to pursue animal models for further study.

In addition, users can check where the gene is expressed by checking the Expression column and clicking the buttons in the column. For example, GTEx data will be displayed after the user clicks on the “Show all/GTEx” button. Then, data from the Protein Atlas and GTEx can be browsed. GTEx shows that this gene, DNMBP, is broadly expressed in all human tissues (Figure 6).

3
Locate the “Human Protein Domains” table to identify annotated domains in the gene of interest.

Refer to the “human gene protein domains” box to obtain predicted protein domains of the human gene from Ensembl (Letunic & Bork, 2018) and Uniprot (Mitchell et al., 2019).

Case example:

From the “Human Gene Protein Domains” box, DNMBP has four Src homology 3 domain of Dynamin Binding Protein and also a RhoGEF domain (Figure 7). These annotations provide clues to the molecular function of the protein and helps to answer the question: do the variants of interest fall within protein domains. In this case, DNMBP:p.Arg271* is within the fourth Src homology domain. Whereas DNMBP:p.Cys1413Trp is not within any protein domains.

3
Refer to the “Multiple Protein Alignment” box to assess the conservation level of your variant of interest (Figure 3).

The amino acid multiple alignment is generated by DIOPT (Hu et al., 2011) using the MAFFT FFT-NS-2 (v7.305b) aligner (Nakamura, Yamada, Tomii, & Katoh, 2018) which includes human (hs), rat (rn), mouse (mm), zebrafish (dr), fruit fly (dm), worm (ce), and yeasts (sc and sp). To highlight the amino acid of interest, enter the amino acid numbers and select the organism(s) of interest in this box.

Note that the alignment generally uses the longest transcript of each gene (usually but not always the canonical transcript), thus not necessarily be the transcript of interest.

Case example:

By searching for amino acid 271 in the multiple alignment, we see that DNMBP:p.Arg271 is conserved from humans to rat, zebrafish, and worm but not in mice, fruit flies, or yeast (Figure 7). This helps the user decide which model organisms may be useful for further study of this variant.

In comparison, DNMBP:p.Cys1413 is conserved from humans to mouse, rat, and fly but not other model organisms.

ALTERNATE PROTOCOL 1 (optional)