Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2019 Oct 24;105(5):974–986. doi: 10.1016/j.ajhg.2019.09.027

A Genocentric Approach to Discovery of Mendelian Disorders

Adam W Hansen 1,2, Mullai Murugan 2, He Li 2, Michael M Khayat 1,2, Liwen Wang 2, Jill Rosenfeld 1, B Kim Andrews 2, Shalini N Jhangiani 2, Zeynep H Coban Akdemir 1, Fritz J Sedlazeck 2, Allison E Ashley-Koch 3,4, Pengfei Liu 1, Donna M Muzny 1,2; Task Force for Neonatal Genomics, Erica E Davis 5,6, Nicholas Katsanis 5,6, Aniko Sabo 1,2, Jennifer E Posey 1, Yaping Yang 1, Michael F Wangler 1, Christine M Eng 1, V Reid Sutton 1,7, James R Lupski 1,2,7,8, Eric Boerwinkle 2,9, Richard A Gibbs 1,2,
PMCID: PMC6849092  PMID: 31668702

Abstract

The advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.

Keywords: genotype-first, whole-exome sequencing, clan genomics, Mendelian disease, big data, Hadoop, data lake, developmental disorder, HARLEE, ultra-rare

Introduction

The foundation of Mendelian disease research is the observation of a direct association between variant alleles affecting the expression of the same gene or perturbing the biological function of its encoded protein product and defined clinical phenotypes in a large enough sample set to satisfy predetermined statistical thresholds.1, 2, 3, 4 For example, cosegregation of specific alleles at a locus with phenotypes in multiple families, or repeated independent occurrences of de novo heterozygous (or hemizygous) variants in the same genes, consistent with autosomal-dominant (AD) and X-linked (XL) disease traits, can be the basis of proof establishing association between a Mendelian disorder and a gene. Moreover, bi-allelic pathogenic variants at a locus inherited in trans from carrier parents can support an autosomal-recessive (AR) disease trait model. In each case allele segregation with phenotypes can be considered alongside the biological role of the indicated gene/protein together with any other in silico prediction or empirical functional data. Although a precise algorithm for “Mendelian causation” has proven elusive, these study components have supported thousands of Mendelian disease-gene associations that have survived the test of time by independent replication.5

Mendelian studies often begin with selection of phenotypically homogeneous sets of individuals, followed by systematic genotyping and analysis. This “phenocentric” paradigm, as exemplified by clinical phenotype data aggregated in OMIM6 (see Web Resources), has contributed most of our current understanding of the genetic basis of human disease. However, its sensitivity is limited by incomplete penetrance, variable expressivity, pleiotropy, locus heterogeneity, ubiquity and non-specificity of certain phenotypic traits, and “granularity” of the semantics of clinical phenotypic descriptions. It generally assumes that enriching a group of individuals for phenotypic homogeneity will also enrich for genetic homogeneity, an assumption that is not always reflected by supporting data.4, 7, 8, 9, 10

Availability of next-generation sequencing (NGS) methods have accelerated the accumulation of gene sequence data from families suspected to harbor Mendelian conditions,1, 3, 5, 11, 12 while at the same time, there have been few advances in high-throughput methods for the study of variant allele function in model systems.13 Hence, there is an increased utilization of genomic DNA sequencing (ES and whole-genome sequencing [WGS]) of affected and healthy research human subjects and associated clinical samples as a driver for Mendelian disease discovery. This DNA sequence-driven, “genocentric” paradigm has been accelerated by the advent of generalizable in silico tools and datasets, such as effective likelihood-based statistical methods to predict potential deleteriousness of missense alleles to protein function14, 15, 16 and nonsense/frameshift alleles to mRNA stability17 and aggregate databases that report the observed number and class (i.e., missense, loss-of-function, etc.) of variant alleles in large reference datasets (e.g., ExAC, gnomAD, CHARGE, 1000 Genomes).18, 19, 20 Scores derived to reflect the population frequencies of the variant classes observed in such databases have proven of great value in Mendelian disease research, as in general, mutations in genes that exhibit less variability in the population are more likely to result in pathogenic effects—the basic tenet of the Clan Genomics hypothesis21 and the rare variant family-based genomics approach. Furthermore, recent efforts have applied deep learning to primate-human comparative genomic data to predict variant pathogenicity.22

To date, genocentric approaches utilizing these scores have focused on the relatively straightforward interpretation of homozygous predicted loss-of-function (LoF) variation23 or de novo mutation.7, 24 Less effort has been applied to the more challenging exploration of missense variants. Missense variants are challenging because of their abundance, especially when there are no samples available from related individuals to allow exploration of the family-based genomics approach and testing of patterns of segregation (AD, AR, XL) of disease phenotypic traits.

Accumulation of large DNA sequence datasets has been matched by the emergence of sophisticated distributed computing methods that offer high levels of capacity combined with the ability to manipulate large, unstructured datasets. These methods can be deployed locally, or on the cloud, with high dexterity. Apache Hadoop is one such tool, which is becoming increasingly popular in NGS pipeline analysis,25, 26, 27, 28, 29, 30, 31, 32, 33 but has not yet been applied, to our knowledge, to the task of discovering Mendelian disorders and their associated genes. Together with the large amount of sequence data from case subjects and families with suspected Mendelian disorders, these developments provide great opportunity for discovery.

We have accumulated ES data from 18,696 individuals from both gene discovery-focused Mendelian disease research efforts and clinical molecular diagnostic programs at Baylor College of Medicine (see Material and Methods), including 14,755 probands (approximately 30% solved), each with a suspected Mendelian condition, and 3,941 control subjects (either unaffected or part of a Wolff-Parkinson-White syndrome cohort) or family members (affected status unknown). These heterogeneous phenotype and genomics data sample sets had varying amounts of clinical and phenotypic annotation, different representation of information and availability of DNA samples from relatives, and variable consent for use as research subjects or as clinical case subjects where the aim for further analysis was to improve the diagnostic yield (see Material and Methods).2

To enable a genocentric analysis, we recorded the variant data from these samples in a single, HIPAA-compliant, secure Hadoop-structured environment, together with appropriate public datasets and computational predictions of variant impact. Data access permissions were carefully managed so as not to inappropriately reveal data from single samples that would compromise privacy agreements. The study revealed an efficient, genocentric pathway to Mendelian discovery and illustrated the power of tools such as Hadoop to enable consolidation of heterogeneously structured genetic data in a single, secure interrogatory environment. Through empirical optimization of search parameters, we identified 154 candidate Mendelian disease-gene associations, 19 of which were reported to OMIM as causative in the months following our initial analysis and discovery. The remaining 135 candidate disease-gene associations are supported by ACMG sequence variant interpretation guidelines, including population and computational predictive data, functional data, and in some instances, de novo inheritance.14, 34

Material and Methods

Samples

Samples were obtained through long-standing research and clinical collaborations. Research samples were collected after written informed consent in conjunction with either the Baylor Hopkins Center for Mendelian Genomics (BHCMG) (H-29697) study with approval by the institutional review board at Baylor College of Medicine or the Task Force for Neonatal Genomics study with approval by the institutional review board at Duke University. Data were also from the Baylor College of Medicine clinical testing laboratories, now incorporated as the Baylor Genetics Laboratories (BG). These data were studied in aggregate for the purpose of improving the diagnostic (protocol H-41191). All genomic studies were performed on DNA extracted from blood or saliva samples. PCA analysis of exome data revealed no significant difference in distribution of ethnicities between case and control samples (Figure S1). Self-reported ethnicity was not tracked.

DNA Sequencing

DNA capture and sequencing of exomes was carried out as previously described by Yang et al.1 at either the Baylor Genetics (BG) laboratories or at the Baylor College of Medicine Human Genome Sequencing Center (HGSC), through the Baylor-Hopkins Center for Mendelian Genomics initiative. Briefly, using 500 μg of DNA, an Illumina paired-end pre-capture library was constructed according to the manufacturer’s protocol (Illumina Multiplexing_SamplePrep_Guide_1005361_D) with modifications as described in the BCM-HGSC Illumina Barcoded Paired-End Capture Library Preparation protocol. Pre-capture libraries were pooled into 4-plex library pools and then hybridized in solution to the HGSC-designed Core capture reagent1 (52 Mb, NimbleGen) or 6-plex library pools used the custom VCRome 2.1 capture reagent1 (42 Mb, NimbleGen) according to the manufacturer’s protocol (NimbleGen SeqCap EZ Exome Library SR User’s Guide) with minor revisions. The sequencing run was performed in paired-end mode using the Illumina HiSeq 2000 platform, with sequencing-by-synthesis reactions extended for 101 cycles from each end and an additional 7 cycles for the index read. With a sequencing yield averaging 8.5 Gb, the sample achieved 93% of the targeted exome bases covered to a depth of 20× or greater. Illumina sequence analysis was performed using the HGSC Mercury analysis pipeline2, 3 (see Web Resources) which moves data through various analysis tools from the initial sequence generation on the instrument to annotated variant calls (SNPs and intra-read indels). In parallel to the exome workflow, an Illumina Infinium Human Exome v1-2 array was generated for a final quality assessment. This included orthogonal confirmation of sample identity and purity using the Error Rate In Sequencing (ERIS) pipeline developed at the BCM-HGSC. Using an “e-GenoTyping” approach, ERIS screens all sequence reads for exact matches to probe sequences defined by the variant and position of interest. A successfully sequenced sample must meet quality-control metrics of ERIS SNP array concordance (>90%) and ERIS average contamination rate (<5%).

Phenotyping

BG

Unstructured phenotypic data are available for all BG samples. Most of these free text clinical summaries were based on clinical notes and the test requisition, and written by clinical scientists, fellows, and laboratory directors. Test requisitions have evolved over time, but typically consisted of a checklist of symptoms with the ability to write-in additional details, and may have been filled out by MDs, genetic counselors, or nurses. Structured phenotypic data are also available for 9,434 samples, with a mean of 7.76 distinct phenotypic descriptors entered per sample; these data were generated by Codified Genomics (see Web Resources) or other tools, typically mapping test requisition symptoms directly to HPO terms, and subsequently reviewed by clinical scientists, fellows, or laboratory directors.

BHCMG

The BHCMG has developed PhenoDB,4 a web-based portal for entry of phenotypic and clinical information that is freely available. The 3K features use the preferred term from the Elements of Morphology and are mapped to the Human Phenotype Ontology. A submitter can enter data by family or cohort including information such as phenotypic features, diagnosis, mode of inheritance, clinical history, and upload previous genetic testing results. PhenoDB has several modules allowing for storage of data as well as analysis and GeneMatcher, a tool used to link investigators sharing the same gene of interest.5

Computer Infrastructure

HARLEE is a 10-node 280TB Cloudera Hadoop cluster. VCF files for all samples were first annotated with VEP—flagging a most-important transcript per-gene per-variant—then subsequently ingested into HARLEE, together with annotation tables from a variety of sources (Supplemental Material and Methods). Data access is strictly controlled with robust encryption and authentication layers, creating an environment ready to comply with FISMA, HIPAA, Texas Medical Records Privacy Act, and other industry regulations.

Genocentric Query Approach

We performed a series of Impala queries in HARLEE, where each query results in a gene list. Scripts were written and executed with R to handle automation of querying and subsequent statistical analysis and visualization. The commonality across all queries is a search for genes harboring ultra-rare variants—with additional quality control filtering—across at least five case subjects (n = 14,755), absent from all control samples (n = 3,941). With the intent of minimizing false-positive candidate gene volume, controls were broadly defined to include parental samples (n = 3,587) in addition to internal healthy control subjects (n = 42) and a Wolff-Parkinson-White syndrome (WPW) cohort (n = 319).

Specifically, all queries shared the following filter parameters: HARLEE internal allele frequency < 0.01; 1000 Genomes allele frequency < 0.001 or is null; CHARGE consortium (large-scale adult cardiovascular cohort sequenced internally) allele frequency ≤ 0.0001 or is null; gnomAD allele frequency ≤ 0.0001 or is null; variant not cited in PubMed; variant has no dbSNP ID; chromosome name does not start with “GL;” domains field, if not empty, does not start with “low_complexity;” ExAC mu_syn is not null (ExAC did not exclude this gene for constraint score analysis); variant read count ≥ 4, variant allele frequency (VAF) ≥ 0.25. Next, we categorized queries as those looking for “loss-of-function” variants (VEP impact = HIGH) versus those looking for missense variants (VEP impact = MODERATE). Finally, queries were further categorized based on one additional gene-specific or variant-specific bioinformatic score or filtering parameter: for loss-of-function variants, ExAC pLI and loss-of-function intolerance z-score; for missense variants, ExAC loss-of-function z-score and missense intolerance z-score, REVEL, MTR (missense tolerance ratio—a region-specific missense tolerance score), SIFT, and PolyPhen. For each of these scores, we implemented a high-pass (or low-pass, for SIFT and MTR) parameter sweep consisting of up to 1,000 queries, measuring the impact of score-based cutoff filtering on resulting gene list size and OMIM disease annotation over time. For each respective parameter sweep query series, the variable parameter was incremented or decremented by 0.01 across the following score ranges, holding all other filtering criteria constant: 0–1 for pLI, REVEL, SIFT, and PolyPhen; 0–10 for loss-of-function and missense intolerance z-scores; and 0–1.6 for MTR.

To enable validation of our approach and parameter optimization, we annotated genes with OMIM Mendelian disease association data from a freeze of the OMIM data from four different timestamps: early 2013, late 2014, mid 2016, and early 2018. For any given set of genes and a fixed duration of time, we define “discovery” as the number of genes in the set with a disease annotation added to OMIM during the given time span.

Candidate gene lists were identified by selecting a hard-cutoff filter value for each respective variable annotation parameter. The cutoff value was selected for each parameter as the value which optimized “discovery density”—calculated as 2013–2018 discovery divided by the number of genes without OMIM annotations in 2018—with a required minimum output of 20 genes without OMIM annotations in 2018. All genes resulting from a query with a cutoff value maximizing discovery density are considered candidates.

Discovery Density Simulations

To demonstrate the sensitivity of the optimum discovery density metric to input OMIM annotation data—which changes over time as associations between genes and Mendelian phenotypes are published and eventually curated by OMIM—discovery density for all five possible nonoverlapping time intervals was plotted for all queries. These data points were supplemented by a distribution of 1,000 instances of removing 20% of all disease-gene annotations from OMIM 2013, calculating discovery as the number of genes within a given query without an OMIM annotation in a given simulation, with an OMIM annotation in the real 2013 database. Discovery density for each query across this simulated interval was then plotted against 2013–2018 discovery. The average discovery density across all 1,000 simulations was also calculated, with a linear regression model fitted against average simulated discovery density versus 2013–2018 discovery density.

Phenocentric Query Approach

We also established a methodology for rapidly conducting a large-scale phenocentric analysis for discovery of variation in genes associating with a specific phenotype, assuming a dominant inheritance model. To conduct a large-scale phenocentric analysis, we first counted the number of distinct samples with variants—meeting specific criteria—per gene across all Mendelian exomes in HARLEE. Variant filtering criteria is as follows: single-nucleotide variant; MAF < 0.0001 (gnomAD, CHARGE); MAF < 0.001 (1000 Genomes); MAF < 0.01 (HARLEE); if REVEL score is available, REVEL score ≥ 0.25; remove chromosome names beginning with “GL;” remove variants where VEP domain annotation begins with “Low_complexity;” remove genes where ExAC does not calculate gene-level scores; VEP existing_variation annotation must be empty or null; VEP impact annotation must be “MODERATE” (indicative of missense variation) or “HIGH” (frameshift, start-loss, stop-gain, stop-loss, or canonical splice site disrupting). We then normalized the mutation rate by cohort size, in addition to normalizing by both cohort size and gene cDNA length.

We then replicated this analysis on a phenocentric sub-cohort filtered out of the overall cohort, only including samples annotated with at least one phenotypic term matching a list of provided terms. We then calculated phenotypic enrichment for each gene by dividing the normalized mutation rates by the respective normalized mutation rates from the overall cohort. Focusing on hearing disorders, phenotypic search terms included “middle ear,” “hearing,” and “deaf.” Fisher’s exact test was utilized to test for significant enrichment, and p values were false discovery rate-corrected.

Results

Computational Infrastructure

The Hadoop Architecture Lake of Exomes (HARLEE) is a data lake created in a Hadoop environment (powered by Cloudera) for housing and facilitating analysis of next-generation sequencing data. This resource provides a flexible environment for simultaneously housing structured, semi-structured, unstructured and heterogeneous data; SQL-on-Hadoop solutions to perform high-speed simple queries and complex comparison queries of the data; a cost-effective solution that uses commodity hardware; the ability to scale-as-required by adding more nodes; fault tolerance achieved by storing the data in triplicate across the nodes; a secure, compliant-ready environment; and granular control of data access privilege. In the current study, anonymized sample-level data were appropriately protected by master-level password access in order to allow only qualified individuals to access specific data components. Instantiation of a more elaborate tiered access system can easily be imposed upon the current HARLEE and be applied to more outward-facing activities.

Benchmark and stress tests via multiple tools, including TeraGen, TeraSort, TestDFSIO, NNBench, and MRBench showed that the performance of the cluster during data ingestion and querying (Table S1), even with the overhead for encryption/decryption (Table S2), far surpassed the capability of programmatic approaches that provide the same results by parsing and interpreting flat files. The architecture allowed warehousing large volumes (i.e., 30,000+ samples, 6 TB) of heterogeneous data while providing rapid sample-level access on the order of a few seconds.

HARLEE Facilitates Genocentric Mendelian Discovery

HARLEE was loaded with data from VCF files from ES samples that were annotated with transcript effect information (VEP) (Figure 1). Utilizing HARLEE, data from all samples were further annotated with known minor allele frequencies (ExAC, gnomAD, CHARGE, 1000 Genomes), functional predictions from multiple bioinformatic algorithms (SIFT, PolyPhen, REVEL, Missense Tolerance Ratio [MTR]), clinical variant information (ClinVar), and other sources.15, 16, 18, 35

Figure 1.

Figure 1

HARLEE Workflow

ES VCF files are first annotated with Variant Effect Predictor (VEP), where one transcript is flagged per variant per gene. Consequence, SIFT, PolyPhen, variant allele frequency from multiple sources, domain information, and other annotations are additionally ascertained by VEP. VEP output is loaded into a Hadoop architecture data lake. Finally, population-, variant- and gene-level annotations from a variety of sources are loaded, allowing for modular, on-demand annotation. After samples and annotations are separately loaded into HARLEE, a series of SQL-like queries generate distinct gene lists. Bioinformatic filtering parameters based on loaded annotations are tuned to optimize discovery density, which takes into account the volume of genes reported to OMIM as disease-associated over time normalized against the number of remaining genes without OMIM disease annotations.

To identify annotation features that would facilitate discovery of genes with associated Mendelian disorders, a series of empirical tests were performed on the accrued data, structured on genotypic parameters, without regard for the underlying phenotypic or disease trait inheritance/segregation patterns. As lists of genes associated with known Mendelian phenotypes curated by OMIM were available from different years (2013–2018), a comparison of the yields of genes discovered at different time points was informative. Throughout, the ratio of known disease-associated genes/all genes that were identified by different parameters and cutoffs were used to optimize different parameters and to maximize the likelihood of enrichment for genes associated with undiscovered Mendelian disorders.

This approach distinguished groups of genes with varying levels of enrichment for known Mendelian phenotypic associations. Figure 2 illustrates testing of a single variable, titrating scores that predict intolerance of genes to missense variants (ExAC’s missense intolerance z-score). As anticipated, when the z-score increased, the total number of genes that were identified decreased, while the fraction that were already known to associate with Mendelian disease in 2016 increased (Figure 2). This reflects a trend where LoF mutations in highly constrained genes are more likely to be pathogenic.18 The majority of genes with a missense intolerance z-score higher than 8 were already reported to OMIM as disease causing in 2016, with two exceptions. In bin 8-8.5, CLTC—Clathrin, Heavy Chain (MIM: 118955)—was since reported to associate with multiple malformation and developmental delay (MIM: 617854).36 In bin 8.5-8, POLR2A—RNA Polymerase II, subunit A (MIM: 180660)—was recently reported to associate with a neurodevelopmental syndrome with infantile-onset hypotonia.37 Hence, this straightforward threshold of a high-missense z-score (>8.5) provides high enrichment for genes known to associate with Mendelian disorders, but yields few discoveries.

Figure 2.

Figure 2

Missense Intolerance Z-Score Pilot Query Series

Query input (missense intolerance z-score range) is plotted against query output (number of resulting genes with and without OMIM disease annotations in 2016). Except for the variable missense intolerance z-score range, all queries were identical, outputting genes within a given z-score range (bin width = 0.5) where at least five case exomes harbored high-quality ultra-rare missense variants (absent from controls). One bin with an intermediate constraint score range (red) had a lower-than-expected proportion of disease-associated genes; 2/9 of these genes were reported as associating to an OMIM phenotype between 2016 and 2018. Outlier genes with extreme constraint scores not known to associate with Mendelian disorders in 2016 (yellow) were all recently reported as disease associated: CLTC, POLR2A, and TRRAP.

Interestingly, the change in proportion of genes known to associate with Mendelian phenotypes was not smooth toward the upper end of constraint: at a missense z-score between 6.5 and 7, a bin with a much lower disease gene enrichment than the surrounding bins was observed. Owing to ongoing efforts by OMIM to curate the literature for newly reported Mendelian disease-gene associations, when the analysis was repeated for the list of known Mendelian genes in 2018, 2/9 of those genes (DHX30 [MIM: 616423], SMC1A [MIM: 300040]) were revealed to have associated Mendelian phenotypes by 2018 (MIM: 617804 and 300590, respectively).38, 39 Thus, this empirical strategy also showed that a “parameter sweep” could identify bins containing sets of genes that were enriched at an intermediate level for known Mendelian phenotypes, as well as many strong candidates for future discovery (Figures 2 and S2).

For further analyses, we defined “discovery density” for a given set of genes as the change in number of genes with associated Mendelian phenotypes in OMIM, over time, normalized (divided) by the number of remaining genes without reported Mendelian disease trait associations. For example, if 10 genes out of a set of 20 had a reported disease association in 2013 and 18/20 were reported to associate with a Mendelian trait by 2018, the discovery density would be (18 − 10)/(20 − 18) = 4.

The suitability of the use of discovery density to optimize filter parameters for discovery of disease-associated genes was separately tested to ensure that the correlations were robust. Overall, we found that use of this measure based upon almost any combination of available data from different years of OMIM was effective, provided years with low absolute discovery rates were avoided (Figures S4–S7).

Subsequent analyses aimed to identify sets of candidate “Mendelian disease-associated genes”—or genes which can be disrupted by variants pathogenic for Mendelian phenotypes—for each annotation score by identifying parameters that yielded the highest discovery density. To reduce the impact of discovery density outlier values inflated by small gene set sizes, candidate disease-associated gene sets were constrained to a minimum size of 20. In total, eight parameter-sweep query series were performed: two testing putative loss-of-function variants (via pLI and loss-of-function [LoF] intolerance z-score) and six testing missense variants (via LoF intolerance z-score, missense intolerance z-score, REVEL, MTR, SIFT, and PolyPhen2).

Apart from variant consequence and the variable parameters cited above, the tests to identify groups of genes rich in undiscovered Mendelian disease-associated genes maintained constant filter parameters across all queries. We limited analyses to include high-quality, ultra-rare (MAF < 1/10,000) variants. Genes were limited to those harboring such variants across at least five different ES entries in HARLEE, substantiating a minimum potential cohort of individuals for each proposed candidate disease-associated gene. No specific filters to remove particular classes of genotype (i.e., heterozygous or homozygous) were included; however, the stringent allele frequency filter greatly enriches for heterozygous variants (often de novo mutations), and thus the subset of true pathogenic variants identified are mostly expected to be dominant-acting alleles for an AD disease trait. Stringent allele frequency filtering appropriately provided a bias toward specificity, rather than sensitivity, as minimizing false positives was an important goal. All variants detected in research samples (see Material and Methods) within candidate disease-associated genes are publicly available (Table S3).

Predicted Loss-of-Function Variants

HARLEE facilitated the identification of 33 candidate disease-associated genes from two distinct loss-of-function variant annotation parameter sweep queries. First, from the loss-of-function intolerance z-score query series, an optimum cutoff value of 7.37 yielded a discovery density of 0.65, while the gene set size was constrained at a minimum of 20 (Table 1). Higher cutoff values could yield a higher discovery density value, but only with a small gene set. Second, the pLI parameter sweep query series identified 29 candidates, including 9 with a pLI score of >0.9999999 that almost certainly constitute Mendelian disease-associating genes.20 The optimum cutoff value of 0.998 (calculated excluding genes where pLI = 1) yielded an optimum discovery density of 0.6.

Table 1.

Summary of Mendelian Discovery Analysis

Variant Category Parameter Name Parameter Category Optimum Value Discovery Density Gene List Size Accepted as Candidates
LoF pLI gene-level >0.998 0.6 29 true
LoF LoF z-score gene-level >7.37 0.65 20 true
Missense LoF z-score gene-level >9.28 0.55 20 true
Missense Mis z-score gene-level >6.23 0.65 20 true
Missense MTR variant-level <0.42 0.372 43 true
Missense PolyPhen variant-level ≥1 0.109 247 false
Missense REVEL variant-level >0.91 0.2 60 true
Missense SIFT variant-level ≤0 0.086 4,837 false

The cutoff value for each query series was selected to optimize discovery density (with a minimum constrained candidate disease-associating gene list size of 20). Results from PolyPhen and SIFT were not considered candidate disease-associating genes as the scores saturated (a maximum level of constraint yielded a maximum cutoff value) with a relatively large remaining gene list size. For the pLI analysis, genes with pLI = 1.0 were excluded from the discovery density optimization calculations, and those genes without OMIM disease annotations were automatically considered candidate disease-associated genes.

In combination, the two methods yielded 33 candidate disease-associated genes with 16 identified by both (Figure 3). Among candidates from these sets, CHD3 (MIM: 602120), DOCK3 (MIM: 603123), KIAA1109 (MIM: 611565), MYO9A (MIM: 604875), and VPS13D (MIM: 608877) have since been reported to have associated Mendelian phenotypes in OMIM (MIM: 618205, 618292, 617822, 618198, and 607317, respectively), providing evidence in support of our approach.40, 41, 42, 43, 44, 45

Figure 3.

Figure 3

Summary of Candidate Disease-Associated Genes by Category

154 genes flagged as candidate Mendelian disease-associated genes, grouped by constraint-metric query series. Shown are (A and B) variant annotation parameter sweep candidate gene list overlaps: loss-of-function (A) and missense (B); (C) high-priority candidates at the intersection of loss-of-function and missense variant parameter sweep candidate genes.

Missense Variants

HARLEE identified 130 candidate disease-associated genes from six distinct missense variant annotation parameter sweeps. From the missense variant loss-of-function intolerance z-score query series, a cutoff value of 9.28 yielded a discovery density of 0.55 with 20 candidate disease-associated genes. From the missense intolerance z-score query series, a cutoff value of 6.23 yielded a discovery density of 0.65, again with a minimal candidate list size of 20 genes. From the MTR query series, an optimum cutoff value of MTR less than 0.42 yielded a discovery density of 0.372 with 43 candidate genes. From the REVEL query series, an optimum cutoff value of REVEL greater than 0.91 yielded a discovery density of 0.2 with 60 candidate genes.

The query series based upon PolyPhen and SIFT produced results that contrasted from the four other methods described above. The PolyPhen query series analysis yielded a discovery density of just 0.109, and an overly large number of 247 candidate genes. Likewise, SIFT analysis yielded a discovery density of 0.086, yielding an unreasonable candidate gene set of 4,837 genes. For both series, the maximum discovery density value occurred when setting the cutoff value to the highest level of constraint for the respective scores (0 for SIFT, 1 for PolyPhen). Because of the excessively large and intractable gene list sets resulting from the PolyPhen and SIFT analyses, combined with their lower discovery density values compared to the six other query series, we did not further utilize these metrics.

The final set of 130 candidate disease-associated genes from our missense variant query series were therefore the union of the gene sets resulting from the loss-of-function intolerance z-score, missense intolerance z-score, MTR, and REVEL parameter sweeps. This included three genes identified by both loss-of-function and missense intolerance z-scores, three genes identified by both loss-of-function z-score and MTR, two genes identified by both loss-of-function z-score and REVEL, five genes identified by both missense intolerance z-score and MTR, and one gene identified by both missense intolerance z-score and REVEL (Figure 3). Of note, TRRAP (loss-of-function z-score, missense z-score, and MTR) (MIM: 603015) and CACNA1I (missense z-score, MTR, and REVEL) (MIM: 608230) were each identified by three scores. Subsequently, TRRAP was reported as a Mendelian disease-associated gene by our collaborators at BHCMG, independent of this analysis (MIM: 618454).46 Furthermore, CACNA1A (MIM: 601011) and CACNA1E (MIM: 601013) have both been reported to be associated with neurodevelopmental disorders (MIM: 617106, 108500, 141500, 183086 for CACNA1A; MIM: 618285 for CACNA1E), establishing a relationship between voltage-dependent calcium channel dysfunction and neurodevelopmental disease, serving as evidence in support of CACNA1I as a candidate Mendelian disease-associated gene.47, 48, 49

Combined Set of Candidate Disease-Associating Genes

In total, we identified 154 distinct candidate disease-associated genes between the loss-of-function and missense variant analyses. On average, these candidates (mean length = 5,147 bp; median length = 3,525 bp) are longer than the average coding gene (mean = 1,649 bp; median = 1,227 bp). This is a shared property of all known Mendelian disease-associated genes and does not reflect a systematic bias that would inherently increase false positives; the set of all coding genes with OMIM disease annotations in 2018 (mean = 2,213 bp; median = 1,554 bp) is also significantly longer than the set of all genes (unpaired t test, p < 0.0001).

Multiple lines of qualitative and quantitative evidence support the merits of HARLEE disease-gene association discovery. First, comparing these genes against the current set of OMIM annotations at the time of preparing this manuscript (May 2019) revealed 19 candidates have since been reported to associate with Mendelian phenotypes: ATP1A150, 51 (MIM: 182310), CACNA1E,49, 52 CHD3,40 CLTC36 (MIM: 118955), DOCK3,41 FBXO1153, 54 (MIM: 607871), IRF2BPL55 (MIM: 611720), KDM5B56 (MIM: 605393), KIAA1109,42 LINGO157 (MIM: 609791), MACF158 (MIM: 608271), MAST159 (MIM: 612256), MYO9A,43 PDE1C60 (MIM: 602987), SCN3A61 (MIM: 182391), SET62 (MIM: 600960), TBX263 (MIM: 600747), TCF2064 (MIM: 603107), and VPS13D.44, 45 Permutation analysis sampling 154 random genes without OMIM annotations (as of January 2018) revealed this to be a highly significant enrichment of recently reported disease-associated genes (expected = 2.29; p < 0.00001; n = 100,000 permutations). Furthermore, 9 of the 33 loss-of-function candidate genes intersected with the 130 missense candidates: CHD3, CSMD3 (MIM: 608399), KIAA1109, LRP1B (MIM: 608766), MDN1 (MIM: 618200), MYCBP2 (MIM: 610392), MYO9A, RYR3 (MIM: 180903), and VPS13D. Notably, four of these—CHD3,40 KIAA1109,42 MYO9A,43 and VPS13D44, 45—have since had associating Mendelian phenotypes reported to OMIM (MIM: 618205, 617822, 618198, and 607317, respectively). In addition, although not yet reported in OMIM, RYR3 was recently reported to associate with arthrogryposis.65

Similarly, following manual curation of this gene set, searching the OMIM website for reported phenotypic associations, two additional genes were discovered to have associated phenotypes reported in OMIM: CTD-307407.11 and SMO. These genes were not initially recognized by our pipeline as having Mendelian disease associations in OMIM because of a discrepancy in gene symbol. OMIM uses the symbols BBS1 (MIM: 209901) and SMOH (MIM: 601500), respectively, for these genes. Thus, we report a total of 133 candidate Mendelian disease-associated genes without a Mendelian phenotype yet reported in OMIM (Table S4).

Next, we compared these remaining candidate disease-associating genes without OMIM annotations with an alternate set of reported disease-gene associations (UniProt),66 demonstrating enrichment of UniProt disease associations for genes in this set. Out of 20,382 genes captured in the ES design of the de novo enrichment analysis described in this manuscript, 16,568 genes did not have a disease association in OMIM as of May 2019. Of these 16,568 genes, 240 (1.45%) had a disease association in UniProt as of February 2019. Including all UniProt gene symbols—standard and non-standard—in comparison, four, or 2.96% of the 135 candidate disease-associated genes (pre- manual curation), were part of this UniProt disease-associated set: CELSR1 (MIM: 604523), MEIS1 (MIM: 601739), SF1 (MIM: 601516), and SMO. (Notably, the “SF1” gene in UniProt (MIM: 184757) was different than the “SF1” candidate from our analysis.) Thus, 132 candidates remain without any reported disease association in OMIM or UniProt at the time of preparing this manuscript. Permutation analysis sampling 135 random genes without OMIM annotations revealed our candidate set to be significantly enriched for genes with disease associations in UniProt, but not in OMIM (again allowing for matching of non-standard gene symbols) (expected = 1.95; p = 0.04482; n = 100,000 permutations).

Finally, we sought replication by intersecting the 154 candidate disease-associated genes with a set of 309 genes harboring 344 de novo mutations across a set of 242 ES trios with a wide range of congenital anomalies.67, 68 Of these 309 genes, 216 harbored de novo nonsynonymous (missense or stopgain) variants; 78 harbored only synonymous de novo variants; 15 genes only carried variants in the 3′ or 5′ UTRs or intronic (including splice-site) variants. Six genes overlapped between the set of 154 candidate disease-associated genes and the 216 genes harboring de novo nonsynonymous variants: AATK (MIM: 605276), CELSR1, IRF2BPL, MYO5C (MIM: 610022), ROCK1 (MIM: 601702), and UBC (MIM: 191340). Permutation analysis revealed that the set of genes harboring de novo nonsynonymous mutations is significantly enriched for genes in our set of 154 candidate disease-associated genes (expected = 1.68; p = 0.005; n = 10,000 permutations). Significantly, variants in IRF2BPL were also reported to cause a Mendelian disorder in August 2018, further validating our approach.55 No genes overlapped between the set of 154 candidate disease-associated genes and the synonymous or noncoding de novo variant sets, supporting the model that de novo nonsynonymous variants are much more likely to be pathogenic than other de novo variants.

HARLEE Facilitates Reverse Genetic Screen Prioritization

HARLEE is also a powerful tool for phenocentric approaches to Mendelian genetic discovery. To illustrate this capability, we sought to identify genes enriched with potentially deleterious genetic variation in individuals with apparent auditory system dysfunction. We first counted the number of samples with ultra-rare variants (see Material and Methods) in each gene across all Mendelian samples in HARLEE, filtering out likely benign variants with a REVEL score less than 0.25, normalizing variant-per-gene count by cohort size. We then repeated this analysis on the subset of all samples whose phenotypic descriptions contain auditory system-related phenotypic keywords. For each gene harboring ultra-rare variants across at least two samples in the auditory phenotype cohort, we then measured cohort-specific enrichment by dividing the ultra-rare variant occurrence rate in the phenocentric cohort by the same rate across all samples in HARLEE.

The top three enriched genes in the auditory phenotype cohort with reported Mendelian phenotypes in OMIM are all directly or indirectly related to hearing loss: (1) variants in the second-most enriched gene overall—GRHL2 (MIM: 608576), harboring 23× more ultra-rare variants than the background rate in HARLEE—are known to cause autosomal-dominant deafness69 (MIM: 608641); (2) deficiency of the fifth-most enriched gene—ECHS1 (MIM: 602292)—is reported to cause deafness in the context of mitochondrial encephalopathy (MIM: 616277);70 (3) mutations in the thirteenth-most enriched gene—KDM6A (MIM: 300128)—cause Kabuki syndrome (MIM: 300867), which leads to hearing loss in approximately 40% of case subjects.71 These preliminary findings therefore support this strategy of prioritization of genes with phenocentric enrichment for potentially pathogenic variation in HARLEE.72

Discussion

We report the application of a Hadoop data lake to Mendelian discovery. Furthermore, we report a large-scale aggregation of 18,696 research and clinical ES data for subjects with suspected Mendelian disease traits. The data were used to discover 132 candidate Mendelian disease-associating genes through an optimization-based genocentric approach. In addition, a phenocentric approach utilized HARLEE to prioritize genes for an ongoing reverse genetic screen for hearing-related genes. These candidates are now available to be studied to further assert proof of Mendelian association.

The methods for identifying the candidates are agnostic to presumed zygosity at a locus and it is likely that the vast majority will display a clinical phenotype with a dominant mode of inheritance—i.e., an AD disease trait. Indeed, of the 19 original candidates recently reported by OMIM to associate with a Mendelian phenotype, 13 have been reported to demonstrate AD inheritance (p = 0.0835; binomial probability), and AD inheritance for high-impact variants cannot be ruled out for 2 additional genes (LINGO1, MYO9A) based on reported cases in OMIM. It also can be anticipated that a significant subset of these genes will eventually reveal more complex architectures such as recessive inheritance or even compound inheritance of coding and non-coding common variant alleles.73, 74, 75, 76 Many case subjects may require extensive follow up, including WGS and scrutiny of databases for genome-wide variant data.

Our approach does not intend to diagnose or solve individual clinical or research case subjects, but rather aims to discover candidate disease-associated genes, constituting cohorts of individuals harboring variants that may or may not be pathogenic in a shared gene. Each of these candidates will ultimately be revealed to either associate with one or more Mendelian disorders or not. For false-positive genes, without a true disease association, none of the individual variants detected in our analysis can be pathogenic. For true-positive genes, only some—but not all—variants in the gene must be pathogenic, notwithstanding the possibility of incidental discovery of a true-positive disease-associated gene where each of the detected variants are actually benign. We anticipate the ratio of pathogenic to benign variants, as well as the ratio of true-positive to false-positive disease-gene associations, to vary across filtering parameters. There may be a negative correlation between maximum discovery density value for a given parameter cutoff value and the associated false-positive rate or benign variant rate. For instance, SIFT and PolyPhen analyses were excluded on the basis of yielding subjectively large candidate disease-associated gene set sizes. Indeed, the discovery density values for SIFT (0.086) and PolyPhen (0.109) are much lower than those of MTR (0.372) or pLI (0.6). However, so long as discovery of Mendelian disease-gene associations and pathogenic variants continues, it is impossible to define true false-positive gene discovery or benign variant rates.

HARLEE is well suited for applications other than the discovery of Mendelian disease-gene associations, including the discovery of previously unrecognized pathogenic variants within genes known to associate with Mendelian disorders and the study of the molecular and genetic models underlying phenotypic expansion.77, 78 In one application, HARLEE was utilized for sample re-analysis and recruitment, identifying three additional individuals with de novo variants in DDX3X (MIM: 300160), previously missed by experienced geneticists searching the same ES data for the exhaustive set of individuals indicated for the study.79 The robust yet flexible nature of a Hadoop data lake such as HARLEE is a powerful tool for genocentric reanalysis of individuals sequenced through clinical labs.

An important advantage of the structured HARLEE data management system is the ability to interrogate specific alleles—meeting any possible bioinformatic filtering criteria—in key samples without exposure of individual identifiers. Within the scope of this study, this is achieved by grouping query results into higher-level categories (i.e., domain, gene, or pathway), reporting aggregating variant counts across selected samples. For more outward-facing activities, our approach can be replicated by a front-end web application with permission to access and query full variant-level information, enabling users without sample-level access permission to query genes for recurrence of variants meeting variable bioinformatic filtering criteria in samples with filterable phenotypic descriptors. For example, a query for recurrent loss-of-function variants in RB1 (MIM: 614041), filtering to include only samples with retinoblastoma, might report a count of two individuals in publicly available ingested datasets. Users who wish to pursue a detailed study involving the resulting individuals could then request to access sample-level information or to be connected with the referring physicians.80 We anticipate that the flexibility and power allowed by such a framework will accelerate disease-gene association discovery.

The discovery and functional annotation of all ∼20–22,000 human genes in the human genome will likely be increasingly dependent upon genocentric analysis. Nevertheless, phenocentric analysis will continue to play an indispensable role in Mendelian disease discovery and clinical re-analysis of extant data. The two approaches are complementary, in the same way that classical reverse genetic approaches complement forward genetic approaches. Ultimately, perhaps the most effective approach for solving the genotype-phenotype puzzle that underlies biological discovery in human genetics and clinical genomics81 will consist of iterating between genocentric and phenocentric analyses and perspectives. In one hypothetical instance, a disease-gene association may be initially identified through a phenocentric cohort analysis, where multiple individuals in a cohort with highly similar phenotypes share the same type of suspect genetic lesion. Iterating to a genocentric analysis of the candidate gene across large cohorts, utilizing tools such as HARLEE, additional individuals may be revealed to harbor identical genetic lesions with the same or different phenotypes. Re-visiting the phenotypic data of these individuals may reveal molecular explanations of pleiotropy, variable expressivity, or incomplete penetrance75 unlikely to be ascertained solely through phenocentric approaches.

Other large-scale genocentric projects have been recently reported, each with a unique scope or angle. The ongoing Deciphering Developmental Disorders (DDD) project is an important resource investigating the genotype-phenotype relationship in the context of de novo mutations in developmental disorders.7 Similarly, the ongoing Human Knockout Project is intended to characterize the extent and impact of homozygous loss-of-function (LoF) variation in populations with elevated rates of consanguinity, such as Pakistani and Finnish populations.23 Our efforts have built upon this foundation of large-scale genocentric analysis, expanding the paradigm into the area of broadly defined suspected Mendelian disease traits, introducing a high-yield methodology initially agnostic to genotype or de novo inheritance status, relevant to both missense and loss-of-function variation.

Declaration of Interests

J.R.L. has stock ownership in 23andMe and Lasergen, is a paid consultant for Regeneron Pharmaceuticals, and is a coinventor on multiple US and European patents related to molecular diagnostics for inherited neuropathies, eye diseases, and bacterial genomic fingerprinting. The Department of Molecular and Human Genetics at Baylor College of Medicine derives revenue from the chromosomal microarray analysis and clinical exome sequencing offered in the Baylor Genetics Laboratory (http://baylorgenetics.com).

Acknowledgments

This work was supported in part by grants UM1 HG008898 from the National Human Genome Research Institute (NHGRI) to the Baylor College of Medicine Center for Common Disease Genetics; UM1 HG006542 from the NHGRI/National Heart, Lung, and Blood Institute (NHLBI) to the Baylor Hopkins Center for Mendelian Genomics; R01 NS058529 and R35 NS105078 (J.R.L.) from the National Institute of Neurological Disorders and Stroke (NINDS); and P50 DK096415 (N.K.) from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). This work was also supported in part by the Baylor College of Medicine President’s Circle Precision Medicine/Population Health Initiative. A.W.H. was supported in part by NIH T32 GM08307-26 and The Cullen Foundation. J.E.P. was supported by NHGRI K08 HG008986.

We thank Huda Y. Zoghbi and Joshua M. Shulman for their insight and feedback as related to genocentric and phenocentric studies of human disease. We thank Jeremy Easton-Marks, Simon White, Joshua Traynelis, Piyushkumar Panchel, and Brian Palazzo for assistance with data architecture, data wrangling, and systems administration. We thank Stephen Wilson for sharing archived OMIM database downloads.

Published: October 24, 2019

Footnotes

Supplemental Data can be found online at https://doi.org/10.1016/j.ajhg.2019.09.027.

Contributor Information

Richard A. Gibbs, Email: agibbs@bcm.edu.

Task Force for Neonatal Genomics:

Alexander Allori, Misha Angrist, Patricia Ashley, Margarita Bidegain, Brita Boyd, Eileen Chambers, Heidi Cope, C. Michael Cotten, Theresa Curington, Erica E. Davis, Sarah Ellestad, Kimberley Fisher, Amanda French, William Gallentine, Ronald Goldberg, Kevin Hill, Sujay Kansagra, Nicholas Katsanis, Sara Katsanis, Joanne Kurtzberg, Jeffrey Marcus, Marie McDonald, Mohammed Mikati, Stephen Miller, Amy Murtha, Yezmin Perilla, Carolyn Pizoli, Todd Purves, Sherry Ross, Azita Sadeghpour, Edward Smith, and John Wiener

Web Resources

Supplemental Information

Document S1. Figures S1–S9, Tables S1 and S2, Supplemental Material and Methods, and the Task Force for Neonatal Genomics Consortium List
mmc1.pdf (2.3MB, pdf)
Table S3. Variants in Research Samples Passing Filtering Criteria
mmc2.xlsx (882.9KB, xlsx)
Table S4. 154 Candidate Mendelian Disease-Associating Genes
mmc3.xlsx (14.4KB, xlsx)
Table S5. Phenocentric Analysis of Genes Enriched for Ultra-rare Variants in Individuals with Hearing-Related Phenotypes
mmc4.xlsx (25.5KB, xlsx)
Document S2. Article plus Supplemental Information
mmc5.pdf (3.1MB, pdf)

References

  • 1.Yang Y., Muzny D.M., Reid J.G., Bainbridge M.N., Willis A., Ward P.A., Braxton A., Beuten J., Xia F., Niu Z. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N. Engl. J. Med. 2013;369:1502–1511. doi: 10.1056/NEJMoa1306555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bamshad M.J., Shendure J.A., Valle D., Hamosh A., Lupski J.R., Gibbs R.A., Boerwinkle E., Lifton R.P., Gerstein M., Gunel M., Centers for Mendelian Genomics The Centers for Mendelian Genomics: a new large-scale initiative to identify the genes underlying rare Mendelian conditions. Am. J. Med. Genet. A. 2012;158A:1523–1525. doi: 10.1002/ajmg.a.35470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chong J.X., Buckingham K.J., Jhangiani S.N., Boehm C., Sobreira N., Smith J.D., Harrell T.M., McMillin M.J., Wiszniewski W., Gambin T., Centers for Mendelian Genomics The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am. J. Hum. Genet. 2015;97:199–215. doi: 10.1016/j.ajhg.2015.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Posey J.E., Rosenfeld J.A., James R.A., Bainbridge M., Niu Z., Wang X., Dhar S., Wiszniewski W., Akdemir Z.H.C., Gambin T. Molecular diagnostic experience of whole-exome sequencing in adult patients. Genet. Med. 2016;18:678–685. doi: 10.1038/gim.2015.142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Posey J.E., O’Donnell-Luria A.H., Chong J.X., Harel T., Jhangiani S.N., Coban Akdemir Z.H., Buyske S., Pehlivan D., Carvalho C.M.B., Baxter S., Centers for Mendelian Genomics Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 2019;21:798–812. doi: 10.1038/s41436-018-0408-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.McKusick V.A. Mendelian Inheritance in Man and its online version, OMIM. Am. J. Hum. Genet. 2007;80:588–604. doi: 10.1086/514346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.McRae J.F., Clayton S., Fitzgerald T.W., Kaplanis J., Prigmore E., Rajan D., Sifrim A., Aitken S., Akawi N., Alvi M., Deciphering Developmental Disorders Study Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–438. doi: 10.1038/nature21062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.White J., Beck C.R., Harel T., Posey J.E., Jhangiani S.N., Tang S., Farwell K.D., Powis Z., Mendelsohn N.J., Baker J.A. POGZ truncating alleles cause syndromic intellectual disability. Genome Med. 2016;8:3. doi: 10.1186/s13073-015-0253-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stessman H.A.F., Willemsen M.H., Fenckova M., Penn O., Hoischen A., Xiong B., Wang T., Hoekzema K., Vives L., Vogel I. Disruption of POGZ Is Associated with Intellectual Disability and Autism Spectrum Disorders. Am. J. Hum. Genet. 2016;98:541–552. doi: 10.1016/j.ajhg.2016.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dentici M.L., Niceta M., Pantaleoni F., Barresi S., Bencivenga P., Dallapiccola B., Digilio M.C., Tartaglia M. Expanding the phenotypic spectrum of truncating POGZ mutations: Association with CNS malformations, skeletal abnormalities, and distinctive facial dysmorphism. Am. J. Med. Genet. A. 2017;173:1965–1969. doi: 10.1002/ajmg.a.38255. [DOI] [PubMed] [Google Scholar]
  • 11.Liu L., Li Y., Li S., Hu N., He Y., Pong R., Lin D., Lu L., Law M. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012;2012:251364. doi: 10.1155/2012/251364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mardis E.R. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–141. doi: 10.1016/j.tig.2007.12.007. [DOI] [PubMed] [Google Scholar]
  • 13.Austin C.P., Battey J.F., Bradley A., Bucan M., Capecchi M., Collins F.S., Dove W.F., Duyk G., Dymecki S., Eppig J.T. The knockout mouse project. Nat. Genet. 2004;36:921–924. doi: 10.1038/ng0904-921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ghosh R., Oak N., Plon S.E. Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome Biol. 2017;18:225. doi: 10.1186/s13059-017-1353-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ioannidis N.M., Rothstein J.H., Pejaver V., Middha S., McDonnell S.K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 2016;99:877–885. doi: 10.1016/j.ajhg.2016.08.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Traynelis J., Silk M., Wang Q., Berkovic S.F., Liu L., Ascher D.B., Balding D.J., Petrovski S. Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res. 2017;27:1715–1729. doi: 10.1101/gr.226589.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Coban-Akdemir Z., White J.J., Song X., Jhangiani S.N., Fatih J.M., Gambin T., Bayram Y., Chinn I.K., Karaca E., Punetha J., Baylor-Hopkins Center for Mendelian Genomics Identifying Genes Whose Mutant Transcripts Cause Dominant Disease Traits by Potential Gain-of-Function Alleles. Am. J. Hum. Genet. 2018;103:171–187. doi: 10.1016/j.ajhg.2018.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lek M., Karczewski K.J., Minikel E.V., Samocha K.E., Banks E., Fennell T., O’Donnell-Luria A.H., Ware J.S., Hill A.J., Cummings B.B., Exome Aggregation Consortium Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., Handsaker R.E., Kang H.M., Marth G.T., McVean G.A., 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kosmicki J.A., Samocha K.E., Howrigan D.P., Sanders S.J., Slowikowski K., Lek M., Karczewski K.J., Cutler D.J., Devlin B., Roeder K. Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nat. Genet. 2017;49:504–510. doi: 10.1038/ng.3789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lupski J.R., Belmont J.W., Boerwinkle E., Gibbs R.A. Clan Genomics and the Complex Architecture of Human Disease. Cell. 2011;147:32–43. doi: 10.1016/j.cell.2011.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sundaram L., Gao H., Padigepati S.R., McRae J.F., Li Y., Kosmicki J.A., Fritzilas N., Hakenberg J., Dutta A., Shon J. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 2018;50:1161–1170. doi: 10.1038/s41588-018-0167-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Saleheen D., Natarajan P., Armean I.M., Zhao W., Rasheed A., Khetarpal S.A., Won H.-H., Karczewski K.J., O’Donnell-Luria A.H., Samocha K.E. Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity. Nature. 2017;544:235–239. doi: 10.1038/nature22034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Samocha K.E., Robinson E.B., Sanders S.J., Stevens C., Sabo A., McGrath L.M., Kosmicki J.A., Rehnström K., Mallick S., Kirby A. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 2014;46:944–950. doi: 10.1038/ng.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Taylor R.C. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics. 2010;11(Suppl 12):S1. doi: 10.1186/1471-2105-11-S12-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Niemenmaa M., Kallio A., Schumacher A., Klemelä P., Korpelainen E., Heljanko K. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012;28:876–877. doi: 10.1093/bioinformatics/bts054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.O’Driscoll A., Daugelaite J., Sleator R.D. ‘Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 2013;46:774–781. doi: 10.1016/j.jbi.2013.07.001. [DOI] [PubMed] [Google Scholar]
  • 28.Zou Q., Li X.B., Jiang W.R., Lin Z.Y., Li G.L., Chen K. Survey of MapReduce frame operation in bioinformatics. Brief. Bioinform. 2014;15:637–647. doi: 10.1093/bib/bbs088. [DOI] [PubMed] [Google Scholar]
  • 29.Siretskiy A., Sundqvist T., Voznesenskiy M., Spjuth O. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015;4:26. doi: 10.1186/s13742-015-0058-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hodor P., Chawla A., Clark A., Neal L. cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud. Bioinformatics. 2016;32:301–303. doi: 10.1093/bioinformatics/btv553. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.O’Driscoll A., Belogrudov V., Carroll J., Kropp K., Walsh P., Ghazal P., Sleator R.D. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool. J. Biomed. Inform. 2015;54:58–64. doi: 10.1016/j.jbi.2015.01.008. [DOI] [PubMed] [Google Scholar]
  • 32.de Castro M.R., Tostes C.D.S., Dávila A.M.R., Senger H., da Silva F.A.B. SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinformatics. 2017;18:318. doi: 10.1186/s12859-017-1723-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yin Z., Lan H., Tan G., Lu M., Vasilakos A.V., Liu W. Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges. Comput. Struct. Biotechnol. J. 2017;15:403–411. doi: 10.1016/j.csbj.2017.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Richards S., Aziz N., Bale S., Bick D., Das S., Gastier-Foster J., Grody W.W., Hegde M., Lyon E., Spector E., ACMG Laboratory Quality Assurance Committee Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 2015;17:405–424. doi: 10.1038/gim.2015.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ng P.C., Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.DeMari J., Mroske C., Tang S., Nimeh J., Miller R., Lebel R.R. CLTC as a clinically novel gene associated with multiple malformations and developmental delay. Am. J. Med. Genet. A. 2016;170A:958–966. doi: 10.1002/ajmg.a.37506. [DOI] [PubMed] [Google Scholar]
  • 37.Haijes H.A., Koster M.J.E., Rehmann H., Li D., Hakonarson H., Cappuccio G., Hancarova M., Lehalle D., Reardon W., Schaefer G.B. De Novo Heterozygous POLR2A Variants Cause a Neurodevelopmental Syndrome with Profound Infantile-Onset Hypotonia. Am. J. Hum. Genet. 2019;105:283–301. doi: 10.1016/j.ajhg.2019.06.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lessel D., Schob C., Küry S., Reijnders M.R.F., Harel T., Eldomery M.K., Coban-Akdemir Z., Denecke J., Edvardson S., Colin E., DDD study. C4RCD Research Group De Novo Missense Mutations in DHX30 Impair Global Translation and Cause a Neurodevelopmental Disorder. Am. J. Hum. Genet. 2017;101:716–724. doi: 10.1016/j.ajhg.2017.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Deardorff M.A., Kaur M., Yaeger D., Rampuria A., Korolev S., Pie J., Gil-Rodríguez C., Arnedo M., Loeys B., Kline A.D. Mutations in cohesin complex members SMC3 and SMC1A cause a mild variant of cornelia de Lange syndrome with predominant mental retardation. Am. J. Hum. Genet. 2007;80:485–494. doi: 10.1086/511888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Snijders Blok L., Rousseau J., Twist J., Ehresmann S., Takaku M., Venselaar H., Rodan L.H., Nowak C.B., Douglas J., Swoboda K.J., DDD study CHD3 helicase domain mutations cause a neurodevelopmental syndrome with macrocephaly and impaired speech and language. Nat. Commun. 2018;9:4619. doi: 10.1038/s41467-018-06014-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Helbig K.L., Mroske C., Moorthy D., Sajan S.A., Velinov M. Biallelic loss-of-function variants in DOCK3 cause muscle hypotonia, ataxia, and intellectual disability. Clin. Genet. 2017;92:430–433. doi: 10.1111/cge.12995. [DOI] [PubMed] [Google Scholar]
  • 42.Gueneau L., Fish R.J., Shamseldin H.E., Voisin N., Tran Mau-Them F., Preiksaitiene E., Monroe G.R., Lai A., Putoux A., Allias F., DDD Study KIAA1109 Variants Are Associated with a Severe Disorder of Brain Development and Arthrogryposis. Am. J. Hum. Genet. 2018;102:116–132. doi: 10.1016/j.ajhg.2017.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.O’Connor E., Töpf A., Müller J.S., Cox D., Evangelista T., Colomer J., Abicht A., Senderek J., Hasselmann O., Yaramis A. Identification of mutations in the MYO9A gene in patients with congenital myasthenic syndrome. Brain. 2016;139:2143–2153. doi: 10.1093/brain/aww130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Seong E., Insolera R., Dulovic M., Kamsteeg E.-J., Trinh J., Brüggemann N., Sandford E., Li S., Ozel A.B., Li J.Z. Mutations in VPS13D lead to a new recessive ataxia with spasticity and mitochondrial defects. Ann. Neurol. 2018;83:1075–1088. doi: 10.1002/ana.25220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gauthier J., Meijer I.A., Lessel D., Mencacci N.E., Krainc D., Hempel M., Tsiakas K., Prokisch H., Rossignol E., Helm M.H. Recessive mutations in >VPS13D cause childhood onset movement disorders. Ann. Neurol. 2018;83:1089–1095. doi: 10.1002/ana.25204. [DOI] [PubMed] [Google Scholar]
  • 46.Cogné B., Ehresmann S., Beauregard-Lacroix E., Rousseau J., Besnard T., Garcia T., Petrovski S., Avni S., McWalter K., Blackburn P.R., CAUSES Study. Deciphering Developmental Disorders study Missense variants in the histone acetyltransferase complex component gene TRRAP cause autism and syndromic intellectual disability. Am. J. Hum. Genet. 2019;104:530–541. doi: 10.1016/j.ajhg.2019.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Damaj L., Lupien-Meilleur A., Lortie A., Riou É., Ospina L.H., Gagnon L., Vanasse C., Rossignol E. CACNA1A haploinsufficiency causes cognitive impairment, autism and epileptic encephalopathy with mild cerebellar symptoms. Eur. J. Hum. Genet. 2015;23:1505–1512. doi: 10.1038/ejhg.2015.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Travaglini L., Nardella M., Bellacchio E., D’Amico A., Capuano A., Frusciante R., Di Capua M., Cusmai R., Barresi S., Morlino S. Missense mutations of CACNA1A are a frequent cause of autosomal dominant nonprogressive congenital ataxia. Eur. J. Paediatr. Neurol. 2017;21:450–456. doi: 10.1016/j.ejpn.2016.11.005. [DOI] [PubMed] [Google Scholar]
  • 49.Heyne H.O., Singh T., Stamberger H., Abou Jamra R., Caglayan H., Craiu D., De Jonghe P., Guerrini R., Helbig K.L., Koeleman B.P.C., EuroEPINOMICS RES Consortium De novo variants in neurodevelopmental disorders with epilepsy. Nat. Genet. 2018;50:1048–1053. doi: 10.1038/s41588-018-0143-7. [DOI] [PubMed] [Google Scholar]
  • 50.Schlingmann K.P., Bandulik S., Mammen C., Tarailo-Graovac M., Holm R., Baumann M., König J., Lee J.J.Y., Drögemöller B., Imminger K. Germline De Novo Mutations in ATP1A1 Cause Renal Hypomagnesemia, Refractory Seizures, and Intellectual Disability. Am. J. Hum. Genet. 2018;103:808–816. doi: 10.1016/j.ajhg.2018.10.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lassuthova P., Rebelo A.P., Ravenscroft G., Lamont P.J., Davis M.R., Manganelli F., Feely S.M., Bacon C., Brožková D.Š., Haberlova J. Mutations in ATP1A1 Cause Dominant Charcot-Marie-Tooth Type 2. Am. J. Hum. Genet. 2018;102:505–514. doi: 10.1016/j.ajhg.2018.01.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Helbig K.L., Lauerer R.J., Bahr J.C., Souza I.A., Myers C.T., Uysal B., Schwarz N., Gandini M.A., Huang S., Keren B., Task Force for Neonatal Genomics. Deciphering Developmental Disorders Study De Novo Pathogenic Variants in CACNA1E Cause Developmental and Epileptic Encephalopathy with Contractures, Macrocephaly, and Dyskinesias. Am. J. Hum. Genet. 2018;103:666–678. doi: 10.1016/j.ajhg.2018.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Gregor A., Sadleir L.G., Asadollahi R., Azzarello-Burri S., Battaglia A., Ousager L.B., Boonsawat P., Bruel A.-L., Buchert R., Calpena E., University of Washington Center for Mendelian Genomics. DDD Study De Novo Variants in the F-Box Protein FBXO11 in 20 Individuals with a Variable Neurodevelopmental Disorder. Am. J. Hum. Genet. 2018;103:305–316. doi: 10.1016/j.ajhg.2018.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fritzen D., Kuechler A., Grimmel M., Becker J., Peters S., Sturm M., Hundertmark H., Schmidt A., Kreiß M., Strom T.M. De novo FBXO11 mutations are associated with intellectual disability and behavioural anomalies. Hum. Genet. 2018;137:401–411. doi: 10.1007/s00439-018-1892-1. [DOI] [PubMed] [Google Scholar]
  • 55.Marcogliese P.C., Shashi V., Spillmann R.C., Stong N., Rosenfeld J.A., Koenig M.K., Martínez-Agosto J.A., Herzog M., Chen A.H., Dickson P.I., Program for Undiagnosed Diseases (UD-PrOZA) Undiagnosed Diseases Network IRF2BPL Is Associated with Neurological Phenotypes. Am. J. Hum. Genet. 2018;103:245–260. doi: 10.1016/j.ajhg.2018.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Faundes V., Newman W.G., Bernardini L., Canham N., Clayton-Smith J., Dallapiccola B., Davies S.J., Demos M.K., Goldman A., Gill H., Clinical Assessment of the Utility of Sequencing and Evaluation as a Service (CAUSES) Study. Deciphering Developmental Disorders (DDD) Study Histone Lysine Methylases and Demethylases in the Landscape of Human Developmental Disorders. Am. J. Hum. Genet. 2018;102:175–187. doi: 10.1016/j.ajhg.2017.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Ansar M., Riazuddin S., Sarwar M.T., Makrythanasis P., Paracha S.A., Iqbal Z., Khan J., Assir M.Z., Hussain M., Razzaq A. Biallelic variants in LINGO1 are associated with autosomal recessive intellectual disability, microcephaly, speech and motor delay. Genet. Med. 2018;20:778–784. doi: 10.1038/gim.2017.113. [DOI] [PubMed] [Google Scholar]
  • 58.Dobyns W.B., Aldinger K.A., Ishak G.E., Mirzaa G.M., Timms A.E., Grout M.E., Dremmen M.H.G., Schot R., Vandervore L., van Slegtenhorst M.A., University of Washington Center for Mendelian Genomics. Center for Mendelian Genomics at the Broad Institute of MIT and Harvard MACF1 Mutations Encoding Highly Conserved Zinc-Binding Residues of the GAR Domain Cause Defects in Neuronal Migration and Axon Guidance. Am. J. Hum. Genet. 2018;103:1009–1021. doi: 10.1016/j.ajhg.2018.10.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Tripathy R., Leca I., van Dijk T., Weiss J., van Bon B.W., Sergaki M.C., Gstrein T., Breuss M., Tian G., Bahi-Buisson N. Mutations in MAST1 Cause Mega-Corpus-Callosum Syndrome with Cerebellar Hypoplasia and Cortical Malformations. Neuron. 2018;100:1354–1368.e5. doi: 10.1016/j.neuron.2018.10.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Wang L., Feng Y., Yan D., Qin L., Grati M., Mittal R., Li T., Sundhari A.K., Liu Y., Chapagain P. A dominant variant in the PDE1C gene is associated with nonsyndromic hearing loss. Hum. Genet. 2018;137:437–446. doi: 10.1007/s00439-018-1895-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zaman T., Helbig I., Božović I.B., DeBrosse S.D., Bergqvist A.C., Wallis K., Medne L., Maver A., Peterlin B., Helbig K.L. Mutations in SCN3A cause early infantile epileptic encephalopathy. Ann. Neurol. 2018;83:703–717. doi: 10.1002/ana.25188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Stevens S.J.C., van der Schoot V., Leduc M.S., Rinne T., Lalani S.R., Weiss M.M., van Hagen J.M., Lachmeijer A.M.A., Stockler-Ipsiroglu S.G., Lehman A., Brunner H.G., CAUSES Study De novo mutations in the SET nuclear proto-oncogene, encoding a component of the inhibitor of histone acetyltransferases (INHAT) complex in patients with nonsyndromic intellectual disability. Hum. Mutat. 2018;39:1014–1023. doi: 10.1002/humu.23541. [DOI] [PubMed] [Google Scholar]
  • 63.Liu N., Schoch K., Luo X., Pena L.D.M., Bhavana V.H., Kukolich M.K., Stringer S., Powis Z., Radtke K., Mroske C., Undiagnosed Diseases Network (UDN) Functional variants in TBX2 are associated with a syndromic cardiovascular and skeletal developmental disorder. Hum. Mol. Genet. 2018;27:2454–2465. doi: 10.1093/hmg/ddy146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Vetrini F., McKee S., Rosenfeld J.A., Suri M., Lewis A.M., Nugent K.M., Roeder E., Littlejohn R.O., Holder S., Zhu W., DDD study De novo and inherited TCF20 pathogenic variants are associated with intellectual disability, dysmorphic features, hypotonia, and neurological impairments with similarities to Smith-Magenis syndrome. Genome Med. 2019;11:12. doi: 10.1186/s13073-019-0623-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Pehlivan D., Bayram Y., Gunes N., Coban Akdemir Z., Shukla A., Bierhals T., Tabakci B., Sahin Y., Gezdirici A., Faith J.M. The Genomics of Arthrogryposis, a Complex Trait: Candidate Genes and Further Evidence for Oligogenic Inheritance. Am. J. Hum. Genet. 2019;105:132–150. doi: 10.1016/j.ajhg.2019.05.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Bateman A., Martin M.J., O’Donovan C., Magrane M., Alpi E., Antunes R., Bely B., Bingley M., Bonilla C., Britto R., The UniProt Consortium UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–D169. doi: 10.1093/nar/gkw1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Katsanis N., Cotten M., Angrist M. Exome and genome sequencing of neonates with neurodevelopmental disorders. Future Neurol. 2012;7:655–658. [Google Scholar]
  • 68.Katsanis S.H., Minear M.A., Sadeghpour A., Cope H., Perilla Y., Cook-Deegan R., Duke Task Force for Neonatal Genomics, Katsanis N., Davis E.E., Angrist M. Participant-Partners in Genetic Research: An Exome Study with Families of Children with Unexplained Medical Conditions. J. Participat. Med. 2018;10:e2. doi: 10.2196/jopm.8958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Vona B., Nanda I., Neuner C., Müller T., Haaf T. Confirmation of GRHL2 as the gene for the DFNA28 locus. Am. J. Med. Genet. A. 2013;161A:2060–2065. doi: 10.1002/ajmg.a.36017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Haack T.B., Jackson C.B., Murayama K., Kremer L.S., Schaller A., Kotzaeridou U., de Vries M.C., Schottmann G., Santra S., Büchner B. Deficiency of ECHS1 causes mitochondrial encephalopathy with cardiac involvement. Ann. Clin. Transl. Neurol. 2015;2:492–509. doi: 10.1002/acn3.189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Tekin M., Fitoz S., Arici S., Cetinkaya E., Incesulu A. Niikawa-Kuroki (Kabuki) syndrome with congenital sensorineural deafness: evidence for a wide spectrum of inner ear abnormalities. Int. J. Pediatr. Otorhinolaryngol. 2006;70:885–889. doi: 10.1016/j.ijporl.2005.09.025. [DOI] [PubMed] [Google Scholar]
  • 72.Gonzaga-Jauregui C., Harel T., Gambin T., Kousi M., Griffin L.B., Francescatto L., Ozes B., Karaca E., Jhangiani S.N., Bainbridge M.N., Baylor-Hopkins Center for Mendelian Genomics Exome Sequence Analysis Suggests that Genetic Burden Contributes to Phenotypic Variability and Complex Neuropathy. Cell Rep. 2015;12:1169–1183. doi: 10.1016/j.celrep.2015.07.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Wu N., Ming X., Xiao J., Wu Z., Chen X., Shinawi M., Shen Y., Yu G., Liu J., Xie H. TBX6 null variants and a common hypomorphic allele in congenital scoliosis. N. Engl. J. Med. 2015;372:341–350. doi: 10.1056/NEJMoa1406829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Liu J., Zhou Y., Liu S., Song X., Yang X.Z., Fan Y., Chen W., Akdemir Z.C., Yan Z., Zuo Y., DISCO (Deciphering disorders Involving Scoliosis and COmorbidities) Study The coexistence of copy number variations (CNVs) and single nucleotide polymorphisms (SNPs) at a locus can result in distorted calculations of the significance in associating SNPs to disease. Hum. Genet. 2018;137:553–567. doi: 10.1007/s00439-018-1910-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Liu J., Wu N., Yang N., Takeda K., Chen W., Li W., Du R., Liu S., Zhou Y., Zhang L., Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) study. Japan Early Onset Scoliosis Research Group. Baylor-Hopkins Center for Mendelian Genomics TBX6-associated congenital scoliosis (TACS) as a clinically distinguishable subtype of congenital scoliosis: further evidence supporting the compound inheritance and TBX6 gene dosage model. Genet. Med. 2019;21:1548–1558. doi: 10.1038/s41436-018-0377-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Yang N., Wu N., Zhang L., Zhao Y., Liu J., Liang X., Ren X., Li W., Chen W., Dong S. TBX6 compound inheritance leads to congenital vertebral malformations in humans and mice. Hum. Mol. Genet. 2019;28:539–547. doi: 10.1093/hmg/ddy358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Posey J.E., Harel T., Liu P., Rosenfeld J.A., James R.A., Coban Akdemir Z.H., Walkiewicz M., Bi W., Xiao R., Ding Y. Resolution of Disease Phenotypes Resulting from Multilocus Genomic Variation. N. Engl. J. Med. 2017;376:21–31. doi: 10.1056/NEJMoa1516767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Karaca E., Posey J.E., Coban Akdemir Z., Pehlivan D., Harel T., Jhangiani S.N., Bayram Y., Song X., Bahrambeigi V., Yuregir O.O. Phenotypic expansion illuminates multilocus pathogenic variation. Genet. Med. 2018;20:1528–1537. doi: 10.1038/gim.2018.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Wang X., Posey J.E., Rosenfeld J.A., Bacino C.A., Scaglia F., Immken L., Harris J.M., Hickey S.E., Mosher T.M., Slavotinek A., Undiagnosed Diseases Network Phenotypic expansion in DDX3X - a common cause of intellectual disability in females. Ann. Clin. Transl. Neurol. 2018;5:1277–1285. doi: 10.1002/acn3.622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Philippakis A.A., Azzariti D.R., Beltran S., Brookes A.J., Brownstein C.A., Brudno M., Brunner H.G., Buske O.J., Carey K., Doll C. The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat. 2015;36:915–921. doi: 10.1002/humu.22858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Veltman J.A., Lupski J.R. From genes to genomes in the clinic. Genome Med. 2015;7:78. doi: 10.1186/s13073-015-0200-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S9, Tables S1 and S2, Supplemental Material and Methods, and the Task Force for Neonatal Genomics Consortium List
mmc1.pdf (2.3MB, pdf)
Table S3. Variants in Research Samples Passing Filtering Criteria
mmc2.xlsx (882.9KB, xlsx)
Table S4. 154 Candidate Mendelian Disease-Associating Genes
mmc3.xlsx (14.4KB, xlsx)
Table S5. Phenocentric Analysis of Genes Enriched for Ultra-rare Variants in Individuals with Hearing-Related Phenotypes
mmc4.xlsx (25.5KB, xlsx)
Document S2. Article plus Supplemental Information
mmc5.pdf (3.1MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES