Joint, multifaceted genomic analysis enables diagnosis of diverse, ultra-rare monogenic presentations

Shilpa Nadimpalli Kobren; Mikhail A Moldovan; Rebecca Reimers; Daniel Traviglia; Xinyun Li; Danielle Barnum; Alexander Veit; Rosario I Corona; George de V Carvalho Neto; Julian Willett; Michele Berselli; William Ronchetti; Stanley F Nelson; Julian A Martinez-Agosto; Richard Sherwood; Joel Krier; Isaac S Kohane; Undiagnosed Diseases Network; Shamil R Sunyaev

doi:10.1101/2024.02.13.580158

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Aug 13:2024.02.13.580158. Originally published 2024 Feb 16. [Version 2] doi: 10.1101/2024.02.13.580158

Joint, multifaceted genomic analysis enables diagnosis of diverse, ultra-rare monogenic presentations

Shilpa Nadimpalli Kobren ^1,^*, Mikhail A Moldovan ^1,^*, Rebecca Reimers ², Daniel Traviglia ¹, Xinyun Li ³, Danielle Barnum ⁴, Alexander Veit ¹, Rosario I Corona ⁵, George de V Carvalho Neto ⁵, Julian Willett ⁶, Michele Berselli ¹, William Ronchetti ¹, Stanley F Nelson ⁵, Julian A Martinez-Agosto ⁵, Richard Sherwood ⁷, Joel Krier ⁸, Isaac S Kohane ¹; Undiagnosed Diseases Network, Shamil R Sunyaev ^1,^†

PMCID: PMC10888768 PMID: 38405764

Abstract

Genomics for rare disease diagnosis has advanced at a rapid pace due to our ability to perform “N-of-1” analyses on individual patients with ultra-rare diseases. The increasing sizes of ultra-rare disease cohorts internationally newly enables cohort-wide analyses for new discoveries, but well-calibrated statistical genetics approaches for jointly analyzing these patients are still under development.^1,2 The Undiagnosed Diseases Network (UDN) brings multiple clinical, research and experimental centers under the same umbrella across the United States to facilitate and scale N-of-1 analyses. Here, we present the first joint analysis of whole genome sequencing data of UDN patients across the network. We introduce new, well-calibrated statistical methods for prioritizing disease genes with de novo recurrence and compound heterozygosity. We also detect pathways enriched with candidate and known diagnostic genes. Our computational analysis, coupled with a systematic clinical review, recapitulated known diagnoses and revealed new disease associations. We further release a software package, RaMeDiES, enabling automated cross-analysis of deidentified sequenced cohorts for new diagnostic and research discoveries. Gene-level findings and variant-level information across the cohort are available in a public-facing browser (https://dbmi-bgm.github.io/udn-browser/). These results show that N-of-1 efforts should be supplemented by a joint genomic analysis across cohorts.

Introduction

For decades preceding the widespread application of DNA sequencing, identifying the genetic etiology of rare monogenic phenotypes including human diseases relied on segregation in pedigrees.³ DNA sequencing enabled the analysis of sporadic cases with no segregation data.⁴ Early studies analyzed small cohorts of phenotypically similar cases,^5,6 a highly successful approach that is, however, limited to diseases with multiple known patients with fairly homogeneous presentations. In the absence of such phenotypically matched case cohorts, N-of-1 studies of undiagnosed patients are gaining popularity.^7–10 By design, these studies cannot attain statistical power from the shared genotypes of unrelated patients and require extensive clinical and biological inquiry to prove the causal involvement of the genotype in disease.^11–13 The most recent phase of human Mendelian genetics employs a data science approach to gene discovery propelled by the joint genomic analysis of phenotypically broad cohorts. Recent studies by the Deciphering Developmental Disorders and 100,000 Genomes consortia have demonstrated the power of this approach to identify new diagnoses and disease genes.^1,14 This opens the prospect of international cross-cohort analyses, leveraging parallel efforts in many countries, and appreciating that rare diseases know no borders.

Undiagnosed Diseases Network dataset

Here, we apply existing and newly developed statistical genetics methods to the Undiagnosed Diseases Network (UDN) cohort that includes extremely difficult-to-solve, likely genetic cases (Figure 1a–e). The unique, diagnostically elusive presentation is the only criterion for inclusion, and patients have varied presentations including neurological, musculoskeletal, immune, endocrine, cardiac, and other disorders. Symptom onset ranges from neonatal through late adulthood. In contrast to most existing rare disease cohorts, individuals accepted to the UDN have already undergone lengthy but ultimately unfruitful diagnostic odysseys prior to enrollment. These patients subsequently undergo extensive phenotypic characterization at UDN clinical sites.¹⁵ Both broad Human Phenotype Ontology (HPO) terms and highly detailed clinical notes are collected and made available for all UDN researchers. Phenotypic information includes laboratory evaluations, dysmorphology examinations, specialist assessments, surgical records, and imaging (Figure 1f).

Figure 1. — **(a)** Map of clinical and research sites within the Undiagnosed Diseases Network (UDN) for evaluating patients and candidate variant functionality. **(b)** Genetic ancestry across the sequenced patient cohort. **(c)** Clinician-recorded primary symptom categories of patients. “Multiple” indicates 2+ categories could be considered primary and “other” indicates an unlisted category. Categories marked with an asterisk (*) are neurological subtypes (Supplementary Note S1). **(d)** Patient-reported age of first symptom onset. **(e)** Patient sex. **(f)** Categories and quantity of phenotype information collected for patients and made available to all UDN researchers (icons are from Microsoft PowerPoint). **(g)** Intronic variants detectable from genome sequencing (orange star) with a predicted splice-altering impact are considered alongside exonic variants in our statistical framework; these variants may result in retained introns or excised exons in processed transcripts. **(h)** We consider genes and gene pathways harboring *de novo* and compound heterozygous variants in sequenced trios (72% of cases). Complete case count by family structure (e.g., proband-only, duo) is in Supplementary Figure S2. Other inheritance modes (e.g., homozygous, uniparental disomy) are not considered in our cohort-based framework. **(i)** Depiction of clinical framework to uniformly evaluate how well a patient’s phenotypes are concordant with a candidate gene or variant.

There is a similar emphasis on collecting sequencing data, with whole genomes sequenced for probands and their immediate or otherwise relevant family members. Although smaller than some other rare disease cohorts,² the UDN—with a design bridging clinical, research and functional validation teams and a focus on extreme patient presentations—was thought to be optimized for “N-of-1” analyses, where probands are evaluated on a per-case basis. Patients’ detailed phenotypic information, ongoing confirmation of new diagnoses, and the potential enrichment for novel genetic disorders make for an ideal data space to validate and develop statistical approaches. We harmonized and jointly called single nucleotide (SNV) and insertion/deletion (indel) variants across 4,236 individuals with whole genome sequencing in the UDN dataset and additionally called de novo mutations from aligned reads across complete trios (Methods, Supplementary Figure S1).¹⁶

Clinical Evaluation of Computational Findings

Here, we generate candidate gene–patient matches via a series of statistical genomic analyses implemented in our software suite, Rare Mendelian Disease Enrichment Statistics (RaMeDiES, Figure 1g,h). We focus on the model of monogenic, autosomal inheritance in de novo and compound heterozygous cases to prioritize candidates via a genotype-first approach, with no clinical input or phenotypic information used. Each candidate is then evaluated with respect to the patient’s clinical presentation and the gene and variant’s putative role in disease—based on known disease associations, functionality in model organisms, tissue expression, molecular function, evolutionary constraint, and in silico predicted pathogenicity—to assess phenotypic match (Figure 1i). For genes or gene pathways harboring deleterious variants across multiple individuals, phenotypic similarity between patients is also assessed. To scale clinical evaluation to the cohort level, we developed a semi-quantitative protocol guided by the ClinGen framework¹⁷ that uses hierarchical decision models to increase efficiency and enables consistent and comparable evaluations of a gene–patient diagnostic fit by independent experts (Supplementary Note S2, Supplementary Figure S3). We calibrated the protocol during development by testing whether the resulting clinical scores assigned by different experts on the clinical team were in agreement. We validated the protocol in a blind test using non-causative candidate genes as controls. Specifically, non-causative genes were selected with identical criteria to true candidate genes except biallelic variants were in cis rather than in trans or had low predicted pathogenicity scores. The clinical team applying the protocol consistently scored true candidate genes higher than control genes (Wilcoxon one-sided rank-sum p-value = 0.0171, Methods, Supplementary Table S1), suggesting that the scores generated by the clinicians’ protocol may be used to prioritize candidates.

Results

De novo analysis

Several highly penetrant, extreme phenotypic presentations underlying Mendelian and other congenital, complex human diseases have been linked to de novo mutations.^1,18,19 We began by evaluating all independent, sporadic trios with complete sequencing data for de novo mutation etiologies. We detected 78.3 de novo point mutations and 9.5 de novo indels on average per proband genome concordant with the expectation.²⁰ Mutation count showed expected dependency on parental ages with Poisson-distributed adjusted counts, attesting to the quality of de novo calling (Figure 2a, Supplementary Figure S4).

Figure 2. — **(a)** *De novo* mutation counts per proband adjusted for parental ages. Blue vertical lines show the mean values of the distributions, and curves represent the Poisson fits. **(b)** Schematic of analytical test for the recurrence of *de novos* that considers distal splice-altering and exonic SNV and indel variants, their variant functionality scores, a genome-wide mutation rate model Roulette, and per-gene GeneBayes constraint values. “Like” variants refer to those of the same variant class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS], intronic indels [II]) and within the same functionality score and minor allele frequency thresholds. **(c)** Genes with highest significance values for *de novo* recurrence across the cohort when focusing on missense variants with AlphaMissense and PrimateAI-3D scores; patients are represented as colored circles. Complete gene list can be found in Supplementary Table S2. **(d)** AlphaFold-predicted human *LRRC7* protein structure (AF-Q96NW7-F1) covering the leucine-rich repeat region with high predicted structural confidence (amino acid positions 86–463). The fifth and eighth LRR domains where missense *de novos* were found are highlighted in blue. Reference alleles for missense *de novo* variants observed across two UDN patients (red) are shown in circles. A depiction of *LRRC7*’s linear protein sequence (Ensembl ID ENSP00000498937) with InterPro predicted domains shown in colored boxes is below. **(e)** Overlap of phenotype terms annotated to each patient.

We then sought to identify genes enriched for deleterious de novo mutations across our patient cohort. The power of this enrichment calculation increases with better models of underlying mutation rates and estimates of variant deleteriousness. Recently, the rate of de novo emergence has been estimated at basepair resolution with a high degree of accuracy.²¹ Newly developed deep learning models for predicting the pathogenicity of de novo and other variants also now exhibit unprecedented accuracy in distinguishing disease-relevant variants.^22,23 We leverage these recent advances to build an accurate, unbiased statistical procedure called RaMeDiES-DN to detect genes enriched for deleterious de novos.

Unlike the earliest generation of de novo recurrence approaches which leveraged Poisson approximations for runtime efficiency but could not take advantage of improved deleteriousness scores and mutation rate models,¹⁸ RaMeDiES methods seamlessly incorporate per-variant deleteriousness scores and mutation rates without sacrificing runtime. Briefly, for a given observed variant in a gene, we define its “mutational target” as the sum of per-variant de novo mutation rates for all possible variants with as high or higher a deleteriousness score. By construction, this per-variant mutational target is expected to be a uniformly distributed statistic (Supplementary Note S3). Our framework naturally combines different variant types including SNVs and indels with a distinct mutation rate model, and can interchangeably utilize various deleteriousness scores (Figure 2b, Methods). Although current state-of-the-art de novo recurrence approaches also incorporate relevant variant-level information, they rely on a complex, permutation procedure.¹ RaMeDiES’ analytical approach eliminates the need for permutation-based significance calculations and can process large datasets in mere seconds while maintaining well-calibrated p-values (Supplementary Figure S5). Furthermore, RaMeDiES’ operation at the level of mutational targets enables sharing of intermediate statistics across cohorts without revealing patients’ individual variants.

We first focus on the subset of missense variants, which comprise a sizable proportion of known Mendelian disease-causing variants and for which new, specialized pathogenicity predictions exist (e.g., PrimateAI-3D and AlphaMissense).^22–24 We find one significant gene, KIF21A, corresponding to the correct, complete diagnosis in one patient and a strong partial diagnosis in one other (Bonferroni-adjusted Cauchy-combined p-value < 0.05, Figure 2c). Notably, disease genes with a de novo mode of inheritance are expected to be under strong selection against heterozygous loss-of-function variants. We further refine our method to incorporate this intuition by prioritizing genes by their GeneBayes values, which indicate selection against heterozygous protein-truncating variants, using a weighted false discovery rate (FDR) procedure.^25–27 With this correction, we obtain three gene findings at an equivalent significance threshold (Q-value < 3e-6) and eight gene findings at FDR 5% (Supplementary Table S2). Our second and third gene hits, BAP1 and RHOA, correspond to a known correct diagnosis in one patient and strong clinical matches in two other patients. Among the five remaining genes at FDR 5%, three genes (CACNA1C, COL4A1 and NOTCH1) correspond to known diagnoses in five patients and the top clinical candidate in one patient. Two impacted patients with de novo missense variants in the leucine-rich repeat region of LRRC7, a gene not yet known to be disease-associated, had phenotypic overlap of hypotonia and developmental delay; one patient additionally experienced nystagmus, staring spells, and balance problems and the second had ataxic gait (Figure 2d–e). These findings and LRRC7’s expression in the brain further support its link to an emerging neurodevelopmental disorder.¹⁴ Another gene, NRBP1, remains a strong candidate in two patients due to their neurological phenotype overlap and NRBP1’s expression in the brain. An initial functional study in fly through the UDN Model Organism Screening Core was inconclusive. This gene has been submitted to Matchmaker Exchange.

We next consider all exonic variants, including nonsense variants and indels, and further incorporate additional well-established deleteriousness predictors, CADD and REVEL.^28,29 Different mutagenesis processes lead to indel mutations, so SNV mutation rate models can be inappropriate for modeling this mutation type for some genes.³⁰ We therefore constructed a separate per-gene mutation rate approximation for indels (see Methods for details). When we reran RaMeDiES-DN on all exonic variants using four deleteriousness predictors, we additionally identified KMT2B (Bonferroni-adjusted Cauchy-combined p-value < 0.05), corresponding to a correct diagnosis in four patients due to de novo indel variants (Supplementary Table S3, Supplementary Figure S6a). The next seven gene findings at FDR 5% were all identified when assessing recurrence of missense variants. At FDR 10%, we identify five new putative diagnoses. For instance, two patients had high impact missense de novo variants impacting H4C5, a histone gene that was not detected with significance in our missense-only enrichment test due to its lack of precomputed AlphaMissense scores. Both patients had infantile-onset gross motor developmental delays, dysmorphic facial features, and speech difficulties (Supplementary Figure S6b,c). These and other phenotypes exhibited by each patient were recently found to be linked to missense variants in histone H4 genes.³¹ For one of the patients, the de novo variant was contemporaneously interpreted by UDN clinical experts to be causal.³² The second patient’s de novo variant has now been reclassified as “pathogenic” and resulted in a new diagnosis for this participant. Two other patients with sporadic neurodevelopmental delay each harbor truncating de novo variants in ZNF865. Both patients have phenotypic overlap with a series of 10+ other patients with ZNF865 mutations, which makes a compelling case for pathogenicity.³³ Subsequent to the publication of the case series, we anticipate this gene–disease relationship will be established as causal and both variants to be reclassified as likely pathogenic.

Inclusion of deep intronic splice variants

Next, we demonstrate how RaMeDiES-DN can be extended to additionally consider non-exonic variants uncovered uniquely from whole genome sequencing using the same methodological infrastructure. On the one hand, it remains challenging to identify non-coding regulatory variants involved in rare Mendelian diseases,³⁴ and the overall role of such variants in congenital disorders is still a subject of debate.³⁵ On the other hand, distal gain-of-splice site mutations creating new acceptor or donor splicing sites deep in the intronic sequences of genes are now a well-recognized cause of monogenic disease.³⁶ Identification of splice-altering variants directly from genome sequencing data is recently possible using newly-developed in silico predictive scores without relying on RNA sequencing. RNA sequencing has limitations for diagnosis because it depends on the availability of relevant tissue material that is especially challenging to obtain for neurodevelopmental patients, and it may miss lowly-expressed isoforms and those targeted by nonsense mediated decay.³⁷ Moreover, identifying disease-causal intronic splice variants is especially appealing due to their potential targetability using antisense oligonucleotide therapies.³⁸

Unlike functional predictions for exonic variants, which have been extensively validated for consistency and accuracy via decades of experimental in vitro and in vivo studies, functional predictions of splice-altering intronic variants are relatively new and still require experimental confirmation. We used a combined computational–experimental approach to prioritize distal splice variants using in silico predicted scores and an in vitro massively parallel splicing reporter assay (Methods, Supplementary Figure S7).^39,40 We found the per-variant in silico predictions to be mostly concordant with the in vitro assay readouts. Variants assigned higher in silico scores are more frequently supported by the experimental, in vitro assay, and those with relatively lower in silico scores (SpliceAI < 0.5) have a non-negligible validation rate as well (Supplementary Figure S8). This prompted us to incorporate the full range of continuous SpliceAI scores, disregarding only the lowest scoring variants, in our statistics. We found this approach to consider distal splice-site variants attractive because it lends itself to a statistical analysis alongside exonic variants. Once genome-wide functionality score tracks are released for the next generation of splice predictors as well (e.g., Pangolin),⁴¹ they can be integrated into RaMeDiES using the same methodology leveraged for exonic variant predictors.

No new candidate genes with a significant recurrence of intronic de novos were found in the UDN dataset. However, by seamlessly incorporating non-exonic variants within the same statistical test, our approach enables a more complete, automated analysis of the growing volume of whole genome sequencing data across rare disease consortia.

We also ran the state-of-the-art de novo enrichment approach, DeNovoWEST.¹ Unlike our approach, DeNovoWEST incorporates a gain-of-function model alongside a loss-of-function model, which has the potential to yield additional findings. We equipped the DeNovoWEST algorithm with the Roulette mutation rate model, up-to-date CADD variant deleteriousness and s_het gene constraint scores,²⁶ and further incorporated deep intronic variants with predicted splice-altering impact (Supplementary Figure S9). This approach yielded two Bonferroni-significant genes, one of which was also uncovered by RaMeDiES-DN at Bonferroni significance and the second at a FDR of 6% (KMT2B and H4C5, Supplementary Figure S10). We did not apply an FDR-based approach to DeNovoWEST’s results to consider additional gene findings, because DeNovoWEST p-values are a construct over three sometimes dependent tests, rendering an FDR adjustment inappropriate. We also find CSMD1, a highly indel-prone gene, within DeNovoWEST’s top-ranked five genes, likely because indels and SNVs are not distinguished in the mutation rate model.⁴²

Compound heterozygous variant analysis

We next evaluate compound heterozygous (comphet) variants, which are the most likely cause of rare recessive disorders in populations with low degrees of consanguinity, as is largely the case in the United States.⁴³ Comphet variants are defined as a pair of distinct alleles landing within the same gene and inherited in trans from unaffected parents who are also heterozygous at these loci. These inherited disease-causing variants tend to be rare in the population, due to the effect of selection against biallelic variant occurrences or against slightly deleterious phenotypes of heterozygous variants.⁴⁴ Despite the expected low frequency of individual alleles comprising a comphet pair, directly selecting for highly deleterious comphet variants still results in numerous false positive findings at the cohort level, motivating a statistical approach for cohort-level comphet prioritization. Developing a statistical framework analogous to de novo recurrence requires modeling the distribution of rare inherited alleles per individual. De novo mutations arise through the universal process of mutagenesis and are therefore straightforward to model. Similarly, the distribution of the total number of all derived alleles per haploid genome (i.e., all non-ancestral variants inherited from one parent without any imposed frequency constraints) are also not dependent on the demographic history of the population and therefore are straightforward to model.^45,46 In contrast, however, the distribution of the total number of rare alleles per individual is highly dependent on population structure, which is notoriously difficult to account for. Some previous approaches for determining cohort-level significance of comphet variants ignore population structure when modeling the number of rare variants. Although this may be an accurate statistical test in controlled model organism cross experiments, it is inappropriate for natural human populations, where population structure is present even at a very fine scale.⁴⁷ In the Genome of the Netherlands (GoNL) dataset for instance, the number of synonymous singletons across unrelated individuals still reflects geographic structure along a south-north cline.⁴⁸

In our framework, we sidestep directly modeling the distribution of rare variant counts per individual and instead condition on the observed number of rare variants inherited from each parent using trio-level data. Given the number of rare variants inherited from each parent per individual, we then compute the probabilities of comphet variants landing in high-scoring positions in the same gene across the cohort. Although the positions where inherited variants land is influenced in part by direct and background selection and biased gene conversion, for very rare variants, the effect of these factors is negligible compared to the effect of the variation in mutation rate along the genome and the overall gene target size.^21,49 We therefore model the positional distribution of rare inherited variants using the same Roulette basepair-resolution de novo mutation rate model leveraged in our de novo recurrence model. Our comphet recurrence model, called RaMeDiES-CH, relies on the comphet mutational target, computed for each comphet variant pair and defined similarly as the de novo mutational target previously introduced. Specifically, the comphet mutational target is computed as the total squared mutation rate of all possible variants with higher functionality scores (Figure 3a). RaMeDiES-CH applies the Cauchy p-value combination approach as before to leverage multiple variant-level functionality scores while considering exonic and intronic variants, but does not incorporate gene constraint scores, which do not exist for recessive selection (Methods, Supplementary Figure S11).⁵⁰ RaMeDiES-CH computes well-calibrated per-gene p-values for comphet variants in a cohort (Supplementary Figure S5).

Figure 3. — **(a)** Illustration of the unnormalized squared mutational target computed for each observed comphet variant in a gene across the cohort (RaMeDiES-CH, Supplementary Figure S11) or in an individual across the genome (RaMeDiES-IND, Supplementary Figure S12). “Like” variants refer to those of the same variant class (i.e., coding SNVs [CS], coding indels [CI], intronic SNVs [IS], intronic indels [II]) and within the same functionality score and minor allele frequency thresholds. **(b)** Top ranked genes resulting in the best enrichment statistic computed for RaMeDiES-IND. Putative candidates refer to genes that remain candidates for pathogenicity due to their phenotypically-relevant tissue expression, but where there is not enough functional evidence or published gene–disease relationships to establish causality at this time. **(c)** Overlap between phenotypes associated with *MED11* and those exhibited by the affected patient. **(d)** RNA-Seq reads from whole blood samples aligned to first two exons and first intron of *MED11* for proband (black), dad (blue), mom (purple) and two tissue-matched control samples (gray). Thin green line represents the intron, solid boxes represent protein-coding exonic regions, and the dotted box represents the 5’ untranslated region of *MED11*. **(e)** Proband exhibits significant retention of the first intron relative to parents and fifty-three tissue-matched control samples. Intron retention ratio is calculated as the (median read depth of first intron) / (number of reads spanning first and second exons + median read depth of first intron).

Across the set of non-consanguineous UDN families, we did not find significant recurrent comphet occurrences across genes. This result is unsurprising, as previous estimates suggest that in panmictic disease populations, only one deleterious comphet variant is expected for every five dominant de novos.⁴⁷ Nevertheless, RaMeDiES-CH represents an accurate and unbiased statistical test for the recurrence of comphet variants in human populations, which can be applied to reveal new diagnoses as sequenced rare disease datasets expand.

We suspected that singleton disease-causing comphet variants were still present in the cohort. We adapted our statistical framework to compute an individual-based statistic, RaMeDiES-IND, that normalizes each observed comphet variant mutational target across all genes in the genome rather than across all individuals in a cohort (Supplementary Figure S12). This approach yielded a ranked list of patient–gene pairs across the UDN cohort, where each patient–gene pair could be annotated as corresponding to a correct diagnosis or otherwise (Supplementary Table S4). We computed a single enrichment statistic for this overall patient–gene ranking, which simultaneously suggested a threshold for clinical consideration of findings, as the best Fisher’s exact test P achieved across all positions in the list. This enrichment statistic was significant when compared to the distribution of the same statistic computed across 10,000 random shuffles of the patient–gene list (permutation p-value = 0.001, Methods, Supplementary Figure S13). Among the top thirteen hits yielding this best enrichment statistic, we recapitulated five known diagnoses (i.e., NEUROG3, PAH, COX20, NDUFAF8, PRDX3)^51,52 and newly identified the genomic cause of a known biochemical diagnosis (i.e., ACADM in a patient with MCAD deficiency). We also identified comphet variants in MED11 which are now leading diagnostic candidates in an undiagnosed patient experiencing neurodegeneration, developmental delay, brain abnormalities, chorea, and hypotonia (Figure 3c). MED11 is associated with epilepsy and intellectual disability, and this patient’s presentation could represent a phenotypic expansion of this known disorder.⁵³ Both inherited variants occur deep in the first intron of MED11, a region that would be missed by exome-only sequencing or analysis, and are predicted to cause cryptic splice donor gains. Transcriptome (RNA) sequencing of blood samples from the affected patient and both parents highlighted a significantly higher rate of first intron retention in the affected patient relative to both parents and to fifty unrelated blood control samples (Figure 3d–e, Supplementary Figure S14).⁵⁴

Our comphet models do not generalize to rare homozygous variants (Supplementary Note S4). However, due to low levels of consanguinity in the UDN cohort, we do not expect homozygous recessive variants to underlie a substantial portion of diagnoses in this dataset.⁴⁷

Pathway analysis

Genes involved in the same pathway are frequently involved in similar phenotypic presentations.^55–58 This provides an enticing possibility of drawing statistical power from multiple independent occurrences of deleterious variants in the same functional units, rather than just in the same genes. Moreover, therapeutics for disorders of the same functional unit that are individually too rare to meet minimal participant requirements for clinical trials may be evaluated together within the same umbrella or basket trial for more efficient approval.⁵⁹ However, such an approach should be pursued with caution, as the phenotypes stemming from perturbations of different genes in the same functional unit may vary to a great extent. Such differences in patient presentations may render the clinical evaluations and therapeutic potential of statistically significant findings virtually impossible. To mitigate this issue, we first initially consider groups of patients with similar phenotypes, and then within each of these groups, assess the overrepresentation of deleterious mutations across established biological pathways (Figure 4a).

Figure 4. — **(a)** Schematic illustrating the two-step process of first clustering patients according to the semantic similarity of their phenotype terms and second finding enriched biological pathways among the genes within each patient cluster. **(b)** The most significant pathways per cluster (adjusted p-value < 0.01) with 1+ genes from 1+ undiagnosed patients; complete list in Supplementary Table S6. **(c)** Two patients with primarily immune-related symptoms each harbored a compelling *de novo* variant in genes involved in immunoproteasome assembly (*POMP*) and structure (*PSMB8*). Their symptoms strongly overlap, and a subset of these symptoms were also known to be associated with either gene in OMIM. **(d)** Three neurological patients had variants in transmembrane genes involved in the same pathway. These patients had substantial phenotypic overlap with each other, as expected, and with the phenotypes associated with each of their genes (depicted as star shapes in the upset plot).

We start by clustering 2,662 affected patients—with or without sequencing data—into 120 groups (median = 17, min = 2, max = 97 patients per cluster) based on the semantic similarity of their phenotype terms. Within each cluster, we then combine our de novo candidates, compound heterozygous candidates and known UDN diagnoses and perform gene set enrichment analysis (Methods, Supplementary Table S5). We focus our attention on undiagnosed cases with de novo or compound heterozygous candidates within enriched pathways in each cluster (Figure 4b). We also report all enriched pathways including those with only diagnosed patients for potential therapeutic grouping (Supplementary Table S6).

Two of three total candidate genes in one cluster with 19 immunological disorder patients are both involved in the immunoproteasome complex (KEGG:03050, n = 46, adjusted p-value = 4.42e-3). One patient’s genome contained a known diagnostic, de novo frameshift variant in POMP, an immunoproteasome chaperone protein.⁶⁰ An undiagnosed patient with evidence of chronic inflammation, recurrent infections, and skin lesions had a missense de novo in PSMB8, a component of the immunoproteasome ❑-ring with overlapping phenotypic associations (OMIM:256040). Both patients had similar combined immunodeficiency beyond what was captured in their standardized phenotype terms, including decreased global antibodies, decreased B cells and natural killer cells, and retained T cell functionality (Figure 4c). Disruptions to immunoproteasome assembly and structure have been shown to lead to an accumulation of precursor intermediates, impaired proteolytic activity and subsequent uncontrolled inflammation.⁶¹

In another cluster of 15 similarly presenting neurological patients, three candidate transmembrane genes were represented in the same functional pathway named for some genes’ known involvement in taste transduction (KEGG:04742, n = 85, adjusted p-value = 7.45e-3). Two of these genes, CACNA1C and GABRA3, harbored high impact de novo and hemizygous missense variants respectively, corresponding to known patient diagnoses.^62,63 The genome of another, undiagnosed, now deceased patient from this cluster with no prior candidate variants contained a synonymous de novo variant predicted to alter splicing in another gene in the same functional pathway, HCN4 (Figure 4d). All three patients exhibited seizures at a young age, speech delays, severe hypotonia, spasticity and visual impairment. Mouse knockouts of HCN4 demonstrate neurological phenotypes.^64,65 In humans, HCN4 is expressed in the visual and nervous systems and has recently been associated with infantile epilepsy, suggesting that this patient’s undiagnosed disorder plausibly represents a phenotypic expansion of this gene.^64,65

Discussion

In total, we analyze 886 sporadic or suspected recessive cases with complete trio or quad genome sequencing alongside an additional 463 phenotyped, diagnosed individuals using computational methods to identify de novo recurrence, compound heterozygosity, and pathway enrichment. We establish five new diagnoses and three new putative diagnoses in known disease-causing genes or genes previously unlinked to these patients’ exact presentations. Our prioritization framework for pathway analysis further recapitulates 70 known de novo and 10 known comphet diagnoses and suggests 82 de novo and eight comphet candidates for follow-up (Methods, Supplementary Table S5).

In the field of common disease genetics, statistical inference of disease-associated genomic loci is confidently regarded as primary evidence for their causality. Rare disease genetics, in contrast, is in a transition state. Due to a lack of large disease-matched cohorts, N-of-1 analyses relying heavily on detailed patient phenotyping and clinical intuition have typically been used to generate candidate variant hypotheses. Evidence required to shift these variants from uncertain significance to known pathogenic status comes from experimental, functional studies and by identifying additional, unrelated, genotype-matched individuals with similar phenotypes through variant matchmaking services such as MatchMaker Exchange.^66,67 Recently, analyses of large, broadly-phenotyped cohorts of N-of-1 patients have demonstrated the potential for statistical approaches to reveal diagnoses and generate new gene discoveries in the rare disease space as well.^1,2,68

Although the genome is a big place, it is also a finite space with respect to gene regions impacted by simple variants such as SNVs and short (<10 basepairs) indels. This suggests that, in theory, recurrence-based statistical methods applied to sufficiently large sequenced cohorts of rare disease patients, even those with diverse phenotypic presentations like the UDN, will enable the eventual discovery of all causes of prenatally viable monogenic disease stemming from these variant types. In order to take statistical discoveries as primary evidence, as is the case for common diseases, we need accurate, well-calibrated statistical methods.⁶⁹ Even slight model misspecification may propagate and exacerbate the rate of false discoveries. The rapid growth of genomic datasets on which these models may be applied, coupled with an ongoing difficulty in phenotyping patients at scale to confirm findings,⁷⁰ further increases the urgency for more rigorous models.

Here we show that well-calibrated statistical models can be built for both de novo and compound heterozygous modes of inheritance. Although novel disease–gene discovery from large, phenotypically- and genetically-homogenous cohorts has been demonstrated, we show here that rigorous analysis of a diverse, moderately-sized disease cohort at the gene and the pathway level shows promise.

We also acknowledge the limitations of our models and of statistical approaches in general for comprehensive rare disease diagnosis using short-read sequencing data. First, although our models integrate non-coding variants with predicted splice-altering impacts, they do not consider potentially functional variants within whole genome data that fall into untranslated gene regions, RNA-coding genes or between genes, as genome-wide tracks of verifiable deleteriousness scores do not exist for these variant types. Improvements to and precomputed scores for these variants will be beneficial for interpretation efforts in general and can be leveraged in future iterations of RaMeDiES. Our statistical analysis also does not consider structural, large indel, copy number, or tandem repeat variants, as their identification from short-read sequencing data is computationally expensive and often inaccurate. Investing in the detection of these variants from available data is difficult to justify given the advent of affordable long-read sequencing technologies and ongoing efforts to generate this data within the UDN and elsewhere, which should enable improved identification and analysis of pathogenic complex variants.^71,72 Developing a statistical model for these variants will still require accurate mutation rate estimates for these variant types, which is lacking. GnomAD-SV represents a promising iteration toward this goal, but is still highly dependent on their specific variant calling pipeline and data rather than biological mutagenic processes.⁷³

The presented method considers only autosomal de novo and compound heterozygous inheritance patterns due to complications in modeling other disease-relevant inheritance patterns. First, it is difficult to propose a statistical model for biallelic variant counts in consanguineous and founder populations, including homozygous variants, because these counts strongly depend on the ancestral population history and inbreeding patterns. A more appropriate statistical approach for assessing recurrence of these variants would be the extension of parametric linkage applied to very large cohorts.⁷⁴ Second, inclusion of hemizygous or other X-chromosome variants requires accurate sex-chromosome variant calling, which is notoriously error prone, as well as an accurate mutational model of the X chromosome, which is complicated due to sex-dependent selection and random X-inactivation. Finally, although we do not model parental mosaicism or uniparental disomy in our recurrence statistics, these inheritance patterns and events are regularly assessed via complementary, traditional “N-of-1” case-based approaches.¹²

Even though genomic sequencing has been liberalized, currently many analyses are still restricted to individual programs, and regulatory and technical barriers prevent sharing individual-level variant data broadly. In contrast, there are avenues for sharing some variant-level data in a way that is easily accessible to clinical geneticists. MatchMaker Exchange, for instance, enables the sharing of specific variants prioritized through N-of-1 analyses with the goal of finding new genotype- and phenotype-matched patients. Broadening the success of MatchMaker Exchange to include variants that may not have risen to the level of strong candidates in N-of-1 analyses is desirable. We developed a browser containing our gene-level findings and variant-level information about rare genetic variation in UDN patients (https://dbmi-bgm.github.io/udn-browser/). In addition, we provide an open-source software package, RaMeDiES, implementing the efficient and well-calibrated statistics for de novo recurrence and deleterious compound heterozygous inference proposed here. RaMeDiES’ operation on shareable summary statistics rather than on variant-level data enables automated, deidentified cross-analysis of substantial existing yet siloed sequenced cohorts for new diagnostic discoveries. As the Mendelian genomics field continues the transition to this new data science phase, the methods we present here should facilitate the exciting prospect of international cross-cohort analyses, resulting in new findings and a vastly improved rare disease diagnostic rate globally.

Supplementary Material

Supplement 1

media-1.pdf^{(8.2MB, pdf)}

Supplement 2

media-2.xlsx^{(335.3KB, xlsx)}

Supplement 3

NIHPP2024.02.13.580158v2-supplement-3.pdf^{(1.2MB, pdf)}

Acknowledgments

The authors would like to thank following individuals and organizations: Feruza Abraamyan for her contribution in the initial stages of developing the clinical evaluation protocol, Tian Yu for formatting RNA library reads for the MPSA analysis, the Undiagnosed Diseases Network Tool Building Coalition working group for advice on variant calling and sequencing quality metrics, Cecilia Esteves for sequencing file management, Amazon Web Services for complimentary data processing cloud credits, Rafael Aldana and members of the Harvard University Research Computing team for advice in optimizing joint variant calling, Logan Blaine for the initial local run of DeNovoWEST, Vladimir Seplyarskiy and Ryan McGinty for advice regarding mutation rate models, Kaitlin Samocha for critically reviewing our manuscript, and members of the Sunyaev and Kohane research groups for helpful feedback on the manuscript text and figures. The authors acknowledge funding from This work was funded by the National Institutes of Health (NIH) Common Fund grant U01HG007530, NIH National Institute of Neurological Disorders and Stroke (NINDS) grant U2CNS132415, NIH National Human Genome Research Institute (NHGRI) grants U01HG012009, R01HG012286 and R21HG010391, NIH National Institute of General Medical Sciences (NIGMS) grants R35GM127131 and 5T32GM007748, NIH National Institute of Mental Health (NIHM) grant R01MH101244, NIH National Center for Advancing Translational Sciences (NCATS) grant KL2TR002552, and NIH National Heart Lung and Blood Institute (NHLBI) grant 1R01HL164409-01.

Footnotes

Code Availability

Our software package RaMeDiES is available at https://github.com/hms-dbmi/RaMeDiES.

Ethics & Inclusion Statement

The authors declare no competing interests. This work was performed in accordance with all ethical guidelines outlined in the NIH IRB #15HG0130. The study proposal and manuscript were approved by the UDN Publications and Research Committee. Local researchers from each UDN site supplying data to the UDN, including clinical team members and bioinformaticians participating in the UDN Tool Building Coalition working group, were included. This research is relevant to individual patients and their clinical teams.

Data Availability

Deidentified genome data, transcriptome data, and corresponding phenotype data in the form of HPO terms are regularly deposited in dbGaP (accession phs001232.v5.p2). Genome-wide, rare SNV and indel variants and HPO codes for UDN participants included in this study are queryable in our public-facing browser. Standardized phenotype data and candidate genes and variants are submitted to MatchmakerExchange. Variant-level data, clinical significance and supporting evidence, demographic information, and phenotype information for all diagnostic variants are regularly submitted to ClinVar. Identifiable patient data is under controlled access to protect patient privacy. Other relevant, deidentified patient-specific clinical information may be shared on a case-by-case basis at the discretion of the corresponding clinical team if it is directly related to diagnosing or potentially treating the patient.

References

1.Kaplanis J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.100,000 Genomes Project Pilot Investigators et al. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report. N. Engl. J. Med. 385, 1868–1880 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Marx J. L. The cystic fibrosis gene is found. Science 245, 923–925 (1989). [DOI] [PubMed] [Google Scholar]
4.Roach J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.O’Roak B. J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585–589 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Jin S. C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. 49, 1593–1601 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Vissers L. E. L. M. et al. A de novo paradigm for mental retardation. Nat. Genet. 42, 1109–1112 (2010). [DOI] [PubMed] [Google Scholar]
8.Chong J. X. et al. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am. J. Hum. Genet. 97, 199–215 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zurek B. et al. Solve-RD: systematic pan-European data sharing and collaborative analysis to solve rare diseases. Eur. J. Hum. Genet. 29, 1325–1331 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Boycott K. M. et al. Care4Rare Canada: Outcomes from a decade of network science for rare disease gene discovery. Am. J. Hum. Genet. 109, 1947–1959 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hartley T. et al. New Diagnostic Approaches for Undiagnosed Rare Genetic Diseases. Annu. Rev. Genomics Hum. Genet. 21, 351–372 (2020). [DOI] [PubMed] [Google Scholar]
12.Kobren S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075–1085 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chung H.-L. et al. Loss- or Gain-of-Function Mutations in ACOX1 Cause Axonal Loss via Different Mechanisms. Neuron 106, 589–606.e6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Greene D. et al. Genetic association analysis of 77,539 genomes reveals rare disease etiologies. Nat. Med. 29, 679–688 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Splinter K. et al. Effect of Genetic Diagnosis on Patients with Previously Undiagnosed Disease. N. Engl. J. Med. 379, 2131–2139 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Mohanty A. K. et al. novoCaller: a Bayesian network approach for de novo variant calling from pedigree and population sequence data. Bioinformatics 35, 1174–1180 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Strande N. T. et al. Evaluating the Clinical Validity of Gene-Disease Associations: An Evidence-Based Framework Developed by the Clinical Genome Resource. Am. J. Hum. Genet. 100, 895–906 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Neale B. M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Samocha K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Veltman J. A. & Brunner H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012). [DOI] [PubMed] [Google Scholar]
21.Seplyarskiy V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat. Genet. 55, 2235–2242 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Gao H. et al. The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Cheng J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023). [DOI] [PubMed] [Google Scholar]
24.Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Genovese C. R., Roeder K. & Wasserman L. False Discovery Control with p-Value Weighting. Biometrika 93, 509–524 (2006). [Google Scholar]
26.Cassa C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Zeng T., Spence J. P., Mostafavi H. & Pritchard J. K. Bayesian estimation of gene constraint from an evolutionary model with gene features. Res Sq (2023) doi: 10.21203/rs.3.rs-3012879/v1. [DOI] [PubMed] [Google Scholar]
28.Ioannidis N. M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Rentzsch P., Schubach M., Shendure J. & Kircher M. CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 13, 31 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Bethune J., Kleppe A. & Besenbacher S. A method to build extended sequence context models of point mutations and indels. Nat. Commun. 13, 7884 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Tessadori F. et al. Recurrent de novo missense variants across multiple histone H4 genes underlie a neurodevelopmental syndrome. Am. J. Hum. Genet. 109, 750–758 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Borja N. et al. H4C5 missense variant leads to a neurodevelopmental phenotype overlapping with Angelman syndrome. Am. J. Med. Genet. A 191, 1911–1916 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Planner Program. https://www.abstractsonline.com/pp8/#!/9070/presentation/2575. [Google Scholar]
34.Albert F. W. & Kruglyak L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16, 197–212 (2015). [DOI] [PubMed] [Google Scholar]
35.Umans B. D., Battle A. & Gilad Y. Where Are the Disease-Associated eQTLs? Trends Genet. 37, 109–124 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Rowlands C. et al. Comparison of in silico strategies to prioritize rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders. Sci. Rep. 11, 20607 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lindeboom R. G. H., Supek F. & Lehner B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat. Genet. 48, 1112–1118 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Crooke S. T., Baker B. F., Crooke R. M. & Liang X.-H. Antisense technology: an overview and prospectus. Nat. Rev. Drug Discov. 20, 427–453 (2021). [DOI] [PubMed] [Google Scholar]
39.Jaganathan K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535–548.e24 (2019). [DOI] [PubMed] [Google Scholar]
40.Rhine C. L. et al. Massively parallel reporter assays discover de novo exonic splicing mutants in paralogs of Autism genes. PLoS Genet. 18, e1009884 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Zeng T. & Li Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol. 23, 103 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Gao G. et al. Common fragile sites (CFS) and extremely large CFS genes are targets for human papillomavirus integrations and chromosome rearrangements in oropharyngeal squamous cell carcinoma. Genes Chromosomes Cancer 56, 59–74 (2017). [DOI] [PubMed] [Google Scholar]
43.Temaj G., Nuhii N. & Sayer J. A. The impact of consanguinity on human health and disease with an emphasis on rare diseases. Orphanet J. Rare Dis. 1, 2 (2022). [Google Scholar]
44.Loreau M. & Hector A. Partitioning selection and complementarity in biodiversity experiments. Nature 412, 72–76 (2001). [DOI] [PubMed] [Google Scholar]
45.Simons Y. B., Turchin M. C., Pritchard J. K. & Sella G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220–224 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.. Do R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Martin H. C. et al. Quantifying the contribution of recessive coding variation to developmental disorders. Science 362, 1161–1164 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Sohail M. et al. Negative selection in humans and fruit flies involves synergistic epistasis. Science 356, 539–542 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Tennessen J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Balick D. J., Jordan D. M., Sunyaev S. & Do R. Overcoming constraints on the detection of recessive selection in human genes from population frequency data. Am. J. Hum. Genet. 109, 33–49 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Wang J. et al. Mutant neurogenin-3 in congenital malabsorptive diarrhea. N. Engl. J. Med. 355, 270–280 (2006). [DOI] [PubMed] [Google Scholar]
52.Hillert A. et al. The Genetic Landscape and Epidemiology of Phenylketonuria. Am. J. Hum. Genet. 107, 234–250 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Calì E. et al. A homozygous MED11 C-terminal variant causes a lethal neurodegenerative disease. Genet. Med. 24, 2194–2203 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Middleton R. et al. IRFinder: assessing the impact of intron retention on mammalian gene expression. Genome Biol. 18, 51 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Rual J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005). [DOI] [PubMed] [Google Scholar]
56.Costanzo M. et al. The genetic landscape of a cell. Science 327, 425–431 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Rolland T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Ferrari S. et al. Retinitis pigmentosa: genes and disease mechanisms. Curr. Genomics 12, 238–249 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Park J. J. H. et al. Systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of master protocols. Trials 20, 572 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Poli M. C. et al. Heterozygous Truncating Variants in POMP Escape Nonsense-Mediated Decay and Cause a Unique Immune Dysregulatory Syndrome. Am. J. Hum. Genet. 102, 1126–1142 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Brehm A. et al. Additive loss-of-function proteasome subunit mutations in CANDLE/PRAAS patients promote type I IFN production. J. Clin. Invest. 125, 4196–4211 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Rodan L. H. et al. Phenotypic expansion of CACNA1C-associated disorders to include isolated neurological manifestations. Genet. Med. 23, 1922–1932 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Syed P., Durisic N., Harvey R. J., Sah P. & Lynch J. W. Effects of GABAA Receptor α3 Subunit Epilepsy Mutations on Inhibitory Synaptic Signaling. Front. Mol. Neurosci. 13, 602559 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Campostrini G. et al. A Loss-of-Function HCN4 Mutation Associated With Familial Benign Myoclonic Epilepsy in Infancy Causes Increased Neuronal Excitability. Front. Mol. Neurosci. 11, 269 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Blake J. A. et al. Mouse Genome Database (MGD): Knowledgebase for mouse-human comparative biology. Nucleic Acids Res. 49, D981–D987 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Richards S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Osmond M. et al. Outcome of over 1500 matches through the Matchmaker Exchange for rare disease gene discovery: The 2-year experience of Care4Rare Canada. Genet. Med. 24, 100–108 (2022). [DOI] [PubMed] [Google Scholar]
68.Coe B. P. et al. Neurodevelopmental disease genes implicated by de novo mutation and copy number variation morbidity. Nat. Genet. 51, 106–116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Uffelmann E. et al. Genome-wide association studies. Nature Reviews Methods Primers 1, 1–21 (2021). [Google Scholar]
70.Chopra M. & Duan T. Rare genetic disease in China: a call to improve clinical services. Orphanet J. Rare Dis. 10, 140 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Steyaert W. et al. Unravelling undiagnosed rare disease cases by HiFi long-read genome sequencing. medRxiv (2024) doi: 10.1101/2024.05.03.24305331. [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Gorzynski J. E. et al. Clinical application of Complete Long Read genome sequencing identifies a 16kb intragenic duplication in EHMT1 in a patient with suspected Kleefstra syndrome. bioRxiv (2024) doi: 10.1101/2024.03.28.24304304. [DOI] [Google Scholar]
73.Collins R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Andres E. M. et al. A genome-wide analysis in consanguineous families reveals new chromosomal loci in specific language impairment (SLI). Eur. J. Hum. Genet. 27, 1274–1285 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.pdf^{(8.2MB, pdf)}

Supplement 2

media-2.xlsx^{(335.3KB, xlsx)}

Supplement 3

NIHPP2024.02.13.580158v2-supplement-3.pdf^{(1.2MB, pdf)}

Data Availability Statement

[R1] 1.Kaplanis J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.100,000 Genomes Project Pilot Investigators et al. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care - Preliminary Report. N. Engl. J. Med. 385, 1868–1880 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Marx J. L. The cystic fibrosis gene is found. Science 245, 923–925 (1989). [DOI] [PubMed] [Google Scholar]

[R4] 4.Roach J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.O’Roak B. J. et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat. Genet. 43, 585–589 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Jin S. C. et al. Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat. Genet. 49, 1593–1601 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Vissers L. E. L. M. et al. A de novo paradigm for mental retardation. Nat. Genet. 42, 1109–1112 (2010). [DOI] [PubMed] [Google Scholar]

[R8] 8.Chong J. X. et al. The Genetic Basis of Mendelian Phenotypes: Discoveries, Challenges, and Opportunities. Am. J. Hum. Genet. 97, 199–215 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Zurek B. et al. Solve-RD: systematic pan-European data sharing and collaborative analysis to solve rare diseases. Eur. J. Hum. Genet. 29, 1325–1331 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Boycott K. M. et al. Care4Rare Canada: Outcomes from a decade of network science for rare disease gene discovery. Am. J. Hum. Genet. 109, 1947–1959 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Hartley T. et al. New Diagnostic Approaches for Undiagnosed Rare Genetic Diseases. Annu. Rev. Genomics Hum. Genet. 21, 351–372 (2020). [DOI] [PubMed] [Google Scholar]

[R12] 12.Kobren S. N. et al. Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases. Genet. Med. 23, 1075–1085 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Chung H.-L. et al. Loss- or Gain-of-Function Mutations in ACOX1 Cause Axonal Loss via Different Mechanisms. Neuron 106, 589–606.e6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Greene D. et al. Genetic association analysis of 77,539 genomes reveals rare disease etiologies. Nat. Med. 29, 679–688 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Splinter K. et al. Effect of Genetic Diagnosis on Patients with Previously Undiagnosed Disease. N. Engl. J. Med. 379, 2131–2139 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Mohanty A. K. et al. novoCaller: a Bayesian network approach for de novo variant calling from pedigree and population sequence data. Bioinformatics 35, 1174–1180 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Strande N. T. et al. Evaluating the Clinical Validity of Gene-Disease Associations: An Evidence-Based Framework Developed by the Clinical Genome Resource. Am. J. Hum. Genet. 100, 895–906 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Neale B. M. et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485, 242–245 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Samocha K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Veltman J. A. & Brunner H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012). [DOI] [PubMed] [Google Scholar]

[R21] 21.Seplyarskiy V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat. Genet. 55, 2235–2242 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Gao H. et al. The landscape of tolerated genetic variation in humans and primates. Science 380, eabn8153 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Cheng J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023). [DOI] [PubMed] [Google Scholar]

[R24] 24.Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Genovese C. R., Roeder K. & Wasserman L. False Discovery Control with p-Value Weighting. Biometrika 93, 509–524 (2006). [Google Scholar]

[R26] 26.Cassa C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Zeng T., Spence J. P., Mostafavi H. & Pritchard J. K. Bayesian estimation of gene constraint from an evolutionary model with gene features. Res Sq (2023) doi: 10.21203/rs.3.rs-3012879/v1. [DOI] [PubMed] [Google Scholar]

[R28] 28.Ioannidis N. M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am. J. Hum. Genet. 99, 877–885 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Rentzsch P., Schubach M., Shendure J. & Kircher M. CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med. 13, 31 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Bethune J., Kleppe A. & Besenbacher S. A method to build extended sequence context models of point mutations and indels. Nat. Commun. 13, 7884 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Tessadori F. et al. Recurrent de novo missense variants across multiple histone H4 genes underlie a neurodevelopmental syndrome. Am. J. Hum. Genet. 109, 750–758 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Borja N. et al. H4C5 missense variant leads to a neurodevelopmental phenotype overlapping with Angelman syndrome. Am. J. Med. Genet. A 191, 1911–1916 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Planner Program. https://www.abstractsonline.com/pp8/#!/9070/presentation/2575. [Google Scholar]

[R34] 34.Albert F. W. & Kruglyak L. The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16, 197–212 (2015). [DOI] [PubMed] [Google Scholar]

[R35] 35.Umans B. D., Battle A. & Gilad Y. Where Are the Disease-Associated eQTLs? Trends Genet. 37, 109–124 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Rowlands C. et al. Comparison of in silico strategies to prioritize rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders. Sci. Rep. 11, 20607 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Lindeboom R. G. H., Supek F. & Lehner B. The rules and impact of nonsense-mediated mRNA decay in human cancers. Nat. Genet. 48, 1112–1118 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Crooke S. T., Baker B. F., Crooke R. M. & Liang X.-H. Antisense technology: an overview and prospectus. Nat. Rev. Drug Discov. 20, 427–453 (2021). [DOI] [PubMed] [Google Scholar]

[R39] 39.Jaganathan K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535–548.e24 (2019). [DOI] [PubMed] [Google Scholar]

[R40] 40.Rhine C. L. et al. Massively parallel reporter assays discover de novo exonic splicing mutants in paralogs of Autism genes. PLoS Genet. 18, e1009884 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Zeng T. & Li Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol. 23, 103 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Gao G. et al. Common fragile sites (CFS) and extremely large CFS genes are targets for human papillomavirus integrations and chromosome rearrangements in oropharyngeal squamous cell carcinoma. Genes Chromosomes Cancer 56, 59–74 (2017). [DOI] [PubMed] [Google Scholar]

[R43] 43.Temaj G., Nuhii N. & Sayer J. A. The impact of consanguinity on human health and disease with an emphasis on rare diseases. Orphanet J. Rare Dis. 1, 2 (2022). [Google Scholar]

[R44] 44.Loreau M. & Hector A. Partitioning selection and complementarity in biodiversity experiments. Nature 412, 72–76 (2001). [DOI] [PubMed] [Google Scholar]

[R45] 45.Simons Y. B., Turchin M. C., Pritchard J. K. & Sella G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220–224 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.. Do R. et al. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat. Genet. 47, 126–131 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Martin H. C. et al. Quantifying the contribution of recessive coding variation to developmental disorders. Science 362, 1161–1164 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Sohail M. et al. Negative selection in humans and fruit flies involves synergistic epistasis. Science 356, 539–542 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Tennessen J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Balick D. J., Jordan D. M., Sunyaev S. & Do R. Overcoming constraints on the detection of recessive selection in human genes from population frequency data. Am. J. Hum. Genet. 109, 33–49 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Wang J. et al. Mutant neurogenin-3 in congenital malabsorptive diarrhea. N. Engl. J. Med. 355, 270–280 (2006). [DOI] [PubMed] [Google Scholar]

[R52] 52.Hillert A. et al. The Genetic Landscape and Epidemiology of Phenylketonuria. Am. J. Hum. Genet. 107, 234–250 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Calì E. et al. A homozygous MED11 C-terminal variant causes a lethal neurodegenerative disease. Genet. Med. 24, 2194–2203 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Middleton R. et al. IRFinder: assessing the impact of intron retention on mammalian gene expression. Genome Biol. 18, 51 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Rual J.-F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, 1173–1178 (2005). [DOI] [PubMed] [Google Scholar]

[R56] 56.Costanzo M. et al. The genetic landscape of a cell. Science 327, 425–431 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Rolland T. et al. A proteome-scale map of the human interactome network. Cell 159, 1212–1226 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Ferrari S. et al. Retinitis pigmentosa: genes and disease mechanisms. Curr. Genomics 12, 238–249 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Park J. J. H. et al. Systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of master protocols. Trials 20, 572 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Poli M. C. et al. Heterozygous Truncating Variants in POMP Escape Nonsense-Mediated Decay and Cause a Unique Immune Dysregulatory Syndrome. Am. J. Hum. Genet. 102, 1126–1142 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Brehm A. et al. Additive loss-of-function proteasome subunit mutations in CANDLE/PRAAS patients promote type I IFN production. J. Clin. Invest. 125, 4196–4211 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Rodan L. H. et al. Phenotypic expansion of CACNA1C-associated disorders to include isolated neurological manifestations. Genet. Med. 23, 1922–1932 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Syed P., Durisic N., Harvey R. J., Sah P. & Lynch J. W. Effects of GABAA Receptor α3 Subunit Epilepsy Mutations on Inhibitory Synaptic Signaling. Front. Mol. Neurosci. 13, 602559 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Campostrini G. et al. A Loss-of-Function HCN4 Mutation Associated With Familial Benign Myoclonic Epilepsy in Infancy Causes Increased Neuronal Excitability. Front. Mol. Neurosci. 11, 269 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Blake J. A. et al. Mouse Genome Database (MGD): Knowledgebase for mouse-human comparative biology. Nucleic Acids Res. 49, D981–D987 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Richards S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Osmond M. et al. Outcome of over 1500 matches through the Matchmaker Exchange for rare disease gene discovery: The 2-year experience of Care4Rare Canada. Genet. Med. 24, 100–108 (2022). [DOI] [PubMed] [Google Scholar]

[R68] 68.Coe B. P. et al. Neurodevelopmental disease genes implicated by de novo mutation and copy number variation morbidity. Nat. Genet. 51, 106–116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] 69.Uffelmann E. et al. Genome-wide association studies. Nature Reviews Methods Primers 1, 1–21 (2021). [Google Scholar]

[R70] 70.Chopra M. & Duan T. Rare genetic disease in China: a call to improve clinical services. Orphanet J. Rare Dis. 10, 140 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] 71.Steyaert W. et al. Unravelling undiagnosed rare disease cases by HiFi long-read genome sequencing. medRxiv (2024) doi: 10.1101/2024.05.03.24305331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] 72.Gorzynski J. E. et al. Clinical application of Complete Long Read genome sequencing identifies a 16kb intragenic duplication in EHMT1 in a patient with suspected Kleefstra syndrome. bioRxiv (2024) doi: 10.1101/2024.03.28.24304304. [DOI] [Google Scholar]

[R73] 73.Collins R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Andres E. M. et al. A genome-wide analysis in consanguineous families reveals new chromosomal loci in specific language impairment (SLI). Eur. J. Hum. Genet. 27, 1274–1285 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Joint, multifaceted genomic analysis enables diagnosis of diverse, ultra-rare monogenic presentations

Shilpa Nadimpalli Kobren

Mikhail A Moldovan

Rebecca Reimers

Daniel Traviglia

Xinyun Li

Danielle Barnum

Alexander Veit

Rosario I Corona

George de V Carvalho Neto

Julian Willett

Michele Berselli

William Ronchetti

Stanley F Nelson

Julian A Martinez-Agosto

Richard Sherwood

Joel Krier

Isaac S Kohane

Shamil R Sunyaev

Abstract

Introduction

Undiagnosed Diseases Network dataset

Figure 1. Undiagnosed Diseases Network cohort analysis.

Clinical Evaluation of Computational Findings

Results

De novo analysis

Figure 2. De novo recurrence.

Inclusion of deep intronic splice variants

Compound heterozygous variant analysis

Figure 3. Compound heterozygous variants.

Pathway analysis

Figure 4. Biological pathways enriched within phenotypically-similar patient subgroups.

Discussion

Supplementary Material

Acknowledgments

Footnotes

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases