Abstract
Around 60% of individuals with neurodevelopmental disorders (NDD) remain undiagnosed after comprehensive genetic testing, primarily of protein-coding genes1. Large genome-sequenced cohorts are improving our ability to discover new diagnoses in the non-coding genome. Here we identify the non-coding RNA RNU4-2 as a syndromic NDD gene. RNU4-2 encodes the U4 small nuclear RNA (snRNA), which is a critical component of the U4/U6.U5 tri-snRNP complex of the major spliceosome2. We identify an 18 base pair region of RNU4-2 mapping to two structural elements in the U4/U6 snRNA duplex (the T-loop and stem III) that is severely depleted of variation in the general population, but in which we identify heterozygous variants in 115 individuals with NDD. Most individuals (77.4%) have the same highly recurrent single base insertion (n.64_65insT). In 54 individuals in whom it could be determined, the de novo variants were all on the maternal allele. We demonstrate that RNU4-2 is highly expressed in the developing human brain, in contrast to RNU4-1 and other U4 homologues. Using RNA sequencing, we show how 5′ splice-site use is systematically disrupted in individuals with RNU4-2 variants, consistent with the known role of this region during spliceosome activation. Finally, we estimate that variants in this 18 base pair region explain 0.4% of individuals with NDD. This work underscores the importance of non-coding genes in rare disorders and will provide a diagnosis to thousands of individuals with NDD worldwide.
Subject terms: Disease genetics, Non-coding RNAs, Genome informatics, Genetics of the nervous system, Neurodevelopmental disorders
The non-coding RNA RNU4-2, which is highly expressed in the developing human brain, is identified as a syndromic neurodevelopmental disorder gene, and, using RNA sequencing, 5′ splice-site use is shown to be systematically disrupted in individuals with RNU4-2 variants.
Main
Despite increasingly powerful genomic and analytic approaches for the diagnosis of rare developmental disorders, around 60% of individuals with NDD remain without an identified genetic diagnosis after genomic testing with current methods1. So far, most known disease-causing variants are in the roughly 1.5% of the genome that directly encodes proteins3. By contrast, the non-coding genome (which makes up the remaining 98.5%) has been relatively unexplored, especially regions far from protein-coding genes. Large-scale systematic application of genome sequencing to clinical populations has increasingly enabled investigation of the contribution of variants in non-coding regions to genetic disorders4.
Non-coding RNAs, which comprise 37.4% of processed exonic RNA sequence in humans5, include important regulators of biological processes with diverse roles across cells and tissues6. snRNAs are a subcategory of non-coding RNAs that are key components of the spliceosome7. snRNAs complex with a multitude of proteins and other snRNA species in small nuclear ribonucleoprotein (snRNP) complexes to mediate the removal of introns from pre-messenger RNA transcripts8. Many spliceosome components have demonstrated roles in human disorders, including two snRNA components of the minor spliceosome: RNU12 variants cause autosomal recessive early-onset cerebellar ataxia9, whereas RNU4ATAC variants cause an autosomal recessive multisystem congenital disorder including microcephaly, growth retardation and developmental delay (eponyms include Taybi Linder10, Lowry–Wood11 and Roifman syndromes12).
Here we identify variants in RNU4-2, which encodes the U4 snRNA component of the major spliceosome, in an autosomal dominant disorder. Using a cohort of 8,841 probands with genetically undiagnosed NDD in the Genomics England 100,000 genomes project (GEL)4, we identify variants in a critical 18 base pair (bp) region in the centre of RNU4-2 associated with a severe neurodevelopmental phenotype and estimate that variants in this region explain around 0.4% of individuals with neurodevelopmental disorders (NDD). We demonstrate that variants in this region are severely depleted from large population datasets. We show that NDD variants map to critical structural elements in the U4/U6 complex that are important to correctly position U6 ACAGAGA to receive the 5′ splice site during initial spliceosome activation, and detail the expression of RNU4-2 through brain development.
A highly recurrent insertion in NDD
We identified a highly recurrent single base insertion (GRCh38:chr. 12: 120291839:T:TA; n.64_65insT) in RNU4-2 in GEL1. This variant was initially identified as arising de novo in 38 probands recruited for genome sequencing with their unaffected parents13. Extending the search to include probands without data for both parents in the full GEL cohort, we identified an extra eight individuals with the n.64_65insT variant; in all eight, the detectable inheritance is consistent with the variant having arisen de novo (that is, where a single parent sample was available the variant was not detected in it). All of the 46 individuals with the variant have undiagnosed NDD (categorized as global developmental delay, intellectual disability and/or autism spectrum disorder), corresponding to 0.52% of 8,841 probands with so far undiagnosed NDD in GEL. The n.64_65insT variant was not found in any of 3,408 NDD probands with an existing genetic diagnosis, 21,817 probands with non-NDD phenotypes or 33,122 unaffected individuals. Individuals with the variant are significantly enriched for global developmental delay (n = 37; odds ratio (OR) = 3.56; Fisher’s P = 2.75 × 10−4), delayed gross motor development (n = 26; OR = 2.55; P = 1.64 × 10−3), microcephaly (n = 26; OR = 6.62; P = 7.87 × 10−10), delayed fine motor development (n = 24; OR = 2.61; P = 1.69 × 10−3), hypotonia (n = 18; OR = 3.60; P = 7.09 × 10−5), short stature (n = 15; OR = 3.54; P = 2.17 × 10−4), drooling (n = 7; OR = 19.2; P = 2.83 × 10−7) and absent speech (n = 6; OR = 6.23; P = 7.45 × 10−4) compared to all other probands with NDD in GEL (n = 12,203; diagnosed and undiagnosed) (Extended Data Fig. 1).
The n.64_65insT variant is not found in 76,215 genome-sequenced individuals in gnomAD v.4.0 (ref. 14), or in 245,400 individuals in the All of Us dataset15. It is seen in a single individual in the UK Biobank16 (allele frequency of 1.02 × 10−6) with a variant allele balance consistent with a true variant (23 reference and 18 (44%) alternate reads). This individual has an International Classification of Diseases 10th Revision (ICD-10) code for ‘personal history of disease of the nervous system and sense organs’ but no further phenotype data to assess a potential NDD diagnosis (Supplementary Table 1).
Given the high occurrence rate of this recurrent insertion, we wanted to rule out the possibility that it is a sequencing or mapping error, despite the overwhelming evidence of phenotype enrichment. Notably, the variant is a single A insertion after a run of four Ts, ruling out the most common cause of sequencing error for indels, polymerase slippage in homopolymer repeats. The variant calls were all high quality according to both analysis of quality metrics (Supplementary Fig. 1) and manual inspection on the Integrative Genomics Viewer (IGV) (Supplementary Fig. 2). The genomic region surrounding the insertion and RNU4-2 maps uniquely to a single region of the genome with short-read sequencing in GRCh38 and T2T CHM13v2.0/hs1. Finally, sequencing reads aligned to RNU4-2 map with good quality (average 96 reads with mapping quality scores greater than 20; Supplementary Fig. 3).
n.64_65insT is in a highly constrained region
The recurrent n.64_65insT variant resides within the central region of RNU4-2, towards the 5′ end of an 18 bp region that is depleted of variants in population datasets compared with the rest of the gene (26% of all possible single nucleotide variants (SNVs) observed in UK Biobank compared to a median of 78% across the rest of the gene; Fig. 1a and Extended Data Fig. 2a). On the basis of the population variant data, we defined a critical, highly constrained region as chr. 12: 120291825–120291842. We refer to this as the ‘critical region’ throughout the rest of the paper.
We searched for variants across this region in GEL, and also in further cohorts containing undiagnosed individuals with NDD (Methods). In total, we identified 115 individuals with variants across this region, including 61 individuals in GEL (60 probands and one more sibling) and 54 from extra cohorts (Fig. 1b and Extended Data Table 1). For 86 of the 115 individuals, sequencing data for both parents were available to confirm that the variants had arisen de novo. Where possible, we used nearby variants to determine the parental allele of origin of the variants. For 54 individuals in whom this could be confidently resolved (46 with n.64_65insT; three with other insertion variants; five with SNVs), all 54 were present on the maternal allele. In one individual the n.65A>G variant appeared to be mosaic in the mother (54 reference and eight alternate reads) and in another an SNV was maternally inherited (n.76C>T). This analysis also enabled us to determine the likely de novo occurrence for five more individuals in whom only one parent was sequenced. Sanger sequencing was used to confirm the presence of the variant in eight individuals with the n.64_65insT variant. For seven of the eight, absence from both parents was also confirmed. In the eighth, the variant was confirmed as absent from the single available parent. In three families, the n.64_65insT variant was identified as occurring de novo in both short- and long-read trio sequencing.
Extended Data Table 1.
Numbers in brackets in NDD count columns correspond to individuals with detailed clinical information in Table 1. *variant found in an additional sibling. **NHS GMS (n = 19); MSSNG (n = 2); SSC (n = 1); GREGoR (n = 10); Undiagnosed Diseases Network (UDN; n = 6); from personal communication/Matchmaker Exchange (n = 16).
Most of the 115 individuals have the initial n.64_65insT variant (n = 89; 77.4%). Five of the 11 extra variants are also single base insertions, including n.77_78insT (GRCh38:chr. 12: 120291826:T:TA), which is seen in six individuals, two of whom are affected siblings. Single base insertion variants in this region are strongly enriched in individuals with NDD: 54 out of 8,841 (0.61%) GEL undiagnosed NDD probands (55 out of 10,388 individuals) have single base insertions compared to two out of 490,132 individuals in the UK Biobank (OR = 1,531; 95% confidence interval (CI): 404–16,384; Fisher’s P = 3.3 × 10−92).
Aside from insertions, there is also a modest enrichment of SNVs in GEL NDD probands across the critical region (undiagnosed NDD: six out of 8,841; UK Biobank: 35 out of 490,132; OR = 9.51; 95%CI: 3.27–22.8; Fisher’s P = 8.16 × 10−5). We identified 15 individuals across cohorts with SNVs in this region (Extended Data Table 1; 10 confirmed de novo), all with phenotypes consistent with individuals with insertion variants. The identified SNVs cluster with the two regions harbouring insertion variants at the extreme ends of the 18 bp critical region (Fig. 1). Conversely, SNVs in the central portion (particularly at nucleotides 71–74) are observed in both non-NDD individuals in GEL (n = 2) and population controls, although all at low frequencies (Extended Data Table 1). Across the remainder of RNU4-2 there is no significant enrichment of variants in undiagnosed NDD probands when compared with non-NDD probands (194 out of 7,519 undiagnosed NDD; 521 out of 19,428 non-NDD in GEL aggregated variant dataset17; OR = 0.96; 95%CI: 0.81–1.14; Fisher’s P = 0.67).
In total, we identify variants in this 18 bp region in 115 individuals with NDD. This includes 60 out of 8,841, or 0.68%, of all genetically undiagnosed NDD probands in GEL. By contrast, variants in this region are observed in 39 out of 490,132 (0.008%) individuals in the UK Biobank (OR = 85.8; 95%CI: 56.4–131.6; Fisher’s P = 1.84 × 10−78). As most individuals in GEL have had genetic testing before recruitment, we cannot estimate the overall prevalence of RNU4-2 variants in all cause NDD from this cohort. Instead, if we assume a diagnostic yield of 40% before defining our GEL undiagnosed NDD cohort, consistent with recent reports1, we estimate that variants in RNU4-2 could explain 0.4% of all NDD (calculated as 60 from an effective cohort size of 14,735 (8,841 × 1/0.6)).
U4 snRNA binds to U6 snRNA through extensive complementary base pairing in the U4/U6.U5 tri-snRNP complex of the major spliceosome. Unwinding of U6 from U4 is essential to generate the catalytically active spliceosome2. The 18 bp critical region in RNU4-2 maps to a region of U4 between the stem I region of complementary base pairing to U6 and the 3′ stem–loop structures (nucleotides 62–79; Fig. 1c). This region is known to be loaded into the active site of the SNRNP200-encoded BRR2 helicase, which mediates spliceosome activation by unwinding the U4/U6 duplex2. The highly recurrent n.64_65insT variant is within a previously described ‘quasi pseudoknot’, or T-loop, structure18 (Fig. 1d). The region spanning nucleotides 76 to 78, where the recurrent n.77_78insT variant resides, is involved in base pairing with U6 in stem III (ref. 19) (Fig. 1d). Both regions are thought to stabilize U4/U6 pairing and accurately position the U6 ACAGAGA sequence to receive the 5′ splice site during initial spliceosome assembly, whereas U4 nucleotides in stem III are part of the loading site for BRR2. Nearby regions that are predicted to have important roles, such as the U4/U6 stem I binding region, are not enriched for variants in NDD probands.
RNU4-2 variants disrupt 5′ splice-site use
Given the importance of U4 snRNA in the spliceosome and previous observations of global disruption to splicing observed in other spliceosomopathies20, we analysed RNA sequencing (RNA-seq) data from blood samples for five individuals from GEL. Three of these individuals have the highly recurrent n.64_65insT variant, another has the other recurrent insertion, n.77_78insT and the final individual has an SNV (n.78A>C). The five individuals with RNU4-2 variants had more abnormal splicing events than 378 control individuals with non-NDD phenotypes (mean 21.6 versus 4.5; Wilcoxon P = 0.0126), but this was not significant after correcting for multiple testing. There was no difference in the number of genes that were significant outliers for expression (mean 1.8 versus 5.7; Wilcoxon P = 0.94; Extended Data Table 2).
Extended Data Table 2.
Outliers predicted by OUTRIDER and FRASER2 in RNA-seq data for five individuals with RNU4-2 variants compared to ten matched controls and 378 unmatched controls. A P-value threshold of 0.005 was used to assess statistical significance (Bonferroni adjusted for 10 tests). All statistical tests are one-sided Wilcoxon rank-sum tests and the P-values are unadjusted.
Consistent with the importance of the critical region in 5′ splice-site recognition, the most pronounced difference was observed for abnormal splicing events corresponding to increased use of unannotated 5′ splice sites (mean 8.8 events in individuals with RNU4-2 variants compared with 0.7 in both 378 unmatched controls and ten controls matched on genetic ancestry, sex and age at consent; Wilcoxon P = 4.0 × 10−5 and P = 5.7 × 10−3, respectively; Fig. 2a and Extended Data Table 2). The individual with the SNV was not notably different from the four individuals with single base insertions (three significant events). Sequence motif analysis showed an increase in T at the +3 position and an increase in C at the +4 and +5 positions in the unannotated 5′ splice sites that were significantly increased in individuals with RNU4-2 variants compared to decreased canonical sites (Fig. 2c). These three positions of the 5′ splice site (+3, +4 and +5), which shift away from consensus in individuals with RNU4-2 variants, pair directly with the U6 ACAGAGA region during spliceosome activation (Fig. 2d).
Of all the detected abnormal splicing events, 12 of these were shared by two or more individuals with RNU4-2 variants (Extended Data Table 3). Eleven of these 12 events (91.6%) corresponded to an increase in unannotated 5′ splice-site use. None of these shared events were identified in any of the 378 controls. By contrast, when randomly sampling five control individuals across 10,000 permutations, the mean number of events shared by two or more individuals was 0.007, significantly fewer than the 12 in RNU4-2 individuals (permutation P < 1 × 10−4; Fig. 2b). Five of the genes implicated in the 12 shared events are in the Developmental Disorders Genotype-to-Phenotype (DDG2P) database21 and/or were associated with NDD in a previous large-scale analysis22 (NDUFV1, H2AC6, JMJD1C, MAP4K4 and SF1). Ten of the 12 shared events affect the protein-coding sequence, with four predicted to cause a frameshift (Extended Data Table 3). Collectively, these results indicate a systematic shift in 5′ splice-site use in individuals with RNU4-2 variants compared to controls. Future work should assess these patterns in a more disease-relevant tissue (for example, brain) or in induced pluripotent stem cell-derived neuronal cells or organoid models. At present, RNA from more tissues from affected individuals is not available.
Extended Data Table 3.
Splicing outlier events detected by FRASER2 that are shared between two or more individuals with RNU4-2 variants. None of the events were observed in any of the 378 control individuals. *Genes identified as associated with NDD in Fu et al. Nature Genetics 202322 (PubMed ID 35982160). DDG2P: Developmental disorders gene 2 phenotype database.
Characterizing the RNU4-2 NDD syndrome
To characterize the phenotypic spectrum associated with variants in RNU4-2, we collected detailed phenotypic information for a subset of 49 individuals (42 with n.64_65insT, three with other single base insertions, and four with SNVs; Table 1 and Supplementary Table 2). Using these data, we find the RNU4-2 syndromic NDD to be characterized by moderate to severe global developmental delay (four children with SNVs with moderate delay) and intellectual disability in all individuals. Most (83%) achieved ambulation but at a delayed age (average 3.4 years, range 17 months to 7.5 years) with some noted to have a wide-based or ataxic gait. Only three individuals (two with an SNV) had fluent speech, some had a few words and most were non-verbal. All but three were reported to have dysmorphic facial features. These facial features varied but consisted of a myopathic face with deep set eyes (some widely spaced and some narrowly spaced), epicanthus, wide nasal bridge, anteverted nares or underdeveloped ala nasi, large cupped ears (some posteriorly rotated), full cheeks, a distinctive mouth with full lips with downturned corners, high arched palate and a large or protruding tongue (Fig. 3). In comparison to the single base insertions, children with SNVs had fewer reports of severe global developmental delay (zero out of four versus 34 out of 40, Fisher’s P = 0.0015).
Table 1.
Clinical feature | |||
---|---|---|---|
Individuals (n) | 49 | ||
Sex | 21 F, 28 M | ||
Median | Range | ||
Age at last evaluation (years) | 10 | 2–38 | |
Maternal age at birth (years)a | 32 | 22–41 | |
Paternal age at birth (years)b | 33 | 26–45 | |
Countc | Percentage (%) | ||
Growth | IUGR | 8 out of 45 | 18 |
Short stature | 37 out of 49 | 76 | |
Microcephaly | 37 out of 48 | 77 | |
- congenital | 19 out of 37 | ||
- acquired | 9 out of 37 | ||
- not specified | 9 out of 37 | ||
Neurodevelopmental | GDD | 49 out of 49 | 100 |
- severe | 34 out of 49 | ||
- moderate | 10 out of 49 | ||
- not specified | 5 out of 49 | ||
Ambulatory (>5 years old) | 30 out of 36 | 83 | |
- abnormal gait | 7 out of 30 | ||
- not specified | 23 out of 30 | ||
Speech abnormality | 45 out of 48 | 94 | |
- non-verbal | 35 out of 45 | ||
- few words | 10 out of 45 | ||
ID | 45 out of 45 | 100 | |
Behavioural issues | 30 out of 45 | 67 | |
ASD | 21 out of 44 | 48 | |
Sleep issues | 15 out of 32 | 47 | |
Hypotonia | 39 out of 45 | 87 | |
Seizures | 37 out of 48 | 77 | |
Abnormal brain MRI | 41 out of 45 | 91 | |
Hearing | Hearing loss | 10 out of 46d | 22 |
Vision | Vision issues | 38 out of 48 | 79 |
- optic nerve hypoplasia | 8 out of 37 | 22 | |
- strabismus | 23 out of 45 | 51 | |
- nystagmus | 18 out of 40 | 45 | |
Gastrointestinal | Constipation | 29 out of 44 | 66 |
GORD | 21 out of 43 | 49 | |
Feeding difficulties | 32 out of 42 | 76 | |
G-tube | 13 out of 41 | 32 | |
Growth problems | 30 out of 43 | 70 | |
Endocrine | 17 out of 39 | 44 | |
Bone and/or skeletal | 27 out of 42 | 64 | |
Limb | 23 out of 42 | 55 | |
Genitourinary | 15 out of 43 | 35 | |
Dental | 17 out of 43 | 40 | |
Cardiac | 5 out of 43 | 12 | |
Cutaneous | 25 out of 44 | 57 | |
Dysmorphic facial features | 45 out of 48 | 94 |
F, female; M, male; IUGR, intrauterine growth restriction; GDD, global developmental delay; ID, intellectual disability; ASD, autism spectrum disorder; MRI, magnetic resonance imaging; GORD, gastro-oesophageal reflux disease; G-tube, gastrostomy tube.
aMaternal age available for 43 out of 49 individuals.
bPaternal age available for 41 out of 49 individuals.
cDenominator indicates the number of individuals for whom data were available.
dOne individual has a dual diagnosis in GJB2 that would account for the hearing loss.
Associated growth and neurodevelopmental phenotypes present in more than or equal to 75% of individuals include short stature, microcephaly (mostly congenital), speech abnormalities (mostly non-verbal), hypotonia and seizures. Seizures had variable onset with a median of 3 years and ranging from the first year of life to 10 years of age (spanning infantile spasms, focal seizures and generalized tonic–clonic seizures, febrile seizures and status epilepticus). Brain magnetic resonance imaging showed a spectrum of abnormalities in most individuals, most frequently reduced white matter volume, hypoplasia of the corpus callosum, ventriculomegaly, delayed myelination and other non-specific abnormalities of the white matter. Involvement of several organ systems was reported for all individuals (with fewer systems reported as involved in individuals with SNVs), often including visual (optic nerve hypoplasia, cortical blindness, strabismus, nystagmus), gastrointestinal (constipation, reflux, feeding issues with need for a gastrostomy tube, poor growth) and bone and/or skeletal abnormalities (osteopenia, recurrent fractures, scoliosis, kyphosis, hip dysplasia) and in fewer individuals, hearing, endocrine (hypothyroidism, growth hormone deficiency, pituitary abnormalities), limb, sleep, genitourinary, dental, cardiac and cutaneous concerns (Table 1 and Supplementary Table 2). No significant differences were noted in the presentation of male versus female individuals.
Exome sequencing rarely finds n.64_65insT
Most individuals with NDD who undergo genetic testing at present have exome rather than genome sequencing. Although RNU4-2 is not directly captured by exome sequencing panels, there is a chance that off-target reads may map to the 18 bp critical region of RNU4-2 and enable detection of variants in this region. To investigate this, we analysed individuals who are included in GEL and also have exome sequencing data in the Deciphering Developmental Disorders (DDD) cohort1. Across the DDD cohort, 3,408 out of 13,450 individuals (25.3%) have at least one read mapping to the position of the n.64_65insT variant (Extended Data Fig. 3). The maximum number of mapping reads in any individual was five, which is below standard thresholds used to identify heterozygous variants. Of 1,755 individuals in both GEL and DDD, 22 have the n.64_65insT variant (1.3%). Two of the 22 individuals (9.1%) each have a single read at the variant position in the exome sequencing data from DDD, but in each case it is identical to the reference sequence. The other 20 individuals have no reads mapping to RNU4-2. Nevertheless, others have reported successful identification and subsequent experimental validation of the n.64_65insT variant identified initially only on one or two sequencing reads (public communication on X/Twitter with S. Laurie and K. Platzer). These analyses suggest that although it is possible to identify individuals who may have variants in RNU4-2 through exome sequencing data, the sensitivity of this approach is very low. Any variants identified through this approach will also need independent confirmation.
Evaluating other snRNAs in NDD
Given the identified importance of RNU4-2 in NDD, we sought to determine whether other snRNA genes with no known association to NDD could also harbour new diagnoses. We investigated 28 snRNA genes that are expressed in the brain, using multiple approaches (Extended Data Table 4). First, we tested for overall enrichment of de novo variants in undiagnosed NDD probands compared to non-NDD probands across each snRNA with at least two identified de novo variants in probands with undiagnosed NDD (n = 14) using the high-confidence de novo callset in GEL. Of the 13 genes other than RNU4-2, none showed a significant enrichment of de novo variants in undiagnosed NDD probands (all Fisher’s P > 0.15).
Extended Data Table 4.
Genomic coordinates of, and burden testing results for snRNA genes in 5,426 undiagnosed NDD probands against 4,776 non-NDD probands.
Second, proposing that the burden of pathogenic variants in other snRNAs may be restricted to specific critical regions, as we see for RNU4-2, we used an 18 bp sliding window to identify snRNA regions that are depleted of variation in the UK Biobank compared to the overall variant burden across each gene. Notably, the regions with the highest depletion in RNU4ATAC correspond to two hotspots of pathogenic variants in ClinVar (chr. 2: 121530923–121530946, chr. 2: 121530984–121531007), however, the strength of the depletion in these regions is lower than in RNU4-2 (minimum normalized proportion of observed −0.11 and −0.2 versus −0.5 for the depleted region in RNU4-2), consistent with lower selection acting on variants in RNU4ATAC that cause recessive disorders. We identified 14 regions in 13 unique snRNAs with a deviation from the median number of SNVs across the full gene of at least 20% (Fig. 4 and Extended Data Table 5). We repeated our de novo variant enrichment test in regions with at least two de novo variants in undiagnosed NDD probands (n = 3). Only the conserved region in RNU4-2 was significant (Fisher’s P = 9.31 × 10−11; undiagnosed NDD probands n = 37, non-NDD probands n = 0; all other tests Fisher’s P > 0.25).
Extended Data Table 5.
Sub-regions of snRNA genes identified as depleted of variation and burden testing results in these regions of variants in 5,426 undiagnosed NDD probands against 4,776 non-NDD probands.
Third, we looked for recurrent de novo variants in undiagnosed GEL NDD probands that were absent from diagnosed NDD probands, non-NDD probands and population controls. There are three de novo variants with an allele count greater than or equal to three in the GEL undiagnosed NDD cohort, two in RNU1-2 (chr. 1: 16895992:C:T and chr. 1: 16896002:A:G) and one in RNVU1-7 (chr. 1: 148038767:G:A). However, all three variants are observed at comparable frequencies in non-NDD probands and are also found at relatively high frequencies in population controls (all variants’ allele frequency greater than 0.5% in gnomAD v.4.0).
Finally, given that variants in RNU12 and RNU4ATAC are associated with recessive disease, we also tested for enrichment of homozygous and compound heterozygous variants in undiagnosed NDD probands compared to non-NDD probands. We observed a nominal enrichment of variants in RNU12 (11 probands with NDD versus two non-NDD probands; Fisher’s P = 0.026), but this was not significant after correcting for multiple testing. We did not identify any significant associations across any other snRNA or when restricted to variants in our identified depleted regions (Extended Data Tables 4 and 5).
RNU4-2 is highly expressed in the brain
Humans have many genes that encode the U4 snRNA, although only two of these, RNU4-2 and RNU4-1, are highly expressed in the human brain (Supplementary Table 3). RNU4-2 and RNU4-1 are contiguous on chr. 12, both 141 bp long and highly homologous, differing by four nucleotides (97.2% similarity). RNU4-1 has a similar depletion of variants in population cohorts in the centre of the RNA (Fig. 4), however, we do not observe an enrichment of variants in GEL in this central region (Extended Data Fig. 2b). There is a variant equivalent to our highly recurrent variant in RNU4-1 that is observed in six individuals in the UK Biobank dataset. There are no consistent phenotypes recorded in these six individuals (Supplementary Table 1).
To investigate the reason for variants in RNU4-2, but not RNU4-1, causing NDD, we analysed the expression of both RNU4-1 and RNU4-2 in the brain. First, we analysed the expression patterns of both genes across many developmental stages using bulk RNA-seq data from 176 human prefrontal cortex samples in BrainVar23. The expression of RNU4-1 and RNU4-2 is tightly correlated (Supplementary Fig. 5), however, RNU4-2 is consistently expressed at a significantly higher level than RNU4-1 (Fig. 5a). Second, we assessed chromatin accessibility in the chromosome 12 locus containing both RNU4-1 and RNU4-2 using assay for transposase-accessible chromatin with sequencing (ATAC-seq) data from two human prenatal prefrontal cortex samples. These data show a notable chromatin accessibility signal around RNU4-2 and a much lower signal surrounding RNU4-1, again consistent with much higher expression of RNU4-2 in the brain (Fig. 5b). Overall, these data support the role of RNU4-2 as the main U4 transcript in the brain.
Discussion
Here we identified a highly constrained 18 bp region of RNU4-2 in which variants cause a severe neurodevelopmental phenotype. We estimated that variants in this region could explain 0.4% of all NDD. As a comparison, the largest proportion of NDD explained by a single gene in 13,449 individuals in the DDD cohort1 was 0.68% for ANKRD11, although we acknowledge that some genes and recognizable syndromes with longstanding associations (for example, MECP2, SCN1A, UBE3A) will be depleted from this cohort. The proportion of NDD explained by variants in RNU4-2 would be even higher if restricted to individuals with severe, syndromic NDD. This is consistent with the much lower rate of RNU4-2 variants in cohorts recruited primarily for autism spectrum disorder (for example, three out of 7,149; 0.042% across the Simons Simplex Collection (SSC)24, SPARK25 and MSSNG26).
Our findings underscore the value of large-scale genome sequencing datasets and the importance of considering variants outside protein-coding regions. This region, despite being within a highly conserved non-coding exon, is not included in commercially available clinical exome sequencing capture, which primarily targets protein-coding exons5. This discovery was empowered by the availability of increasingly large genome sequencing datasets from families affected by genetic disease around the world. Indeed, the scale and accessibility of the Genomics England dataset facilitated both the work reported here and a parallel discovery by an independent group27. The detailed phenotypic characterization included here will help prioritize individuals for targeted sequencing of RNU4-2.
For all individuals in whom we were able to confidently determine the parent of origin of the identified RNU4-2 variants (n = 54), the variants were observed to be on the maternal allele. This is in contrast to the well-established paternal bias observed for de novo small mutations28. The absence of any paternally derived variants in our cohort may be a consequence of negative selection in the male germline if RNU4-2 plays an important role during spermatogenesis. It may also be a consequence of imprinting, for example if variants on a highly expressed paternal allele are embryonic lethal, whereas those on a weakly expressed maternal allele are survivable but result in NDD. Further work is needed to test these hypotheses.
Most individuals in our cohort have the highly recurrent n.64_65insT variant. It is observed in 46 of 8,841 undiagnosed NDD probands in GEL. By contrast, the most recurrent protein-coding variant in a dataset of 31,058 individuals with developmental disorders29 is observed in 36 individuals (0.12%; GRCh38:chr. 11: 66211206:C:T; PACS1:p.Arg203Trp). The reasons for this high recurrence are unclear, but it could be driven by either a high endogenous mutation rate or positive selection in the germline. The latter has previously been described for so-called selfish mutations associated with paternal age effects30. One hypothesis is that germline selection is acting to increase the apparent frequency of the n.64_65insT variant, for example through meiotic drive effects or by accelerating oocyte maturation31. We do not see an association with maternal age for individuals with n.64_65insT in GEL (mean 30.2 compared to 29.7 across other NDD probands; Extended Data Fig. 4).
Alternatively, recurrence may be driven by a high mutation rate. This is consistent with the observed open chromatin state and very high expression of RNU4-2 (Fig. 5), as high levels of transcription are known to be correlated with increased mutation rate32. Hypermutability of short non-coding RNA genes, including snRNAs, has previously been documented33,34. Consistent with this, a high variant density is observed across RNU4-2 in the UK Biobank (Extended Data Fig. 5). Despite the high number of variants in RNU4-2 in the UK Biobank, there are no individuals with homozygous variants and all observed variants are very rare (maximum allele frequency 0.025%), consistent with strong negative selection acting on variants across RNU4-2. A high overall mutational burden does not, however, explain the high recurrence of this specific single base insertion. Local formation of secondary structure and base stacking is a known driver of biased small insertion mutations35. The high propensity of this region to form secondary structure when single-stranded may drive creation of this specific insertion. It is also possible that this variant is more compatible with live birth relative to other comparably recurrent mutations in the critical region.
The n.64_65insT variant is one of six single base insertions that we observe in the 18 bp critical region in individuals with NDD, in a total of 100 individuals across cohorts. By contrast, single base insertions are very rare in population cohorts. Although we do also observe some SNVs in this region in individuals with NDD, our initial data suggest these SNVs may result in a milder phenotype. However, given this observation is based on only four fully phenotyped individuals, it needs to be confirmed in larger cohorts. Saturation mutagenesis experiments that test the impact of different length insertions and deletions as well as SNVs across the length of RNU4-2 will be important to understand the spectrum of deleterious mutations. The high proportion of single base insertion variants in individuals with NDD may indicate that steric conformational changes are needed to disrupt RNU4-2 function. Specifically, insertion of a single base into the T-loop or stem III regions may destabilize the U4/U6 interaction and/or alter the positioning of the U6 ACAGAGA sequence and potentially disrupt the correct loading of the 5′ splice site into the fully assembled spliceosome. This proposed effect is supported by the observed systematic disruption to 5′ splice-site use observed in RNA-seq data from five individuals with RNU4-2 variants. In particular, our observation that the +3, +4 and +5 positions of the 5′ splice site, which directly pair with the U6 ACAGAGA sequence, shift away from consensus at sites with increased use in individuals with RNU4-2 variants provides functional evidence that these variants disrupt accurate splice-site recognition during spliceosome activation. Further, variants in U6 snRNA and protein components of the spliceosome situated in the proximity of our RNU4-2 variants have recently been shown to alter 5′-splice-site selection by changing the preference for nucleotides that pair with the U6 snRNA ACAGAGA, consistent with this region being involved in subtle regulation of alternative splicing36,37.
Whereas two other snRNA genes, RNU12 and RNU4ATAC, have been linked to different phenotypes, both are components of the minor spliceosome and are associated with recessive disorders. By contrast, here we implicate variants in a major spliceosome snRNA in a dominant disorder. We further explored whether other snRNA genes could explain undiagnosed cases. We did not find any other snRNAs, or constrained subregions of snRNAs, that were significantly enriched for either de novo variants or recessively inherited variants in NDD cases when compared with non-NDD probands. We note, however, that these tests have low power given the small size of the genes and regions (mean 139.5 and 28.1 bp, respectively). Variants in the regions we delineated should also be investigated in other disease cohorts.
In summary, we identify RNU4-2 as a syndromic NDD gene, explaining roughly 0.4% of all individuals with NDD. Including RNU4-2 in standard clinical workflows will end the diagnostic odyssey for thousands of patients with NDD worldwide, and knowledge of the gene responsible for disease will enable investigation of potential treatments for these individuals.
Methods
Categorizing participants in Genomics England
We defined four groups of individuals in the Genomics England 100,000 genomes project v.18 dataset. Individuals with NDD (n = 13,812) were defined as those with Human Phenotype Ontology (HPO)43 and/or ICD-10 codes44 for global developmental delay (HP:0001263, HP:0012736, HP:0011344, HP:0011343, HP:0011342; ICD-10: R62, F80, F81, F82, F83, F88, F89), intellectual disability (HPO: HP:0001249, HP:0002187, HP:0010864, HP:0002342, HP:0001256, HP:0006887, HP:0006889; ICD-10: F70, F71, F72, F73, F78, F79) and/or autism (HPO: HP:0000717, HP:0000729, HP:0000753; ICD-10: F84), or who were recruited to GEL with a normalized specific disease of intellectual disability. NDD individuals were classified as diagnosed (n = 3,424) if they were marked as solved or partially solved in the gmc_exit_questionnaire table or had an entry in the submitted_diagnostic_discovery table in GEL Labkey. The remaining 10,388 NDD individuals formed our undiagnosed NDD cohort. Of these, 8,841 are probands. We also identified 21,817 probands without NDD phenotypes (that is, without the HPO and ICD-10 codes detailed above) and 33,122 individuals reported to be unaffected. Our defined cohorts exclude anyone who has subsequently removed consent.
For most of our analyses, we used two previously defined datasets within GEL. First, a high-confidence set of de novo variants from 13,949 trios13. As of 13 March 2024, this dataset includes 12,554 probands with consent: 5,426 probands with undiagnosed NDD, 2,352 with diagnosed NDD and 4,776 non-NDD probands. De novo variants were filtered to those that pass the stringent_filter. Second, an aggregated variant call set (aggV2)17 that contains 29,850 probands: 7,519 undiagnosed NDD, 2,903 diagnosed NDD and 19,428 non-NDD.
Identifying variants in population datasets
We used data from gnomAD v.4.0 (76,215 genome-sequenced individuals)14, All of Us15 (accessed through the publicly available data browser https://databrowser.researchallofus.org/; 245,400 genomes as of 28 March 2023) and the UK Biobank (490,640 genome-sequenced individuals)16.
Expanded NDD cohort and clinical data collection
Clinical data were collected from research participants after obtaining written informed consent from their parents or legal guardians, with the study approved by the local regulatory authority. Samples were collected largely through personal communications (N.W., A.O.’D.-L., D.G.M.), as variants in this gene have not been prioritized in analysis. On entry into Matchmaker Exchange using the seqr45 node, one match was made with GeneMatcher46 (C.D.). N.W. reviewed the National Health Service Genome Medicine Service (NHS GMS; v.3) dataset. Samples from NHS GMS were manually checked to remove duplicates with GEL. A.O.’D.-L. and S.S. reviewed the Broad Centre for Mendelian Genomics and the GREGoR consortium datasets and contacted the Undiagnosed Disease Network (UDN) through M.T.W. D.G.M. contacted extra local collaborators. Clinical collaboration requests were submitted to GEL to contact recruiting clinicans and collect extra phenotypic information. Clinical data were collected and summarized for features seen across the cohort. Written consent was obtained to publish all photographs included in Fig. 3.
We also searched 7,149 trios with autism spectrum disorder and 4,180 sibling control trios from three cohorts: SSC (2,383 cases; 1,938 controls)24, SPARK (3,144 cases; 2,190 controls)25 and MSSNG (1,622 cases; 52 controls)26.
Generating 1,000 random intergenic sequences
Using the bedtools (v.2.31.0) subtractBed function47 we retrieved regions on chromosome 12 that do not overlap with RefSeq transcripts aligned by the National Center for Biotechnology Information. We further removed regions within 10 kbp of an annotated transcript and restricted the remaining regions to those at least 141 bp in length (n = 611). We further removed regions overlapping the centromere. We then generated a set of 1,000 random sequences from each intergenic region and then randomly selected 1,000 non-overlapping regions from these.
Identifying human snRNA genes
We extracted genes with snRNA biotypes from Ensembl genome annotation v.111. We filtered out known pseudogenes (that is, with gene names marked with ‘P’ or identified through manual curation). For each remaining gene, we used BrainVar23 RNA-seq expression data to calculate the mean counts per million (CPM) value across the gene. We selected only genes with mean CPM value across all BrainVar samples greater than five, resulting in a dataset of 28 snRNA genes.
Assessing variant depletion
Given the high mutability of RNU4-2 and other snRNA genes, coupled with strong selection pressures on variants, we did not think that conventional mutational models would be well calibrated to assess variant depletion. Instead, we devised a sliding window-based strategy to identify regions within snRNA genes that are relatively depleted of SNVs. We split genes into 18 bp sliding windows (chosen as it is the size of the region defined in RNU4-2) and tallied the number of SNVs observed in UK Biobank 500k genome sequencing data within that window, divided by the total number of possible SNVs (that is, 18 × 3). The proportion of possible SNVs observed in each window was normalized to the median across all sliding windows in that gene (that is, the per-gene median proportion observed was subtracted from each value). Depleted regions were defined as those spanning windows with a deviation from the per-gene median of at least 20%, that is, normalized observed proportion of possible SNVs less than −0.2. The same calculation was performed on 1,000 randomly selected 141 bp intergenic regions on chr. 12 (above). A one-way approximative (Monte Carlo) Fisher–Pitman test was conducted to show the median observed proportion of possible SNVs was significantly higher for RNU4-1 and RNU4-2 compared to the distribution in the 1,000 random regions.
RNA-seq of individuals with RNU4-2 variants
Blood was collected from a subset of 100,000 Genomes Project probands in PaxGene tubes to preserve RNA at the time of recruitment. RNA was extracted, depleted of globin and ribosomal RNAs, and subjected to sequencing by Illumina using 100 bp paired-end reads, with a mean of 102 million mapped reads per individual. Alignment was performed using Illumina’s DRAGEN pipeline (v.3.8.4). FRASER2 (ref. 40) and OUTRIDER48 were used to detect abnormal splicing events and expression differences with samples run in batches of 500, both run using the DROP pipeline49 (v.1.3.3). Significant outlier events were identified as those with a false discovery rate adjusted P < 0.05. The number of outlier events detected in five individuals with RNU4-2 variants was compared to two different control sets: (1) ten individuals matched (two per RNU4-2 individual) on genetic ancestry, sex and approximate age at consent and that did not have any NDD phenotypes; (2) 378 individuals with more than 60 million mapped reads, age below 18 years and with no NDD phenotypes. Sequence logo plots in Fig. 2 and Supplementary Fig. 6 were created in R (v.4.0.2) using the ggseqlogo package.
Assessing the sensitivity to detect the n.64_65insT variant in exome sequencing data
We used a Python script that uses samtools mpileup to retrieve the coverage and base change at the n.64_65 critical locus to identify putative carriers of the insertion (https://github.com/francois-lecoquierre/genomics_shortcuts/blob/main/find_RNU4-2_recurrent_variant.py). This was applied to exome sequencing data (32,681 CRAM files from probands and parents) from the DDD cohort.
Analysing RNU4-2 and RNU4-1 expression
We used the BrainVar23 dataset to assess patterns of whole-gene expression of RNU4-2 and RNU4-1 in the human cortex across prenatal and postnatal development. This dataset includes bulk-tissue RNA-seq data from 176 de-identified postmortem samples of the dorsolateral prefrontal cortex (DLPFC, n = 167 older than ten postconception weeks) or frontal cerebral wall (n = 9 younger than ten postconception weeks), ranging from six postconception weeks to 20 years of age. The 100 bp paired-read RNA-seq data from BrainVar were aligned to the GRCh38.p12 human genome using STAR aligner50 (v.2.4.2a), and gene-level read counts for GENCODE v.31 human gene definitions were calculated with DEXSeq51 (v.1.50.0) and normalized to CPM52.
Prenatal prefrontal cortex ATAC-seq data
Methods of generating ATAC-seq have been described previously53, which is the source of the data shown here. Briefly, fresh prenatal (18 and 19 gestational weeks) brain samples were dissected within 2 h of elective termination to extract the entire telencephalic wall, from the ventricular zone to the meninges. Intact nuclei were isolated by manually douncing the tissue on ice using a loose pestle douncer then lysed on ice for 10 min by adding a solution with 0.1% NP-40. Nuclei were spun down by centrifugation then resuspended and exposed to Tagmentation Enzyme for 30 min at 37 °C. The ATAC-seq library was generated using Illumina barcode oligos, amplified by high-fidelity PCR and sequenced on the Illumina HiSeq 2500 using paired-end sequencing. Reads were aligned to GRCh38 using the ENCODE ATAC-seq pipeline with default parameters54. A UCSC Browser track of per nucleotide ATAC-seq counts was used to assess the region around RNU4-2 and RNU4-1.
Burden testing and statistical analysis
The enrichment of both de novo variants and homozygous and/or compound heterozygous variants across each of 28 snRNA genes and 14 constrained subregions was assessed in undiagnosed NDD probands compared to non-NDD probands. De novo variants were identified from GEL’s high-confidence de novo callset. Homozygous and compound heterozygous variants were identified from genotyping data in individual participants’ VCF files. Homozygous variants were identified as variants that are heterozygous in both parents and homozygous in offspring. To identify compound heterozygous variants in a gene or region, we assessed whether there are greater than or equal to 1 paternally inherited heterozygous variant and greater than or equal to one maternally inherited heterozygous variant in the offspring. Multiallelic sites were excluded from this analysis. Homozygous variants and compound heterozygous variants were grouped together for burden testing. ORs and associated P values were calculated using a two-sided Fisher’s exact test in R. A P value threshold of 0.0014 was used to assess statistical significance as a Bonferroni correction accounting for 35 tests.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-024-07773-7.
Supplementary information
Acknowledgements
We thank all the incredible patients and families who contributed to our study. We thank P. O’Donovan, M. Sato, M. Hoti, J. Yang, C. Smith and N. Elkhateeb from Genomics England for their help with clinician collaboration and Airlock requests. We thank J. Barrett for his statistical advice. Y.C. is supported by a studentship from Novo Nordisk. N.W. is supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant no. 220134/Z/20/Z). G.L. was supported by the Fonds de recherche en santé du Québec (FRQS), V.S.G. by grant no. NIAMS K23AR083505 and S.L.S. by a fellowship from Manton Center for Orphan Disease Research at Boston Children’s Hospital. The research was supported by grant funding from Novo Nordisk (to N.W.), the Rosetrees Trust (grant no. PGL19-2/10025 to N.W.), the Simons Foundation Autism Research Initiative (grant no. SFARI 736613 to S.J.S.), the National Institute of Mental Health (grant no. R01 MH129751 to S.J.S.), the HDR-UK Molecules to Health Records Driver Programme (S.J.S.), the Australian National Health and Medical Research Council (grant nos. 1164479, 1155244, GNT2001513), the Australian National Health and Medical Research Council Centre for Research Excellence in Neurocognitive Disorders (grant no. NHMRC-RG172296), the Australian Medical Research Future Fund (grant nos. MRF2007677 and GHFM76747), NHGRI (grant nos. U01HG011762, U01HG011745, U24HG011746, UM1HG008900, U01HG011755, R21HG012397 and R01HG009141), NINDS (grant nos. U01NS134358, U01NS106845, U54NS115052 and 1U24NS131172), the Chan Zuckerberg Initiative Donor-Advised Fund at the Silicon Valley Community Foundation (funder 10.13039/100014989) grant nos. 2019-199278, 2020-224274 (10.37921/236582yuakxy), the US Department of Defense Congressionally Directed Medical Research Programs (grant no. PR170396), the National Institute of Neurological Disorders and Stroke of the National Institutes of Health (grant nos. U01HG007709, U01HG007942 and U01HG010217), the National Institute for Health Research Moorfields Biomedical Research Centre and the Clinical Translational Core of the Baylor College of Medicine IDDRC (grant no. P50HD103555) from the Eunice Kennedy Shriver National Institute of Child Health and Human Development. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Eunice Kennedy Shriver National Institute of Child Health and Human Development or the National Institutes of Health. The Rare Disease Flagship acknowledges financial support from the Royal Children’s Hospital Foundation, the Murdoch Children’s Research Institute and the Harbig Foundation. Massimo’s Mission acknowledges funding support from the Australian Government Department of Health and Aged Care (grant no. EPCD000034). Sequencing and analysis were supported by the Deutsche Forschungsgemeinschaft (DFG) Research Infrastructure West German Genome Center (project no. 407493903) as part of the Next Generation Sequencing Competence Network (project no. 423957469). Short-read genome sequencing was carried out at the production site Cologne (Cologne Center for Genomics). CD received the DFG grant no. 458099954 as part of the DFG Sequencing call no. 3. We also thank S. Kaya for technical assistance and C. Schröder for bioinformatic analysis. S.M.S. is supported by the Epilepsy Society. This research was made possible through access to data in the National Genomic Research Library, which is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The National Genomic Research Library holds data provided by patients and collected by the NHS as part of their care and data collected as part of their participation in research. The National Genomic Research Library is funded by the National Institute for Health Research and NHS England. The Wellcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure.
Extended data figures and tables
Author contributions
Y.C., R.D., H.C.K., A.L., S.L.S., S.W., J.L., G.L., A.C.M.-G., V.S.G., J.M., J.M.E., E.D., E.N.D.’S. and S.D. analysed data and contributed to the figures and tables in the paper. J.L.R., E.M.-P., S.M.F., D.B., C.D., D.G.M., J.M.M.H., S.J.S., A.O.’D.-L. and N.W. collected data, provided funding and supervised the work. All other authors provided clinical and/or genomic data and are listed alphabetically. Y.C. and N.W. wrote the paper with input from all the authors.
Peer review
Peer review information
Nature thanks Maurice Swanson, Henrike Heyne and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Research on the de-identified patient data used in this publication from the Genomics England 100,000 Genomes Project and the NHS GMS dataset can be carried out in the Genomics England Research Environment subject to a collaborative agreement that adheres to patient-led governance. All interested readers will be able to access the data in the same manner that the authors accessed the data. For more information about accessing the data, interested readers may contact research-network@genomicsengland.co.uk or access the relevant information on the Genomics England website: https://www.genomicsengland.co.uk/research. Genomic and phenotypic data from the GREGoR consortium (including the RGP cohort) and the UDN are available through the dbGaP accession numbers phs003047.v1.p1 and phs001232.v5.p2, respectively, with at least annual data releases. Access is managed by a data access committee designated by dbGaP and is based on intended use of the requester and allowed use of the data submitter as defined by consent codes. The BrainVar data are available through the PsychENCODE Knowledge Portal: syn21557948 on Synapse.org (https://www.synapse.org/#!Synapse:syn4921369). Raw ATAC-seq and chromatin immunoprecipitation with seqencing data are available on dbGAP: accession phs002033.v1.p1.
Code availability
Analysis of the 100,000 genomes project and NHS GMS data was performed inside the Genomics England Research Environment. We are happy to share the location of all code to registered users. Code used for analyses outside Genomics England is available at GitHub: https://github.com/Computational-Rare-Disease-Genomics-WHG/RNU4-2 and https://github.com/francois-lecoquierre/genomics_shortcuts/blob/main/find_RNU4-2_recurrent_variant.py.
Competing interests
N.W. receives research funding from Novo Nordisk and has consulted for ArgoBio studio. S.J.S. receives research funding from BioMarin Pharmaceutical. A.O.’D.-L. is on the scientific advisory board for Congenica, was a paid consultant for Tome Biosciences, Ono Pharma USA Inc. and at present for Addition Therapeutics, and received reagents from PacBio to support rare disease research. H.L.R. has received support from Illumina and Microsoft to support rare disease gene discovery and diagnosis. M.W. has consulted for Illumina and Sanofi and received speaking honoraria from Illumina and GeneDx. S.B.M. is an advisor for BioMarin, Myome and Tenaya Therapeutics. S.M.S. has received honoraria for educational events or advisory boards from Angelini Pharma, Biocodex, Eisai, Zogenix/UCB and institutional contributions for advisory boards, educational events or consultancy work from Eisai, Jazz/GW Pharma, Stoke Therapeutics, Takeda, UCB and Zogenix. The Department of Molecular and Human Genetics at Baylor College of Medicine receives revenue from clinical genetic testing completed at Baylor Genetics Laboratories. J.M.M.H. is a full-time employee of Novo Nordisk and holds shares in Novo Nordisk A/S. D.G.M. is a paid consultant for GlaxoSmithKline, Insitro and Overtone Therapeutics and receives research support from Microsoft. All other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Ruebena Dawes, Hyung Chul Kim, Alicia Ljungdahl, Sarah L. Stenton, Susan Walker
Extended data
is available for this paper at 10.1038/s41586-024-07773-7.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-024-07773-7.
References
- 1.Wright, C. F. et al. Genomic diagnosis of rare pediatric disease in the United Kingdom and Ireland. N. Engl. J. Med.388, 1559–1571 (2023). 10.1056/NEJMoa2209046 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nguyen, T. H. D. et al. The architecture of the spliceosomal U4/U6.U5 tri-snRNP. Nature523, 47–52 (2015). 10.1038/nature14548 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ellingford, J. M. et al. Recommendations for clinical interpretation of variants found in non-coding regions of the genome. Genome Med.14, 73 (2022). 10.1186/s13073-022-01073-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.100,000 Genomes Project Pilot Investigators. 100,000 Genomes Pilot on rare-disease diagnosis in health care—preliminary report. N. Engl. J. Med.385, 1868–1880 (2021). 10.1056/NEJMoa2035790 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Aspden, J. L., Wallace, E. W. J. & Whiffin, N. Not all exons are protein coding: addressing a common misconception. Cell Genom.3, 100296 (2023). 10.1016/j.xgen.2023.100296 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Nemeth, K., Bayraktar, R., Ferracin, M. & Calin, G. A. Non-coding RNAs in disease: from mechanisms to therapeutics. Nat. Rev. Genet.25, 211–232 (2024). 10.1038/s41576-023-00662-1 [DOI] [PubMed] [Google Scholar]
- 7.Will, C. L. & Lührmann, R. Spliceosome structure and function. Cold Spring Harb. Perspect. Biol.3, a003707 (2011). 10.1101/cshperspect.a003707 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Guthrie, C. & Patterson, B. Spliceosomal snRNAs. Annu. Rev. Genet.22, 387–419 (1988). 10.1146/annurev.ge.22.120188.002131 [DOI] [PubMed] [Google Scholar]
- 9.Elsaid, M. F. et al. Mutation in noncoding RNA RNU12 causes early onset cerebellar ataxia. Ann. Neurol.81, 68–78 (2017). 10.1002/ana.24826 [DOI] [PubMed] [Google Scholar]
- 10.Edery, P. et al. Association of TALS developmental disorder with defect in minor splicing component U4atac snRNA. Science332, 240–243 (2011). 10.1126/science.1202205 [DOI] [PubMed] [Google Scholar]
- 11.Farach, L. S. et al. The expanding phenotype of RNU4ATAC pathogenic variants to Lowry Wood syndrome. Am. J. Med. Genet. A176, 465–469 (2018). 10.1002/ajmg.a.38581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Merico, D. et al. Compound heterozygous mutations in the noncoding RNU4ATAC cause Roifman Syndrome by disrupting minor intron splicing. Nat. Commun.6, 8718 (2015). 10.1038/ncomms9718 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.De Novo Variant Research Dataset—Research Environment User Guide (Genomics England, 2023); https://re-docs.genomicsengland.co.uk/de_novo_data/.
- 14.Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature581, 434–443 (2020). 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.All of Us Research Program Investigators. The ‘All of Us’ Research Program. N. Engl. J. Med.381, 668–676 (2019). 10.1056/NEJMsr1809937 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12, e1001779 (2015). 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Aggregated Variant Calls (AggV2)—Research Environment User Guide (Genomics England, 2024); https://re-docs.genomicsengland.co.uk/aggv2/.
- 18.Charenton, C., Wilkinson, M. E. & Nagai, K. Mechanism of 5′ splice site transfer for human spliceosome activation. Science364, 362–367 (2019). 10.1126/science.aax3289 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wilkinson, M. E., Charenton, C. & Nagai, K. RNA splicing by the spliceosome. Annu. Rev. Biochem.89, 359–388 (2020). 10.1146/annurev-biochem-091719-064225 [DOI] [PubMed] [Google Scholar]
- 20.Griffin, C. & Saint-Jeannet, J.-P. Spliceosomopathies: diseases and mechanisms. Dev. Dyn.249, 1038–1046 (2020). 10.1002/dvdy.214 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Thormann, A. et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat. Commun.10, 2373 (2019). 10.1038/s41467-019-10016-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet.54, 1320–1331 (2022). 10.1038/s41588-022-01104-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Werling, D. M. et al. Whole-genome and RNA sequencing reveal variation and transcriptomic coordination in the developing human prefrontal cortex. Cell Rep.31, 107489 (2020). 10.1016/j.celrep.2020.03.053 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron68, 192–195 (2010). 10.1016/j.neuron.2010.10.006 [DOI] [PubMed] [Google Scholar]
- 25.SPARK Consortium. SPARK: a US cohort of 50,000 families to accelerate autism research. Neuron97, 488–493 (2018). 10.1016/j.neuron.2018.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.C Yuen, R. K. et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci.20, 602–611 (2017). 10.1038/nn.4524 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Greene, D. et al. Mutations in the U4 snRNA gene RNU4-2 cause one of the most prevalent monogenic neurodevelopmental disorders. Nat. Med.10.1038/s41591-024-03085-5 (2024). 10.1038/s41591-024-03085-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature549, 519–522 (2017). 10.1038/nature24018 [DOI] [PubMed] [Google Scholar]
- 29.Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature586, 757–762 (2020). 10.1038/s41586-020-2832-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wood, K. A. & Goriely, A. The impact of paternal age on new mutations and disease in the next generation. Fertil. Steril.118, 1001–1012 (2022). 10.1016/j.fertnstert.2022.10.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Clark, F. E. & Akera, T. Unravelling the mystery of female meiotic drive: where we are. Open Biol.11, 210074 (2021). 10.1098/rsob.210074 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jinks-Robertson, S. & Bhagwat, A. S. Transcription-associated mutagenesis. Annu. Rev. Genet.48, 341–359 (2014). 10.1146/annurev-genet-120213-092015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Seplyarskiy, V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat. Genet.55, 2235–2242 (2023). 10.1038/s41588-023-01562-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Thornlow, B. P. et al. Transfer RNA genes experience exceptionally elevated mutation rates. Proc. Natl Acad. Sci. USA115, 8996–9001 (2018). 10.1073/pnas.1801240115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pray, L. DNA replication and causes of mutation. Nat. Education1, 214 (2008). [Google Scholar]
- 36.Zahler, A. M. et al. SNRP-27, the C. elegans homolog of the tri-snRNP 27 K protein, has a role in 5′ splice site positioning in the spliceosome. RNA24, 1314–1325 (2018). 10.1261/rna.066878.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Parker, M. T. et al. m6A modification of U6 snRNA modulates usage of two major classes of pre-mRNA 5′ splice site. eLife11, e78808 (2022). 10.7554/eLife.78808 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wersig, C. & Bindereif, A. Conserved domains of human U4 snRNA required for snRNP and spliceosome assembly. Nucleic Acids Res.18, 6223–6229 (1990). 10.1093/nar/18.21.6223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Burley, S. K. et al. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res.51, D488–D508 (2023). 10.1093/nar/gkac1077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Scheller, I. F., Lutz, K., Mertes, C., Yépez, V. A. & Gagneur, J. Improved detection of aberrant splicing with FRASER 2.0 and the intron Jaccard index. Am. J. Hum. Genet.110, 2056–2067 (2023). 10.1016/j.ajhg.2023.10.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature604, 310–315 (2022). 10.1038/s41586-022-04558-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kang, H. J. et al. Spatio-temporal transcriptome of the human brain. Nature478, 483–489 (2011). 10.1038/nature10523 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Gargano, M. A. et al. The Human Phenotype Ontology in 2024: phenotypes around the world. Nucleic Acids Res.52, D1333–D1346 (2024). 10.1093/nar/gkad1005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.World Health Organization. International Statistical Classification of Diseases and Health-Related Problems 10th Revision (ICD-10); https://icd.who.int/browse10/2019/en (WHO, 2019).
- 45.Pais, L. S. et al. seqr: a web-based analysis and collaboration tool for rare disease genomics. Hum. Mutat.43, 698–707 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Sobreira, N., Schiettecatte, F., Valle, D. & Hamosh, A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum. Mutat.36, 928–930 (2015). 10.1002/humu.22844 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics26, 841–842 (2010). 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Brechtmann, F. et al. OUTRIDER: a statistical method for detecting aberrantly expressed genes in RNA sequencing data. Am. J. Hum. Genet.103, 907–917 (2018). 10.1016/j.ajhg.2018.10.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Yépez, V. A. et al. Detection of aberrant gene expression events in RNA sequencing data. Nat. Protoc.16, 1276–1296 (2021). 10.1038/s41596-020-00462-5 [DOI] [PubMed] [Google Scholar]
- 50.Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics29, 15–21 (2013). 10.1093/bioinformatics/bts635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Anders, S., Reyes, A. & Huber, W. Detecting differential usage of exons from RNA-seq data. Genome Res.22, 2008–2017 (2012). 10.1101/gr.133744.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Bedre, R. Reneshbedre/bioinfokit: Bioinformatics data analysis and visualization toolkit. Zenodo10.5281/zenodo.3965241 (2020).
- 53.Markenscoff-Papadimitriou, E. et al. A chromatin accessibility atlas of the developing human telencephalon. Cell182, 754–769.e18 (2020). 10.1016/j.cell.2020.06.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Lee, J. et al. Kundajelab/atac_dnase_pipelines: v.0.3.0. Zenodo10.5281/zenodo.156534 (2016).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Research on the de-identified patient data used in this publication from the Genomics England 100,000 Genomes Project and the NHS GMS dataset can be carried out in the Genomics England Research Environment subject to a collaborative agreement that adheres to patient-led governance. All interested readers will be able to access the data in the same manner that the authors accessed the data. For more information about accessing the data, interested readers may contact research-network@genomicsengland.co.uk or access the relevant information on the Genomics England website: https://www.genomicsengland.co.uk/research. Genomic and phenotypic data from the GREGoR consortium (including the RGP cohort) and the UDN are available through the dbGaP accession numbers phs003047.v1.p1 and phs001232.v5.p2, respectively, with at least annual data releases. Access is managed by a data access committee designated by dbGaP and is based on intended use of the requester and allowed use of the data submitter as defined by consent codes. The BrainVar data are available through the PsychENCODE Knowledge Portal: syn21557948 on Synapse.org (https://www.synapse.org/#!Synapse:syn4921369). Raw ATAC-seq and chromatin immunoprecipitation with seqencing data are available on dbGAP: accession phs002033.v1.p1.
Analysis of the 100,000 genomes project and NHS GMS data was performed inside the Genomics England Research Environment. We are happy to share the location of all code to registered users. Code used for analyses outside Genomics England is available at GitHub: https://github.com/Computational-Rare-Disease-Genomics-WHG/RNU4-2 and https://github.com/francois-lecoquierre/genomics_shortcuts/blob/main/find_RNU4-2_recurrent_variant.py.