Diagnosing missed cases of spinal muscular atrophy in genome, exome, and panel sequencing datasets

Ben Weisburd; Rakshya Sharma; Villem Pata; Tiia Reimand; Vijay S Ganesh; Christina Austin-Tse; Ikeoluwa Osei-Owusu; Emily O’Heir; Melanie O’Leary; Lynn Pais; Seth A Stafki; Audrey L Daugherty; Chiara Folland; Stojan Perić; Nagia Fahmy; Bjarne Udd; Magda Horakova; Anna Łusakowska; Rajanna Manoj; Atchayaram Nalini; Veronika Karcagi; Kiran Polavarapu; Hanns Lochmüller; Rita Horvath; Carsten G Bönnemann; Sandra Donkervoort; Göknur Haliloğlu; Ozlem Herguner; Peter B Kang; Gianina Ravenscroft; Nigel Laing; Hamish S Scott; Ana Töpf; Volker Straub; Sander Pajusalu; Katrin Õunap; Grace Tiao; Heidi L Rehm; Anne O’Donnell-Luria

doi:10.1101/2024.02.11.24302646

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jun 29:2024.02.11.24302646. Originally published 2024 Feb 13. [Version 3] doi: 10.1101/2024.02.11.24302646

Diagnosing missed cases of spinal muscular atrophy in genome, exome, and panel sequencing datasets

Ben Weisburd ^1,^2,^*, Rakshya Sharma ^1,³, Villem Pata ^4,⁵, Tiia Reimand ^4,⁶, Vijay S Ganesh ^1,^2,^7,⁸, Christina Austin-Tse ^1,², Ikeoluwa Osei-Owusu ^1,⁸, Emily O’Heir ^1,⁸, Melanie O’Leary ¹, Lynn Pais ^1,⁸, Seth A Stafki ⁹, Audrey L Daugherty ⁹, Chiara Folland ²⁶, Stojan Perić ^10,¹¹, Nagia Fahmy ¹², Bjarne Udd ¹³, Magda Horakova ^14,¹⁵, Anna Łusakowska ¹⁶, Rajanna Manoj ¹⁷, Atchayaram Nalini ¹⁷, Veronika Karcagi ¹⁸, Kiran Polavarapu ¹⁹, Hanns Lochmüller ^19,^20,²¹, Rita Horvath ²², Carsten G Bönnemann ²³, Sandra Donkervoort ²³, Göknur Haliloğlu ^23,²⁴, Ozlem Herguner ²⁵, Peter B Kang ⁹, Gianina Ravenscroft ^26,²⁷, Nigel Laing ^26,²⁷, Hamish S Scott ²⁸, Ana Töpf ²⁹, Volker Straub ²⁹, Sander Pajusalu ^4,⁶, Katrin Õunap ^4,⁶, Grace Tiao ¹, Heidi L Rehm ^1,², Anne O’Donnell-Luria ^1,^2,^8,^*

^1.Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

^2.Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

^3.UC Santa Cruz Genomics Institute, UCSC, Santa Cruz, CA, USA

^4.Department of Clinical Genetics, Institute of Clinical Medicine, University of Tartu, Tartu, Estonia

^5.Anesthesiology and Intensive Care Clinic, Tartu University Hospital, Tartu, Estonia

^6.Genetics and Personalized Medicine Clinic, Tartu University Hospital, Tartu, Estonia

^7.Department of Neurology, Brigham & Women’s Hospital,Boston, MA, USA

^8.Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA

^9.Greg Marzolf Jr. Muscular Dystrophy Center, Department of Neurology, and Institute for Translational Neuroscience, University of Minnesota, Minneapolis, MN, USA

^10.University of Belgrade, Faculty of Medicine, Belgrade, Serbia

^11.University Clinical Centre of Serbia, Neurology Clinic, Belgrade, Serbia

^12.Neuromuscular Center, Ain Shams University, Cairo, Egypt

^13.Tampere Neuromuscular Center and Folkhalsan Research Center, Helsinki, Finland

^14.Department of Neurology, Neuromuscular Center ERN, University Hospital Brno, Brno, Czech Republic

^15.Faculty of Medicine, Masaryk University, Brno, Czech Republic

^16.Department of Neurology, Medical University of Warsaw, Warsaw, Poland

^17.National Institute of Mental Health and Neuro Sciences, Bengaluru, India

^18.Istenhegyi Genetic Diagnostic Centre, Molecular Genetic Laboratory, Budapest, Hungary

^19.Children’s Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada

^20.Division of Neurology, Department of Medicine, The Ottawa Hospital, Ottawa, ON, Canada

^21.Brain and Mind Research Institute, University of Ottawa, Ottawa, ON, Canada

^22.Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK

^23.Neuromuscular and Neurogenetic Disorders of Childhood Section, Neurogenetics Branch, National Institute of Neurological Disorders and Stroke, NIH, Bethesda, MD, USA

^24.Division of Pediatric Neurology, Department of Pediatrics, Hacettepe University Faculty of Medicine, Ankara, Turkey

^25.Çukurova University Faculty of Medicine, Department of Pediatrics, Division of Pediatric Neurology, Adana, Turkey

^26.Centre of Medical Research, The University of Western Australia, Perth, Western Australia, Australia

^27.Harry Perkins Institute for Medical Research, Perth, Western Australia, Australia

^28.Centre for Cancer Biology, An SA Pathology & UniSA Alliance, Adelaide, SA, Australia

^29.John Walton Muscular Dystrophy Research Centre, Newcastle University and Newcastle Hospitals NHS Foundation Trust, Newcastle upon Tyne, UK

Corresponding authors: Ben Weisburd, weisburd@broadinstitute.org, Anne O’Donnell-Luria, odonnell@broadinstitute.org

PMCID: PMC10889006 PMID: 38405995

Abstract

Spinal muscular atrophy (SMA) is a genetic disorder that causes progressive degeneration of lower motor neurons and the subsequent loss of muscle function throughout the body. It is the second most common recessive disorder in individuals of European descent and is present in all populations. Accurate tools exist for diagnosing SMA from genome sequencing data. However, there are no publicly available tools for GRCh38-aligned data from panel or exome sequencing assays which continue to be used as first line tests for neuromuscular disorders. This deficiency creates a critical gap in our ability to diagnose SMA in large existing rare disease cohorts, as well as newly sequenced exome and panel datasets. We therefore developed and extensively validated a new tool - SMA Finder - that can diagnose SMA not only in genome, but also exome and panel sequencing samples aligned to GRCh37, GRCh38, or T2T-CHM13. It works by evaluating aligned reads that overlap the c.840 position of SMN1 and SMN2 in order to detect the most common molecular causes of SMA. We applied SMA Finder to 16,626 exomes and 3,911 genomes from heterogeneous rare disease cohorts sequenced at the Broad Institute Center for Mendelian Genomics as well as 1,157 exomes and 8,762 panel sequencing samples from Tartu University Hospital. SMA Finder correctly identified all 16 known SMA cases and reported nine novel diagnoses which have since been confirmed by clinical testing, with another four novel diagnoses undergoing validation. Notably, out of the 29 total SMA positive cases, 23 had an initial clinical diagnosis of muscular dystrophy, congenital myasthenic syndrome, or myopathy. This underscored the frequency with which SMA can be misdiagnosed as other neuromuscular disorders and confirmed the utility of using SMA Finder to reanalyze phenotypically diverse neuromuscular disease cohorts. Finally, we evaluated SMA Finder on 198,868 individuals that had both exome and genome sequencing data within the UK Biobank (UKBB) and found that SMA Finder’s overall false positive rate was less than 1 / 200,000 exome samples, and its positive predictive value (PPV) was 97%. We also observed 100% concordance between UKBB exome and genome calls. This analysis showed that, even though it is located within a segmental duplication, the most common causal variant for SMA can be detected with comparable accuracy to monogenic disease variants in non-repetitive regions. Additionally, the high PPV demonstrated by SMA Finder, the existence of treatment options for SMA in which early diagnosis is imperative for therapeutic benefit, as well as widespread availability of clinical confirmatory testing for SMA, warrants the addition of SMN1 to the ACMG list of genes with reportable secondary findings after genome and exome sequencing.

Introduction

Spinal muscular atrophy (SMA) is a rare genetic condition characterized by progressive loss of muscle function due to the death of lower motor neurons. It is one of the most common recessive disorders, particularly in individuals of European descent where it affects between 1 in 6,000–12,000,^1,2 while in other populations it occurs at lower but appreciable rates.³ In 2016, the US Food and Drug Administration (FDA) approved nusinersen, the first drug to slow disease progression, followed by the approval of onasemnogene abeparvovec-xioi in 2019, and risdiplam in 2020.⁴ The American Health Resources & Services Administration (HRSA) added SMA testing to the recommended uniform newborn screening panel (RUSP) in 2018, and it is now included in all 50 states. An increasing number of countries have also included SMA within their national screening programs. However, affected individuals may remain undiagnosed if they were born before testing became available in their region, if their parents declined newborn screening, or due to false-negative test reports.

The molecular etiology of SMA is linked to disruption of the SMN locus,⁵ most commonly due to deletions or gene conversions resulting in SMN1 deficiency. The SMN locus consists of two nearly identical paralogs, SMN1 and SMN2. The ‘C’ nucleotide at the c.840 position of SMN1 facilitates proper splicing, while the ‘T’ at that position in SMN2 causes exon 7 skipping in 85% of SMN2 transcripts (Figure 1A). As a result, SMN2 alone does not produce a sufficient amount of functional SMN protein, causing neurons that have zero functional copies of SMN1 to die prematurely. In the general population, individuals typically inherit two intact copies of each paralog, though genomes with between zero and five copies of each have been observed.³ Since SMN2 can still produce a small amount of SMN protein, disease severity and age of onset in affected individuals is influenced by the copy number of SMN2.

Figure 1. — A. The *SMN1* and *SMN2* paralogs are 99.9% identical. One of the few differences between them occurs at their c.840 position. The ‘C’ at this position in *SMN1* leads to proper splicing, while the ‘T’ in *SMN2* leads to skipping of exon 7 in most *SMN2* transcripts. Individuals that have zero functional copies of *SMN1* develop spinal muscular atrophy (SMA), and the severity of their disease is inversely proportional to the number of copies of *SMN2* in their genome since each copy of *SMN2* can produce a small amount of SMN protein.

B. SMA Finder works by counting all aligned reads that overlap the c.840 position in both *SMN1* and *SMN2* and then computing the fraction of reads that have a ‘C’ at that position. This fraction is interpreted as the fraction of intact *SMN1* copies in the individual’s genome. When it is near zero, it implies the absence of any functional copies of *SMN1*, and therefore suggests that the sample is positive and the individual has a diagnosis of SMA.

The near-perfect sequence homology between SMN1 and SMN2 causes read alignment algorithms like BWA⁶ to either mismap or ambiguously map reads at this locus, often resulting in reads having a mapping quality of 0. This, in turn, confounds standard variant calling pipelines such as GATK⁷, and prevents them from being directly useful for SMA diagnosis. To address these issues, multiple specialized tools have been developed to diagnose SMA from genome and long read sequencing data, including SMNCopyNumberCaller³ and Paraphase.⁸ Additionally, the MYO-SEQ study⁹ published an SMA calling pipeline for exome data on GRCh37 and used it to identify 5 novel SMA diagnoses in a cohort of 1001 cases with limb-girdle weakness. Most recently, the Chameleolyser tool was described in a study involving GRCh37-aligned exome samples from a cohort of 17,650 undiagnosed patients.¹⁰ There, Chameleolyser successfully identified 15 novel SMA cases. However, to our knowledge, no publicly available tool exists for detecting SMA in exome or panel sequencing samples on GRCh38 or the new telomere-to-telomere (T2T-CHM13)¹¹ reference. We therefore created SMA Finder for use with exome, genome, or panel sequencing data aligned to GRCh37, GRCh38, or T2T-CHM13, and applied it to diagnose unsolved cases within large rare disease cohorts from the Broad Institute Center for Mendelian Genomics (CMG) and Tartu University Hospital.

Materials and Methods

SMA Finder algorithm

SMA Finder works by computing two numbers from the read data, r and N, where N is the total number of aligned reads that overlap the c.840 position of SMN1 and SMN2, and r is the subset of these reads that have a ‘C’ at the c.840 position. When there is sufficient read coverage (N ≥ 14), SMA Finder interprets the lack of reads with a ‘C’ at c.840 as evidence that the individual has zero functional copies of SMN1 (Figure 1B, Supplementary Methods), and so reports a positive call. Figure 3 shows how r and N are used to distinguish between true positive and true negative samples.

Fig 3. — Read counts measured by SMA Finder in exome (A) and genome (B) samples from CMG cohorts, as well exomes (C) and panel sequencing samples (D) from Tartu University Hospital. Each dot represents a sample. The red line represents the decision boundary used by SMA Finder which reports samples to the left of the boundary as SMA-positive. Samples in the gray box where y ≤ 14 are reported as having insufficient read coverage to make a call. The red dots represent previously known SMA diagnoses, the gray dots are rare disease cases (including the new SMA diagnoses), and the blue dots are unaffected relatives. To clearly show points across a large range of read count values, the x and y axes use a symmetrical log scale that is linear in the range 0 ≤ x ≤ 14 and 0 ≤ y ≤ 14 before switching to a logarithmic scale for x or y > 14. This choice of scale causes part of the decision boundary to appear curved even though it is linear in standard Cartesian coordinates. E and F show SMA Finder read counts for 198,868 UKBB exomes and genomes respectively. The red dot represents UKBB sample i1 which had phenotype records consistent with an SMA diagnosis and was called positive by both SMA Finder and SMNCopyNumberCaller. The yellow dot represents i2 which was only called positive by SMA Finder and was a no-call from SMNCopyNumberCaller. Marginal histograms show the density of scatter plot points along each axis, with the histogram along the vertical axis showing a distribution of read counts overlapping the c.840 position in *SMN1* + *SMN2*, while the histogram along the horizontal axis shows the number of reads with a ‘C’ at the c.840 position. **NOTE:** The exome, genome, and panel sequencing samples in A and B as well as in C and D are largely from non-overlapping sets of individuals, while the exomes and genomes in E and F are alternative samples from the same set of 198,868 individuals in UKBB.

Cohort summary

The Broad Institute Center for Mendelian Genomics (CMG) cohort consisted of 20,537 individuals from 10,754 families with a wide spectrum of medical conditions. Families sequenced through the CMG were enrolled at the Broad Institute by the Rare Genomes Project or enrolled in research studies with local regulatory approval through the participating collaborators, including for sharing de-identified samples for sequencing and analysis. This project was approved by the Mass General Brigham IRB (protocols #2016P001422 and #2013P001477). In this cohort, 12,045 (59%) individuals were affected with a variety of suspected monogenic disorders, while 8,401 (41%) were unaffected parents or relatives (Figure 2A). We performed ancestry inference for 20,205 individuals by computing principal components for high-quality bi-allelic autosomal SNVs using the gnomAD v2 method.¹² We found that 13,240 (66%) individuals were European and the remaining 34% came from various other populations (Figure 2B). Among the 12,045 affected individuals, 10,125 (84%) had phenotype descriptions encoded in Human Phenotype Ontology (HPO) terms.¹³ Of these, 5,138 (51%) had at least one HPO term in the “Abnormality of the nervous system” category, and 4,035 (40%) had at least one HPO term in the “Abnormality of the musculoskeletal system” category (Figure 2C).

Figure 2. — A. The affected status of individuals in the CMG cohort is shown on the y-axis. 12,045 individuals are in the Affected category, 8,401 are Not Affected, and 91 individuals have unknown affected status. Here “Affected” means that the individual was enrolled in a rare disease cohort due to having a disease considered to be rare and most likely genetic in origin.

B. Inferred ancestry of individuals within the CMG cohort is shown on the x-axis: NFE (Non-Finnish Europeans), MDE (Middle Eastern), SAS (South Asian), AMR (Admixed American), AFR (African/African American), EAS (East Asian), ASJ (Ashkenazi Jewish), and UNC (unclassified).

C. The top-level categories from the Human Phenotype Ontology (HPO) are shown on the y-axis. Any individual with multiple HPO terms was counted only once in each category but may be counted more than once across categories.

Genome sequencing used PCR-free library preparation and Illumina HiSeq X Ten v2 chemistry to generate 150bp paired-end reads. The mean target coverage was >30x. Exome sequencing used Illumina Nextera or Twist exome capture (~38 Mb target) and similarly generated 150bp paired-end reads. It aimed to cover >80% of targets at 20x and a mean target coverage of >60x. Both genome and exome data were processed using GATK best practices, starting with BWA alignment to GRCh38 and GATK base quality score recalibration (BQSR). We used Hail Batch¹⁴ to run SMA Finder on all samples in parallel. The average number of reads overlapping the c.840 position of SMN1 + SMN2 was 77.9 across CMG genomes and 110.7 across CMG exomes. These numbers reflect the higher coverage in exomes compared to genomes at this locus, as well as the total copy number of SMN paralogs per individual.

The Tartu University Hospital cohort consisted of 9,919 pseudonymised samples sent for molecular diagnostics for various suspected medical conditions to Tartu University Hospital between 2015 – 2022. The cohort analysis was approved by the University of Tartu Research Ethics Committee (protocol # 374M-6). Out of the 9,919 samples, 1,157 (11.7%) were exomes and 8,762 (88.3%) were panel sequencing panels generated using the Illumina TruSight^™ One or the Illumina TruSight^™ One Expanded panel of 4,811 to 6,794 genes. Guaranteed mean coverage for both panels and exomes was >100x for panel regions. The data was aligned to reference genome build GRCh37 using BWA. Samples were annotated with summary clinical characteristics, and de-pseudonymized data was accessed for positive SMA Finder results only.

The UK Biobank (UKBB)¹⁵ dataset included de-identified genetic and medical records from 198,868 individuals as well as both exome and genome samples for each individual. These samples were aligned to GRCh38 using BWA. The UKBB is a large population-based prospective study that recruited UK residents between 40 and 69 years of age during the years 2006–2010. UKBB genotypes had a mean depth of coverage of 32.5x and a minimum of 23.5x.¹⁶ For UKBB exomes, the depth of coverage, on average, exceeded 20x at 95% of panel bases.¹⁷ The average number of reads overlapping the c.840 position of SMN1 + SMN2 was 56.8 in UKBB genomes and 75.1 in UKBB exomes, again reflecting the higher coverage in exomes at this locus, as well as the total copy number of SMN paralogs per individual.

Our cohorts did not include any samples aligned to the T2T-CHM13 reference. Therefore, to test SMA Finder’s performance on T2T-CHM13, we realigned the 21 confirmed-positive exome samples from the CMG cohort to the T2T-CHM13 reference using BWA MEM and ran SMA Finder on the realigned samples. We then repeated these steps on 21 randomly selected exomes from the CMG cohort that SMA Finder had previously called as SMA-negative. After realignment to T2T-CHM13, SMA Finder still reported the 21 positive samples as SMA-positive, and the 21 negative samples as SMA-negative, confirming its utility for T2T-aligned samples.

Results

We applied SMA Finder to 16,626 exomes and 3,911 genomes from phenotypically heterogeneous rare disease cohorts within the Broad CMG,¹⁸ part of the GREGoR Consortium. SMA Finder identified all 13 known SMA-positive samples (10 exomes and 3 genomes), and flagged 10 previously undiagnosed exome samples as candidate SMA cases, of which 8 have now been validated by gold standard methods such as MLPA. For the remaining novel diagnoses (n=2), confirmatory testing is pending in one case, and is not possible in the other due to loss of contact with the patient (Table 1). SMA Finder was negative for 8,401 unaffected samples (6,500 exomes and 1,901 genomes). SMA Finder reported insufficient read coverage to make a call in 112 exomes (0.7%) and 6 genomes (0.2%) due to these samples having fewer than 14 total reads overlapping c.840 in SMN1 + SMN2. Strikingly, out of the 23 total SMA-positive cases in this heterogeneous rare disease cohort, 20 had an initial clinical diagnosis of muscular dystrophy or myopathy while one had a clinical diagnosis of congenital myasthenic syndrome.

Table 1:

Positive cases identified by SMA Finder in the rare disease cohorts

graphic file with name nihpp-2024.02.11.24302646v3-t0001.jpg

Open in a new tab

Table 1 lists all samples called as SMA-positive by SMA Finder within the Broad Institute Center for Mendelian Genomics (CMG) and Tartu University Hospital cohorts. The Original Clinical Diagnosis column lists the clinical diagnosis prior to SMA testing. The SMA Confirmation Status column shows whether the diagnosis was already known at the time of the SMA Finder result as well as the status of confirmatory testing for novel diagnoses. Age of Onset and SMA Type columns are based on physician reports, while the SMN2 Copies column is based on the SMN2 copy number as measured by MLPA genetic testing. The next three columns display the confidence score, r, and N counts computed by SMA Finder (see Figure1, Methods), while the last column shows the imputed ancestry as NFE (Non-Finnish European), MDE (Middle Eastern), AFR (African/African American), SAS (South Asian), or UNC (unclassified).

To evaluate its performance beyond the CMG dataset, we also ran SMA Finder on 9,919 samples (1,157 exomes and 8,762 panel sequencing samples) from another heterogeneous rare disease cohort at the Tartu University Hospital. SMA Finder detected three previously known cases and three new cases which have since been confirmed by clinical testing (Table 1). Of the three new cases, two presented as SMA type III, and one had symptoms that did not include obvious SMA features and thus may be a presymptomatic case of SMA type IV.

Finally, to measure SMA Finder’s positive predictive value (PPV) in a large population cohort as well as compare its performance to other tools and across sample types, we applied SMA Finder to 198,868 individuals from the UK Biobank (UKBB).¹⁵ Given the availability of an exome and genome sample for each individual, we ran SMA Finder on all exomes and genomes, while also running SMNCopyNumberCaller³ on the genomes. We found SMA Finder calls to be identical between exomes and genomes, with the only difference being that SMA Finder reported 85 out of 198,868 (0.042%) genome samples and 14 out of 198,868 (0.007%) exome samples (Fisher’s exact p=1.4×10⁻¹³) as having insufficient read coverage to make a call (Table S3). SMA Finder flagged two individuals (i1 and i2) as SMA-positive, with their exome and genome samples yielding the same result. It reported all other exomes and genomes that had sufficient coverage as SMA-negative (Figure 3E, F). Concordance between SMA Finder and SMNCopyNumberCaller was also nearly 100%. Like SMA Finder, SMNCopyNumberCaller (a genome-only tool) reported individual i1 to be SMA-positive, while reporting no-call for 35 genomes including i2 (Table S2).

The UKBB phenotype records for i1 included ICD-10 code G121 “Other inherited spinal muscular atrophy”, suggesting this to be a true positive call for both SMA Finder and SMNCopyNumberCaller. Additionally, i1’s records indicated that they died between the age of 65 and 70, with the cause of death listed as “Spinal muscular atrophy”. In contrast, i2 had few phenotype records, making it difficult to tell whether this was a false positive call by SMA Finder, or a missed call by SMNCopyNumberCaller. The fact that SMA Finder made the same call for both the exome and genome of i2 increases the likelihood that this is a true positive. However, even if we conservatively count i2 as a potential false positive call by SMA Finder based on the absence of phenotype records and the no-call result from SMNCopyNumberCaller, SMA Finder’s overall false positive rate would still be < 1 / 200,000 exomes. Additionally, this would yield a PPV of 97% based on 28 true positive samples (21 from CMG cohorts, 6 from Tartu University Hospital, and 1 from UKBB) and at most 1 hypothetical false positive (counting sample i2 from UKBB).

Discussion

We developed and validated SMA Finder, a new tool for detecting SMA-positive samples within short read exome, genome, and panel sequencing data. After testing this tool on multiple heterogeneous rare disease cohorts, as well as nearly 200,000 individuals from the UKBB, we found its false positive rate to be less than 1 in 200,000. Our results showed that SMA Finder is a robust and accurate tool for detecting the most common molecular causes of SMA using exome, genome, and panel sequencing samples.

However, the analysis had several limitations related to these cohorts. It was predominantly based on samples of European ancestry. A precise estimate of SMA Finder’s accuracy in other populations will require testing on cohorts with larger numbers of non-European samples. Similarly, since most samples used DNA extracted from blood, further testing is needed to measure SMA Finder’s accuracy for other sample sources such as saliva or dried blood spots.

The SMA Finder algorithm in its current form also has several important limitations. First, it only determines whether an individual has zero or more than zero functional copies of SMN1, and thus does not provide information on carrier status or SMN2 copy number. In the future, we may extend SMA Finder to estimate the exact copy numbers of SMN1 and SMN2. Second, SMA Finder only tests for variants that disrupt the c.840 position of SMN1. Studies have shown that, depending on ancestry, between 6% and 49% of SMA cases¹⁹ may be caused by other variants in regions of the SMN1 gene that don’t overlap the c.840 position. SMA Finder cannot currently detect these variants, and so would yield false negative results in those cases. Furthermore, SMA Finder was unable to make a call in 112 exomes (0.7%) and 6 genomes (0.2%) from the CMG as well as 14 exomes (0.007%) and 85 genomes (0.04%) from UKBB due to insufficient read coverage at the c.840 position. These no-call percentages may be useful for confirming sequencing data quality when applying SMA Finder to new cohorts where a significantly higher proportion of no-calls would indicate problems with target capture or sequencing at the SMN locus. Finally, SMA Finder was designed and tested on samples generated using Illumina sequencers and aligned using BWA.⁶ Samples generated using a different sequencing technology, or aligned using a different, non-functionally-equivalent aligner may not work as expected, particularly when aligning to the GRCh38 reference where the ALT contigs include additional copies of SMN1 and SMN2 (Supplementary Methods).

Despite these limitations, SMA Finder’s exceptionally high accuracy in detecting homozygous loss of SMN1 due to deletions or gene conversions at the c.840 position allowed us to shed new light on open questions about the genetic architecture of SMA. Over the years, reports have described cases of unaffected individuals with biallelic loss of SMN1 due to deletions that would be detectable by SMA Finder.^20,21,22 However, the frequency of such cases in the general population has not been well-characterized. Our analysis of UKBB exomes and genomes using SMA Finder and SMNCopyNumberCaller shows that such cases are extremely rare in the European population, occurring at a frequency below 1 in 200,000.

Rare disease diagnostic pipelines based on short read sequencing data have, until recently, needed to exclude highly repetitive regions such as segmental duplications due to technical limitations of existing variant calling algorithms. However, our analysis adds to the growing body of research^3,10,23 which demonstrates that, with focused tool development, it is possible to detect variants in these regions with sufficient accuracy for rare disease diagnosis. Furthermore, at some loci such as the SMN c.840 position, our ability to accurately detect the causal variant from exome or panel sequencing data can be comparable to the accuracy of variant calling in non-repetitive regions.

A striking observation from our cohorts was that 11 out of the 13 new SMA diagnoses were made in individuals carrying a clinical diagnosis of muscular dystrophy, congenital myasthenia, or myopathy, and 12 of the 16 known SMA cases carried an initial clinical misdiagnosis of muscular dystrophy, reinforcing similar observations by other groups.²⁴ Although it is known that the differential diagnosis of an infant or adolescent with progressive weakness includes SMA in addition to muscular dystrophies, myopathies, and congenital myasthenic syndromes²⁵, clinical gene panels for muscular dystrophy and myopathy may not include SMA testing. Moreover, clinicians and researchers may be misled by the presence of variants of unknown significance (VUS) in other neuromuscular disease genes, by complementary diagnostic tools such as electromyography (EMG) and muscle MRI, and by the highly elevated creatine kinase (CK) levels sometimes observed in ambulant patients with SMA type III. Prior to the inclusion of SMA testing in newborn screening, there was commonly a diagnostic delay in SMA, particularly for the milder types II and III which can present with clinical features that are similar to later-onset Mendelian myopathies.²⁶ Now, with newly-available SMA treatments where early detection is imperative for the best therapeutic effect,²⁹ there is increased urgency to address diagnostic delays. Our analysis demonstrates that the inclusion of an SMA detection tool within existing rare disease analysis pipelines - particularly for cohorts that include neuromuscular phenotypes - can successfully identify missed cases.

In 2021, the American College of Medical Genetics (ACMG) published an updated set of recommendations for reporting of secondary findings (SF) in clinical exome and genome sequencing.²⁸ In the section titled “Poor candidates for secondary findings, due to concerns about analytical validity”, the authors listed reasons why genes with homologous sequences should be excluded from the SF list. These included concerns about the accuracy of variant detection in these regions as well as challenges in orthogonal validation. Given the clinically overlooked diagnoses presented in this paper, the demonstrated PPV of SMA Finder, as well as the widespread availability of confirmatory molecular testing for SMA, we propose that SMN1 should now be considered for inclusion in the ACMG SF list.

Supplementary Material

NIHPP2024.02.11.24302646V3-supplement-1.pdf^{(55.6KB, pdf)}

Acknowledgments

This paper is dedicated to Clara and Dmitry Bogomolny. We thank the many families who participate in these rare disease studies and the Broad CMG collaborators for sharing the samples and medical data. The Broad CMG sequencing and analysis was funded by the National Human Genome Research Institute (NHGRI), the National Eye Institute, the National Heart, Lung and Blood Institute grant UM1HG008900 and NHGRI grants U01HG011755 and R01HG009141. GH is supported by the GREGoR Consortium, and research in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number U24HG011746. This publication has also been made possible in part by CZI grants 2019-19927, 2020-224274, and 2022-309464 https://doi.org/10.37921/236582yuakxy, from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (funder DOI 10.13039/100014989). The Tartu University Hospital cohort analysis was funded by the Estonian Research Council grants PSG774 and PRG471. VSG was supported by NIH NHGRI grant T32HG010464. Work in CGB’s laboratory is supported by intramural funds of NINDS/NIH. This research has been conducted using the UK Biobank Resource under application number 48511. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the other organizations that have funded the work. HL receives support from the Canadian Institutes of Health Research (CIHR) for Foundation Grant FDN-167281 (Precision Health for Neuromuscular Diseases), Transnational Team Grant ERT-174211 (ProDGNE) and Network Grant OR2-189333 (NMD4C), from the Canada Foundation for Innovation (CFI-JELF 38412), the Canada Research Chairs program (Canada Research Chair in Neuromuscular Genomics and Health, 950-232279), the European Commission (Grant # 101080249) and the Canada Research Coordinating Committee New Frontiers in Research Fund (NFRFG-2022-00033) for SIMPATHIC, and from the Government of Canada Canada First Research Excellence Fund (CFREF) for the Brain-Heart Interconnectome (CFREF-2022-00007). KP is a recipient of CIHR Postdoctoral fellowship (202210MFE-491707-404816). RH is supported by the Wellcome Discovery Award (226653/Z/22/Z), the Medical Research Council (UK) (MR/V009346/1), Ataxia UK, Action for AT, the Muscular Dystrophy UK, the LifeArc Centre to Treat Mitochondrial Diseases (LAC-TreatMito), the UKRI/Horizon Europe MSCA Doctoral Network Programme (Project 101120256 — MMM) and an MRC strategic award to establish an International Centre for Genomic Medicine in Neuromuscular Diseases (ICGNMD) MR/S005021/1. This research was supported by the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.

Declaration of interests

HLR receives research funding from Microsoft and previously received funding from Illumina to support rare disease gene discovery and diagnosis. AODL has consulted for Tome Biosciences, Ono Pharma USA Inc, and Addition Therapeutics, and is member of the scientific advisory board for Congenica Inc and the Simons Foundation SPARK for Autism study. AL received honoraria for speaking at educational events for Biogen, PTC and Roche, is a subinvestigator in clinical trials by Roche and PTC, and is involved in a project supported by Biogen (POL-SMA-17-11166). PBK has received research support from ML Bio and Sarepta Therapeutics, and has consulted for Lupin, Neurogene, NS Pharma, and Teneofour.

Data and code availability

SMA Finder is publicly available as a stand-alone tool at https://github.com/broadinstitute/sma_finder under the open-source MIT license.

Also, it is available as a WDL workflow on Terra at https://portal.firecloud.org/?return=terra#methods/translational-genomics-group/sma-finder/4

References

1.Sarv S. et al. The Birth Prevalence of Spinal Muscular Atrophy: A Population Specific Approach in Estonia. Front. Genet. 12, 796862 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Verhaart I. E. C. et al. Prevalence, incidence and carrier frequency of 5q-linked spinal muscular atrophy - a literature review. Orphanet J. Rare Dis. 12, 124 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chen X. et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genet. Med. 22, 945–953 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Schorling D. C., Pechmann A. & Kirschner J. Advances in Treatment of Spinal Muscular Atrophy - New Phenotypes, New Challenges, New Implications for Care. J Neuromuscul Dis 7, 1–13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Lefebvre S. et al. Identification and characterization of a spinal muscular atrophy-determining gene. Cell 80, 155–165 (1995). [DOI] [PubMed] [Google Scholar]
6.Li H. & Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Poplin R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv 201178 (2018) doi: 10.1101/201178. [DOI] [Google Scholar]
8.Chen X. et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. Am. J. Hum. Genet. 110, 240–250 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Töpf A. et al. Sequential targeted exome sequencing of 1001 patients affected by unexplained limb-girdle weakness. Genet. Med. 22, 1478–1488 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Steyaert W. et al. Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation. Nat. Commun. 14, 6845 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Karczewski K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Robinson P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Poterba T. et al. The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes. bioRxiv (2024) doi: 10.1101/2024.01.09.574205. [DOI] [PubMed] [Google Scholar]
15.Sudlow C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Halldorsson B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Van Hout C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Baxter S. M. et al. Centers for Mendelian Genomics: A decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Vorster E., Essop F. B., Rodda J. L. & Krause A. Spinal Muscular Atrophy in the Black South African Population: A Matter of Rearrangement? Front. Genet. 11, 54 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cobben J. M. et al. Deletions of the survival motor neuron gene in unaffected siblings of patients with spinal muscular atrophy. Am. J. Hum. Genet. 57, (1995). [PMC free article] [PubMed] [Google Scholar]
21.Oprea G. E. et al. Plastin 3 Is a Protective Modifier of Autosomal Recessive Spinal Muscular Atrophy. Science 320, 524 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Chong J. X. et al. A common spinal muscular atrophy deletion mutation is present on a single founder haplotype in the US Hutterites. Eur. J. Hum. Genet. 19, 1045–1051 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Peng X. et al. Overcoming the Pitfalls of Next-Generation Sequencing-Based Molecular Diagnosis of Shwachman-Diamond Syndrome. J. Mol. Diagn. 24, 1240–1253 (2022). [DOI] [PubMed] [Google Scholar]
24.Krenn M., Jengojan S. & Grisold W. Spinal muscular atrophy presenting with mild limb-girdle weakness in adulthood: Diagnostic pitfalls in the era of disease-modifying therapies. J. Neurol. Sci. 440, 120347 (2022). [DOI] [PubMed] [Google Scholar]
25.D’Amico A., Mercuri E., Tiziano F. D. & Bertini E. Spinal muscular atrophy. Orphanet J. Rare Dis. 6, 71 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lin C.-W., Kalb S. J. & Yeh W.-S. Delay in Diagnosis of Spinal Muscular Atrophy: A Systematic Literature Review. Pediatr. Neurol. 53, 293–300 (2015). [DOI] [PubMed] [Google Scholar]
27.Schwartz O. et al. Clinical Effectiveness of Newborn Screening for Spinal Muscular Atrophy: A Nonrandomized Controlled Trial. JAMA pediatrics vol. 178 540–547 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Miller D. T. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2021 update: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 23, 1391–1398 (2021). [DOI] [PubMed] [Google Scholar]
29.Hail Team. Hail 0.2. https://github.com/hail-is/hail

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHPP2024.02.11.24302646V3-supplement-1.pdf^{(55.6KB, pdf)}

Data Availability Statement

SMA Finder is publicly available as a stand-alone tool at https://github.com/broadinstitute/sma_finder under the open-source MIT license.

Also, it is available as a WDL workflow on Terra at https://portal.firecloud.org/?return=terra#methods/translational-genomics-group/sma-finder/4

[R1] 1.Sarv S. et al. The Birth Prevalence of Spinal Muscular Atrophy: A Population Specific Approach in Estonia. Front. Genet. 12, 796862 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Verhaart I. E. C. et al. Prevalence, incidence and carrier frequency of 5q-linked spinal muscular atrophy - a literature review. Orphanet J. Rare Dis. 12, 124 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Chen X. et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genet. Med. 22, 945–953 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Schorling D. C., Pechmann A. & Kirschner J. Advances in Treatment of Spinal Muscular Atrophy - New Phenotypes, New Challenges, New Implications for Care. J Neuromuscul Dis 7, 1–13 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Lefebvre S. et al. Identification and characterization of a spinal muscular atrophy-determining gene. Cell 80, 155–165 (1995). [DOI] [PubMed] [Google Scholar]

[R6] 6.Li H. & Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Poplin R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv 201178 (2018) doi: 10.1101/201178. [DOI] [Google Scholar]

[R8] 8.Chen X. et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. Am. J. Hum. Genet. 110, 240–250 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Töpf A. et al. Sequential targeted exome sequencing of 1001 patients affected by unexplained limb-girdle weakness. Genet. Med. 22, 1478–1488 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Steyaert W. et al. Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation. Nat. Commun. 14, 6845 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Nurk S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Karczewski K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Robinson P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Poterba T. et al. The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes. bioRxiv (2024) doi: 10.1101/2024.01.09.574205. [DOI] [PubMed] [Google Scholar]

[R15] 15.Sudlow C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Halldorsson B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Van Hout C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Baxter S. M. et al. Centers for Mendelian Genomics: A decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Vorster E., Essop F. B., Rodda J. L. & Krause A. Spinal Muscular Atrophy in the Black South African Population: A Matter of Rearrangement? Front. Genet. 11, 54 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Cobben J. M. et al. Deletions of the survival motor neuron gene in unaffected siblings of patients with spinal muscular atrophy. Am. J. Hum. Genet. 57, (1995). [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Oprea G. E. et al. Plastin 3 Is a Protective Modifier of Autosomal Recessive Spinal Muscular Atrophy. Science 320, 524 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Chong J. X. et al. A common spinal muscular atrophy deletion mutation is present on a single founder haplotype in the US Hutterites. Eur. J. Hum. Genet. 19, 1045–1051 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Peng X. et al. Overcoming the Pitfalls of Next-Generation Sequencing-Based Molecular Diagnosis of Shwachman-Diamond Syndrome. J. Mol. Diagn. 24, 1240–1253 (2022). [DOI] [PubMed] [Google Scholar]

[R24] 24.Krenn M., Jengojan S. & Grisold W. Spinal muscular atrophy presenting with mild limb-girdle weakness in adulthood: Diagnostic pitfalls in the era of disease-modifying therapies. J. Neurol. Sci. 440, 120347 (2022). [DOI] [PubMed] [Google Scholar]

[R25] 25.D’Amico A., Mercuri E., Tiziano F. D. & Bertini E. Spinal muscular atrophy. Orphanet J. Rare Dis. 6, 71 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Lin C.-W., Kalb S. J. & Yeh W.-S. Delay in Diagnosis of Spinal Muscular Atrophy: A Systematic Literature Review. Pediatr. Neurol. 53, 293–300 (2015). [DOI] [PubMed] [Google Scholar]

[R27] 27.Schwartz O. et al. Clinical Effectiveness of Newborn Screening for Spinal Muscular Atrophy: A Nonrandomized Controlled Trial. JAMA pediatrics vol. 178 540–547 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Miller D. T. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2021 update: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 23, 1391–1398 (2021). [DOI] [PubMed] [Google Scholar]

[R29] 29.Hail Team. Hail 0.2. https://github.com/hail-is/hail

PERMALINK

This is a preprint.

Diagnosing missed cases of spinal muscular atrophy in genome, exome, and panel sequencing datasets

Ben Weisburd

Rakshya Sharma

Villem Pata

Tiia Reimand

Vijay S Ganesh

Christina Austin-Tse

Ikeoluwa Osei-Owusu

Emily O’Heir

Melanie O’Leary

Lynn Pais

Seth A Stafki

Audrey L Daugherty

Chiara Folland

Stojan Perić

Nagia Fahmy

Bjarne Udd

Magda Horakova

Anna Łusakowska

Rajanna Manoj

Atchayaram Nalini

Veronika Karcagi

Kiran Polavarapu

Hanns Lochmüller

Rita Horvath

Carsten G Bönnemann

Sandra Donkervoort

Göknur Haliloğlu

Ozlem Herguner

Peter B Kang

Gianina Ravenscroft

Nigel Laing

Hamish S Scott

Ana Töpf

Volker Straub

Sander Pajusalu

Katrin Õunap

Grace Tiao

Heidi L Rehm

Anne O’Donnell-Luria

Abstract

Introduction

Figure 1. Detecting SMA using reads aligned to the SMN1 and SMN2 paralogs.

Materials and Methods

SMA Finder algorithm

Fig 3. SMA Finder results.

Cohort summary

Figure 2. Overview of the CMG rare disease cohort.

Results

Table 1:

Discussion

Supplementary Material

Acknowledgments

Declaration of interests

Data and code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases