Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2025 Jan 24;112(2):428–449. doi: 10.1016/j.ajhg.2025.01.002

Advancing long-read nanopore genome assembly and accurate variant calling for rare disease detection

Shloka Negi 1, Sarah L Stenton 2,3, Seth I Berger 4, Paolo Canigiula 4, Brandy McNulty 1, Ivo Violich 1, Joshua Gardner 1, Todd Hillaker 1, Sara M O’Rourke 1, Melanie C O’Leary 2, Elizabeth Carbonell 2, Christina Austin-Tse 2,5, Gabrielle Lemire 2,3, Jillian Serrano 2,3, Brian Mangilog 2, Grace VanNoy 2, Mikhail Kolmogorov 6, Eric Vilain 7, Anne O’Donnell-Luria 2,3,5,9, Emmanuèle Délot 7,9, Karen H Miga 1,9, Jean Monlong 8,9,, Benedict Paten 1,9,∗∗
PMCID: PMC11866955  PMID: 39862869

Summary

More than 50% of families with suspected rare monogenic diseases remain unsolved after whole-genome analysis by short-read sequencing (SRS). Long-read sequencing (LRS) could help bridge this diagnostic gap by capturing variants inaccessible to SRS, facilitating long-range mapping and phasing and providing haplotype-resolved methylation profiling. To evaluate LRS’s additional diagnostic yield, we sequenced a rare-disease cohort of 98 samples from 41 families, using nanopore sequencing, achieving per sample ∼36× average coverage and 32-kb read N50 from a single flow cell. Our Napu pipeline generated assemblies, phased variants, and methylation calls. LRS covered, on average, coding exons in ∼280 genes and ∼5 known Mendelian disease-associated genes that were not covered by SRS. In comparison to SRS, LRS detected additional rare, functionally annotated variants, including structural variants (SVs) and tandem repeats, and completely phased 87% of protein-coding genes. LRS detected additional de novo variants and could be used to distinguish postzygotic mosaic variants from prezygotic de novos. Diagnostic variants were established by LRS in 11 probands, with diverse underlying genetic causes including de novo and compound heterozygous variants, large-scale SVs, and epigenetic modifications. Our study demonstrates LRS’s potential to enhance diagnostic yield for rare monogenic diseases, implying utility in future clinical genomics workflows.

Keywords: long-read sequencing, structural variants, rare-disease diagnosis, gene conversion, haplotype phasing, methylation, clinical testing, variant annotation


Long-read sequencing (LRS) enhanced diagnostic yield in rare monogenic diseases, uncovering missed coding regions and rare variants while providing phasing and methylation. Out of 42 affected individuals, LRS established causal variants in 11 (3 previously undiagnosed), highlighting its potential as a single, rapid, and cost-effective alternative to complex genetic testing.

Introduction

Despite extensive efforts to identify pathogenic variants using whole-genome protocols, approximately 50% of individuals with suspected rare genetic diseases remain undiagnosed.1,2,3 This diagnostic gap is not solely due to interpretative and clinical challenges—such as incomplete understanding of gene functions, genomic regions, and disease phenotypes that hinder accurate genotype-phenotype inference—but also partly due to technical limitations in the short-read sequencing (SRS) that is widely used for genome sequencing.4 For example, pathogenic variants may reside in parts of the genome that are “inaccessible” to SRS techniques. These regions consist of combinations of highly similar or highly repetitive sequences to which short reads (150–250 bp of typically paired-end sequences) are frequently unable to unambiguously map, thereby hindering variant detection. There are more than 1,000 genes associated with these regions, some of which are known to be clinically relevant.5,6

Similarly, structural variants (SVs) have been shown to contribute to many rare diseases,7,8 but the complexity of assembling and detecting SVs with conventional SRS approaches is well documented.9,10 Numerous studies highlight high incidence of false positives and failure to detect SVs in low-complexity regions, such as those consisting of GC- or AT-rich repeats, tandem repeats, and genomic loci characterized by segmental duplications (frequently long, and sometimes highly similar sequence copies), which are hotspots for SVs and experience some of the highest mutation rates, in both germline and soma.11,12,13,14,15

In contrast, long-read sequencing (LRS) produces reads that are frequently 2–3 orders of magnitude longer than SRS. Such reads can accurately map across many large repetitive regions, thereby improving the detection of single-nucleotide variants (SNVs) in homologous gene copies16 and facilitating de novo assemblies for more effective detection and evaluation of SVs.17 Beyond improving variant resolution and detection, LRS offers long-range read-based haplotype phasing of variants, which can be critical for detecting compound heterozygous diagnoses, particularly when parental samples are unavailable to determine variant phase.18 Furthermore, most LRS uses single-molecule sequencing, facilitating the simultaneous detection of base modifications natively from the sequencing data and allowing for the study of genomic variation in conjunction with the methylation status of CpG sites.19

Despite the reported advantages of LRS, short-read protocols are the standard-of-care diagnostic tests for genome sequencing due to well-established analysis protocols and pipelines and the availability of large reference population databases for the prioritization of rare variants.20 Though not yet widely clinically available, recent studies have demonstrated that LRS can identify causal variants in both known and previously uncharacterized disease-associated genes that had eluded detection by SRS, such as complex SVs and low-complexity repeat expansions, in addition to facilitating phasing of variants also detected by SRS.21,22,23,24,25,26,27,28 However, many of these studies applied targeted LRS (adaptive sequencing) to loci of high phenotypic interest or included only a small number of samples. As a result, the additional diagnostic yield of genome-wide LRS over SRS is not clearly quantified at this time.

The two widely used contemporary LRS technologies, Pacific Bioscience’s (PacBio) Hifi sequencing and Oxford Nanopore Technology’s (ONT) nanopore sequencing, differ fundamentally in their data-generation methods, leading to differences in read lengths and error rates.29 PacBio HiFi reads range from 15 to 20 kb,30 while ONT long reads range from 10 to 100 kb, with ultra-long reads reaching 100–300 kb.29 ONT sequencing currently has lower read accuracy compared to PacBio HiFi and is subject to more frequent base-calling errors in homopolymers and short tandem repeats that lead to larger numbers of short insertion or deletion (indel) errors involving these sequences.

With continuous updates to pore chemistries and base-calling algorithms, ONT is improving in accuracy, making LRS on the ONT platform increasingly viable for clinical research.31 ONT sequencing relative to other sequencing technologies can also be integrated into very rapid sequencing protocols to enable variant detection and characterization in time-critical applications, such as acute care.32,33 In a recent study, Kolmogorov et al.34 demonstrated that state-of-the-art small and SV calling performance is achievable using ONT reads from a single flow cell at high throughput, including achievement of a higher overall F1 score relative to Illumina for detecting SNVs according to Genome in a Bottle (GIAB) benchmarks.35 This protocol is potentially useful in diagnostic testing, as it enables accurate de novo assembly and the combined evaluation of different alteration types (i.e., methylation, SNVs, small indels, and SVs) in a single analysis, avoiding the sequential application of multiple tests and offering a potentially significant advantage in terms of cost and simplicity relative to less comprehensive approaches.

Here, we demonstrate the capabilities of this protocol to sequence and analyze rare disease samples, generating de novo assembly, haplotype-resolved small and SV calls, and haplotype-specific CpG methylation calls in a single workflow run. We applied the accompanying computational protocol, Napu (Nanopore Analysis Pipeline for U),34 to a cohort that included trios, comprising healthy parents of children with undiagnosed rare diseases and several singletons for whom parental data were unavailable. Five clinically diagnosed individuals were also included to demonstrate the ability of LRS to act as a single diagnostic test, providing improved breakpoint resolution, phasing, and ability to distinguish variants in highly homologous genes within segmental duplications. We evaluated the clinical utility of our one-flow-cell LRS-derived measurements by conducting a thorough comparison with those derived from SRS to identify the additional information yield LRS can provide. Out of the 42 affected individuals in the cohort (41 families), LRS identified diagnostic variants in 11 individuals. LRS confirmed all five previously known diagnoses, often providing additional details beyond what was clinically known. In addition to three individuals that could also have been diagnosed by SRS reanalysis, LRS enabled the diagnosis of three previously undiagnosed individuals by providing crucial information such as methylation, phasing, and accurate alignment in highly homologous segmental duplications.

Methods

Overview of sequencing protocol

Isolated DNA, white blood cells, or whole blood samples were received from Broad Institute and Children’s National Hospital (CNH). CNH samples were either from the Disorders of Sex/Development—Translational Research Network (DSD-TRN) biobank or from the Pediatric Mendelian Genomics Research Center (PMGRC)/University of California, Irvine (UCI) site of the Genomics Research to Elucidate the Genetics of Rare diseases (GREGoR) Consortium. Some had high-molecular-weight (HMW) DNA previously extracted for Optical Genome Mapping on the same samples, using the Bionano recommended protocol. Twenty probands with neurodevelopmental phenotypes, one affected sibling, and 40 unaffected parents (n = 61) were consented to the Broad Institute Rare Genomes Project (RGP), which includes the use and sharing of data for research purposes (Mass General Brigham Institutional Review Board [IRB] protocol 2016P001422) and had previous short-read whole-genome sequencing and analysis performed. Two EDTA tubes (4 mL) and one PAXgene RNA tube of whole blood were collected from participants from across the United States and sent at room temperature by overnight courier to the Broad Institute or CNH. Upon receipt, samples were frozen, and an EDTA tube from each participant was later shipped in bulk on dry ice to University of California, Santa Cruz (UCSC). The study was approved by the IRB of Children’s National Hospital (Pro00015852 for UCI-GREGoR samples and Pro10127 for DSD-TRN samples).

Of the 105 samples received at UCSC, six samples with pre-extracted DNA were excluded from the cohort for not meeting our quality standards for HMW DNA size. Early in protocol development, five whole blood samples failed DNA extraction due to low yield. Four of these were replaced and successfully sequenced, resulting in a total of 98 sequenced samples. HMW DNA was extracted using Circulomics Nanobind CBB Big DNA Kit (NB-900-001-01) or NEB Monarch HMW DNA extraction kit for cells and blood (NEB T3050). Approximately 5 μg of isolated DNA was sheared using Diagenode Megaruptor 3, DNA fluid+ kit (E07020001). The size of sheared DNA fragments was analyzed on the Agilent Femto Pulse System using genomic DNA 165 kb kit (FP-1002-0275). Fragment size distribution of post-sheared DNA had peak at approximately 50 kb. Small DNA fragments were removed from the sample using PacBio SRE (Short Read Eliminator) kit (SKU 102-208-300). Library preparation was carried out using ONT ligation sequencing kit V14 (SQK-LSK114). Sequencing was performed on the PromethION 48 sequencer using R10.4.1 flow cells. Each sample was used to prepare four libraries per flow cell. Flow cells were washed using the ONT wash kit (EXP-WSH004) and reloaded with a fresh library every 24 h for a total sequencing runtime of 96 h. Human bulk transcriptome sequencing was performed by the Genomics Platform at the Broad Institute of MIT and Harvard. The transcriptome product combines poly(A)-selection of mRNA transcripts with a strand-specific cDNA library preparation, with 150-bp reads and a mean insert size of 550 bp. Libraries were sequenced on the HiSeq 2500 platform to a minimum depth of 75 million STAR-aligned reads.

We then ran the Napu end-to-end pipeline on 98 samples, generating diploid de novo phased assemblies, harmonized variant calls against the GRCh38 reference genome (merging reference-based small variant calls and assembly-based SV calls), and haplotype-specific methylation calls.

Genome completeness analysis

To identify genomic regions uniquely callable by LRS and SRS, we used BAM files aligned to GRCh38 and T2T-CHM13v2.0 reference genome. Assembly gaps and simulated centromeric regions in the GRCh38 reference were excluded from the analysis to avoid false coverage estimates. Since raw data for SRS was not available to us, we reverted GRCh38-aligned SRS CRAMs and remapped them to T2T-CHM13 using Picard RevertSam (v.1.141). We then analyzed per-base depth using mosdepth (v.0.3.4). Aligned reads with a minimum mapping quality of 10 were considered (90% probability of correct mapping), and coverage bins were defined as 0:1:10:80, using the “--quantize” option to merge adjacent bases within the same coverage bins. Bases with depth 0 were binned as NO_COVERAGE, 1–9 as LOW_COVERAGE, 10–79 as CALLABLE, and ≥80 as HIGH_COVERAGE. The Gencode Comprehensive gene set release 45 for GRCh38 was used to define protein-coding genes (includes exons and introns) and coding exons. Mendelian disease-associated genes were defined as genes reported in OMIM with an associated phenotype of mode of inheritance36 as well as genes from ClinGen37 and GenCC.38 Ideograms were plotted using KaryotypeR.39

SV-aware SnpEff-based clinical annotation workflow

We developed a clinical annotation workflow called LRGenotate to annotate SNVs, indels, and SVs from Napu harmonized variant call format (VCF). After splitting multi-allelic variants with bcftools,40 they were annotated with SnpEff41 v.5.1 using the GRCh38.105 pre-built database. Small variants (30 bp or less) and SVs were then annotated separately.

SVs were annotated using the sveval package42 to add frequency estimates and flag SVs that overlapped with known clinical SVs (nstd102 in dbVar) or any DGV SVs (GRCh38_hg38_variants_2020-02-25.txt). To annotate their frequency, each SV was matched with SVs in eight public SV databases (DGV common SVs/nstd186, gnomAD-SV/nstd166,43 HPRC v.1.0, HGSVC2, 1000GP ONT Miller, 1000GP ONT Vienna, 1000GP ONT SV imputation panel) if their reciprocal overlap was at least 50% and they were located at less than 100 bp. Of note, the simple repeat track from the UCSC Genome Browser was used to add wiggle room to help match SVs placed differently around repeats. SVs were considered rare if the frequency of all matched variants in all databases was below 1%. SVs were also annotated with the number of enhancers of known disease-associated genes that they overlapped, using GeneHancer and ENCODE ELS lists as enhancer catalogs. In parallel, AnnotSV44 was used to annotate SVs using the GRCh38 human pre-built database and GeneHancer v.5.9.45 When human phenotype ontology terms were available, they were provided as input to AnnotSV. Finally, SVs were regenotyped from the raw long reads using vg42 to count the number and proportion of reads supporting the alternate allele. SV calls with no supporting reads were filtered out.

After the first SnpEff annotation, small variants were further annotated with SnpSift v.5.1 to flag ClinVar variants (2023-03-18 version) and annotate their frequency in gnomAD v.3.0 as well as GERP++, CADD, MetaRNN, and ALFA provided through dbNSFP v.4.4.46 SnpSift was also used to keep only small variants with “HIGH” or “MODERATE” impact or a percent loss-of-function transcript higher than 0.9, or those that are in ClinVar.

For variant curation, we also integrated gene-level information such as pLI scores from gnomAD v.2.1.1, the dosage sensitivity map from Collins et al.,47 genes in OMIM, and genes from ClinGen and GenCC with moderate-definitive strength disease assertions. LRGenotate is available in Snakemake and WDL formats at https://github.com/jmonlong/variant_annotation_wf.

Family-based analysis of LRS in seqr

SNVs and indels called by DeepVariant for the 20 RGP families (19 trios, 1 quad) were loaded to the seqr genomic analysis platform for family-based monogenic disease analysis (seqr.broadinstitute.org/). In brief, “De Novo/Dominant” and “Recessive” variant searches with both “Restrictive” and “Permissive” thresholds for reports of pathogenicity, functional consequence and predicted deleteriousness, allele frequency (in the gnomAD reference population48), and call quality were applied as reported in detail by Pais et al.49 The returned variants were assessed for relevance to the reported phenotype and disease mechanism (e.g., loss of function, gain of function) using the ClinVar,50 OMIM,51 and DECIPHER52 databases, as well as their potential involvement in uncharacterized disease associations using external resources linked to seqr, spanning gene-level data (e.g., gnomAD constraint metrics53), tissue expression data from the Genotype-tissue Expression (GTEx) portal,54 and functional data (e.g., mouse models), among others.

De novo pipeline for LRS small variants

To identify putative de novo small variants (SNVs and indels <30 bp) from LRS, individual harmonized VCFs from Napu for a trio were preprocessed to retain PASS calls and non-homozygous reference variants. This was followed by running rtgtools vcfeval (v.3.12.1) in a pairwise manner to identify variants that were called in the child but not in either parent. Variants with GQ < 20, DP < 10, or homozygous alternate calls were removed. For male samples, hemizygous alternate variants in sex chromosomes with DP ≥ 5 were retained. Next, rare variants (AF < 0.001 as per gnomAD v.3.0) were selected. These constituted the “whole-genome” callset. The “annotated” callset included non-intergenic variants, predicted by SnPEff to have some impact on overlapping or nearby genes. More annotations including GERP++, CADD, and MetaRNN provided through dbNSFP (v.4.4) were added. Further, to generate a stringent set of DNVs, we utilized an in-house script for in silico read validation using parental reads. Only variants with no read support in both parents were retained. Finally, DNVs called in more than one individual from the entire cohort were removed. The workflow described is available in a WDL format at https://github.com/shlokanegi/denovo_smallvars.

De novo pipeline for SRS small variants

SRS was performed on DNA purified from blood by the Broad Institute Genomics Platform on an Illumina sequencer to 30× average depth. Raw sequence reads were aligned to the GRCh38 reference genome. Variants were called with GATK version 4.1.8.07 in the form of SNVs and indels <30 bp to generate a joint called VCF file. To identify putative de novo variants, the VCF file was loaded with Hail (https://github.com/hail-is/hail), and Hail’s de_novo() function was called, with variant allele frequencies from gnomAD GRCh38 exomes v.2 and genomes v.3 used as priors. After calling, putative de novo variants were excluded from the dataset if present in gnomAD or within their own call set at an allele frequency ≥0.1%, containing a GATK Variant Quality Score Recalibration (VQSR) flag in the filters field of the VCF, falling within a reported problematic region of the genome (downloadable from the UCSC browser: ucscUnusualRegions.bed, encBlacklist.bed, grcExclusions.bed), falling within close proximity to other de novo variants in the same individual (within 1 kb) or having a call rate of ≤0.99 in the call set. Variants were also excluded if the proband had <5 alternative or <5 reference reads or an allele balance <0.2 (<0.3 for indels) or if any of the proband, mother, or father had a depth <10 (<15 for indels) or a genotype quality <20 (<25 for indels).

Validation categories for SRS and LRS de novo SNVs

To assess efficacy of LRS in detecting additional rare de novos (DNVs) as compared to SRS, we applied the LRS de novo calling pipeline for small variants to 21 trios generated from 20 families (19 trios and 1 quad, using both affected siblings in the quad as separate trios) with available data. SRS rare DNVs were generated with a separate SRS de novo pipeline. For ease and strictness of comparison, analysis was restricted to SNVs. These were filtered out if called in >1 family or showed evidence in even a single read in parental samples (a high level of stringency, but potentially excluding instances of parental mosaicism and those also including recurrent sequencing errors).

Categorization criteria for defining validation category for technology-specific exclusive de novo SNVs (DN-SNVs) are as follows. (1) Called by other technology as a non-DNV: “called by other technology’s variant caller but classified as non-DNV.” They were either present in one/both parents or were not unique within the cohort. (2) “Likely false-positive”: false-positive due to multiple reasons: AB < 0.2, multi-allelic, region noisy with multiple indels on affected haplotype, or low-quality bases. (3) “Likely postzygotic mosaic”: discordant DN-SNVs (SRS-only DN-SNVs excluding categories 1 and 2) were classified as likely postzygotic mosaic, if: (a) LRS phasing showed the presence of at least three reference alleles on the alt-haplotype; (b) phasing was accurate, confirmed by at least two surrounding heterozygous SNVs on both sides; (c) absence of a third allele when mapped to T2T-CHM13 or GRCh38 days); (d) absence of INDELs on reads phased to alternate haplotype, which could be a homopolymer or short tandem-repeat errors on long reads; and (e) AB on both SRS and LRS was <0.5. (4) “Likely true-positive”: all the remaining exclusive DN-SNVs with AB ≥ 0.2.

De novo pipeline for LRS SVs

A first list of de novo SV candidates was produced by comparing the SV LRS calls in the proband and their parents using the sveval package. As above, SVs are matched based on their location, type, and size, with some added wiggle room in annotated simple repeats (UCSC Genome Browser track). Looser matching thresholds (5% reciprocal overlap, 500 bp distance) were used to produce a stringent list of de novo candidates by ensuring that no similar SVs were called in the parents. These de novo SV candidates were then regenotyped in the proband and the parents using vg and the raw long reads. We kept variants that were supported by at least 25% of the reads in the proband and less than 10% in both parents. Of note, variant calls overlapping an assembly gap in GRCh38 (gap track from UCSC Genome Browser) were removed.

Short tandem-repeat expansion

Phased SV calls by Hapdiff were clustered based on the simple repeat annotation from the UCSC Genome Browser (simpleRepeat track). For each annotated site, we retrieved how many bases were deleted or inserted within the repeat region separately for each haplotype.

We ran the same analysis on two public datasets of healthy individuals to use as controls when looking for outliers (see below). First, phased SVs derived from the Human Pangenome Reference Consortium (HPRC) v.1.0 pangenome made with Minigraph-Cactus were extracted from hprc-v1.0-mc.grch38.vcfbub.a100k.wave.vcf.gz VCF file, using only variants larger than 30 bp. Second, phased SVs derived from the Human Genome Structural Variant Consortium (HGSVC) assemblies, freeze 3, were extracted from the hgsvc.freeze3.sv.alt.vcf.gz VCF file.

For each simple repeat site, we compared the size delta (number of bases deleted/added) in each proband to the full cohort and the HPRC/HGSVC controls. We computed a Z score for each as the value for the proband minus the average across controls, divided by the standard deviation across the controls. We then selected the least extreme Z scores of the two. Hence, a Z score higher than 5 means that the site is expanded above 5 standard deviations compared to both the cohort and the HPRC/HGSVC controls.

Each region was annotated with the distance to the nearest coding exon defined in Gencode v.41. We annotated the nearest protein-coding gene and the nearest gene known to be associated with a disease.

Identification of sex chromosomal translocations

The Napu pipeline produced reads aligned to the GRCh38 reference genome. We first ran mosdepth (v.0.3.4) on these BAMs to identify abnormal copy number in two probands. Specifically, we observed coverage suggesting two copies for most of chromosome X (chrX) but also evidence for one copy of a small part of chromosome Y (chrY), including SRY (MIM: 480000). For those two samples with suspected translocations, we remapped the reads to the T2T-CHM13 genome and found evidence of the translocation in the form of long split-mapped reads where each part mapped on each side of the approximate breakpoints suggested by the read coverage analysis. Those supporting reads were collected using an in-house script that identified the read alignments breaking around the location of copy-number changes. Split-mapped reads allowed both breakpoints of the translocation to be located at base-level resolution. Of note, we also tested calling the translocation using Sniffles2 on reads mapped to the T2T-CHM13 genome using minimap2 or NGMLR.

Detecting gene fusion and conversions of CYP21A2

We developed a pangenome-based approach, called Parakit, to characterize the complex locus hosting CYP21A2 (MIM: 613815; unpublished data). A local pangenome is first built, collapsing the module with the pseudogene and the module with CYP21A2 in the GRCh38 reference. High-quality assemblies produced by the HPRC are also integrated in this pangenome, which is then used as a reference to re-align reads from this region. Because the similar copies are collapsed in the pangenome, the reads now align to only one position. Furthermore, they traverse the pangenome through nodes that are informative for inference because they are specific to CYP21AP (pseudogene) or CYP21A2 modules. Using this information, Parakit analyzes the read alignments on the pangenome graph to identify and visualize reads supporting fusion/gene-conversion events and the corresponding changes in read coverage or allelic balance. Parakit also infers the most likely haplotypes by enumerating walks through the pangenome graph that are consistent with the aligned reads. Parakit is under development and available on GitHub at https://github.com/jmonlong/parakit (unpublished data).

Annotation, segmentation, and average regional methylation calculation within CpG islands and cis-regulatory elements

Within Napu, modkit was run with the default --filter-threshold of 10th percentile, with the --cpg, --collapse-strands, and --partition-by HP parameters. The generated haplotype-resolved bedMethyl files were used to calculate regional methylation in predefined regions of interest. We used CpG Islands (CGI) (available from the UCSC browser) and human cis-regulatory elements (CCREs) from ENCODE Registry V3 as target regions. Large target regions can often represent variable methylation, which will be incorrectly captured by regional average methylation calculation. To tackle this, we developed SegMeth (https://github.com/shlokanegi/SegMeth), which segments target regions into windows that exhibit statistically different methylation patterns. SegMeth was run with relaxed parameters to avoid oversegmentation (-p 1e-7, -ut 30, -mt 70, -minCG 5). Segments were generated for all 98 samples and collapsed to generate intersected segments using an in-house script. Gencode’s comprehensive gene set, release 45, was used to annotate all target regions (CGI segments and CCREs) using bedtools closest and bedtools groupby (v.2.29.1). Each target region was annotated with the nearest coding exon(s) and gene(s) (within ±1 Mbp for CCREs and ±1 kbp for CGIs). A higher distance threshold for CCREs was considered due to the presence of enhancers in the database, some of which are known to exert distal effects on target genes by DNA looping in 3D chromatin space.55

Average methylation across annotated target regions was calculated for all samples using both haplotype-specific bedMethyl pileups. For each target region, an average of %methylation was taken across all individual CpGs with at least 5× valid coverage. Regions in samples with <10 CpGs or >50% of low-coverage CpGs were flagged to aid with downstream processing and differentially methylated region (DMR) calculations. The workflow described above is available as a WDL at https://github.com/nanoporegenomics/Napu_wf/blob/r10/wdl/workflows/bedtoolsMap.wdl.

Detection of differentially methylated regions

For each regional target (CGI segments and CCREs), the average methylation values of proband haplotypes were systematically compared to those of two control sets (control set 1: all healthy parents; control set 2: remaining probands) using paired t tests. Multiple testing corrections were applied using the Benjamini-Hochberg (BH) method. Cohen’s d was reported as a measure of effect size, and the methylation difference between the proband and the control means was used to filter out statistically significant but biologically irrelevant DMRs. Only target regions with regional methylation measurements in at least half the samples in the cohort were tested. Significant DMRs were identified using the following criteria: effect size ≥3, BH-corrected p values <0.001, and mean methylation difference ≥40% for both proband vs. control set 1 and proband vs. control set 2 comparisons. To identify potential disease-associated DMRs per proband, we inferred their transmission status (de novo across all samples or inherited from one or both parents). This allowed for further investigation into rare variants within or near DMRs in the context of haplotype specificity as well as transmission pattern (inherited or de novo).

Episignature analysis for Coffin-Siris syndrome 1

CpG dinucleotides differentially methylated in Coffin-Siris syndrome 1 (CSS1) (MIM: 135900) were extracted from supplemental data S4 in Aref-Eshghi et al.,56 which met their study-wide significance threshold of 0.01. Sites were lifted over from hg19 to hg38 using the UCSC Genome Browser liftover tool.57 Modkit-generated CpG bedMethyl files and custom python scripts were used to generate heatmaps for the methylation at the CSS1 episignature sites for the entire cohort including the proband under investigation (PMGRC-146-146-0). Permutation testing generated 1,000,000 random profiles through selection of a random value from each row. A total methylation score was calculated by counting sites with above mean methylation at sites hypermethylated in CSS1 and sites with below mean methylation at sites hypomethylated in the CSS1 episignature.

Results

Scalable ONT sequencing of blood samples yields high throughput and long read lengths

Several recent studies utilized ultra-long (≥100 kb) ONT sequencing to produce high-quality de novo assemblies of human genomes. However, multiple flow cells were used to achieve sufficient genomic coverage, as ultra-long DNA preparation protocols typically see lower sequencing yields. In our work, we further optimized the scalable Napu DNA processing and library preparation protocol for the ONT R10 chemistry34 (methods), aiming to achieve sufficient coverage and read length to be useful for clinical sequencing. For 21 samples that were sequenced while we were still optimizing the protocol, we generated data from two R10.4.1 PromethION flow cells to reach the target throughput of ≥30× coverage and ≥30 kb read N50. For the remainder, each sample was sequenced with a single flow cell (average ∼110 Gb output per flow cell, corresponding to ∼36× coverage; average read length N50 of ∼32 kb; Figures 1 and S1; Table S1). In total, we successfully sequenced 98 blood samples comprising 26 trios (affected child plus unaffected parents), one quad (two affected siblings plus unaffected parents), one duo (affected child plus single unaffected parent), a single unaffected mother of a proband (pre-extracted DNA from the other two family members was too degraded for sequencing, as indicated by a significant DNA smear), and 13 affected singletons. The majority of these samples were obtained from whole blood, while some were sourced from white blood cells and HMW DNA. The median read identity aligned to the GRCh38 reference genome was 99.2%.

Figure 1.

Figure 1

Single-flow-cell scalable sequencing protocol

(A) Cost-efficient, scalable, one-flow-cell nanopore sequencing protocol.

(B) From top to bottom: read length N50, that is, the read length (y axis) such that reads of this length or longer represent 50% of the total sequence; total sequenced bases/haploid human genome coverage (assuming a 3.1-Gbp genome) from total reads for each sample; distribution of read identities (percentage of matching bases in reads when aligned to the reference genome) when aligned to T2T-CHM13 v.2.0. Supporting data are available in Table S1.

Genome completeness analysis reveals complex Mendelian disease-associated genes well covered by LRS only

Using GRCh38-aligned reads with a mapping quality of at least 10 (estimated 90% probability of correct mapping), the autosomal coverage across 21 affected individuals (20 probands and 1 affected sibling), for which both SRS and LRS data were available, was similar for LRS (median 35.21, minimum 24.26, maximum 46.91) and SRS (median 33.52, minimum 23.33, maximum 45.25) (Table S2). We quantified the percentage of callable bases (coverage between 10× and 80×) in the reference genome that we could call variants against (methods). LRS showed a higher proportion of callable bases (average 0.82% more) than SRS, covering 92.18% of the GRCh38 reference genome; LRS also had more high coverage and fewer low- or no-coverage bases than SRS (Figure 2A and Table S3). Mapping to T2T-CHM13, the autosomal coverage increased for all of these samples (LRS median 37.17×, SRS median 36.8×). Reinforcing the study by Aganezov et al.,58 relative to GRCh38, a higher percentage (5.27%) of the T2T-CHM13 genome was callable by LRS (93.99%) vs. SRS (88.27%) (Figure 2A; Tables S4 and S5), indicating that the benefits of reference-based LRS analysis will grow with a transition to more complete telomere-to-telomere (T2T) and pangenome.59 With GRCh38-aligned reads, we found many >1-kb regions exclusively callable by LRS consistently across all chromosomes (Figure 2B, left, showing proband RGP_696_3 as a representative example). Interestingly, the relatively few SRS-only callable regions all went away on mapping to the T2T-CHM13 reference (Figure 2B, right), with manual analysis confirming they were the result of mapping artifacts caused by the incompleteness of GRCh38.

Figure 2.

Figure 2

Genome completeness analysis reveals complex Mendelian disease-associated genes callable by long-read sequencing only

(A) Genome-wide coverage distribution across all probands for GRCh38 and T2T-CHM13, calculated using reads with a mapping quality (MAPQ) greater than 10. Assembly gaps and simulated centromeric regions were excluded for GRCh38-aligned reads. Bases are categorized into four coverage levels: CALLABLE (10–80×), HIGH_COVERAGE (>80×), LOW_COVERAGE (0–10×), and NO_COVERAGE (0×). For proband RGP_696_3.

(B) Ideogram for GRCh38 (left) and T2T-CHM13 (right), showing genomic regions >1 kb callable by LRS with no coverage in SRS (blue) and regions >1 kb callable by SRS with no coverage in LRS (magenta). Red cytoband represents the centromere.

(C) Cumulative counts of genomic features (coding exons, Mendelian disease-associated genes, and all protein-coding genes) based on overlap fraction. The x axis shows the fraction of each feature’s length. The y axis shows the number of genomic features with LRS-only/SRS-only callable coverage over at least a fraction (x) of their length (y axis limit is set to 700 for clarity).

(D) Ideogram highlighting seven Mendelian disease-associated genes that have most of their length callable by LRS only. Red cytoband represents the centromere.

(B), (C), and (D) are shown for one proband, and supporting data for other probands are available in Table S2.

A median of 111 genes per affected individual had at least half their length covered by LRS and not by SRS, including a median of five Mendelian disease-associated genes (Table S6); 38 genes had their entire length callable by LRS alone, and 280 genes had at least one coding exon per gene that was entirely covered by LRS alone, including the first coding exon in ∼134 of these genes, where SRS is often known to exhibit reduced coverage due to high GC content.60,61,62,63 Reversing the analysis, SRS exclusively covered a median of two coding exons, all of which appeared to result from mismapping in segmental duplications, and did not fully cover any genes that LRS did not. Examining one affected individual as an example, seven Mendelian disease-associated genes—SMN1 (MIM: 600354), C4A (MIM: 120810), C4B (MIM: 120820), CBS (MIM: 613381), CRYAA (MIM: 123580), ICOSLG (MIM: 605717), and PRODH (MIM: 606810)—had at least half their length covered by LRS and not by SRS (Figures 2C and 2D); C4A and C4B, associated with complement component 4a and 4b deficiency, are located within the highly polymorphic major histocompatibility complex locus and overlap regions of high-identity segmental duplications64 (Figure S2). Similarly, SMN1, with >99.9% sequence identity to its paralog SMN2 (MIM: 601627), resides in a large and complex segmental duplication region.27 These results highlight the added value of LRS in capturing genetic information that may be missed by traditional short-read methods.

LRS detects additional rare functionally annotated small variants

To carry out an unbiased comparison of the number of functionally annotated variants (FAVs), we applied a clinical annotation pipeline (methods) to LRS and SRS small variant call sets (variants <30 bp) for 21 affected individuals with data from both sequencing technologies. FAVs were defined as non-homozygous reference exonic variants, with predicted HIGH/MODERATE impact or loss of function. With genotype quality (GQ) ≥20, we observed a high concordance between functional annotated SNVs (FA-SNVs) identified by SRS and LRS, with 91.6% of LRS FA-SNVs being called by SRS and 97.5% of SRS FA-SNVs being called by LRS (Figure 3A). LRS detected a median of 1,437 FAVs not detected by SRS per sample, whereas SRS called a median of 494 FAVs not detected by LRS (Figure 3B).

Figure 3.

Figure 3

LRS detects additional rare functionally annotated small variants

(A) Comparison of functionally annotated HIGH and MODERATE impact functionally annotated small variants (top, SNVs; bottom, indels) between SRS and LRS.

(B) Linear breakdown of LRS-only FAVs reveals additional rare variants in Mendelian disease-associated genes.

(C) Example from sample RGP_1081_3 showing a rare, heterozygous, MODERATE impact missense variant in KRT86, a gene associated with autosomal dominant Monilethrix, located in a region unmappable with short reads. This was not found to be clinically relevant in the proband.

Stratification revealed that per sample, a median of 41 LRS-exclusive FAVs were rare (allele frequency <0.001 in gnomAD v.3 and unique to a single family; 38 FA-SNVs, 4 FA-indels per sample), of which a median of 4 were located in Mendelian disease-associated genes. Further, 80% of LRS-exclusive rare FAVs overlapped segmental duplications, and 66% were in low short-read mappable regions (sourced from GIAB GRCh38 stratifications). Manual investigation confirmed that SRS maps poorly to these sites (Figure 3C provides an example), highlighting the additional yield of LRS in SRS-inaccessible regions.

Conversely, of SRS-exclusive FAVs, a median per sample of 24 were rare (20 FA-SNVs, 4 FA-indels), four of which affected Mendelian disease-associated genes (Figure S3A). More than 200 of the SRS exclusive FAVs per sample were indels; however, notably, LRS and SRS called similar median numbers (n = 4) of rare exclusive FA-indels, despite indel calling being a challenge with ONT in homopolymers and tandem repeats. A median of 46% of SRS-exclusive rare FAVs overlapped segmental duplications or were in low short-read mappable regions (Figure S3B). A manual investigation of 21 variants in these regions (one randomly picked from each sample) revealed 16 likely false positives (multi-allelic or allele balance <0.2). However, most SRS-only FAVs were called by LRS but with slightly lower GQ. Since GQ calibration between SRS and LRS variant callers is different, we compared unfiltered call sets and found median per sample LRS-only FAVs doubled to 3,046 (with 99 rare [90 FA-SNVs, 13 FA-indels], of which 15 were in Mendelian disease-associated genes), while SRS-only FAVs dropped to 334 (with 22 rare [16 FA-SNVs, 2 FA-indels], of which five were in Mendelian disease-associated genes), suggesting that future technology development leading to more confident, higher GQ LRS variant call sets could enhance the benefits of LRS over SRS for FAV detection (Figure S4).

Comparison of de novo SNVs between SRS and LRS identifies postzygotic mosaicism

Across 21 affected individuals with both LRS and SRS data, after strict filtering (methods) and restricting to DN-SNVs, genome-wide LRS detected a median of 65 per sample, while SRS detected 62 (Figure 4A and Table S7). Annotated DN-SNVs (methods) were also similar for LRS (median 47) and SRS (median 45). Aggregating across samples, 972 (74.6% of SRS DN-SNVs and 74.26% of LRS DN-SNVs) whole-genome DN-SNVs and 689 (74.25% of SRS DN-SNVs and 76.73% of LRS DN-SNVs) annotated DN-SNVs, respectively, were called by both technologies (Figure 4B).

Figure 4.

Figure 4

De novo SNV comparison between SRS and LRS

(A) Counts of rare DN-SNVs (genome wide and annotated) called exclusively by each technology (LRS or SRS).

(B) Comparison between LRS and SRS DN-SNV callsets. Bar charts in the center represent concordance. Pie charts on each side (left for LRS-only and right for SRS-only) show the proportion of exclusive DN-SNVs that upon IGV inspection are found to be likely false positive, called by the other technology as non-DN-SNV, likely true positive, or likely postzygotic mosaic.

(C) Allele balance of likely postzygotic mosaic DN-SNVs in SRS (left) and LRS (right) reads, mapped to both GRCh38 and T2T-CHM13.

(D) Correlation of allele balance between SRS and LRS for potential mosaic DN-SNVs compared to the concordant set (DN-SNVs called by both technologies). Allele balance is consistent across GRCh38 and T2T-CHM13 mapped reads.

(E) (Top) Likely true-positive LRS-only DN-SNVs in GIAB low short-read mappable regions. (Bottom) Likely true-positive LRS-only and SRS-only DN-SNVs stratified by overlap with GIAB low-complexity regions. Supporting data are available in Tables S7–S14.

Sequencing-technology-specific DN-SNVs were categorized into four validation categories: “A, called as non-DNV by other technology’s variant caller”; “B, likely false-positive”; “C, likely postzygotic mosaic”; and “D, likely true-positive” (Figure 4B; Tables S8 and S9; methods). Most sequencing-technology-specific DN-SNVs fell into category A, as they were called by the other technology’s variant caller, but were subsequently filtered out as a potential de novo variant due to being either present in one or both parents or being common within the cohort (Tables S10–S13).

Excluding the variants in categories A and B, we examined the remaining 54 SRS-only DN-SNVs and found LRS read evidence for 50, although they were not genotyped by DeepVariant. Upon investigation, we consistently found the alternate allele phased to one haplotype in LRS, but many reads on that haplotype showed the reference allele (Figure S5). Possible reasons included incorrect phasing, copy-number variation, and mosaicism. Normal coverage and consistent phasing with neighboring SNVs ruled out the first two. The allele balance (AB) was consistent when mapping LRS and SRS to either GRCh38 or T2T-CHM13, suggesting it was not caused by an obvious read-reference mapping artifact (Figure 4C). LRS AB showed overall better consistency than SRS, potentially due to better mapping to both GRCh38 and T2T-CHM13. Using stringent filters (methods), we classified 35/54 of these discordant DN-SNVs (and 25/38 annotated DN-SNVs) as likely postzygotic mosaics, confirmed by SRS and LRS (Table S14). Convincingly, these likely postzygotic mosaics had lower AB compared to the concordant DN-SNVs (i.e., DN-SNVs called by both SRS and LRS) (Figure 4D). As future work, integration of a somatic long-read variant caller into our pipeline should make it possible to confidently call these variants with LRS and identify them as postzygotic mosaic, which is often difficult with SRS due to the absence of read-based phasing information for many variants.

After this analysis, there were only 19 whole-genome and 13 annotated likely true-positive, non-postzygotic mosaic SRS-exclusive DN-SNVs (5.7% and 5.4%, respectively, of the total SRS-exclusive DN-SNV callset). Several of these SRS-only true-positive DN-SNVs were observed on LRS but were either multi-allelic or had very low base quality, in part due to underlying homopolymers and dinucleotide repeats. However, some missed variants on LRS lack a clear explanation (Figure S6).

Conversely, all remaining LRS-only DN-SNVs were categorized as likely true positives (74 genome-wide and 34 annotated), since none showed evidence of potential postzygotic mosaicism (AB evenly distributed around 0.45), as DeepVariant uses phasing information and is not trained to detect somatic variants; very few LRS-only DN-SNVs showed any SRS read evidence. LRS-only likely true-positive DN-SNVs were stratified by mappability and complexity using GIAB GRCh38 stratifications. 23/34 (68%) annotated LRS-only likely true-positive DN-SNVs (and 51/74 [69%] genome-wide) were in low short-read mappable regions (Figures 4E and S7). Additionally, 6/34 LRS-only true-positive DN-SNVs were in low-complexity regions, compared to none of the 19 SRS-only true-positive DN-SNVs (Figure 4E). Notably, we observed the presence of LRS-only DN-SNV clusters on specific chromosomes in a few samples (Figure S8). Some of these clusters overlapped segmental duplications but were retained in the callset, as they were individually high-quality DN-SNVs with a lack of evidence on well-covered parental reads and unique within cohort. These could be representative of gene-conversion events or the mapping to copy-balanced alternative duplicon copies, but further analysis of surrounding structural variation is needed to determine the likelihood of these scenarios.

Reviewing the de novo variants in the 20 families analyzed above, we were able to make new diagnoses for three. In each instance, the variants were detected by both SRS and LRS, and there had been insufficient evidence at the time of SRS analysis to make the diagnosis, evidencing the benefits of regular data reanalysis. In two families (RGP_607 and RGP_696) with a neurodevelopmental phenotype, a highly recurrent de novo single base insertion (GenBank: NR_003137.3, n.64_65insT) in non-coding RNU4-2 (MIM: 620823) was detected. Our findings in these individuals contributed to the recent report of RNU4-2 as a disease-associated gene and frequent cause of syndromic neurodevelopmental delay (ReNU syndrome [MIM: 620851]).65 LRS enabled phasing of the de novo variant to the maternal allele in both probands (only possible by SRS in one). In family RGP_123 with a neurodevelopmental phenotype, a de novo 5′ UTR splice variant (GenBank: NM_001033044.4, c.−13-2A>G) in GLUL (MIM: 138290) was prioritized in the LRS analysis. The variant was noted but not considered high priority at the time of SRS analysis due to lack of a compound heterozygous variant and limited phenotype overlap, given that the only Mendelian condition reported at the time was recessive (glutamine deficiency, congenital [MIM: 610015]). However, a previously uncharacterized de novo mechanism of disease and new phenotype association of developmental and epileptic encephalopathy 116 (MIM: 620806), in keeping with our proband, has recently been reported for GLUL.66 As the report included functional validation of this proband’s GLUL variant in a second unrelated proband, we now consider the variant causal. LRS exclusively enabled the phasing of the de novo variant to the paternal allele.

Accurate characterization of SVs and tandem-repeat expansions with LRS

Comparing raw LRS SV calls (called by Hapdiff; similar results were found using Sniffles [Tables S15 and S16]) to raw SRS SV calls (called by GATK-SV, which includes Wham and depth-based algorithms47) revealed that LRS detected a median of 22,561 SVs (≥50 bp) per individual, compared to 16,496 SVs by SRS (Table S17). Since the LRS variant caller reports duplications as long insertions, we combined LRS and SRS duplications as insertions for comparison. Per proband, LRS detected a median of 13,738 insertions and 41 inversions, approximately 2- and 4-times the medians identified by SRS (insertions 7,131, inversions 10) (Figure 5A). Deletion numbers were comparable; LRS detected a median of 8,944 deletions, while SRS detected 9,337.

Figure 5.

Figure 5

Characterization of structural variants and tandem-repeat expansions with LRS

(A) Counts of LRS and SRS SVs (deletions, insertions, and inversions) per proband.

(B) Comparison of LRS and SRS SVs using a fuzzy-matching approach implemented by sveval (for supporting data see Table S18).

(C) Number of rare structural variants (allele frequency of 0.01 or less) in each individual with different profile. The violin plots represent the distribution across probands, and the dots highlight the median values. The “high” and “modifier” impact prediction came from SnpEff (HIGH/MODIFIER impact classes). Regulatory regions are candidate cis-regulatory elements from ENCODE with enhancer-like signature.

(D) Expansion scores at annotated simple repeat sites across all probands. The vertical dotted line highlights repeats that are significantly expanded (adjusted p value <0.01 and fold change >2) compared to the controls. Regions at less than 10 kbp of coding exons are highlighted in green (for known disease-associated genes) and orange (for other protein-coding genes).

Stratifying by length, an assembly-based and a reference-based LRS variant caller (Hapdiff and Sniffles, respectively) showed consistent insertion counts across the length spectrum, detecting more insertions than SRS throughout (Figure S9). For deletions, LRS and SRS counts were comparable up to 100 kb. Beyond this, more deletions were observed in SRS data, possibly influenced by false positives or alignment-based LRS methods not being well calibrated for detecting split-read mappings.

Intersecting the SV callsets, LRS called a median of 13,714 SVs per affected individual not called by SRS (Figure 5B; Tables S18 and S19). Inversely, SRS called a median 8,003 SVs not called by LRS. Manual investigation of a set of ten randomly chosen SRS-only SVs confirmed at least nine to be false positives, reinforcing earlier findings of high error rates (Table S20).11,12,13,14,15 For example, for SRS-only deletions, we observed many instances without paired-end reads spanning the deletion and showing uniform coverage throughout with LRS (Figure S10). We also found instances involving multiple overlapping fragmentary SRS SVs in regions where LRS had called a single sequence-resolved SV (Figure S11). The high false-positive rate of the SRS SVs was also evident from a very high heterozygous-to-homozygous ratio (median 4.84, maximum 6.88) compared to median 1.67 for LRS SVs (for comparison, median het/hom for small variants; LRS 1.65 and SRS 1.63; Tables S21 and S22). Conversely, examining a random set of ten LRS-only SVs, none were false calls (Table S23 and Figure S12).

After annotating the LRS SVs with frequencies and functional impact predictions (methods), we found, on average per affected individual, 208 rare SVs (allele frequency below 0.01), four rare SVs with a high predicted impact (typically disrupting exonic sequences), and a rare SV with a high predicted impact on a known disease-associated gene in 29% of affected individuals (Figure 5C), with one of these contributing to the proband’s phenotype and now considered diagnostic (proband DSDTRN17 described below). We also identified, on average, 35 rare SVs overlapping a regulatory element of a known disease-associated gene. SVs were also annotated with AnnotSV,44 and we found at least one rare deletion with a ranking score higher than 0.9 (score adapted from ACMG/AMP/ClinGen’s recommendations and equivalent to “likely pathogenic”) in six affected individuals (∼15%). In trios, we also identified, on average, 3.2 candidate de novo SVs by comparing the probands with their parents (methods), three of which overlapped coding sequences although not from a known disease-associated gene.

We used the phased LRS SV calls within annotated simple repeats to measure the size variation at these sites for each haplotype. We then selected repeat expansions for further investigation if they were outliers in this cohort and compared them to a set of control high-quality phased de novo assemblies from the HPRC59 and the HGSVC67 (Figure 5D). On average, per proband, 84 simple repeat sites were significantly expanded compared to controls (adjusted p value <0.01 and fold change >2). 29.5 sites were, on average, located at less than 10 kbp from a protein-coding gene, 9.2 of which were from a known disease-associated gene. Of note, 28 expanded repeat sites in 26 probands overlapped directly with coding sequences. A clinical review of these variants did not allow us to confidently implicate any in a participant’s phenotype.

Base-level characterization of sex chromosomal translocations

Translocations between sex chromosomes are challenging to detect because of their representation in the GRCh38 reference genome and sequence similarity that may confuse read mapping and SV detection tools. Two probands in our cohort were diagnosed with 46,XX testicular disorder of sex development (MIM: 400045), caused by a translocation between the chrX and chrY p-arms. Karyotype testing revealed an XX karyotype, and fluorescence in situ hybridization confirmed the presence of SRY; however, neither method could detect the size or the accurate breakpoints of the translocation. Although some evidence could be found manually, these variants could not be detected by SV detection tools using SRS data. Bionano optimal mapping had detected some pieces of the translocations but still did not provide accurate breakpoint resolution.68 In our LRS dataset, we were able to identify the breakpoints of each translocation supported by multiple split-mapped reads when mapped to the T2T-CHM13 reference genome (Figures S13A and S13B). In one proband (DSDTRN10), LRS detected additional deletions on the chrX, which were not identifiable through clinical testing. The breakpoints were consistent with the copy-number changes inferred from read coverage. Of note, one translocation was detected by Sniffles2 from reads mapped with NGMLR, a more sensitive and SV-oriented long-read mapper. Therefore, compared to other technologies, long reads from ONT enable us to both detect the presence of the translocations and precisely predict their breakpoints’ locations.

Detecting gene fusion and conversion of CYP21A2

Three individuals in our cohort were diagnosed with CAH (adrenal hyperplasia, congenital, due to 21-hydroxylase deficiency, MIM: 201910), while one individual was undiagnosed despite a high clinical suspicion of CAH. Variants in CYP21A2 are responsible for CAH. CYP21A2 is located in a tandemly duplicated segmental duplication that is about 30 kbp long. The most common pathogenic variants arise from a gene-conversion event with CYP21A1P (pseudogene) located in the upstream module or from deletions that create CYP21A1P-CYP21A2 fusions.69 Clinical testing for CAH includes long-range PCR, Sanger sequencing, and multiplex ligation-dependent probe amplification; however, these methods are labor intensive and provide low variant resolution, with no phasing possible in the absence of parental samples. Although the Napu pipeline detected some pathogenic variants, we developed a specialized tool to fully characterize this region at the haplotype level (see methods; unpublished data). Using this new approach, we identified compound heterozygous pathogenic variants for all four probands with CAH, including one proband (DSDTRN09) who was previously undiagnosed (Figure S13C). Three of the diagnoses involved a pathogenic SNV in trans with a CYP21A1P-CYP21A2 fusion. Of note, two probands were particularly challenging to analyze because they carried a haplotype with three modules (two with the pseudogene and one with a gene-converted gene). We also confirmed that the unaffected mother (DSDTRN19) of one CAH proband carried only one CYP21A2 pathogenic fusion allele (Figure S14). Overall, our approach identified and phased the CYP21A2 pathogenic alleles in all probands and showed Mendelian consistency with their parents.

Phasing with long reads reveals compound heterozygous variants

Napu generates harmonized and phased structural and small variant calls, providing a comprehensive representation of sample variants. Across all samples, the median phase block NG50 was 2.16 Mb (Figure S15). This phased view helps characterize complex regions containing multiple variants on both haplotypes to identify compound heterozygous variants in genes associated with autosomal recessive disorders. Among protein-coding genes, a median of 17,365 (87%) out of 20,048 per individual were entirely within a single phase block (Figures 6A and S16).

Figure 6.

Figure 6

Phasing with long reads reveals compound heterozygous variants in protein-coding genes

(A) (Left) Plot showing cumulative counts of protein-coding genes overlapping a single phase block with varying overlap fractions. The x axis shows the fraction of each gene’s length. The y axis represents the number of genes that are at least x fraction phased by a single phase block. Each line corresponds to a sample and is colored by its phase block NG50. (Right) Plot showing the number of genes phased by a single phase block across different phasing percentage categories (0%, 0%–25%, 25%–50%, 50%–75%, 75%–100%, and 100%) on the x axis, with the y axis showing the count of genes per individual within each phasing category.

(B) In proband DSDTRN17, LRS resolved pathogenic compound heterozygous variants in LHCGR, encoding the Luteinizing hormone/choriogonadotropin receptor, causing Leydig cell hypoplasia.

Using our clinical annotation and prioritization pipeline on these long-range phased variants, we identified clinically relevant compound heterozygous variants in four probands (one diagnostic and three candidates requiring additional evidence). In proband DSDTRN17, who was undiagnosed, we found compound heterozygous pathogenic variants in the luteinizing hormone/choriogonadotropin receptor (LHCGR, MIM: 152790) associated with recessive Leydig cell hypoplasia (MIM: 238320) (Figure 6B). The first variant was a ClinVar pathogenic/likely pathogenic missense SNV located on exon 11 (NM_000233.4, c.1847C>A [p.Ser616Tyr]), also identified by SRS. LRS additionally detected a 6,694-bp heterozygous deletion, which deleted exon 9 of LHCGR. A manual investigation later revealed that the exon deletion was seen on the SRS data but had not been called. Nevertheless, to confirm compound heterozygosity, these variants needed to be phased in the absence of parental DNA, requiring LRS to reach a definitive diagnosis. The exon 9 deletion truncates the extracellular domain of LHCGR, while the exon 11 missense variant is situated in a transmembrane region with other known pathogenic variants.

Haplotype-specific methylation profiling with LRS can prioritize rare intronic SVs and variants of uncertain significance

Single-molecule LRS captures both DNA sequences and modifications simultaneously.29 Napu runs modkit (https://github.com/nanoporetech/modkit), which generates haplotype-specific and combined CpG methylation calls. We used these calls to identify methylation outlier regulatory regions in probands, potentially linked to rare but previously unprioritized variants in a haplotype-specific context. We analyzed regional methylation (methods) in CpG Islands (CGIs) and ENCODE cis-regulatory elements across individual haplotypes. To account for variability in large CGIs, CpGs were segmented by methylation patterns, with segment consistency maintained across samples (methods).

Differentially methylated CGIs (DM-CGIs) were identified per proband haplotype using all unaffected, unrelated parents (control set 1) and other probands (control set 2) as controls (methods). A median of 11 DM-CGIs per proband were found, including two near (±10 kbp) disease-associated genes, three near other protein-coding genes, and five without any protein-coding gene nearby (Figure S17A). Similarly, out of a median of 22.5 DM-CCREs called, ten were near (±1 Mbp) disease-associated genes, 11 were near other protein-coding genes, and one was without any protein-coding gene nearby (Figure S17B). For a subset of these differentially methylated regions (DMRs) within ≤10 kb of genes expressed in blood, we found a significant association between methylation and gene-expression changes (Figure S18).

A clinical review of the DMRs did not allow us to directly implicate any in a participant’s phenotype, although it did identify interesting correlations. For example, in a family with two affected siblings, we identified a haplotype-specific hypomethylated CGI in FTCD (MIM: 606806) (Figure S19A). Although FTCD is typically expressed at very low levels in blood, it was found to be overexpressed in both siblings, correlating with hypomethylation (Figure S19B). The hypomethylation and corresponding overexpression is likely caused by a 365-bp rare intronic LRS-only SV insertion at the same position in the DM-CGI (Figure S19C). This region is a variable number tandem repeat (VNTR) with a repeat unit size of 73 bp, exhibiting a five-copy expansion in the affected siblings. Shorter insertions of 146 bp (two additional copies) were detected at the same position in 11 unrelated samples with a normal degree of methylation and expression. The unaffected father also carried the insertion (albeit with a slightly higher methylation signal), but no RNA-sequencing data were available to assess the impact on expression in the father. While the variant is unlikely to be contributing to the phenotype in this family, this example highlights that methylation alterations can identify regions with DNA variants and altered expression that should be studied further.

Validation of causal variant with an independent evaluation of known episignatures using ONT methylation

In PMGRC-146-146-0, a proband with a complex neurodevelopmental phenotype including minimal expressive language, autistic features, and dysmorphic features, analysis from SRS had identified a de novo deep intronic SNV of uncertain significance in ARID1B (MIM: 614556, GenBank: NM_001374828.1, c.3235+700C>G) (Figure 7A). This variant, which was 700 bases from the nearest exon, had a spliceAI prediction score of 0.31 for an acceptor gain 143 bases downstream and a spliceAI score of 0.19 for a donor gain 5 bases downstream of the variant. Since the phenotype was not highly specific and the splice predictions were relatively weak, this variant was considered a variant of uncertain clinical significance (VUS) based on ACMG/AMP variant classification guidelines. LRS analysis also confirmed the de novo variant (Figure S20). Interrogation of whole blood RNA-sequencing data on the proband and the parents demonstrated that the variant leads to inclusion of a 138-bp cryptic exon consistent with the spliceAI predicted donor and acceptor gains (Figure 7B). This new exon includes a premature stop codon and is expected to lead to a loss-of-function effect. Moreover, using methylation data from ONT whole-genome data from the proband and all other unrelated samples in our cohort, we interrogated the CpG methylation status of 106 sites known to be differentially methylated in CSS1 (MIM: 135900) (Figures 7C and S21). This proband’s sample showed a methylation pattern more similar to the CSS1 episignature than all the other samples (p < 0.001, permutation test). Using these data for phenotypic specificity and demonstrated splice effect, the variant was reclassified as pathogenic using the ACMG/AMP criteria.

Figure 7.

Figure 7

CSS1 diagnoses with concurrent detection of known episignature

(A) De novo deep intronic VUS in ARID1B on short-read whole-genome sequencing data.

(B) Sashimi plot shows that the variant leads to inclusion of a cryptic exon in the proband.

(C) Using ONT whole-genome CpG methylation status of 106 sites known to be differentially methylated in CSS1, PMGRC-146-146-0 shows a CpG methylation pattern more similar to the CSS1 episignature than all the other samples in the cohort (p < 0.001, permutation test).

Summary of diagnostic findings

Out of 41 families (1 quad, 26 trios, 1 duo, 13 proband-only, including 42 affected individuals), diagnostic variants were found for 11 (Table S24). As reported in detail in earlier sections, these include four probands with de novo variants (one validated by a matching episignature), one proband with compound heterozygous variants, two probands with sex-chromosomal translocations, and four probands with complex compound heterozygous variants (including gene fusions) in a complex segmental duplication containing CYP21A2. Candidates that require additional evidence were identified for a further four probands (one de novo variant, three compound heterozygous variants) (Table S25). The remaining 26 families (27 affected individuals) remained unsolved with ongoing strong suspicion of a monogenic cause of disease.

Our cohort consisted of three distinct subsets of affected individuals. The first subset consisted of six probands with disorders of sex development or CAH. Five of these probands had known diagnostic SVs and were included in our study to demonstrate the ability of LRS to act as a single diagnostic test, providing improved breakpoint resolution, phasing in absence of parental data, and the ability to distinguish variants in highly homologous genes within segmental duplications. In all five, LRS was able to characterize the causal variant precisely (two with sex-chromosomal translocations and three with complex compound heterozygous variants in CYP21A2-CYP21A1P). One CAH proband was genetically undiagnosed, and our LRS pangenome method was able to detect and phase causal biallelic variants in CYP21A2, providing a definitive diagnosis. The second set consisted of 26 genetically undiagnosed families (27 affected individuals) with a high clinical suspicion of monogenic disease and inconclusive SRS analysis and clinical genetic testing. These included mainly, but not exclusively, individuals with neurodevelopmental phenotypes, who were specifically selected for inclusion to enrich for likely de novo causal variants70 (Tables S26 and S27). The third set included nine probands with varying phenotypes, all with inconclusive clinical genetic testing (Table S28). For subsets 2 and 3, genome-wide analysis of rare (inherited and de novo) small variants, SVs, tandem-repeat expansions, and methylation outliers, identified causal de novo variants in a total of four probands, three of which would have also been detectable by reanalysis of the SRS data (RNU4-2 in two probands, GLUL in one proband) and one already detected by SRS yet only validated by confirming a known epigenetic signature for ARID1B accessible by LRS-only. Biallelic variants in LHCGR were detected in one proband using LRS-only. In addition, candidates were identified in a further four probands (BLOC1S1 [MIM: 601444], SLC6A3 [MIM: 126455], SRSF2 [MIM: 600813], and LHCGR), of which in two the variant of interest was detectable by LRS-only (55-bp intronic deletion in SLC6A3 and 300-bp upstream insertion in LHCGR).

Discussion

In this study, we sequenced an undiagnosed rare-disease cohort using a time- and cost-efficient, one-flow-cell nanopore sequencing protocol, yielding on average ∼110 Gb of ONT reads (corresponding to ∼36× coverage). The Napu pipeline was used to process the sequencing data to generate assemblies, variants, phasing, and methylation calls in a single run. Both our wet and computational protocols are openly available (methods), with the latter being reproducible on both cloud and local infrastructure quickly and inexpensively.

We evaluated the additional yield of LRS by systematically comparing it with SRS. Our results reinforce earlier findings that LRS can address more of the genome34,58 and broadly demonstrate that this translates into additional clinically relevant information. LRS accurately mapped many genomic regions without coverage in SRS, providing resolution in coding exons of a median of 280 protein-coding genes per individual missed by SRS. Some of these genes overlapped highly identical segmental duplications known to be implicated in Mendelian disorders but which have been refractory to SRS detection.6,64,71,72 Indeed, most LRS-exclusive SN-DNVs overlapped low-complexity and low short-read mappable regions. Also supporting the idea that LRS provides additional clinical value, despite higher base-level error rates, we find that LRS identifies many additional rare functionally annotated small variants relative to SRS. Interestingly, the additional number of such variants relative to SRS increased substantially when reducing the threshold for genotype quality filtering, and, simultaneously, the number of SRS-exclusive functionally annotated variants decreased. While reducing the quality filter will introduce false positives, it is likely that further technology developments to improve the base-level accuracy of ONT long reads will result in still larger gains in the number of confidently ascertained functionally annotated variants relative to short reads. Finally, and also reinforcing many earlier studies, we find high numbers of additional SVs relative to SRS, including rare tandem-repeat expansions in protein-coding genes.73,74 The quality of these SV calls is superior to what has been possible with SRS. Overall, our comparisons reinforce the strengths of LRS in accurate mapping to new and previously inaccessible regions, with biggest gains evident in accurate, base-level SV characterization.

Our clinical diagnostic results show the value of LRS both in the discovery of LRS-exclusive candidates and in providing additional evidence to variants detected by SRS, such as phasing and methylation. Our study also highlights the considerable value of regular data reanalysis, consistent with prior reports,75,76,77 as previously uncharacterized pathogenic variants, disease-associated genes, and additional disease mechanisms or inheritance modes are continuously being discovered (as seen for GLUL in one of our probands). We suspect that many of the remaining undiagnosed probands in our cohort also have a disease caused by highly penetrant monogenic variation that we have not yet been able to confidently interpret and can hopefully be addressed with future reanalyses.

Related to the ongoing value of reanalysis, we found many variants with LRS that we could not usefully interpret. To more fully utilize the additional information provided by LRS, more comprehensive population and clinical databases based on long-read data are needed. Currently large reference population databases are short-read based and lack population-based allele frequencies for LRS-derived variants.47,53 Similarly, clinical databases such as ClinVar50 contain variants that have almost exclusively been detected by short reads (plus other clinical genetic tests) and do not yet contain the classifications of many “long-read exclusive” variants to help with their identification and prioritization. For this reason, most of the LRS-exclusive rare SVs, tandem-repeat expansions, and small FAVs are difficult to interpret. This somewhat frustrating situation will be alleviated by the creation of LRS-derived databases of variants. There is a need for diverse, population-scale sequencing of healthy cohorts by LRS to create repositories of variation and their frequencies. Projects such as the HPRC,59 the HGSVC,78 and larger-scale biobank projects like those from the All of Us project, are starting to generate this much-needed sequencing data. In tandem, and likely over a much longer timescale, the expansion of LRS in clinical genome sequencing should progressively lead to the discovery and contribution of LRS-exclusive disease-causing variants to clinical databases.

LRS facilitated the prioritization of rare, non-coding SVs by providing complementary methylation information and enabled the detection of methylation outliers. Some were correlated with gene expression, although none were definitively diagnostic in this cohort. Methylation information did, however, provide orthogonal validation to a VUS by using a known episignature in one of our probands, where the pathogenicity of a de novo deep intronic variant in ARID1B was confirmed by the presence of a reported episignature in a single assay.79,80,81,82 Interpreting rare methylation outliers is challenging, due to the complex interplay of environmental and genetic factors and the lack of good reference datasets. However, more LRS-based clinical studies analyzing methylation in conjunction with genetic variation in rare diseases will help establish the increased methylation-based diagnostic yield.79

Our de novo comparison revealed a potential application of LRS for distinguishing postzygotic mosaic variants from prezygotic de novos, which may in the future contribute further understanding to rare genetic diseases83,84 and the penetrance (or lack of) for such variants. Related to this, our future work will focus on developing accurate characterization methods for comparing de novo indels between sequencing technologies to avoid discrepancies and prevent missed or falsely identified variants.

Overall, LRS enabled more comprehensive genome analysis. Decreasing costs of LRS, leveraging cost-efficient sequencing and computational protocols, coupled with clinical analysis pipelines for analyzing all alteration types together, should lead to accurate diagnoses for individuals with suspected Mendelian conditions who today remain unsolved after a comprehensive evaluation. To more precisely estimate the benefits of whole-genome LRS relative to SRS will require larger cohorts and nuance in appreciating the phenotypic makeup of the cohort, which will likely change over time with continued data sharing to progressively overcome the interpretation gaps that these new data reveal.

Data and code availability

Genomic and phenotypic data from the Rare Genomes Project is available via dbGaP accession number phs003047 (GREGoR). Access is managed by a data access committee designated by dbGaP. Additional information on accessing the data is available on the GREGoR website at https://gregorconsortium.org/data. The Pediatric Mendelian Genomics Research Center (PMGRC) samples have been submitted to AnVIL and will be available pending processing by the AnVIL team. The DSD-TRN data will be available upon request. Scripts for performing all genomic analyses are available on GitHub (https://github.com/shlokanegi/Long_reads_rare_disease_Paper).

Acknowledgments

The Chan Zuckerberg Initiative (CZI) provided funding for sample collection, sequencing, and analyses. We thank Dr. Jonah Cool, Dr. Sara Simmonds, and Mr. Bruce Martin for their valuable feedback throughout this project. We thank Dr. Paolo Carnevali for his assistance and support on the use of the Shasta genome assembler. B.P. and S.N. were also supported by National Institutes of Health (NIH) grants R01HG010485, U41HG010972, U24HG011853, and OT2OD033761. M.K. was supported in part by the Intramural Research Program of NIH. A.O.-L. and RGP SRS analyses were supported in part by NIH grants UM1HG008900, U01HG011755, and R01HG009141 and grants 2019-199278, 2020-224274, and 2022-316726 from the CZI Donor-Advised Fund at the Silicon Valley Community Foundation (https://doi.org/10.37921/236582yuakxy, https://doi.org/10.13039/100014989) and in part by research funding from Illumina. S.L.S. was supported by a fellowship from the Manton Center for Orphan Disease Research at Boston Children’s Hospital and G.L. by a fellowship from Fonds de recherche en santé du Québec. Samples were collected from the DSD-TRN Biobank (supported by R01HD068138 and RO1HD093450 to E.V. and E.D.) and the GREGoR-UCI/CNH site (supported by U01 HG011745 to E.V., E.D., and S.B.). For the DSD-TRN samples, whole-genome sequencing was funded in part by the Gabriella Miller Kids First Initiative grant X01HL132384. We gratefully acknowledge the help of DSD-TRN investigators Meilan Rutter, Phyllis Speiser, Natalie Nokoff, Courtney Finlayson, and Jodie Johnson, who contributed the diagnosed samples used here, and Miguel Almalvez from UCI for management of DSD-TRN and UCI-GREGoR biobanks and sample extraction.

Author contributions

B.P., J.M., K.H.M., A.O.-L., and E.D. helped conceive and direct the study. S.N. and J.M. performed data analysis. S.L.S., E.C., and C.A.-T. contributed to RGP cohort analysis. S.I.B. and P.C. contributed to the PMGRC cohort clinical analysis. B.M., J.G., T.H., and S.M.O. contributed to ONT data sequencing. I.V. contributed to base calling. M.C.O., G.V., J.S., B.M., and G.L. were involved in recruitment and selection of RGP clinical cases for sequencing. S.N., B.P., J.M., and S.L.S. drafted the manuscript.

Declaration of interests

A.O.-L. was a paid consultant for Tome Biosciences, Ono Pharma USA, and Addition Therapeutics. The Rare Genomes Project received support in the form of reagents from Illumina Inc. and Pacific Biosciences.

Published: January 24, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2025.01.002.

Contributor Information

Jean Monlong, Email: jean.monlong@inserm.fr.

Benedict Paten, Email: bpaten@ucsc.edu.

Supplemental information

Document S1. Figures S1–S21
mmc1.pdf (47.4MB, pdf)
Table S1. Sequencing data statisitics
mmc2.xlsx (39.4KB, xlsx)
Data S1. Tables S2–S6
mmc3.xlsx (160.7KB, xlsx)
Data S2. Tables S7 and S8
mmc4.xlsx (16.8KB, xlsx)
Data S3. Tables S9–S14
mmc5.xlsx (204KB, xlsx)
Data S4. Tables S15–S23
mmc6.xlsx (588.9KB, xlsx)
Data S5. Tables S24 and S25
mmc7.xlsx (104.8KB, xlsx)
Data S6. Tables S26–S28
mmc8.xlsx (19.6KB, xlsx)
Document S2. Article plus supplemental information
mmc9.pdf (53.8MB, pdf)

References

  • 1.Graessner H., Zurek B., Hoischen A., Beltran S. Solving the unsolved rare diseases in Europe. Eur. J. Hum. Genet. 2021;29:1319–1320. doi: 10.1038/s41431-021-00924-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kingsmore S.F., Cakici J.A., Clark M.M., Gaughran M., Feddock M., Batalov S., Bainbridge M.N., Carroll J., Caylor S.A., Clarke C., et al. A Randomized, Controlled Trial of the Analytic and Diagnostic Performance of Singleton and Trio, Rapid Genome and Exome Sequencing in Ill Infants. Am. J. Hum. Genet. 2019;105:719–733. doi: 10.1016/j.ajhg.2019.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Costain G., Walker S., Marano M., Veenma D., Snell M., Curtis M., Luca S., Buera J., Arje D., Reuter M.S., et al. Genome Sequencing as a Diagnostic Test in Children With Unexplained Medical Complexity. JAMA Netw. Open. 2020;3 doi: 10.1001/jamanetworkopen.2020.18109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wojcik M.H., Reuter C.M., Marwaha S., Mahmoud M., Duyzend M.H., Barseghyan H., Yuan B., Boone P.M., Groopman E.E., Délot E.C., et al. Beyond the exome: What’s next in diagnostic testing for Mendelian conditions. Am. J. Hum. Genet. 2023;110:1229–1248. doi: 10.1016/j.ajhg.2023.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., et al. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wagner J., Olson N.D., Harris L., McDaniel J., Cheng H., Fungtammasan A., Hwang Y.-C., Gupta R., Wenger A.M., Rowell W.J., et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 2022;40:672–680. doi: 10.1038/s41587-021-01158-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Groza C., Schwendinger-Schreck C., Cheung W.A., Farrow E.G., Thiffault I., Lake J., Rizzo W.B., Evrony G., Curran T., Bourque G., Pastinen T. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nat. Commun. 2024;15:657. doi: 10.1038/s41467-024-44980-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Merker J.D., Wenger A.M., Sneddon T., Grove M., Zappala Z., Fresard L., Waggott D., Utiramerur S., Hou Y., Smith K.S., et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. 2018;20:159–163. doi: 10.1038/gim.2017.86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2011;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Olson N.D., Wagner J., Dwarshuis N., Miga K.H., Sedlazeck F.J., Salit M., Zook J.M. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 2023;24:464–483. doi: 10.1038/s41576-023-00590-0. [DOI] [PubMed] [Google Scholar]
  • 11.Alkan C., Sajjadian S., Eichler E.E. Limitations of next-generation genome sequence assembly. Nat. Methods. 2011;8:61–65. doi: 10.1038/nmeth.1527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mills R.E., Walter K., Stewart C., Handsaker R.E., Chen K., Alkan C., Abyzov A., Yoon S.C., Ye K., Cheetham R.K., et al. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470:59–65. doi: 10.1038/nature09708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zook J.M., Hansen N.F., Olson N.D., Chapman L., Mullikin J.C., Xiao C., Sherry S., Koren S., Phillippy A.M., Boutros P.C., et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 2020;38:1347–1355. doi: 10.1038/s41587-020-0538-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sudmant P.H., Rausch T., Gardner E.J., Handsaker R.E., Abyzov A., Huddleston J., Zhang Y., Ye K., Jun G., Fritz M.H.Y., et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hodgkinson A., Chen Y., Eyre-Walker A. The large-scale distribution of somatic mutations in cancer genomes. Hum. Mutat. 2012;33:136–143. doi: 10.1002/humu.21616. [DOI] [PubMed] [Google Scholar]
  • 16.Vollger M.R., Guitart X., Dishuck P.C., Mercuri L., Harvey W.T., Gershman A., Diekhans M., Sulovari A., Munson K.M., Lewis A.P., et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376 doi: 10.1126/science.abj6965. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Cheng H., Concepcion G.T., Feng X., Zhang H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Olivucci G., Iovino E., Innella G., Turchetti D., Pippucci T., Magini P. Long read sequencing on its way to the routine diagnostics of genetic diseases. Front. Genet. 2024;15 doi: 10.3389/fgene.2024.1374860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang Y., Zhao Y., Bollas A., Wang Y., Au K.F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 2021;39:1348–1365. doi: 10.1038/s41587-021-01108-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Conlin L.K., Aref-Eshghi E., McEldrew D.A., Luo M., Rajagopalan R. Long-read sequencing for molecular diagnostics in constitutional genetic disorders. Hum. Mutat. 2022;43:1531–1544. doi: 10.1002/humu.24465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Miller D.E., Sulovari A., Wang T., Loucks H., Hoekzema K., Munson K.M., Lewis A.P., Fuerte E.P.A., Paschal C.R., Walsh T., et al. Targeted long-read sequencing identifies missing disease-causing variation. Am. J. Hum. Genet. 2021;108:1436–1449. doi: 10.1016/j.ajhg.2021.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cohen A.S.A., Farrow E.G., Abdelmoity A.T., Alaimo J.T., Amudhavalli S.M., Anderson J.T., Bansal L., Bartik L., Baybayan P., Belden B., et al. Genomic answers for children: Dynamic analyses of >1000 pediatric rare disease genomes. Genet. Med. 2022;24:1336–1348. doi: 10.1016/j.gim.2022.02.007. [DOI] [PubMed] [Google Scholar]
  • 23.Lecoquierre F., Quenez O., Fourneaux S., Coutant S., Vezain M., Rolain M., Drouot N., Boland A., Olaso R., Meyer V., et al. High diagnostic potential of short and long read genome sequencing with transcriptome analysis in exome-negative developmental disorders. Hum. Genet. 2023;142:773–783. doi: 10.1007/s00439-023-02553-1. [DOI] [PubMed] [Google Scholar]
  • 24.Ohori S., Miyauchi A., Osaka H., Lourenco C.M., Arakaki N., Sengoku T., Ogata K., Honjo R.S., Kim C.A., Mitsuhashi S., et al. Biallelic structural variations within FGF12 detected by long-read sequencing in epilepsy. Life Sci. Alliance. 2023;6 doi: 10.26508/lsa.202302025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tayoun A.A., Sinha S., Rabea F., Ramaswamy S., Chekroun I., Naofal M.E., Jain R., Alfalasi R., Halabi N., Yaslam S., et al. Long read sequencing enhances pathogenic and novel variation discovery in patients with rare diseases. Research Gate. 2024 doi: 10.21203/rs.3.rs-4235049/v1. Preprint at. [DOI] [Google Scholar]
  • 26.Steyaert W., Sagath L., Demidov G., Yépez V.A., Esteve-Codina A., Gagneur J., Ellwanger K., Derks R., Weiss M., den Ouden A., et al. Unravelling undiagnosed rare disease cases by HiFi long-read genome sequencing. medRxiv. 2024 doi: 10.1101/2024.05.03.24305331. Preprint at. [DOI] [Google Scholar]
  • 27.Chen X., Harting J., Farrow E., Thiffault I., Kasperaviciute D., Genomics England Research Consortium. Hoischen A., Gilissen C., Pastinen T., Eberle M.A. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. Am. J. Hum. Genet. 2023;110:240–250. doi: 10.1016/j.ajhg.2023.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Hiatt S.M., Lawlor J.M.J., Handley L.H., Latner D.R., Bonnstetter Z.T., Finnila C.R., Thompson M.L., Boston L.B., Williams M., Rodriguez Nunez I., et al. Long-read genome sequencing and variant reanalysis increase diagnostic yield in neurodevelopmental disorders. Genome Res. 2024;34:1747–1762. doi: 10.1101/gr.279227.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Logsdon G.A., Vollger M.R., Eichler E.E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 2020;21:597–614. doi: 10.1038/s41576-020-0236-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hon T., Mars K., Young G., Tsai Y.-C., Karalius J.W., Landolin J.M., Maurer N., Kudrna D., Hardigan M.A., Steiner C.C., et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data. 2020;7:399. doi: 10.1038/s41597-020-00743-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mastrorosa F.K., Miller D.E., Eichler E.E. Applications of long-read sequencing to Mendelian genetics. Genome Med. 2023;15:42. doi: 10.1186/s13073-023-01194-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Goenka S.D., Gorzynski J.E., Shafin K., Fisk D.G., Pesout T., Jensen T.D., Monlong J., Chang P.-C., Baid G., Bernstein J.A., et al. Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing. Nat. Biotechnol. 2022;40:1035–1041. doi: 10.1038/s41587-022-01221-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Gorzynski J.E., Goenka S.D., Shafin K., Jensen T.D., Fisk D.G., Grove M.E., Spiteri E., Pesout T., Monlong J., Baid G., et al. Ultrarapid Nanopore Genome Sequencing in a Critical Care Setting. N. Engl. J. Med. 2022;386:700–702. doi: 10.1056/NEJMc2112090. [DOI] [PubMed] [Google Scholar]
  • 34.Kolmogorov M., Billingsley K.J., Mastoras M., Meredith M., Monlong J., Lorig-Roach R., Asri M., Alvarez Jerez P., Malik L., Dewan R., et al. Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods. 2023;20:1483–1492. doi: 10.1038/s41592-023-01993-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Olson N.D., Wagner J., McDaniel J., Stephens S.H., Westreich S.T., Prasanna A.G., Johanson E., Boja E., Maier E.J., Serang O., et al. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. Cell Genom. 2022;2 doi: 10.1016/j.xgen.2022.100129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hamosh A., Scott A.F., Amberger J.S., Bocchini C.A., McKusick V.A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2005;33:D514–D517. doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rehm H.L., Berg J.S., Brooks L.D., Bustamante C.D., Evans J.P., Landrum M.J., Ledbetter D.H., Maglott D.R., Martin C.L., Nussbaum R.L., et al. ClinGen — The Clinical Genome Resource. N. Engl. J. Med. 2015;372:2235–2242. doi: 10.1056/NEJMsr1406261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.DiStefano M.T., Goehringer S., Babb L., Alkuraya F.S., Amberger J., Amin M., Austin-Tse C., Balzotti M., Berg J.S., Birney E., et al. The Gene Curation Coalition: A global effort to harmonize gene–disease evidence resources. Genet. Med. 2022;24:1732–1742. doi: 10.1016/j.gim.2022.04.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gel B., Serra E. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics. 2017;33:3088–3090. doi: 10.1093/bioinformatics/btx346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., Li H. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10 doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Cingolani P., Platts A., Wang L.L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hickey G., Heller D., Monlong J., Sibbesen J.A., Sirén J., Eizenga J., Dawson E.T., Garrison E., Novak A.M., Paten B. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 2020;21:35. doi: 10.1186/s13059-020-1941-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sirén J., Monlong J., Chang X., Novak A.M., Eizenga J.M., Markello C., Sibbesen J.A., Hickey G., Chang P.-C., Carroll A., et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2021;374 doi: 10.1126/science.abg8871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Geoffroy V., Herenger Y., Kress A., Stoetzel C., Piton A., Dollfus H., Muller J. AnnotSV: an integrated tool for structural variations annotation. Bioinformatics. 2018;34:3572–3574. doi: 10.1093/bioinformatics/bty304. [DOI] [PubMed] [Google Scholar]
  • 45.Fishilevich S., Nudel R., Rappaport N., Hadar R., Plaschkes I., Iny Stein T., Rosen N., Kohn A., Twik M., Safran M., et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database. 2017;2017 doi: 10.1093/database/bax028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Liu X., Li C., Mou C., Dong Y., Tu Y. dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 2020;12:103. doi: 10.1186/s13073-020-00803-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Collins R.L., Brand H., Karczewski K.J., Zhao X., Alföldi J., Francioli L.C., Khera A.V., Lowther C., Gauthier L.D., Wang H., et al. A structural variation reference for medical and population genetics. Nature. 2020;581:444–451. doi: 10.1038/s41586-020-2287-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Chen S., Francioli L.C., Goodrich J.K., Collins R.L., Kanai M., Wang Q., Alföldi J., Watts N.A., Vittal C., Gauthier L.D., et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature. 2024;625:92–100. doi: 10.1038/s41586-023-06045-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Pais L.S., Snow H., Weisburd B., Zhang S., Baxter S.M., DiTroia S., O’Heir E., England E., Chao K.R., Lemire G., et al. seqr: A web-based analysis and collaboration tool for rare disease genomics. Hum. Mutat. 2022;43:698–707. doi: 10.1002/humu.24366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Landrum M.J., Lee J.M., Riley G.R., Jang W., Rubinstein W.S., Church D.M., Maglott D.R. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42:D980–D985. doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Amberger J.S., Bocchini C.A., Schiettecatte F., Scott A.F., Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43:D789–D798. doi: 10.1093/nar/gku1205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Firth H.V., Richards S.M., Bevan A.P., Clayton S., Corpas M., Rajan D., Van Vooren S., Moreau Y., Pettett R.M., Carter N.P. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 2009;84:524–533. doi: 10.1016/j.ajhg.2009.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013;45:580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Pennacchio L.A., Bickmore W., Dean A., Nobrega M.A., Bejerano G. Enhancers: five essential questions. Nat. Rev. Genet. 2013;14:288–295. doi: 10.1038/nrg3458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Aref-Eshghi E., Bend E.G., Hood R.L., Schenkel L.C., Carere D.A., Chakrabarti R., Nagamani S.C.S., Cheung S.W., Campeau P.M., Prasad C., et al. BAFopathies’ DNA methylation epi-signatures demonstrate diagnostic utility and functional continuum of Coffin–Siris and Nicolaides–Baraitser syndromes. Nat. Commun. 2018;9:4885. doi: 10.1038/s41467-018-07193-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Raney B.J., Barber G.P., Benet-Pagès A., Casper J., Clawson H., Cline M.S., Diekhans M., Fischer C., Navarro Gonzalez J., Hickey G., et al. The UCSC Genome Browser database: 2024 update. Nucleic Acids Res. 2024;52:D1082–D1088. doi: 10.1093/nar/gkad987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Aganezov S., Yan S.M., Soto D.C., Kirsche M., Zarate S., Avdeyev P., Taylor D.J., Shafin K., Shumate A., Xiao C., et al. A complete reference genome improves analysis of human genetic variation. Science. 2022;376 doi: 10.1126/science.abl3533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Liao W.-W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J.K., Monlong J., Abel H.J., et al. A draft human pangenome reference. Nature. 2023;617:312–324. doi: 10.1038/s41586-023-05896-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Porreca G.J., Zhang K., Li J.B., Xie B., Austin D., Vassallo S.L., LeProust E.M., Peck B.J., Emig C.J., Dahl F., et al. Multiplex amplification of large sets of human exons. Nat. Methods. 2007;4:931–936. doi: 10.1038/nmeth1110. [DOI] [PubMed] [Google Scholar]
  • 61.Hoppman-Chaney N., Peterson L.M., Klee E.W., Middha S., Courteau L.K., Ferber M.J. Evaluation of Oligonucleotide Sequence Capture Arrays and Comparison of Next-Generation Sequencing Platforms for Use in Molecular Diagnostics. Clin. Chem. 2010;56:1297–1306. doi: 10.1373/clinchem.2010.145441. [DOI] [PubMed] [Google Scholar]
  • 62.Hu H., Wrogemann K., Kalscheuer V., Tzschach A., Richard H., Haas S.A., Menzel C., Bienek M., Froyen G., Raynaud M., et al. Mutation screening in 86 known X-linked mental retardation genes by droplet-based multiplex PCR and massive parallel sequencing. HUGO J. 2009;3:41–49. doi: 10.1007/s11568-010-9137-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Valencia C.A., Rhodenizer D., Bhide S., Chin E., Littlejohn M.R., Keong L.M., Rutkowski A., Bonnemann C., Hegde M. Assessment of Target Enrichment Platforms Using Massively Parallel Sequencing for the Mutation Detection for Congenital Muscular Dystrophy. J. Mol. Diagn. 2012;14:233–246. doi: 10.1016/j.jmoldx.2012.01.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Awdeh Z.L., Alper C.A. Inherited structural polymorphism of the fourth component of human complement. Proc. Natl. Acad. Sci. USA. 1980;77:3576–3580. doi: 10.1073/pnas.77.6.3576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Chen Y., Dawes R., Kim H.C., Ljungdahl A., Stenton S.L., Walker S., Lord J., Lemire G., Martin-Geary A.C., Ganesh V.S., et al. De novo variants in the RNU4-2 snRNA cause a frequent neurodevelopmental syndrome. Nature. 2024;632:832–840. doi: 10.1038/s41586-024-07773-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Jones A.G., Aquilino M., Tinker R.J., Duncan L., Jenkins Z., Carvill G.L., DeWard S.J., Grange D.K., Hajianpour M.J., Halliday B.J., et al. Clustered de novo start-loss variants in GLUL result in a developmental and epileptic encephalopathy via stabilization of glutamine synthetase. Am. J. Hum. Genet. 2024;111:729–741. doi: 10.1016/j.ajhg.2024.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Ebert P., Audano P.A., Zhu Q., Rodriguez-Martin B., Porubsky D., Bonder M.J., Sulovari A., Ebler J., Zhou W., Serra Mari R., et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021;372 doi: 10.1126/science.abf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Sahajpal N.S., Barseghyan H., Kolhe R., Hastie A., Chaubey A. Optical Genome Mapping as a Next-Generation Cytogenomic Tool for Detection of Structural and Copy Number Variations for Prenatal Genomic Analyses. Genes. 2021;12:398. doi: 10.3390/genes12030398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Merke D.P., Auchus R.J. Congenital Adrenal Hyperplasia Due to 21-Hydroxylase Deficiency. N. Engl. J. Med. 2020;383:1248–1261. doi: 10.1056/NEJMra1909786. [DOI] [PubMed] [Google Scholar]
  • 70.McRae J.F., Clayton S., Fitzgerald T.W., Kaplanis J., Prigmore E., Rajan D., Sifrim A., Aitken S., Akawi N., Alvi M., et al. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–438. doi: 10.1038/nature21062. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Trier C., Fournous G., Strand J.M., Stray-Pedersen A., Pettersen R.D., Rowe A.D. Next-generation sequencing of newborn screening genes: the accuracy of short-read mapping. NPJ Genom. Med. 2020;5:36. doi: 10.1038/s41525-020-00142-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Chen X., Sanchis-Juan A., French C.E., Connell A.J., Delon I., Kingsbury Z., Chawla A., Halpern A.L., Taft R.J., et al. NIHR BioResource Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genet. Med. 2020;22:945–953. doi: 10.1038/s41436-020-0754-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Jensen T.D., Ni B., Reuter C., Gorzynski J.E., Fazal S., Bonner D.E., Ungar R.A., Goddard P.C., Natarajan Raja A., Ashley E.A., et al. Integration of transcriptomics and long-read genomics prioritizes structural variants in rare disease. medRxiv. 2024 doi: 10.1101/2024.03.22.24304565. Preprint at. [DOI] [Google Scholar]
  • 74.Mahmoud M., Gobet N., Cruz-Dávalos D.I., Mounier N., Dessimoz C., Sedlazeck F.J. Structural variant calling: the long and the short of it. Genome Biol. 2019;20:246. doi: 10.1186/s13059-019-1828-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Wijngaard R., Demidov G., O’Gorman L., Corominas-Galbany J., Yaldiz B., Steyaert W., de Boer E., Vissers L.E.L.M., Kamsteeg E.-J., Pfundt R., et al. Mobile element insertions in rare diseases: a comparative benchmark and reanalysis of 60,000 exome samples. Eur. J. Hum. Genet. 2024;32:200–208. doi: 10.1038/s41431-023-01478-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Schobers G., Schieving J.H., Yntema H.G., Pennings M., Pfundt R., Derks R., Hofste T., de Wijs I., Wieskamp N., van den Heuvel S., et al. Reanalysis of exome negative patients with rare disease: a pragmatic workflow for diagnostic applications. Genome Med. 2022;14:66. doi: 10.1186/s13073-022-01069-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.van Slobbe M., van Haeringen A., Vissers L.E.L.M., Bijlsma E.K., Rutten J.W., Suerink M., Nibbeling E.A.R., Ruivenkamp C.A.L., Koene S. Reanalysis of whole-exome sequencing (WES) data of children with neurodevelopmental disorders in a standard patient care context. Eur. J. Pediatr. 2024;183:345–355. doi: 10.1007/s00431-023-05279-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Ebler J., Ebert P., Clarke W.E., Rausch T., Audano P.A., Houwaart T., Mao Y., Korbel J.O., Eichler E.E., Zody M.C., et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 2022;54:518–525. doi: 10.1038/s41588-022-01043-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Cheung W.A., Johnson A.F., Rowell W.J., Farrow E., Hall R., Cohen A.S.A., Means J.C., Zion T.N., Portik D.M., Saunders C.T., et al. Direct haplotype-resolved 5-base HiFi sequencing for genome-wide profiling of hypermethylation outliers in a rare disease cohort. Nat. Commun. 2023;14:3090. doi: 10.1038/s41467-023-38782-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Li T., Ferraro N., Strober B.J., Aguet F., Kasela S., Arvanitis M., Ni B., Wiel L., Hershberg E., Ardlie K., et al. The functional impact of rare variation across the regulatory cascade. Cell Genom. 2023;3 doi: 10.1016/j.xgen.2023.100401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Martin-Trujillo A., Patel N., Richter F., Jadhav B., Garg P., Morton S.U., McKean D.M., DePalma S.R., Goldmuntz E., Gruber D., et al. Rare genetic variation at transcription factor binding sites modulates local DNA methylation profiles. PLoS Genet. 2020;16 doi: 10.1371/journal.pgen.1009189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Chundru V.K., Marioni R.E., Prendergast J.G.D., Lin T., Beveridge A.J., Martin N.G., Montgomery G.W., Hume D.A., Deary I.J., Visscher P.M., et al. Rare genetic variants underlie outlying levels of DNA methylation and gene-expression. Hum. Mol. Genet. 2023;32:1912–1921. doi: 10.1093/hmg/ddad028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Tinker R.J., Bastarache L., Ezell K., Kobren S.N., Esteves C., Rosenfeld J.A., Macnamara E.F., Hamid R., Cogan J.D., Rinker D., et al. The contribution of mosaicism to genetic diseases and de novo pathogenic variants. Am. J. Med. Genet. A. 2023;191:2482–2492. doi: 10.1002/ajmg.a.63309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Noyes M.D., Harvey W.T., Porubsky D., Sulovari A., Li R., Rose N.R., Audano P.A., Munson K.M., Lewis A.P., Hoekzema K., et al. Familial long-read sequencing increases yield of de novo mutations. Am. J. Hum. Genet. 2022;109:631–646. doi: 10.1016/j.ajhg.2022.02.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S21
mmc1.pdf (47.4MB, pdf)
Table S1. Sequencing data statisitics
mmc2.xlsx (39.4KB, xlsx)
Data S1. Tables S2–S6
mmc3.xlsx (160.7KB, xlsx)
Data S2. Tables S7 and S8
mmc4.xlsx (16.8KB, xlsx)
Data S3. Tables S9–S14
mmc5.xlsx (204KB, xlsx)
Data S4. Tables S15–S23
mmc6.xlsx (588.9KB, xlsx)
Data S5. Tables S24 and S25
mmc7.xlsx (104.8KB, xlsx)
Data S6. Tables S26–S28
mmc8.xlsx (19.6KB, xlsx)
Document S2. Article plus supplemental information
mmc9.pdf (53.8MB, pdf)

Data Availability Statement

Genomic and phenotypic data from the Rare Genomes Project is available via dbGaP accession number phs003047 (GREGoR). Access is managed by a data access committee designated by dbGaP. Additional information on accessing the data is available on the GREGoR website at https://gregorconsortium.org/data. The Pediatric Mendelian Genomics Research Center (PMGRC) samples have been submitted to AnVIL and will be available pending processing by the AnVIL team. The DSD-TRN data will be available upon request. Scripts for performing all genomic analyses are available on GitHub (https://github.com/shlokanegi/Long_reads_rare_disease_Paper).


Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES