Abstract
Background
Next-Generation short-read sequencing has limited diagnostic utility in phasing distantly separated variants and analysing genomic regions with high homology. Determining the phase of variants from parental chromosomes is critical for accurate identification of compound heterozygosity. Long-read sequencing technology is able to overcome these limitations through the analysis of long haplotypes of separated variants. This study has developed and validated a robust, end-to-end workflow for phasing and localising variants using long-range PCR (LR-PCR) and targeted Nanopore sequencing for clinical implementation.
Methods
NA24385 (HG002) reference DNA was used for all tests. Four PCR kits were tested to optimise LR-PCR for targets between 1 and 20 kb. Amplicons were barcoded and sequenced on Flongle flow cells, with up to eight amplicons on each flow cell. An in-house bioinformatic pipeline was developed to analyse the amplicons. This pipeline is capable of detecting chimeric reads (a known PCR artefact), and incorporating Clair3 for variant calling, and WhatsHap and HapCUT2 for phasing.
Results
The UltraRun LongRange PCR Kit performed with a 90% success rate for DNA amplification up to 22 kb. All 15 tested heterozygous Single Nucleotide Variant (SNV) pairs, and 10 small InDels, with inter-variant distances from 5.8 to 21.4 kb, were phased with 100% concordance to known phase. Furthermore, SNV calling within six low-mappability genes demonstrated precision and sensitivity of 1 against benchmark data. The median proportion of chimeric reads was maintained at 2.80% (range 1.79–16.12%) under optimised conditions.
Conclusions
This study establishes a reliable and affordable clinical diagnostic workflow for accurate phasing of variants separated by up to ~ 20 kb and for variant localisation in genomic regions not able to be sequenced by short-read sequencing. This integrated approach enables implementation in diagnostic settings to resolve complex genetic findings and improve variant interpretation. The bioinformatic pipeline and documentation are available at https://github.com/j-jamshidi/ONT_amp_phase.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12920-025-02251-z.
Keywords: Nanopore long-read sequencing, Phasing, Long-range PCR
Introduction
Next generation sequencing (NGS) has revolutionised diagnostic genomics, with whole exome (WES) and whole genome (WGS) sequencing having become integral to clinical diagnosis. NGS has several inherent limitations related to short read sequencing (SRS), such as poor alignment in regions of high sequence similarity (paralogous regions) of the human genome and the inability to phase variants more than a few hundred bases apart [1].
Phasing refers to the process of determining the parental origin of each variant allele either located on the same chromosome (in cis) or on different chromosomes (in trans). Phasing can determine whether two alleles sequenced in a single individual are compound heterozygous (in trans) which is crucial for resolving the inheritance of variants in autosomal recessive conditions, particularly when parental segregation is not possible [2]. In addition, when one parent is heterozygous for both variants or when one variant is de novo, parental sequencing cannot determine the phase.
Poor alignment occurs in SRS within regions of the genome with large stretches of identical or nearly identical sequences that exist at multiple locations such as tandem repeats, transposable elements, paralogous genes and pseudogenes [3]. In such regions short reads cannot be uniquely aligned to any one location, causing misalignment and poor mapping [4]. Variants detected in regions with low mapping usually require confirmation using an alternative method, such as Sanger sequencing. However, designing specific Sanger primers may not be feasible due to high sequence similarity and sequencing length limitations, leaving limited diagnostic tools available to confirm the variants [3]. Long-read sequencing (LRS) can overcome this issue as primers may be designed in unique sequence regions at a distance from the variant of interest.
Recent advances in basecalling accuracy have enabled LRS to address the shortcomings of SRS, and for it to be implemented in diagnostic genomic testing [5]. Oxford Nanopore Technologies (ONT) offers high accuracy, affordable and scalable LRS [6] using the smallest flow cell (Flongle Flow cells) for targeted sequencing, enabling financially feasible small-scale diagnostic assays.
Although long-range PCR may amplify longer DNA sequences, challenges may arise with primer design and increased levels of chimeric reads, - a PCR artefact derived from two different biological sequences, affecting sequencing accuracy and phasing [7, 8]. Therefore, optimising long-range PCR and detection whilst minimising chimeric reads is crucial for successful sequencing applications for diagnostic purposes.
This study has developed a robust long-range PCR protocol by comparing different PCR kits to maximise successful simultaneous amplification of long DNA targets (~ 1–20 kb), a protocol to barcode and sequence multiple long amplicons on Flongle flow cells, and a bioinformatic pipeline to streamline and automate quality control, variant calling, localisation, and phasing from barcoded ONT amplicons. Combining these three steps has resulted in a reliable and affordable workflow for localising or phasing clinically relevant variants up to 20 kb apart, suitable for implementation in clinical diagnostics.
Methods
Samples and ethical considerations
NA24385 DNA (HG002 from GIAB) (Coriell Institute) was used to set up the PCRs, sequencing and evaluation of variant calling and phasing from amplicons. This project was reviewed by the South Eastern Sydney Local Health District HREC at a meeting of its Low and Negligible Risk Committee, which determined that the project did not involve any ethical risks in accordance with NSW Health Policy.
Long-range PCR kit comparison
Four different PCR kits with the ability to amplify long targets including Platinum SuperFi II PCR Master Mix (2X) (Invitrogen), LongAmp® Taq 2X Master Mix (New England Biolabs), Q5® Hot Start High-Fidelity 2X Master Mix (New England Biolabs), and UltraRun LongRange PCR Kit (Qiagen) were selected for assessment. Ten DNA regions of the human genome between 1 and 22 kb in size were chosen for amplification which contained at least one gene, and where each amplicon spanned a minimum of two exons. This experiment focused solely on evaluating the ability of different kits to amplify targets of varying lengths (1–22 kb), independent of phasing analysis. Primers for these regions were designed using NCBI primer blast (Primer IDs: A1-A10). PCR programs were configured according to the manufacturer’s recommendations for each kit. A single PCR program using a single annealing temperature and extension time was established for each kit to enable running samples simultaneously on a thermal cycler. All PCRs were run in 26 cycles to minimise chimeric reads. All PCRs were conducted in a final volume of 20 µl, using 150 ng of the same DNA sample (NA24385) and 0.5 µM of each forward and reverse primer. After PCR, the amplicons were analysed using the Agilent 4200 Tape Station System. A successful amplicon was defined as the presence of a clear band in the gel image (concentration > 2 ng/µl). Non-specific bands were defined as any additional bands, with a concentration > 2 ng/µl. A detailed description of PCR kit selection including PCR protocols, amplicons and primer sequences are presented in the supplementary document.
Nanopore library preparation and sequencing
Barcoding genomic DNA using the Flongle protocol (Ligation Sequencing gDNA - Native Barcoding Kit 24 V14, SQK-NBD114.24) was adapted for amplicon library preparation. DNA repair and end-prep step was modified to accommodate the use of amplicons. More detail is available from the supplementary document. Nanopore libraries were prepared using the Ligation Sequencing Kit V14 (SQK-LSK114) and barcoded using Native Barcoding Kit 24 V14 (SQK-NBD114.24). Briefly, up to eight amplicons were multiplexed in each sequencing run with equimolar amounts of each amplicon. 0.75 µl Ultra II End-prep Enzyme Mix and 0.875 µl Ultra II End-prep Reaction Buffer (New England Biolabs #E7546) was added to 5–10 femtomole of each PCR product adjusted to 18.4 µl with nuclease free water. Reactions were incubated at 20 °C for 5 min and 65 °C for 5 min and then purified using AMPure XP beads (AXP). Native barcodes were ligated in a reaction with 7.5 µl end-prepped DNA, 2.5 µl Native Barcode, 10 µl Blunt/TA Ligase Master Mix (New England Biolabs #M0367), incubated at room temperature for 20 min. EDTA was then added to each reaction and samples were pooled and another AXP purification was performed. Sequencing adaptors were then ligated to the barcoded end-prepped amplicons in a reaction with 30 µl pooled barcoded sample, 5 µl Native Adapter (NA), 10 µl NEBNext Quick Ligation Reaction Buffer (5X) and 5 µl Quick T4 DNA Ligase (New England Biolabs #E6056) incubated for 20 min at room temperature, followed by washing with Long Fragment Buffer (LFB) and AXP clean up. The library was eluted in 7 µl of Elution Buffer (EB) and incubated at 37 °C for 10 min. Ten femtomole of the library was sequenced using a Flongle flow cell (R10.4.1, FLO-FLG114) on a GridION (ONT) device.
Basecalling and alignment
Super accuracy basecalling (SUP) (dna_r10.4.1_e8.2_400bps_sup@v4.3.0) was performed during sequencing using MinKNOW software (GridION Release 24.06.15 version) which uses dorado (https://github.com/nanoporetech/dorado) for basecalling and Minimap2 version 2.28 [9] to align the reads with human reference genome (build hg38) and output BAM files. The minimum phred score was set to ten.
Bioinformatic analysis
Samtools v1.21 [10] was employed for BAM file manipulation and quality control. For phasing analysis, reads with mapping quality (MAPQ) < 20 and those shorter than the distance between the two variants were excluded. For variant localisation analysis, reads shorter than 1000 bp, those with read identity < 80%, and MAPQ < 20 were filtered out. Variant calling from amplicons was performed using Clair3 v1.0.4 [11], while phasing was conducted using WhatsHap v2.3 [12] and HapCUT2 v1.3.4 [13]. WhatsHap was additionally used to haplotag the BAM files for visualisation. A custom Python script was developed to identify variants, reads, and their corresponding phases. The chimeric read proportion was calculated as the percentage of reads classified as the opposite phase to that determined by WhatsHap and HapCUT2, representing the smaller fraction of discordant reads. All BAM files were visualised and manually inspected using the Integrative Genomics Viewer (IGV) [14] to confirm the variants localisation and phase assignment. A bioinformatics pipeline was developed to automate these analyses and generate reports containing both quality control metrics and phase/localisation details. A minimum threshold of 50 high-quality reads (as defined above) was set as the requirement for reliable phasing. All BAM files were down-sampled to 50 reads and re-evaluated to assess whether this threshold was sufficient for accurate phasing.
Complete documentation of the bioinformatics pipeline and associated code are available in the corresponding GitHub repository (https://github.com/j-jamshidi/ONT_amp_phase).
Phasing samples
In a separate experiment designed to assess phasing performance, ten heterozygote variant pairs with distances ranging from 6.8 to 21.4 kb were selected from the NA24385 sample. Primers were designed to span the two variants in each case (Primer IDs: B1-B10). All amplicons were sequenced, and the phases of the variants were determined and compared with the known phase. Details of the variants, primers and amplicons are provided in the supplementary material table S10.
Chimeric read additional samples
Eight targets were amplified using (a) Platinum SuperFi II kit (26 cycles) and (b) UltraRun LongRange PCR kit (28 cycles) to assess the impact of different PCR conditions on chimeric read formation. The percentage of chimeric reads were then compared against the UltraRun LongRange PCR kit (26 cycles) baseline condition with the Wilcoxon signed-rank test. A p-value < 0.05 was considered statistically significant.
Low mappability genes
To evaluate the suitability of the method to call and localise variants in low mappability regions of the human genome, eight genes (TUBB2A, TUBB2B, CYP11B1, SBDS, HBA1, HBA2, RBMX, and CEL) were selected. These genes are associated with Mendelian disorders and contain regions of low mappability. The details of these genes and their associated phenotypes are summarised in Table S11. Primers were designed to amplify the entire length of each gene. HBA1 and HBA2 are in close proximity in the human genome, therefore, a single primer pair was designed to span the region covering both genes. Details of the primers and amplicons are provided in the supplementary table S12. Long-range PCR and sequencing were performed similarly to the phasing cases. The called SNVs and small indels from amplicon sequencing of these genes (excluding RBMX, which is located on the X chromosome and not included in the benchmarking VCF file) were compared with GIAB ground-truth variants (benchmark version v4.2.1) [15], using the vcfeval command from RTG Tools v3.13 (https://github.com/RealTimeGenomics/rtg-tools).
Results
Long-range PCR
Among the four PCR kits tested, the UltraRun LongRange PCR Kit demonstrated the best performance, successfully amplifying 9 out of 10 amplicons (90% success rate). Platinum SuperFi II and LongAmp® Taq showed a similar success rate of 70% (7 out of 10). The Q5® Hot Start High-Fidelity kit had the lowest success rate, successfully amplifying only four amplicons. Therefore, the UltraRun LongRange PCR Kit was selected for further amplifications. Details on the performance of each kit is available in the supplementary material (Tables S4-S7). All ten primer pairs designed for phasing (B1-B10), amplified their target regions successfully, although one amplicon (B1) required the addition of Q Solution, a PCR additive included with the UltraRun LongRange PCR Kit to enhance amplification.
Six of the seven PCR primers designed for the amplification of low-mappability genes produced a product (TUBB2A and HBA1/HBA2 required Q solution), whereas CEL primers did not yield a product.
Nanopore sequencing
15 amplicons that were amplified with UltraRun LongRange PCR kit in 26 cycles (5 amplicons that were used for testing different PCR kits and 10 that were designed specifically for phasing) were sequenced on Flongle flow cells. All amplicons had high depth of high quality (MAPQ < 20) reads covering both variants of interest (mean = 1920, median = 3248, standard deviation = 1716.8, range = 286–5483). Table 1 summarises the sequencing details.
Table 1.
List of amplicons sequenced for variant phasing
| Amplicon ID | Amplicon Size | Variant 1 (hg38) | Variant 2 (hg38) | Distance between variants (bp) | HQ spanning reads* | Chimeric reads % | Known phase |
|---|---|---|---|---|---|---|---|
| B1 | 8743 | chr16:88,643,329 C>T | chr16:88,650,126 A>G | 6797 | 5483 | 13.97 | cis |
| A3 | 9625 | chr1:154,579,450 C>T | chr1:154,585,209 C>T | 5759 | 5385 | 4.25 | cis |
| B2 | 10,026 | chr12:8,927,591 T>C | chr12:8,936,676 A>C | 9085 | 4413 | 16.12 | trans |
| B3 | 11,521 | chr12:6,936,914 G>A | chr12:6,944,269 C>T | 7355 | 1748 | 3.03 | cis |
| B4 | 13,789 | chr4:41,638,861 C>T | chr4:41,651,881 G>A | 13,020 | 673 | 1.93 | cis |
| B5 | 14,391 | chr3:49,412,205 G>A | chr3:49,425,854 G>A | 13,649 | 3165 | 2.62 | cis |
| B6 | 14,466 | chr21:42,375,588 G>A | chr21:42,388,818 A>G | 13,230 | 4450 | 3.33 | cis |
| A4 | 15,163 | chr6:152,389,867 T>C | chr6:152,403,191 G>T | 13,324 | 3349 | 2.51 | cis |
| A5 | 16,084 | chr1:236,544,737 T>C | chr1:236,559,878 C>A | 15,141 | 3248 | 5.48 | trans |
| A6 | 16,837 | chr1:43,658,540 G>T | chr1:43,667,233 A>G | 8693 | 3792 | 2.24 | cis |
| B7 | 17,828 | chr21:42,372,760 C>T | chr21:42,387,895 A>C | 15,135 | 906 | 2.32 | cis |
| A7 | 18,097 | chr1:6,629,764 A>G | chr1:6,645,884 C>T | 16,120 | 3633 | 1.79 | cis |
| B8 | 21,530 | chr3:48,566,437 G>A | chr3:48,584,160 T>C | 17,723 | 709 | 6.77 | cis |
| B9 | 22,097 | chr9:130,407,156 A>G | chr9:130,426,008 C>T | 18,852 | 2570 | 2.80 | trans |
| B10 | 22,409 | chr1:236,011,853 T>C | chr1:236,033,263 A>G | 21,410 | 286 | 2.10 | cis |
* HQ Spanning reads: High quality reads (MAPQ > 20) that span both variants of interest (i.e., variants to be phased)
Additionally, all amplicons amplified using Platinum SuperFi II PCR Master Mix and those amplified with 28 cycles using the UltraRun LongRange PCR Kit were also sequenced. The sequencing details are available in supplementary documents Tables S13 and S14. The depth of sequencing for each amplicon targeting low mappability genes was > 500x after QC, see supplementary Table 15.
Phasing
Fifteen variant pairs in amplicons generated using the UltraRun LongRange PCR Kit were phased and compared with the known phase. All variants were SNVs, with inter-variant distances ranging from 5,759 to 21,410 bp. All 15 variant pairs were phased correctly and were concordant with the known phase (Table 1). Furthermore, to test the robustness of this pipeline for phasing small indels, ten indels were identified and phased in the existing amplicons. All the indels were also phased correctly (table S16). Moreover, re-evaluating all cases with only 50 high quality reads per sample did not compromise phasing accuracy. Figure 1 shows an IGV screenshot of the phased reads for the B7 amplicon, highlighting the positions of the selected variants used for phasing as well as the variants called within the amplicon and the benchmarking variants.
Fig. 1.
IGV screenshot of phased reads for the B7 amplicon. This IGV screenshot illustrates phased long reads, coloured and grouped by haplotype (phase 1 and phase 2). From top to bottom, the tracks display: the reference gene, TMPRSS3; high-confidence variants for benchmarking (HG002); variants called from the amplicon using Clair3; and the coverage and alignments from long-read sequencing of the B7 amplicon, grouped by phase. Two key variants, chr21:42,372,760 C>T and chr21:42,387,895 A>C, are highlighted by red boxes and arrows, demonstrating their cis configuration (on the same phase). The reads are down-sampled and small indels (<2 bp) are hidden to provide a clearer view
Chimeric reads
The estimated proportion of chimeric reads in the fifteen phased amplicons ranged from 1.79% to 16.12% (mean = 4.75, median = 2.80, SD = 4.41). The percentage of chimeric reads for each amplicon is presented in Table 1.
Chimeric reads in different PCR conditions
The UltraRun LongRange (26 cycles) baseline condition yielded a median chimeric read percentage of 3.64% across the eight samples. Compared to this baseline, amplification with Platinum SuperFi II (26 cycles) (median = 9.33%; W-statistic = 1.0, p = 0.017) and increasing the cycles to 28 (median = 6.56%; W-statistic = 1.0, p = 0.017) resulted in significantly higher chimeric read levels. See Fig. 2 and Table S17.
Fig. 2.
Chimeric reads across different PCR conditions. Box plots display the median and interquartile range (IQR) of chimeric reads for each PCR condition, with individual data points overlaid. The UltraRun LongRange PCR Kit amplified samples for 26 cycles (reference group, blue) and 28 cycles (green). The Platinum SuperFi II PCR Master Mix amplified samples for 26 cycles (orange)
Variant calling from low mappability genes amplicons
The SNVs called via the pipeline with Clair3 from the amplicons generated for low mappability genes (TUBB2A, TUBB2B, CYP11B1, SBDS, HBA1 and HBA2, 64 variants in total) showed 100% concordance with the benchmark (high confidence) variants (Precision = 1, Sensitivity = 1 -see Supplementary Material for definitions of precision and sensitivity). Figure 3 illustrates short-read WGS and long-read amplicon sequencing of the TUBB2A gene.
Fig. 3.
IGV screenshot of short-read WGS and long-read amplicon sequencing of the TUBB2A gene. From top to bottom, the tracks display: the reference gene, TUBB2A; high-confidence variants for benchmarking (HG002); coverage and alignments from short-read WGS of HG002 (from GIAB repository); coverage and alignments from long-read sequencing of the TUBB2A amplicon (down-sampled to 176 reads). The red arrow highlights a variant that appears in a low-mappability region in the short-read WGS. This variant is not present in the high-confidence call set and not a real variant. The variant is not present in the long-read amplicon sequencing, demonstrating the utility of this method for accurate variant localisation in low-mappability regions of the genome
Discussion
This study developed and validated an integrated, end-to-end clinical diagnostic workflow combining optimised LR-PCR, targeted ONT sequencing, and an in-house developed bioinformatic pipeline for variant phasing and localisation. The workflow achieved accurate phasing of variants up to 21.4 kb apart and precise variant localisation in low-mappability genomic regions, addressing critical limitations of current short-read approaches. Methodologically, the study presents a comprehensively optimised workflow. This rigorous approach involved comparing LR-PCR kits systematically, optimising PCR conditions, developing a modified Nanopore library preparation protocol for barcoding on Flongle flow cells, and adjusting the multiplexing number and molarity of amplicons. Complementing these laboratory improvements is an automated bioinformatic pipeline integrating established tools like Clair3 for variant calling, and WhatsHap and HapCUT2 for phasing, alongside custom scripts for quality control and specific analytical logic for phasing versus localisation tasks. This end-to-end development, tailored for a cost-constrained yet high-performance application, represents a significant advancement beyond existing tools.
LR-PCR is a recognised versatile technique for amplifying large genomic segments [16], but its success is highly dependent on careful optimisation, limiting its applications in targeted long-read sequencing [17]. In the current study, different LR-PCR kits were compared to optimise the workflow, with the UltraRun LongRange PCR Kit demonstrating the best performance by successfully amplifying nine out of ten targets. Both Platinum SuperFi II and LongAmp® Taq also performed well, each amplifying seven targets successfully. Notably, the three amplicons that failed with both kits were the longest in the panel (all >18 kb), consistent with the difficulty of amplify ultra-long fragments. The Q5® Hot Start High-Fidelity kit demonstrated the weakest performance, with only four successful amplifications, likely due to the use of a fixed annealing temperature, despite the kit’s recommendation for primer-specific annealing conditions. A universal annealing temperature and extension time is necessary to allow simultaneous amplification of samples, facilitate batching and streamlining the workflow, which are key considerations for implementation in a diagnostic laboratory.
Although some regions slightly over 20 kb were amplified, limiting amplicons to 20 kb is recommended due to the technical challenges of amplifying longer targets [18]. An internal analysis of a curated panel of 5,678 genes associated with Mendelian disorders used in our laboratory showed a median gene size of approximately 38 kb. In this dataset for any two random variants within a gene, there is a ~ 68% chance they would lie within a 20 kb range (figure S1 and Supplementary Data for detailed analysis). 20 kb should therefore be a sufficient length for diagnostic utility for the majority of variants requiring phasing in a clinical context.
Variants >20 kb apart could be potentially phased using overlapping amplicons, however, this approach adds complexity and may not always be feasible [19]. While this strategy can be effective for targeted phasing of specific loci [20], it is less practical for general diagnostic testing. Given the generality and focus of this method on clinical implementation, we chose not to pursue this strategy to avoid unnecessary complexity.
An important technical consideration in amplicon-based phasing is the formation of chimeric reads during PCR, which can falsely link variants on different haplotypes [7]. Unlike microbial diversity studies, which benefit from established tools for filtering out chimeric reads [21], phasing applications lack robust methods for chimera removal, largely because the parental haplotypes may be nearly identical and sometimes differing only by the variants being phased [19]. Furthermore, approximately 1.7% of reads in nanopore-based amplicon sequencing may contain post-amplification chimeric elements [22]. Therefore, detecting and minimising chimeric reads is critical for reliable phasing using amplicon-based approaches.
Previous studies have demonstrated that failure to account for chimeric reads can lead to incorrect phasing interpretations [7]. The number of PCR cycles was reduced to 26 to reduce chimera formation, a strategy supported by earlier studies [7, 19, 23] and confirmed by this study, where increasing the cycle number to 28 led to higher chimeric reads. While reducing the number of PCR cycles can lower chimera formation, decreasing below 26 cycles may lead to inadequate yields, specifically for longer amplicons. Thus, 26 cycles represented a balance between minimising chimera formation and ensuring sufficient amplicon material for reliable sequencing across a range of targets. The UltraRun LongRange PCR Kit was observed to generate fewer chimeric reads than Platinum SuperFi II on the same samples, further supporting its selection. This difference may arise from the polymerases used or protocol variations; notably, UltraRun uses a two-step PCR protocol, which has been associated with lower chimera formation rates [24].
Beyond PCR optimisation, the developed bioinformatic pipeline included a method for quantifying chimeric reads based solely on the two variants under phasing. This approach contrasts with other studies that relied on external software or manual review, requiring extra steps for chimeric detection [19, 25]. In the primary sample, chimeric read levels were between 1.79% and 16.12% (median 2.8%), consistent with previously reported figures (7,19). The variation in chimeric read proportions observed across amplicons did not show a relationship with amplicon length, suggesting that local sequence context such as repetitive or low-complexity regions may play a role in promoting template switching [26]. While a median rate of 2.8% allows for accurate phasing, the broad range highlights the potential for substantial chimera presence in specific amplicons. This variability underscores the importance of careful, amplicon-specific assessment of chimeric read rates, especially in diagnostic contexts where precision is critical. Notably, the developed pipeline maintained robust and accurate phasing performance even in samples with chimeric read levels as high as 27%, observed in Platinum SuperFi II amplicons (see Table S17).
LR-PCR and nanopore sequencing have been employed for phasing variants [8, 19, 20, 27, 28] and localising those identified in low-mappability regions [29, 30] in previous studies. However, these studies focused on specific genomic regions, such as haplotyping the ABCA4 locus [28], or the localisation of variants within the highly repetitive domains of TTN [29]. This study has developed a robust and reliable workflow adaptable to any genomic region that can be feasibly amplified using LR-PCR.
Notably, previous similar studies have typically used MinION flow cells for multiple amplicons in nanopore-based assays [8, 31, 32] while Flongle flow cells have been used often for single assay sequencing [27, 33, 34]. Sequencing multiple barcoded amplicons on a Flongle flow cells was a key feature of our assay design, offering an affordable and practical solution, particularly for smaller diagnostic laboratories.
While the workflow described here is adaptable to larger flow cells such as the MinION, achieving similar cost efficiency to Flongle would require a larger number of samples per run. In a diagnostic setting, accumulating sufficient samples for high-level multiplexing could lead to elevated turnaround times [35]. Flongle-based multiplexing therefore currently offers the most efficient and affordable solution for clinically relevant applications and supports the broader adoption of targeted nanopore sequencing in clinical diagnostics.
The median sequencing depth per amplicon in this study exceeded 3,000x, consistent with the multiplexing of more than eight amplicons per Flongle flow cell being technically feasible. However, variability in ONT flow cell performance relating to active pore count, and significant variation in depth, despite using equimolar quantities of PCR products [36] can affect stability and reliability, which are critical in diagnostic settings. Limiting the number of amplicons per run therefore ensures consistent performance, even when sequencing output is suboptimal. Additionally, as with MinION flow cells, higher multiplexing may delay turnaround time due to the need to batch more samples. Thus, using eight amplicons per run balances cost-efficiency, test robustness, and clinical practicality.
The developed workflow has significant potential to enhance molecular diagnostics. Accurate phasing of variants is crucial for the interpretation of results in autosomal recessive Mendelian conditions, where it is necessary to determine if two heterozygous variants in a gene are in cis or in trans to confirm compound heterozygosity [2, 37]. Misinterpretation of phase can lead to incorrect diagnostic conclusions. By providing direct phasing from proband DNA, this workflow can clarify the clinical significance of co-occurring variants. While long-read WGS or adaptive sampling via nanopore sequencing can be used for variant phasing, these approaches are costly and impractical when the goal is to phase only two variants. Similarly, mRNA or cDNA sequencing for variant phasing has several limitations: it is restricted to coding variants and requires high-quality, full-length cDNA to capture distant variants on the same transcript [38]. Tissue-specific expression and alternative splicing further limit its clinical applicability, particularly when the relevant tissue is inaccessible [39]. Therefore, although mRNA and cDNA sequencing are powerful tools for studying expression and splicing, they are not the most practical strategy for variant phasing.
The ability to reliably call variants in low-mappability regions is of high clinical importance. Many disease-associated genes contain segments with high sequence similarity to other genomic locations (e.g., pseudogenes) or repetitive sequences, which can lead to variants being missed or incorrectly called by standard NGS approaches [40]. The successful validation of variant calling in genes like TUBB2A/TUBB2B, CYP11B1, and HBA1/HBA2 demonstrates the workflow’s utility in these challenging but clinically important regions. The cost-effectiveness and targeted nature of this workflow positions it as an ideal “Tier 2” diagnostic test. It can be employed to resolve ambiguous findings from initial WES or WGS, such as phasing variants of uncertain significance (VUS) to aid in their classification, or to confirm variants detected in regions where NGS data quality is suboptimal. The routine availability of such assays could lead to faster, more accurate diagnoses and reduce the number of unresolved cases and ultimately improve patient care.
While the study demonstrates considerable strengths, certain limitations should be acknowledged. The LR-PCR protocol, despite optimisation and selection of the best-performing kit, did not achieve universal amplification for all attempted targets. For example, primers designed for the CEL gene failed to yield an amplicon, and several other amplicons (CYB, TUBB2A, HBA) required the addition of a PCR additive (Q Solution) for successful amplification. This is consistent with certain genomic regions remaining challenging for LR-PCR amplification, potentially due to extreme GC content, secondary structures, or other sequence-specific features [16]. Such targets might require further bespoke optimisation, alternative primer designs, or different amplification strategies.
The proportion of chimeric reads, while generally low with the optimised protocol, showed variability across amplicons, with some exhibiting higher levels. Although not observed in this study, amplicons that consistently produce a high percentage of chimeric reads could compromise the accuracy of phasing or variant detection if not controlled stringently and bioinformatically flagged. This highlights the importance of establishing clear quality control thresholds for chimeric read proportions for each diagnostic target.
This workflow proved robust for detecting SNVs and small indels, however, it was not designed for other variant types such as structural variants or short tandem repeats (STRs) and is therefore limited to SNVs and indels. Moreover, ONT sequencing has a higher error rate for indels than for SNVs [41], warranting extra caution when interpreting such variants. Detection of variants located within homopolymer regions is also challenging using ONT sequencing [42], placing it outside the scope of this workflow. Furthermore, although the workflow achieved a precision and sensitivity of 1 for variant calling in the selected genes containing low-mappability regions, it is important to note that this evaluation was based on partial genomic regions covering only 64 variants. In larger datasets involving more genes and samples, a decrease in precision and sensitivity is expected [43].
The variable performance of ONT flow cells can introduce variability across sequencing batches, potentially impacting the stability and reliability of the assay. Although our workflow was designed to mitigate this through specific wet-lab (e.g., limiting amplicon numbers) and bioinformatic strategies, this intrinsic limitation remains. The bioinformatic pipeline assists in managing this by flagging suboptimal results, such as instances where fewer than 50 high-quality reads are available to support phase interpretation or variant localisation.
Conclusions
This study has developed and validated a robust, accurate, and cost-effective targeted long-read sequencing workflow using LR-PCR and ONT Flongle technology. By addressing the challenges of variant phasing and analysis of low-mappability genomic regions effectively, this workflow offers a significant advancement over conventional short-read NGS methods for specific, complex diagnostic questions. The comprehensive optimisation from sample preparation through to bioinformatic analysis provides a practical solution with the potential to enhance molecular diagnostics, clarify ambiguous genetic findings, and ultimately improve patient care by enabling more precise genetic diagnoses where current methodologies may fall short.
Supplementary Information
Authors’ contributions
JJ, TR and FH conceptualised and designed the study. JJ and CR performed the PCR and sequencing experiments. JJ, FZ, and YZ analysed the data and developed the bioinformatic pipeline. JJ wrote the draft of the manuscript, and TR, MB, SF, FH, FZ, YZ, and CR reviewed and edited the main text. TR and MB provided the funding and resources. All authors read and approved the final version of the manuscript.
Funding
This study was supported by the University of New South Wales Sydney and the Medical Research Futures Fund PreGen program (GHFMPACI000006).
Data availability
The sequencing data used in this study were generated from the HG002 reference sample, which is publicly available through the Genome in a Bottle (GIAB) consortium and can be accessed at https://www.nist.gov/programs-projects/genome-bottle.
All documentation, source code, and example data for the bioinformatics pipeline developed in this study are available in the associated GitHub repository: https://github.com/j-jamshidi/ONT_amp_phase.
Declarations
Ethics approval and consent to participate
This project was reviewed by the South Eastern Sydney Local Health District Human Research Ethics Committee (HREC) at a meeting of its Low and Negligible Risk (LNR) Committee, which determined that the project did not involve any ethical risks and was exempt from full ethical review, in accordance with the NHMRC National Statement (updated 2023) —Sect. 5.1.22, the NHMRC Ethical Considerations in Quality Assurance and Evaluation Activities (2014) guidance and NSW Health Guideline GL2007_020 Human Research Ethics Committees - Quality Improvement and Ethics Review: A Practice Guide for NSW. The study was conducted in accordance with the Declaration of Helsinki. This study does not include human participants and only publicly available, de-identified human genomic data (HG002 sample from the Genome in a Bottle Consortium) was used; no informed consent was required.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Javad Jamshidi, Email: j.jamshidi@neura.edu.au.
Tony Roscioli, Email: tony.roscioli@health.nsw.gov.au.
References
- 1.Bonfiglio F, Legati A, Lasorsa VA, Palombo F, De Riso G, Isidori F, et al. Best practices for germline variant and DNA methylation analysis of second- and third-generation sequencing data. Hum Genomics. 2024;18(1):120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human genomics. Nat Rev Genet. 2011;12(3):215–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rojahn S, Hambuch T, Adrian J, Gafni E, Gileta A, Hatchell H, et al. Scalable detection of technically challenging variants through modified next-generation sequencing. Mol Genet Genomic Med. 2022;10(12):e2072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mandelker D, Schmidt RJ, Ankala A, McDonald Gibson K, Bowser M, Sharma H, et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet Med. 2016;18(12):1282–9. [DOI] [PubMed] [Google Scholar]
- 5.Warburton PE, Sebra RP. Long-Read DNA, sequencing: recent advances and remaining challenges. Annu Rev Genomics Hum Genet. 2023;24(24, 2023):109–32. [DOI] [PubMed]
- 6.Espinosa E, Bautista R, Larrosa R, Plata O. Advancements in long-read genome sequencing technologies and algorithms. Genomics. 2024;116(3):110842. [DOI] [PubMed] [Google Scholar]
- 7.Laver TW, Caswell RC, Moore KA, Poschmann J, Johnson MB, Owens MM, et al. Pitfalls of haplotype phasing from amplicon-based long-read sequencing. Sci Rep. 2016;6(1):21746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Maestri S, Maturo MG, Cosentino E, Marcolungo L, Iadarola B, Fortunati E, et al. A Long-Read sequencing approach for direct haplotype phasing in clinical settings. Int J Mol Sci. 2020;21(23):9177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinforma Oxf Engl 2018;34(18):3094–100. [DOI] [PMC free article] [PubMed]
- 10.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence Alignment/Map format and samtools. Bioinforma Oxf Engl. 2009;25(16):2078–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zheng Z, Li S, Su J, Leung AWS, Lam TW, Luo R. Symphonizing pileup and full-alignment for deep learning-based long-read variant calling. Nat Comput Sci. 2022;2(12):797–803. [DOI] [PubMed] [Google Scholar]
- 12.Martin M, Ebert P, Marschall T. Read-Based phasing and analysis of phased variants with whatshap. Methods Mol Biol Clifton NJ. 2023;2590:127–38. [DOI] [PubMed] [Google Scholar]
- 13.Edge P, Bafna V, Bansal V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 2016;gr.213462.116. [DOI] [PMC free article] [PubMed]
- 14.Robinson JT, Thorvaldsdóttir H, Wenger AM, Zehir A, Mesirov JP. Variant review with the integrative genomics viewer. Cancer Res. 2017;77(21):e31–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, et al. Best practices for benchmarking germline small variant calls in human genomes. Nat Biotechnol. 2019;37(5):555–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chowdhary MR, Gupta N. Long-range PCR in next generation sequencing: A low cost approach for large & complex genes. Indian J Med Res. 2023;157(6):591–2. [DOI] [PMC free article] [PubMed]
- 17.Kee PS, Karunanathie H, Maggo SDS, Kennedy MA, Chua EW. Long-Range Polymerase Chain Reaction. In: Domingues L, editor. PCR: Methods and Protocols. New York, NY: Springer US; 2023 [cited 2025 July 1]. pp. 181–92. Available from: 10.1007/978-1-0716-3358-8_15. [DOI] [PubMed]
- 18.Maggi J, Koller S, Bähr L, Feil S, Kivrak Pfiffner F, Hanson JVM, et al. Long-Range PCR-Based NGS applications to diagnose Mendelian retinal diseases. Int J Mol Sci. 2021;22(4):1508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.McClinton B, Watson CM, Crinnion LA, McKibbin M, Ali M, Inglehearn CF, et al. Haplotyping using Long-Range PCR and nanopore sequencing to phase variants: lessons learned from the ABCA4 locus. Lab Invest. 2023;103(8):100160. [DOI] [PubMed] [Google Scholar]
- 20.Gueuning M, Thun GA, Wittig M, Galati AL, Meyer S, Trost N, et al. Haplotype sequence collection of ABO blood group alleles by long-read sequencing reveals putative A1-diagnostic variants. Blood Adv. 2023;7(6):878–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Callahan BJ, Grinevich D, Thakur S, Balamotis MA, Yehezkel TB. Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome 2021;9(1):130. [DOI] [PMC free article] [PubMed]
- 22.White R, Pellefigues C, Ronchese F, Lamiable O, Eccles D. Investigation of chimeric reads using the minion. F1000Research. 2017;6:631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Namias A, Sahlin K, Makoundou P, Bonnici I, Sicard M, Belkhir K, et al. Nanopore sequencing of PCR products enables multicopy gene family reconstruction. Comput Struct Biotechnol J. 2023;21:3656–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Qin Y, Wu L, Zhang Q, Wen C, Van Nostrand JD, Ning D, et al. Effects of error, chimera, bias, and GC content on the accuracy of amplicon sequencing. mSystems. 2023;8(6):e01025–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kroll F, Dimitriadis A, Campbell T, Darwent L, Collinge J, Mead S, et al. Prion protein gene mutation detection using long-read nanopore sequencing. Sci Rep. 2022;12(1):8284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Potapov V, Ong JL. Examining sources of error in PCR by Single-Molecule sequencing. PLoS ONE. 2017;12(1):e0169774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Durkie M, Watson CM, Winship P, Hogg AC, Nyanhete R, Cooley S, et al. The common PKD1 p.(Ile3167Phe) variant is hypomorphic and associated with very early Onset, biallelic polycystic kidney disease. Hum Mutat. 2023;2023(1):5597005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mc Clinton B, Crinnion LA, Ali M, Inglehearn C, Watson CM, Toomes C. Using Oxford Nanopore long-range sequencing to phase ABCA4. Invest Ophthalmol Vis Sci. 2021;62(8):1541.
- 29.Perrin A, Van Goethem C, Thèze C, Puechberty J, Guignard T, Lecardonnel B et al. Long-Reads Sequencing Strategy to Localize Variants in TTN Repeated Domains. J Mol Diagn. 2022;24(7):719–26. [DOI] [PubMed]
- 30.Watson CM, Dean P, Camm N, Bates J, Carr IM, Gardiner CA, et al. Long-read nanopore sequencing resolves a TMEM231 gene conversion event causing Meckel–Gruber syndrome. Hum Mutat. 2020;41(2):525–31. [DOI] [PubMed] [Google Scholar]
- 31.Holt GS, Batty LE, Alobaidi BKS, Smith HE, Oud MS, Ramos L, et al. Phasing of de Novo mutations using a scaled-up multiple amplicon long-read sequencing approach. Hum Mutat. 2022;43(11):1545–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Stockton JD, Nieto T, Wroe E, Poles A, Inston N, Briggs D, et al. Rapid, highly accurate and cost-effective open-source simultaneous complete HLA typing and phasing of class I and II alleles using nanopore sequencing. HLA. 2020;96(2):163–78. [DOI] [PubMed] [Google Scholar]
- 33.Jeck WR, Iafrate AJ, Nardi V. Nanopore flongle sequencing as a Rapid, Single-Specimen clinical test for fusion detection. J Mol Diagn. 2021;23(5):630–6. [DOI] [PubMed] [Google Scholar]
- 34.Watson CM, Holliday DL, Crinnion LA, Bonthron DT. Long-read nanopore DNA sequencing can resolve complex intragenic duplication/deletion variants, providing information to enable preimplantation genetic diagnosis. Prenat Diagn. 2022;42(2):226–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tafess K, Ng TTL, Lao HY, Leung KSS, Tam KKG, Rajwani R et al. Targeted-Sequencing Workflows for Comprehensive Drug Resistance Profiling of Mycobacterium tuberculosis Cultures Using Two Commercial Sequencing Platforms: Comparison of Analytical and Diagnostic Performance, Turnaround Time, and Cost. Clin Chem. 2020;66(6):809–20. [DOI] [PubMed]
- 36.Whitford W, Hawkins V, Moodley KS, Grant MJ, Lehnert K, Snell RG, et al. Proof of concept for multiplex amplicon sequencing for mutation identification using the minion nanopore sequencer. Sci Rep. 2022;12(1):8572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Guo MH, Francioli LC, Stenton SL, Goodrich JK, Watts NA, Singer-Berk M, et al. Inferring compound heterozygosity from large-scale exome sequencing data. Nat Genet. 2024;56(1):152–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ura H, Togi S, Niida Y. Targeted Double-Stranded cDNA Sequencing-Based phase analysis to identify compound heterozygous mutations and differential allelic expression. Biology. 2021;10(4):256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhao S, Macakova K, Sinson JC, Dai H, Rosenfeld J, Zapata GE, et al. Clinical validation of RNA sequencing for Mendelian disorder diagnostics. Am J Hum Genet. 2025;112(4):779–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Claes KBM, Rosseel T, De Leeneer K. Dealing with pseudogenes in molecular diagnostics in the next generation sequencing era. Methods Mol Biol Clifton NJ. 2021;2324:363–81. [DOI] [PubMed] [Google Scholar]
- 41.Santos R, Lee H, Williams A, Baffour-Kyei A, Lee SH, Troakes C, et al. Investigating the performance of Oxford nanopore Long-Read sequencing with respect to illumina microarrays and Short-Read sequencing. Int J Mol Sci. 2025;26(10):4492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Olson ND, Wagner J, Dwarshuis N, Miga KH, Sedlazeck FJ, Salit M, et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet. 2023;24(7):464–83. [DOI] [PubMed]
- 43.Nyaga DM, Tsai P, Gebbie C, Phua HH, Yap P, Le Quesne Stabej P, et al. Benchmarking nanopore sequencing and rapid genomics feasibility: validation at a quaternary hospital in new Zealand. NPJ Genomic Med. 2024;9:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The sequencing data used in this study were generated from the HG002 reference sample, which is publicly available through the Genome in a Bottle (GIAB) consortium and can be accessed at https://www.nist.gov/programs-projects/genome-bottle.
All documentation, source code, and example data for the bioinformatics pipeline developed in this study are available in the associated GitHub repository: https://github.com/j-jamshidi/ONT_amp_phase.



