Skip to main content
PeerJ logoLink to PeerJ
. 2023 Dec 18;11:e16515. doi: 10.7717/peerj.16515

Benchmarking long-read genome sequence alignment tools for human genomics applications

Jonathan LoTempio 1,2, Emmanuele Delot 3,4, Eric Vilain 1,2,
Editor: Xin Zhang
PMCID: PMC10734412  PMID: 38130927

Abstract

Background

The utility of long-read genome sequencing platforms has been shown in many fields including whole genome assembly, metagenomics, and amplicon sequencing. Less clear is the applicability of long reads to reference-guided human genomics, which is the foundation of genomic medicine. Here, we benchmark available platform-agnostic alignment tools on datasets from nanopore and single-molecule real-time platforms to understand their suitability in producing a genome representation.

Results

For this study, we leveraged publicly-available data from sample NA12878 generated on Oxford Nanopore and sample NA24385 on Pacific Biosciences platforms. We employed state of the art sequence alignment tools including GraphMap2, long-read aligner (LRA), Minimap2, CoNvex Gap-cost alignMents for Long Reads (NGMLR), and Winnowmap2. Minimap2 and Winnowmap2 were computationally lightweight enough for use at scale, while GraphMap2 was not. NGMLR took a long time and required many resources, but produced alignments each time. LRA was fast, but only worked on Pacific Biosciences data. Each tool widely disagreed on which reads to leave unaligned, affecting the end genome coverage and the number of discoverable breakpoints. No alignment tool independently resolved all large structural variants (1,001–100,000 base pairs) present in the Database of Genome Variants (DGV) for sample NA12878 or the truthset for NA24385.

Conclusions

These results suggest a combined approach is needed for LRS alignments for human genomics. Specifically, leveraging alignments from three tools will be more effective in generating a complete picture of genomic variability. It should be best practice to use an analysis pipeline that generates alignments with both Minimap2 and Winnowmap2 as they are lightweight and yield different views of the genome. Depending on the question at hand, the data available, and the time constraints, NGMLR and LRA are good options for a third tool. If computational resources and time are not a factor for a given case or experiment, NGMLR will provide another view, and another chance to resolve a case. LRA, while fast, did not work on the nanopore data for our cluster, but PacBio results were promising in that those computations completed faster than Minimap2. Due to its significant burden on computational resources and slow run time, Graphmap2 is not an ideal tool for exploration of a whole human genome generated on a long-read sequencing platform.

Keywords: Long-read sequence, Genome alignment, Benchmark, Genomic medicine, Short-read sequence

Introduction

The diverse ecosystem of DNA preparation, sequencing, and mapping technologies is capable of generating computational representations of genome biology through different chemistries and processes on different sequencing and mapping platforms (Sanger, Nicklen & Coulson, 1977; Fuller et al., 2009; Branton et al., 2008; Ardui et al., 2018; Levy-Sakin et al., 2019). This has resulted in many different approaches for making use of the data of different sources, with algorithms and the tools built upon them used across platforms to differing success. Here we briefly outline the differences between these platforms and analysis strategies so that we may consider an important gap in applying the newest technologies to the problem of human genomics.

Sequencing-by-synthesis platforms produce highly accurate sequence data in the form of short-reads (<300 base pairs, bp) (Modi et al., 2021) from high molecular weight DNA inputs. In contrast, single-molecule real-time platforms can produce highly accurate reads through circular consensus sequencing (CCS) on molecules >10 kilobase pairs (kbp) (Wenger et al., 2019) while nanopore-based sequencing (Jain et al., 2018a; Jain et al., 2018b) and nanochannel-based mapping platforms (Formenti et al., 2019) can sequence or visualize megabase-length DNA molecules (Pollard et al., 2018). Each of these platforms has different utilities and available tools, which contributes to the diversity of projects enabled by these technologies (Ho, Urban & Mills, 2020; Mahmoud et al., 2019).

Optical genome mapping (developed by Bionano Genomics, San Diego, CA, USA) excels at detecting structural variants (SV), such as balanced translocations and deletions/insertions in the 1 kb to 1 Mb range, and its clinical utility was demonstrated in Duchenne (DMD) or facioscapulohumeral (FSHD) muscular dystrophies (Barseghyan et al., 2017; Sharim et al., 2019), and cancer (Neveling et al., 2021; Talsania et al., 2022; Bornhorst et al., 2023). But long-read sequence (LRS), developed by Pacific Biosciences (Menlo Park, CA, USA) and Oxford Nanopore Technologies (Oxford, UK) among others, has been shown to be the most appropriate technology to detect smaller variants in the 50 bp-1 kb range (Chaisson et al., 2019). This is especially true in repetitive regions of the genome and should therefore bring new diagnosis potential for diseases of trinucleotide repeat expansion and genome instability such as Huntington’s disease and myotonic dystrophies (Kantartzis et al., 2012; Santoro et al., 2017; Liu, Gao & Wang, 2017; Mitsuhashi et al., 2019; Mantere, Kersten & Hoischen, 2019), as well as for discovery of SVs affecting regulatory regions of known genes. Fulfilling the promise of LRS for medical genetics requires understanding the tools that power the technology and their respective strengths.

LRS platforms have been further shown to be capable of generating single, phaseable reads spanning repetitive or complex genomic regions that remain unresolved in all (Wenger et al., 2019; Jain et al., 2018a; Jain et al., 2018b; Huddleston et al., 2017; Ebbert et al., 2019) but one current human genome assembly (Nurk et al., 2022). This has implications for resolution of highly similar pseudogenes and large SV, even in low-complexity genomic regions, or from otherwise healthy diploid samples. Read lengths in the tens of kbp have further allowed for end-to-end viral genome sequencing (Walker et al., 2020), while read lengths in the millions of bp have the potential to span whole mammalian chromosomes (Jain et al., 2018a; Jain et al., 2018b; Miga et al., 2020). This potential for more contiguous de novo human assembly has led to studies to specifically improve and benchmark basecalling and polishing tools (Wick, Judd & Holt, 2019), as well as assembly tools (Koren et al., 2018; Cheng et al., 2021; Kolmogorov et al., 2019) that can be applied to human genomics.

For reference genome-guided experiments, LRS has proven useful for amplicon sequencing in cancer detection (Norris et al., 2016) and for metagenomics (Deshpande et al., 2019; Cuscó et al., 2019; Sanderson et al., 2018), but the field is still in early days for assessment of variation in high-coverage human whole genome sequence. There has been a demonstration of the added value of LRS in resolving SV has been demonstrated by efforts aggregating callsets across platforms and technologies to deeply characterize a genome (Chaisson et al., 2019). These experiments are concerned mainly with understanding the full scope of the architecture of a genome, rather than contrasting differences between results of one alignment tool or another. Each published tool showcases its own utility and strengths in their initial publications, especially in terms of the mapping quality of reads to the reference genome and precision and sensitivity to preserve known variants in synthetic or downsampled genomic data. However, there have been no studies that specifically benchmark LRS alignment platforms and tools for reference-guided experiments.

To address this gap, we benchmarked the most recent LRS alignment tools with the datasets generated from the Joint Initiative for Metrology in Biology’s Genome in a Bottle Initiative (GIAB), specifically samples, NA12878 sequenced with nanopore technology and NA24385 sequenced with CCS (branded HiFi) technology. These were generated with Oxford 9.4 pore chemistry and with Guppy5 basecalling performed by the Whole Genome Sequencing Consortium (Jain et al., 2018a; Jain et al., 2018b) and with Pacific Biosciences RSII SMRT CCS technology by PacBio and the National Institute of Standards and Technology (Zook et al., 2019). All alignments were performed on GRCh38, rather than the more complete CHM13 or more diverse pangenome graphs because there are not yet high-quality GIAB-like reference materials or variant truthsets that allow for this sort of benchmarking against these cutting-edge reference materials.

We compared computational performance (peak memory utilization, central processing unit (CPU) time, file size/storage requirements), genome depth and basepair coverage, and quantified the reads left unaligned in any given experiment. We have limited this study to tools that are platform agnostic and function on our cluster. Since the resolution of large SVs is a key application of this technology, and also allows comparison of the differences in genome alignments in an aggregate way, we ran the SV-calling tool Sniffles to highlight differences in breakpoint location in each binary alignment map (BAM) file (Sedlazeck et al., 2018). Taken together, these experiments present a comprehensive view of differences in the products of LRS whole-genome alignment pipelines.

Materials & Methods

Portions of this text were previously published as part of a preprint (https://www.biorxiv.org/content/10.1101/2021.07.09.451840v3.full) (LoTempio, Délot & Vilain, 2023).

Tool selection criteria

Starting with tools that were recommended by the developers of each platform, we examined the tools that were cited or benchmarked against new, platform-specific tools. We also used the website Long-Read Tools (https://long-read-tools.org/), searching their database with the filters “nanopore”, “pacbio”, and “alignment”. Since this yielded many software tools, we dove deeper and excluded any tools not designed for whole genome experiments. Tools which passed this test were then assessed for their suitability for use with nanopore and SMRT data and whether they produced a SAM/BAM file for analysis. Since tools can be regularly updated or out-versioned, we wanted to use only the most up to date software at the time of analysis.

In our examination of structural variant calling pipelines, we elected to limit our study to the variant caller sniffles due to the fact that the output files were most descriptive. Specifically, Sniffles outputs VCFs which call SVs as indels, translocations, duplications, inversions, and more classifications. This increases its utility for comparison to reference truthsets.

Data

Main experiments

Reference genome: GRCh38 was accessed on March 22, 2022 from NCBI (GRCh38_no_alt_analysis_set).

Sequence data

No new or original sequence data from new or existing samples were generated here. Instead, we leveraged data generated by third party consortia on well-characterized samples. These data include nanopore sequence data generated on NA12878 from the WGS Consortium, rel7 guppy 5 basecalls was accessed on March 22, 2022 (NA12878 rel7).

We also used the 15kb insert size SMRT CCS data from sample NA24385 was accessed from the human pangenomics consortium GitHub on March 22, 2022 (NA24385 15kb insert).

While no DOI exist for these files, the citations provide the most stable URLs available (PacBio, 2021).

SV truthsets

Structural variant truthsets for NA12878, updated February 25, 2020, from DGV were accessed on March 31, 2022 (NA12878 DGV truthset).

Structural variant truthsets from Pacific Biosciences and the GIAB consortium were accessed on May 20, 2022 (PacBio GIAB SV Truthset).

Absolute coordinate systems within a reference genome have allowed for very accurate genome mapping and analysis within that reference. However, reference sets are tied to these absolute coordinate systems and require liftover and curation. Our study is subject to the limitation that these variants were called with alignments to GRCh37, but we used the more advanced GRCh38. Accordingly, the absolute coordinates of the location of breakpoints differs across builds. By focusing our study on the size of variants within 1,000 and 10,000 bp windows, rather than their relative position within a set of reference genome coordinates, we are confident that this is a minor limitation.

Pilot experiments

Initial pilot experiments were performed on older reference builds and datasets. These can be found as Supplement S4 and may be useful to investigators who have LRS data from the previous decade.

Reference genome

GRCh38.p12 was accessed and downloaded on April 8, 2019 (GRCh38.p12).

Sequence data

Nanopore sequence data were accessed on April 8, 2019, version rel5-guppy-0.3.0-chunk10k.fastq, from AWS Open Data as made available by the Whole Genome Consortium. SMRT sequence data were accessed on May 28, 2019, version sorted_final_merged.bam from the NCBI FTP (NA12878 rel 5; NA12878 SMRT Mt Sinai).

Database of genome variants

DGV data were accessed on January 30, 2020 (Database of Genomic Variants, DGV). The full set of variants was reduced to 11,042 variants confirmed to be present in NA12878. A total of 14 of these variants were excluded as they were called on contigs present out of the main reference assembly contigs.

Hardware configuration

Computations were performed on the George Washington University High Performance Computer Center’s Pegasus Cluster on SLURM-managed default queue compute nodes with the following configuration: Dell PowerEdge R740 server with Dual 20-Core 3.70 GHz Intel Xeon Gold 6148 processors, 192 GB of 2666 MHz DDR4 ECC Register DRAM, 800 GB SSD onboard storage (used for boot and local scratch space), and Mellanox EDR InfiniBand controller to 100 GB fabric.

Whole genome alignment tool benchmarking and analysis

The following tool versions were used in the work presented in the main tables and figures.

1. Long-read aligner (LRA) v1.3.3. (Ren & Chaisson, 2021)

2. GraphMap2 v0.6.3 (Marić et al., 2019)

3. Minimap2 v2.24 (Li, 2018)

4. CoNvex Gap-cost alignMents for Long Reads (NGMLR) v0.2.7 (Sedlazeck et al., 2018)

5. Winnowmap v2.0.3 (Jain et al., 2022)

The versions of alignment tools used in the preliminary experiments of this project available in S4 included:

1. GraphMap v0.5.2 (Sović et al., 2016)

2. Minimap2 v2.16 (Li, 2018)

3. NGMLR v0.2.7 (Sedlazeck et al., 2018)

All tools were run with recommended parameters and were flagged for nanopore or SMRT data based on the requirements of that experiment. Where multithreading was available, -t, or equivalent was set to 40, the max for our cluster. LRA, Minimap2, NGMLR, and Winnowmap2 all have specific presets for SMRT and nanopore-based data while Graphmap2 uses the same parameterization for all LRS platforms, with an option for alternative defaults to handle Illumina data. Its default parameters are optimized for accuracy and through graph-based handling of reads can address complex genomic regions and users have the option of modifying parameters to decrease accuracy.

LRA uses the same minimizers as seeds for both sequencing platforms, but decreases the density of minimizers in the genome by nearly 10-fold, which are then compared to minimizers from the reference, resulting in a set of query-reference anchors. Anchors are clustered based on read accuracy, with lower accuracy nanopore reads given a minimum of three clusters per read, and higher accuracy SMRT-CCS reads given a minimum of 10 clusters per read.

Minimap2, upon which Winnowmap2 builds, uses homopolymer-compressed minimizers in the seeding step of its algorithm. This is to account for errors in SMRT movie data which can struggle with homopolymer runs. For example, GATTAACCA becomes GATACA through homopolymer minimization. The nanopore default parameterization uses ordinary minimizers as seeds because in development, they saw no increased benefit to homopolymer compression, even though the technology has similar limitations to SMRT. Winnowmap2 builds on this by down-weighting frequently occurring minimizers to minimize masking seeds in complex genomic regions including tandem repeats.

For NGMLR generally, reads are handled in a way which is SV-aware and accounts for small (10 bp) indels or split reads over larger indels. Reads which can be mapped linearly are, and the remaining reads are handled through splitting them over SV. The nanopore preset further customizes parameters to address the characteristic high error rates, including substitutions, insertions, and deletions, inherent to nanopore sequencing. It achieves this by lowering match/mismatch scores, increasing gap penalties, and emphasizing sensitivity, making it well-suited for applications like structural variant analysis. Conversely, the SMRT preset retains higher match/mismatch scores to prioritize base accuracy, suitable for PacBio data known for lower substitution errors but higher indel error rates.

Computational metrics were printed from SLURM job records with seff and sacct with flags–format JobID, JobName, Elapsed, NCPUs, TotalCPU, CPUTime, ReqMem, MaxRSS, MaxDiskRead, MaxDiskWrite, State, ExitCode. Samtools 1.15.1 was used for all alignment manipulations, alignment read depth coverage calculations, and to extract unmapped reads. Samtools view -f 4 was used to generate bamstats files and Python Venn was used to compare the readnames across unmapped read files to assess the degree of overlap of unmapped reads across these subsetted alignment files (Li et al., 2009; Danecek et al., 2021). R 3.5.2 version Eggshell Igloo with the tidyverse packages were used for preliminary experiments on depreciated data shown in S4. Genome coverage was calculated with samtools coverage. Binary conversion values were used for bytes (1,073,741,824) and kilobytes (1,048,576) to gigabytes.

Data reshaping and visualization

As data were integrated across multiple sources and formats, they needed to be reshaped for comparison and visualization. For example, the multiple separators found in VCFs (commas and semicolons) are not directly usable in python pandas. Relevant dataframes from genome alignment files were reshaped with shell scripts and Python scripts whose methodology and key intermediate files are available on our GitHub.

Comparison of breakpoints

Structural variants (SV) were called with sniffles/1.0.11 with default parameters (Sedlazeck et al., 2018). Only variants called on the main GRCh38 assembly were included. Python was used to bin and graph structural variants by SV Type. Intermediary files and scripts are available at our GitHub, archived stably here: https://zenodo.org/badge/latestdoi/429362081.

Data from Sniffles VCFs and from DGV were subsetted by comparable fields including SV Type, SV Length, and Chromosome since a common format was unavailable for direct comparison. Figures were made with seaborn and matplotlib as well as Microsoft Excel (Waskom, 2021; Hunter, 2007).

Results

Rigorously annotated variants were drawn from the Database of Genomic Variants (DGV) (MacDonald et al., 2014) for NA12878 and from a curated public repository for NA24385 (PacBio, 2021).

We present our results in four sections:

(1) the tools that were included,

(2) computational performance and benchmarking,

(3) an analysis of aligned and unaligned reads,

(4) an analysis of structural variation present in each alignment compared to a baseline.

Tools that passed the inclusion/exclusion criteria

Following a literature review of available alignment tools and search of Long-Read-Tools (Amarasinghe, Ritchie & Gouil, 2021), we established a set of inclusion criteria (see Material and Methods), which accounted for both types of LRS data (a tool must be able to handle both nanopore and SMRT reads), as well as the state of the field in terms of software updates (must not have been superseded by another tool). All relevant tools and their reason for exclusion are outlined in Table 1, with all tools annotated as platform agnostic and relevant to genome alignment included in Supplemental S1. Five alignment tools, GraphMap2, LRA, minimap2, NGMLR, and Winnowmap2 were included in this study (Marić et al., 2019; Li, 2018; Sedlazeck et al., 2018; Jain et al., 2022; Ren & Chaisson, 2021).

Table 1. Long-read genome sequence alignment tools.

The tools included in the study are highlighted in this study. Further tools from Long-read-tools.org are available in S1.

Tool Year PMID DOI Inclusion If no, why?
BLASR 2012 22988817 10.1186/1471-2105-13-238 No Designed only for SMRT seq
BWA-MEM 2011 NA arXiv:1303.3997 No Superseded by Minimap2
GraphMap 2016 27079541 10.1038/ncomms11307 No Superseded by GraphMap2
GraphMap2 2019 NA 10.1101/720458 Yes  
Kart 2019 31057068 10.1142/S0219720019500082 No Designed only for SMRT, SBS
LAMSA 2016 27667793 10.1093/bioinformatics/btw594 No Depreciated dependencies resulting in seg fault
LAST 2011 21209072 10.1101/gr.113985.110 No Not parameterized for the nuances of current platforms
lordFAST 2018 30561550 10.1093/bioinformatics/bty544 No Runs resulted in partial completion and a seg fault
LRA 2021 34153026 10.1371/journal.pcbi.1009078 Yes  
Mashmap2 2017 30423094 10.1093/bioinformatics/bty597 No Does not use SAM format needed for variant caller
MECAT 2017 28945707 10.1038/nmeth.4432 No Superseded by Mecat2
MECAT2 2017 28945707 10.1038/nmeth.4432 No Designed only for SMRT seq
Meta-aligner 2017 28231760 10.1186/s12859-017-1518-y No Designed only for SMRT seq
Minimap2 2018 29750242 10.1093/bioinformatics/bty191 Yes  
NGMLR 2018 29713083 10.1038/s41592-018-0001-7 Yes  
rHAT 2015 26568628 10.1093/bioinformatics/btv662 No Designed only for SMRT seq
S-ConLSH 2021 33573603 10.1186/s12859-020-03918-3 No Designed only for SMRT seq
smsMap 2020 32753028 10.1186/s12859-020-03698-w No Does not use SAM format needed for variant caller
Winnowmap 2020 32657365 10.1093/bioinformatics/btaa435 No Superseded by Winnowmap2
Winnowmap2 2022 35365778 10.1101/2020.11.01.363887 Yes  
YAHA 2012 22829624 10.1093/bioinformatics/bts456 No Not parameterized for the nuances of current platforms

Sixteen tools were excluded because they were superseded by other tools, produced an alignment file in a non-SAM format, were designed for only one type of input data, or did not work on our cluster (Kiełbasa et al., 2011; Li, 2013; Chaisson & Tesler, 2012; Faust & Hall, 2012; Liu et al., 2016; Sović et al., 2016; Liu, Gao & Wang, 2017; Jain et al., 2018a; Jain et al., 2018b; Nashta-Ali et al., 2017; Xiao et al., 2017; Haghshenas, Sahinalp & Hach, 2019; Li, 2018; Sedlazeck et al., 2018; Marić et al., 2019; Kumar, Agarwal & Ranvijay, 2019; Wei, Zhang & Liu, 2020; Jain et al., 2020; Jain et al., 2022; Ren & Chaisson, 2021; Chakraborty, Morgenstern & Bandyopadhyay, 2021).

We excluded LAST and YAHA as they were not parameterized for the nuances of current sequencing platforms, i.e., they did not have default settings that took into account long reads and low error rates because their success would require more careful consideration of research questions, rather than exploration. smsMap produced an output that would require post-processing to put it into a SAM compatible format, and since we did not aim to create new software here, we excluded it. LAMSA, lordFAST and were run, but resulted in segmentation faults (Haghshenas, Sahinalp & Hach, 2019; Liu, Gao & Wang, 2017). LAMSA was built to call precompiled GEM3 libraries, with which our current compute architecture was incompatible, while lordFast produced an undiagnosable error log (Marco-Sola et al., 2020). The other 11 tools, outlined in Table 1, were excluded because they were only designed to handle data from one LRS platform.

Computational performance and benchmarking

Computations were run three times on a single node allocated fully on our university’s high-performance compute cluster (HPC), which includes 40 CPUs with a configuration described in the methods. All tools were run with default parameters to reflect typical use in exploratory studies. High-level computational benchmarks can be found in Table 2. Full reports on each run can be examined in Supplemental S2, while a composite table of the output of samtools coverage is available as Supplemental S3.

Table 2. Benchmarking metrics.

Each of the three runs for each tool is shown in this table. All values are taken directly from slurm’s sacct and seff features. All metrics are intuitive except for CPU efficiency, which is a measure of idle CPUs to CPU time over wall clock time.

Platform Tool Wall clock time
(D-H:M:S)
CPU time
(D-H:M:S)
CPU
efficiency
Memory utilized
(Gb)
Nanopore Graphmap2 14-00:00:04 560-00:02:40 65.25% 90.19
14-00:00:17 560-00:11:20 63.34% 94.58
14-00:00:15 560-00:10:00 64.23% 89.68
Minimap2 1-11:00:08 58-08:05:20 7.51% 12.43
1-10:48:18 58-00:12:00 7.51% 12.80
1-11:07:00 58-12:40:00 7.51% 12.78
NGMLR 10-00:00:03 400-00:02:00 97.70% 35.86
10-00:00:10 400-00:06:40 99.56% 34.38
9-14:55:56 384-21:17:20 93.43% 36.34
Winnowmap2 15:25:58 25-17:18:40 83.82% 29.48
15:19:14 25-12:49:20 83.61% 30.02
15:15:24 25-10:16:00 83.06% 28.74
SMRT-CCS Graphmap2 5-01:33:22 202-14:14:40 98.88% 45.07
5-02:54:53 204-20:35:20 98.81% 44.90
5-06:49:25 211-08:56:40 98.52% 48.96
LRA 6:53:00 11-11:20:00 33.20% 20.63
6:53:04 11-11:22:40 33.11% 21.10
6:32:33 10-21:42:00 33.22% 27.91
Minimap2 11:02:04 18-09:22:40 7.51% 9.10
11:06:03 18-12:02:00 7.51% 9.13
10:55:09 18-04:46:00 7.52% 9.09
NGMLR 21:50:29 36-09:39:20 99.20% 25.53
22:24:38 37-08:25:20 99.19% 25.69
21:58:53 36-15:15:20 99.53% 25.82
Winnowmap2 3:39:42 6-02:28:00 88.10% 18.57
3:43:42 6-05:08:00 88.32% 18.71
3:40:32 6-03:01:20 87.91% 19.14

Globally successful tools

Minimap2 was the least memory-demanding tool

Minimap2 successfully aligned nanopore data every time with unrestricted and restricted resources. Unrestricted runs used around 12.6  gigabytes (Gb) and the jobs took ∼34 wall clock hours to complete.

Minimap2 also successfully aligned SMRT data every time with unrestricted or restricted resources. The runs used around 9 Gb of memory and the jobs took ∼11 wall clock hours to complete, considerably faster than the (admittedly larger) nanopore dataset.

Memory usage and runtime were consistent across triplicate runs with unrestricted resources and did not change with restriction of resources when the tool was used with either dataset. The consistency of results, as well as the speed and relatively low computational demands of Minimap2 make it a strong candidate for inclusion in clinical analysis pipelines.

Winnowmap2 was the fastest tool

Winnowmap2 was the fastest to run on any dataset. Specifically, it aligned the entire nanopore dataset in around 15 wall clock hours, using only a little more than two times more memory than minimap2 ( ∼29 Gb). For SMRT, it was even faster at just over 3.5 wall clock hours and almost exactly twice as much memory ( ∼18 Gb). These numbers were consistent over all runs.

Additionally, the high CPU efficiency of the jobs on our cluster were positive to note: over 80% efficiency, relative to Minimap2’s ∼7% efficiency. This is a measure of the ratio of CPU time used to the wall clock time times the number of CPUs. In the case of our cluster, CPUs are fully allocated to a node and not shared once allocated, but it does point the way towards either choosing more efficient tools, or optimizing jobs.

NGMLR completed 4/6 tasks and performance demanded great resources

NGMLR successfully aligned SMRT data every time but was the second slowest. The runs used around 25 Gb and the jobs took around 22 wall clock hours to complete, nearly twice as long as the next tool, Minimap2.

NGMLR was much more inconsistent on nanopore data. It successfully aligned the data on our first range finding experiment in approximately ∼230 wall clock hours. Following this run, two jobs were set with a time of 10 days, and both resulted in timeouts. The single successful run used 36.34 Gb of memory, the most of all tools which produced an alignment.

Tools that presented a challenge

LRA did not work on nanopore-generated data

While LRA did show promise on SMRT data, aligning a genome in ∼11 CPU days or ∼6 wall clock hours, we could not run it successfully with nanopore. Each run resulted in unresolvable (undescriptive) segmentation faults that produced no partial alignment. In the case of SMRT data, it did fall in the front of the pack, aligning genomes quickly with typical memory usage relative to its peers ( ∼20–27 GB).

GraphMap2 was the most resource-intensive tool

GraphMap2 took the longest, used the most memory, and had the most failures of all of the tools we examined. It did not produce alignment files on nanopore data in the time limit of 14 days imposed by our HPC. On the SMRT dataset which it was successful at aligning, it took more than 5 times longer to produce an alignment file than the next slowest tool, NGMLR. It had the highest memory usage, in excess of 89 GB nanopore on 44 GB on SMRT.

Whole genome alignment

Aligned reads show disagreement in coverage of each chromosome

In preliminary experiments that were completed successfully, alignments of the same genome with the same tool had the same genome coverage, whether resource-restricted or not (S4, column O). For this reason, one file generated from the first run was used for subsequent analysis, shown in Table 3, a summary of the coverage depth of each chromosome alignment and the percentage of the basepairs of GRCh38 covered. Global genome coverage of ∼30x was reported for nanopore and ∼34x, 20x of which is present in the 15 kbp insert set, was reported for SMRT data sets. Breaking down the summary into chromosomes allows for more precision than referring to a genome by its coverage as a uniform metric, since there is discrepancy across the chromosomes.

Table 3. Sequencing depth and reference coverage.

Chromosome level descriptions of the average number of reads covering each basepair, as well as the percentage of basepairs of the reference covered. The highest value for a chromosome is accented with bold text, while the lowest value is accented with italic text. Winnowmap2 abbreviated as WM2.

SMRT-CCS Nanopore
  Mean depth (x-depth) Coverage of basepairs (%)   Mean depth (x depth) Coverage of basepairs (%)
  LRA Minimap2 NGMLR WM2 LRA Minimap2 NGMLR WM2   Minimap2 NGMLR WM2 Minimap2 NGMLR WM2
chr1 19.0627 20.0049 17.9711 20.002 91.7893 92.3244 91.1358 92.3622 chr1 37.7867 33.3715 37.4783 92.4802 91.4793 92.4861
chr2 20.0613 20.1343 19.9393 20.1463 98.7397 98.8041 98.707 98.6286 chr2 40.1079 36.3557 39.497 99.0601 98.8241 98.8272
chr3 20.1303 20.5557 20.0446 20.3575 98.984 99.5552 98.7438 99.5645 chr3 37.7894 36.5499 39.9339 99.773 98.8066 99.767
chr4 20.4284 20.7247 20.1254 21.8975 99.1528 99.0963 98.6843 99.2809 chr4 39.8297 36.1401 39.6729 99.4841 98.7817 99.5176
chr5 20.0292 20.3439 19.7821 20.0655 98.2323 98.1787 97.6187 98.2961 chr5 37.1564 35.9384 37.17 98.5218 98.0472 98.5096
chr6 19.9858 20.4916 19.9745 20.4963 98.795 99.0781 98.674 99.2314 chr6 37.4983 36.4317 38.3408 99.4746 98.9018 99.4822
chr7 19.5124 19.7661 19.3123 19.7686 98.1498 99.2092 97.7858 98.9308 chr7 37.5093 35.7748 37.3072 99.6968 98.4965 99.6698
chr8 20.0067 20.279 19.7645 20.272 98.5302 99.2526 98.1774 99.0982 chr8 37.2117 35.8164 37.4528 99.6024 98.4431 99.6027
chr9 17.085 17.2704 16.5087 17.2736 86.2352 86.2731 85.4122 86.3347 chr9 33.7857 31.5634 33.62 87.3625 86.6206 87.4715
chr10 19.8087 20.0826 19.4999 20.5247 98.2398 98.8789 98.1124 98.8153 chr10 37.9547 35.509 37.9365 99.3546 98.2053 99.3873
chr11 19.5361 19.8058 19.4349 19.8105 98.0161 99.1255 98.1379 98.264 chr11 37.5146 35.9223 37.282 99.562 98.5893 99.0318
chr12 19.6294 19.9493 19.6137 19.9618 98.3953 99.548 98.151 99.3563 chr12 37.9345 36.4544 37.6064 99.8008 98.397 99.7298
chr13 17.5257 18.1764 16.9585 17.6178 85.2628 85.3114 84.1626 85.3793 chr13 32.6205 30.5633 32.35 85.6486 84.3648 85.6546
chr14 16.815 16.8162 16.6225 16.9593 82.5962 82.5668 82.5949 82.5858 chr14 31.2147 30.4015 31.9599 82.5971 82.5928 82.5975
chr15 16.0919 16.3256 15.7467 16.3303 80.7849 81.7561 80.0048 81.9674 chr15 32.9232 30.434 32.7613 82.7815 80.8381 82.8274
chr16 18.4585 19.5281 18.1335 19.4345 89.1223 90.0939 87.9862 90.1285 chr16 36.9247 34.5339 37.2781 90.5049 88.4413 90.5183
chr17 18.0451 18.8278 18.0519 18.7767 95.1484 97.5541 94.7803 95.9119 chr17 37.2793 35.2194 36.9045 97.3704 95.1519 98.046
chr18 19.4328 19.7389 18.8198 19.7815 96.004 98.3045 94.3649 97.8677 chr18 36.8051 34.2153 36.5148 99.3802 94.1824 99.0731
chr19 16.8558 16.8674 16.5895 16.8926 95.0649 95.0588 94.9148 95.0692 chr19 36.3404 33.9213 39.4614 95.1037 95.0137 95.1095
chr20 19.34 20.0554 19.0343 20.1369 96.0126 98.1195 95.5606 98.0231 chr20 39.3238 35.3079 39.1318 99.1143 96.3336 99.1551
chr21 18.0549 18.4984 16.8995 18.2805 80.4792 80.4144 78.9093 80.16 chr21 35.4168 30.0951 34.298 81.3939 80.763 81.2955
chr22 13.9854 14.0173 13.3346 17.4726 72.7139 72.6378 72.5114 72.7444 chr22 32.7008 25.9485 31.1753 72.9147 72.9092 72.9018
chrX 9.8603 9.97983 9.73401 9.98727 97.639 97.9196 96.4952 98.0195 chrX 36.1372 33.9833 35.6273 99.1734 97.0137 98.9417
chrY 7.94848 8.21933 3.85465 8.37289 40.9697 40.9213 37.9947 40.8261 chrY 1.8305 0.501775 1.3745 3.1548 4.9006 2.37513
chrM 1010.03 1203.88 1203.67 1202.19   100 100 100 100 chrM 11954.1 11413.1 11685.6   100 100 100

For nanopore data, three alignment files were measured, while for SMRT data, four files could be included. Table 3 shows cover of each chromosome with high values represented in bold text, and low values represented in italic text. It becomes immediately apparent that Minimap2 and Winnowmap2 both retain the most reads (highest x coverage) and cover the most basepairs of the reference. NGMLR excludes the most reads and covering the fewest basepairs of the reference genome, which may impact downstream analysis. The final tool, LRA, which only worked on SMRT data, is intermediate with some lower values of coverage depth, and some higher values of coverage of basepairs.

Unaligned reads reveal differential exclusion of reads

The readname assigned to each read in a fastq retained in the BAM allowed us to directly compare the lists of reads that were not included in the alignment by each alignment tool with Python Venn (Fig. 1) (Grigorev, 2018).

Figure 1. Unmapped reads.

Figure 1

Venn diagrams show the total number of reads flagged as unmapped in the sorted BAM files. The left panel presents unmapped reads from the SMRT dataset. The right panel presents unmapped reads from the nanopore dataset.

All tools agreed to leave ∼1 million of the same nanopore reads unaligned, but differed in their overall totals of unaligned reads. NGMLR left the highest number of reads unaligned, ∼3.1 million, which explains why it produced the lowest coverage genome. It agreed with the other tools on approximately half of its discarded reads. Minimap2 and NGMLR agreed to leave a further ∼0.5 million reads unaligned, while NGMLR and Winnowmap2 agreed on a separate ∼0.2 million unaligned reads and Winnowmap2 and Minimap2 on ∼0.03 million further unaligned reads.

All tools agreed to leave 682 of the same SMRT reads unaligned. This is because Winnowmap2 excludes less than 1,000 reads in total, where the other tools excluded more reads over roughly three orders of magnitude. NGMLR left the highest number of reads unaligned, ∼200 thousand, which explains why it produced the lowest coverage genome. LRA excluded around 90 thousand reads in all, of which around 70 thousand are common to NGMLR. This also likely contributes to its lower-coverage status shown in Table 3. Minimap2 leaves out just shy of 12 thousand reads in all, most of which are shared with NGMLR and LRA.

In all, not enough information is available at this level of granularity to pass judgment on whether or not tools include or exclude the right reads. For this, an examination of breakpoints is required.

Breakpoints reveal differences between competing alignments of the same genome

Based on the above results, it became clear that we needed to examine the differences in alignment. To compare the alignments to each other, and assess their usability for variant calling across the whole genome, we looked to the SVs known to be present in the NA12878 genome as curated by the DGV resource. We ran a pilot study on now depreciated datasets for linear SMRT data before undertaking a study with current technology (NIST, 2015). The results of these pilot experiments are available as S4 and point to a gap between the largest SV curated in DGV and what is resolvable by long-read sequencing.

To expand upon these initial, promising findings, we accessed the most up-to-date data generated on nanopore and SMRT platforms. Presently, this includes the WGS consortium rel7 on NA12878 for nanopore and a new release of SMRT-CCS on NA24385. NA24385 is not present in DGV, so a high quality callset released by PacBio in collaboration with the Genome in a Bottle Consortium was used as a truthset. These differences are reflected in their labels within Figs. 2 and 3.

Figure 2. Nanopore SV 1,001–10,000 bp.

Figure 2

Variants called by sniffles on each alignment file. (A) Deletions. (B) Insertions. (C) Inversions. (D) Duplications.

Figure 3. SMRT SVs 1001–10,000 bp.

Figure 3

Variants called by sniffles on each alignment file. (A) Deletions. (B) Insertions. (C) Duplications.

We leveraged the sniffles SV caller because of its highly-detailed output files. Our truth sets were the curated SV present in DGV and the set released by PacBio and GIAB for NA24385. Due to the different nomenclature for SV type in the sniffles output and the NA24385 calls, which follow VCF specification (GA4GH, 2023) but differ from DGV annotations, comparisons were limited to the four classes of variant that were most unambiguously labeled in VCFs and the truth sets: insertions, deletions, duplications, and inversions.

The variants available for NA24385 were coded as all insertions or deletions, with subtypes within including duplications. Therefore, the callsets from SMRT do not include inversions as there was no baseline. Sniffles variants were graphed by SV length on the X-axis in shades of blue, grey, and yellow, contrasted with DGV variants in red, organized by platform and SV type. Figures are organized by platform and size of SV, specifically between 1,001 and 100,000 bp, the range at which LRS platforms are known to excel.

Nanopore

For the variants between 1,001 and 10,000 bp (Fig. 2), alignment sets perform largely well on calling deletions that are curated in DGV while inversions were largely missed in all alignments. NGMLR alignments contained the highest number of breakpoints called as duplications. Above the 2001–3000 bp bin, the total duplications in DGV eclipses the number of breakpoints called in any of the alignment files. In contrast, there are very few insertions of this size range present in DGV, but hundreds in our callsets. This could be due to a differential interpretation of the breakpoints, where origin of inserted material has been ascertained (as a duplication) in DGV but remains unattributed (as an insertion) in call sets.

In the 10,001–100,000 bp size range (Supplement S5), DGV largely contains more variants, with the exception of inversions. Inversions present an interesting case here because DGV contains more very large inversions than are reported (>80,001), but various callsets do well at smaller ranges. Across deletions, insertions, and duplications, all alignment tools fall short of the curated reference. It is most interesting that DGV had few insertions smaller than 10,000 bp, but hundreds greater than that threshold. The trend of VCFs from our alignment files overperforming relative to DGV breaks down, and they highly underperform, calling only a few 10s of variants.

SMRT

The truthset for the SMRT dataset from NA24385 does not contain variants annotated as inversions, so Fig. 3 (SV 1001–10,000 bp) and S6 (SV 10,001–100,000 bp) only contain three panels each. We can see immediately in Fig. 3 that the bars generated from the reference set are much closer to the height of the bars from the VCFs, and this is not affected by the use of only the 15 kbp insert set. There is one notable exception with regard to duplications. As with the nanopore dataset, an NGMLR alignment contains the most duplications.

However, these duplications are far in excess of what is present in the reference set. This is notable, as the reference set has been carefully validated. What was missed, versus what is a false positive, is not resolvable in this experiment and likely not resolvable without wet lab bench work or orthogonal validation.

The largest SV show a complex picture (S6). NGMLR calls the most and the largest deletions and no tools do well for large insertions in absolute terms or relative to the truthset. However, Minimap2 and NGMLR preserve no breakpoints called as insertions. For duplications 10,001–20,000 bp, all tools except LRA resolve many more than are present in the reference. NGMLR continues to include many more duplications than the other alignments, and more than the reference.

The disparity between the number of large variants greater 10,001 bp in the NA24385 truthset versus the NA12878 is striking. This, along with the general lack of reference standards for all publicly-available genomes, highlights the need for better, more comprehensive reference sets for all publicly-available resources. Inability to resolve large duplications may yield false negative results for conditions where structural variation is a recognized etiology, such as DMD or FSHD, where variants range between the tens and hundreds of thousands of kbp (Barseghyan et al., 2017; Sharim et al., 2019), beyond the size of variants accurately detected in this study. It is also a critical issue when the technology is used for broad exploratory surveys seeking to identify new etiology in low-complexity regions of the genome where long-read sequencing should shine.

Discussion

In this study we have highlighted the key differences between alignment files generated by different tools. By using well-characterized genome standards, NA12878 and NA24385, we were able to directly compare the performance of each tool on the sequencing datasets obtained on different platforms. We further analyzed the reads that were included or excluded from the alignment, and how these read alignments revealed breakpoints that could be resolved as structural variants in genome architecture. We then compared those variants to sets of previously published variants discovered through multiple platforms.

Reassuringly, each alignment tool was internally consistent: when an alignment tool was given the same fastq and the same reference genome, it produced the same result as judged by bamstats and Sniffles variant callsets. However, when looking across the alignment files produced by different tools on the same sequence data, the representations of the genome diverged in terms of which reads were included or excluded and the numbers and types of variants that were present in VCFs. This is impactful because of the potential high value of LRS data in terms of genome phasing and identification of epigenetic DNA modification (Ni et al., 2019). Since the majority of experiments that leverage large scale population surveys can be expected to rely on reference-guided alignment rather than de novo assembly because of both the cost and speed of analysis (Mardis, 2010), it is key to understand the idiosyncrasies of each type of alignment files prior to generalizing clinical use for these technologies. Furthermore, clinical experiments in genomic medicine face human time constraints—speedier analyses will have higher appeal and adoption.

At this time, GraphMap2 does not show utility for producing whole genome alignments that include structural variations when run with default parameters. The resource usage was large and the time to complete the computation was long and when it did work, Sniffles could not call variants on its product BAM. This is not unexpected, as the tool was designed in large part to increase single nucleotide variant sensitivity in noisy nanopore-sequenced reads.

Minimap2 used the least memory while Winnowmap2 ran successfully in the fastest time, an important point, should these platforms be tied closer to bedside applications. On data from both platforms, they both allowed calling of the most insertions and deletions, but fell short on inversions and duplications.

NGMLR was the most discerning aligner, in that it left the highest number of reads unmapped. It used more compute power than Minimap2 and took much longer (3–10 times) than the next fastest tool. While it was designed specifically to resolve structural variation, it calls a high number of SV that have not been validated with other methodologies or curated in DGV.

There is a great divergence in Sniffles-called variants from alignment files generated by all tools from the variants present in DGV. This is a concerning expansion of seminal findings in a previous study (Chaisson et al., 2019) as none of the Sniffles VCFs mirror the SVs present in the high-quality curated DGV database. Furthermore, there are many genomes released across multiple consortia including GIAB and the Human Pangenome Reference Consortium (Miga & Wang, 2021). But not all of these genomes are sequenced on multiple platforms, and comprehensive reference sets which are harmonious across samples, genomes, and datasets are still limited.

This resulted in us using two reference sets for this project to accommodate the latest releases of data. The callsets generated in our experiments used annotations which were different from the NA12878 variants in DGV and the curated variant set for NA24385, which was surprising. Finally, the most up to date variant benchmark for NA24385 is called on GRCh37-aligned genomes, which is from 2009 and not as comprehensive as current builds (Nurk et al., 2022). Taken together, this underscores the need for orthogonal approaches and collaboration between wet and dry labs to solve this problem.

The points above are critical in designing pipelines for genome analysis and structural variant discovery. In short, it is not a high burden to generate two alignment files per genome with each Minimap2 and Winnowmap2. In computationally unlimited research settings, there is value added in the generation of alignment files from NGMLR as well, although the time to complete these experiments makes them less appealing. These three perspectives on the same genome will account for some of the inherent differences of each tool and the algorithms they use to handle read alignment (Bizjan et al., 2020). If computational resources are limited, Minimap2 is the best choice to move the greatest number of genomes through the pipeline quickly with small memory needs; however, the loss of comprehensiveness must be considered in cases where a suspected variant is not found.

This is impactful in genomic medicine. For example, as variants range from tens of kbp in FSHD and hundreds of kbp to mbp in DMD, diagnosis of these disorders will likely not benefit from data generated on LRS platforms at present, underscoring the need for optical mapping or array-based technologies. However, disorders resulting from smaller SVs such as Huntington’s ( ∼18–540 bp) (Moncke-Buchner, 2002), myotonic dystrophy 1 (∼15–153 bp) and myotonic dystrophy 2 (∼338–143,000 bp) (Yum, Wang & Kalsotra, 2017) could be good candidates for deep study with LRS platforms based on the variants present in alignment files from Minimap2, Winnowmap2 and NGMLR. Accordingly, LRS has been used to identify variants in many such disorders (Mitsuhashi & Matsumoto, 2020).

If pathogenic loci are known, a high diagnostic yield may be obtained by generating maps with each available alignment tool, and use of a structural variant caller such as NanoSV (Cretu Stancu et al., 2017). Unlike sniffles, which provides a call of type of SV (deletion, insertion etc.), NanoSV only identifies breakpoints in the alignment, without assigning those breakpoints an SV type. A robust comparison of SV callers on nanopore datasets highlighted the relative strengths of variant calling pipelines and may help users determine the best caller for their experiments. NanoSV may be suitable for identifying breakpoints missed by Sniffles, but comes with the further caveat that it is resource-intensive and may not scale in a clinical setting without vast computational resources (Zhou, Lin & Xing, 2019).

The discrepancies between the VCFs generated from alignment files starkly show the need to design experiments with the appreciation that the genomics ecosystem cannot yet be dominated by one platform or pipeline and requires a multifaceted approach to discovery. We are at a position where simply because the breakpoint is missing from the VCF from an LRS genome, we cannot say that it is not present. We must therefore look across platforms and data types for comprehensive genome representations (Chaisson et al., 2019). However, this requires more data from the community-generation of NA24385 data on new nanopore chemistry would allow for direct comparison to the current CCS standard, or alternatively new benchmark CCS data could be generated on NA12878. This would address a critical gap in our ability to benchmark new tools in a platform-agnostic manner.

Conclusions

As the cost of long-read sequencing catches up to that of inexpensive short-read sequencing, the inevitable boom in data production will require well thought-out analysis pipelines. Pipeline design always involves a set of tradeoffs. To accurately assess these tradeoffs, we must have a rigorously benchmarked view into the tools available to create the analytic product. Here, we looked at the differences in reference-guided human genome alignments to understand the difference in each tool’s alignment of the same genome, and how it affects a structural variant callset.

This informs our conclusion that, regardless of the sequencing platform, when computational resources are not a limiting factor, it should be best practice to align an LRS human genome with three alignment tools. Minimap2, Winnowmap2, and NGMLR will provide a strong foundation to gain better insight into the architecture of a genome of interest, but there are circumstances where use of LRA for SMRT data may make sense in lieu of NGMLR. When compute resources are limited, minimap2 is a strong choice, and when time is a limiting factor, Winnowmap2 is the best choice.

As we enter the era of multiple references, pangenomics, and graph genomes, analytic substrate from the unmapped reads of a genome gains higher value. Through finding consensus reads unaligned by different aligners, teams interested in data excluded from the reference genome will yield more fertile ground for the discovery of novel genomic material.

Supplemental Information

Supplemental Information 1. LRS Tools.

Survey of LRS tools available as of 12 June 2023

DOI: 10.7717/peerj.16515/supp-1
Supplemental Information 2. Coverage statistics.

The coverage metrics for each run included in Table 3.

DOI: 10.7717/peerj.16515/supp-2
Supplemental Information 3. Pilot benchmarking.

The values for computational benchmarking during pilot experiments on data which have been superseded by new technologies.

DOI: 10.7717/peerj.16515/supp-3
Supplemental Information 4. Benchmarking summary and raw output.

The summarized and raw values for computations performed in this article.

DOI: 10.7717/peerj.16515/supp-4
Supplemental Information 5. SV 10,001–100,000 bp.

These figures represent the largest structural variants present in callsets from manuscript analyses.

DOI: 10.7717/peerj.16515/supp-5

Acknowledgments

Adam Kai Leung Wong, PhD, High Performance Computing Specialist for Genomics at the GWU Computational Biology Institute provided critical support in preparing the HPC for this study. The contents of this manuscript are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.

Funding Statement

Eric Vilain was supported by the A. James Clark Distinguished Professorship in Molecular Genetics at Children’s National Hospital. Eric Vilain, Jonathan LoTempio, Emmanuele Delot are part of the GREGoR consortium, supported by NIH Award Number U01HG011745 from the National Human Genome Research Institute. The University of California, Irvine’s Institute for Clinical and Translational Science, and the National Center for Advancing Translational Sciences (UL1TR001414) supported the publication fee for this article. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Jonathan LoTempio conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Emmanuele Delot conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Eric Vilain conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

The code is available at GitHub and Zenodo:

- https://github.com/jlotempiojr/longread_alignment_benchmarking

- jlotempiojr. (2023). jlotempiojr/longread_alignment_benchmarking: Benchmarking long-read genome sequence alignment tools for human genomics applications (1.1). Zenodo. https://doi.org/10.5281/zenodo.8416228

No new sequence data was generated in this study. The data used is available at GitHub:

- https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md

- https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0

References

  • Amarasinghe, Ritchie & Gouil (2021).Amarasinghe SL, Ritchie ME, Gouil Q. long-read-tools.org: an interactive catalogue of analysis methods for long-read sequencing data. GigaScience. 2021;10(2):giab003. doi: 10.1093/gigascience/giab003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ardui et al. (2018).Ardui S, Ameur A, Vermeesch JR, Hestand MS. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Research. 2018;46:2159–2168. doi: 10.1093/nar/gky066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Barseghyan et al. (2017).Barseghyan H, Tang W, Wang RT, Almalvez M, Segura E, Bramble MS, Lipson A, Douine ED, Lee H, Délot EC, Nelson SF, Vilain E. Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis. Genome Medicine. 2017;9:90. doi: 10.1186/s13073-017-0479-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bizjan et al. (2020).Bizjan BJ, Katsila T, Tesovnik T, Šket R, Debeljak M, Matsoukas MT, Kovač J. Challenges in identifying large germline structural variants for clinical use by long read sequencing. Computational and Structural Biotechnology Journal. 2020;18:83–92. doi: 10.1016/j.csbj.2019.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Bornhorst et al. (2023).Bornhorst M, Eze Jr A, Bhattacharya S, Putnam E, Almira-Suarez MI, Rossi C, Kambhampati M, Almalvez M, Barseghyan M, Del Risco N, Dotson D, Turner J, Myseros JS, Vilain E, Packer RJ, Nazarian J, Rood B, Barseghyan H. Optical genome mapping identifies a novel pediatric embryonal tumor with a ZNF532::NUTM1 fusion. Journal of Pathology. 2023;260(3):329–338. doi: 10.1002/path.6085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Branton et al. (2008).Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, Tabard-Cossa V, Wanunu M, Wiggin M, Schloss JA. The potential and challenges of nanopore sequencing. Nature Biotechnology. 2008;26:1146–1153. doi: 10.1038/nbt.1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Chaisson et al. (2019).Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, Gardner EJ, Rodriguez OL, Guo L, Collins RL, Fan X, Wen J, Handsaker RE, Fairley S, Kronenberg ZN, Kong X, Hormozdiari F, Lee D, Wenger AM, Hastie AR, Antaki D, Anantharaman T, Audano PA, Brand H, Cantsilieris S, Cao H, Cerveira E, Chen C, Chen X, Chin C-S, Chong Z, Chuang NT, Lambert CC, Church DM, Clarke L, Farrell A, Flores J, Galeev T, Gorkin DU, Gujral M, Guryev V, Heaton WH, Korlach J, Kumar S, Kwon JY, Lam ET, Lee JE, Lee J, Lee W-P, Lee SP, Li S, Marks P, Viaud-Martinez K, Meiers S, Munson KM, Navarro FCP, Nelson BJ, Nodzak C, Noor A, Kyriazopoulou-Panagiotopoulou S, Pang AWC, Qiu Y, Rosanio G, Ryan M, Stütz A, Spierings DCJ, Ward A, Welch AE, Xiao M, Xu W, Zhang C, Zhu Q, Zheng-Bradley X, Lowy E, Yakneen S, McCarroll S, Jun G, Ding L, Koh CL, Ren B, Flicek P, Chen K, Gerstein MB, Kwok P-Y, Lansdorp PM, Marth GT, Sebat J, Shi X, Bashir A, Ye K, Devine SE, Talkowski ME, Mills RE, Marschall T, Korbel JO, Eichler EE, Lee C. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nature Communications. 2019;10:1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Chaisson & Tesler (2012).Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13:238. doi: 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Chakraborty, Morgenstern & Bandyopadhyay (2021).Chakraborty A, Morgenstern B, Bandyopadhyay S. S-conLSH: alignment-free gapped mapping of noisy long reads. BMC Bioinformatics. 2021;22:64. doi: 10.1186/s12859-020-03918-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cheng et al. (2021).Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cretu Stancu et al. (2017).Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, Korzelius J, de Bruijn E, Cuppen E, Talkowski ME, Marschall T, de Ridder J, Kloosterman WP. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nature Communications. 2017;8:1326. doi: 10.1038/s41467-017-01343-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Cuscó et al. (2019).Cuscó A, Catozzi C, Viñes J, Sanchez A, Francino O. Microbiota profiling with long amplicons using Nanopore sequencing: full-length 16S rRNA gene and the 16S-ITS-23S of the rrn operon. F1000Research. 2019;7:1755. doi: 10.12688/f1000research.16817.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Danecek et al. (2021).Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Deshpande et al. (2019).Deshpande SV, Reed TM, Sullivan RF, Kerkhof LJ, Beigel KM, Wade MM. Offline next generation metagenomics sequence analysis using MinION detection software (MINDS) Gene. 2019;10(8):578. doi: 10.3390/genes10080578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ebbert et al. (2019).Ebbert MTW, Jensen TD, Jansen-West K, Sens JP, Reddy JS, Ridge PG, Kauwe JSK, Belzil V, Pregent L, Carrasquillo MM, Keene D, Larson E, Crane P, Asmann YW, Ertekin-Taner N, Younkin SG, Ross OA, Rademakers R, Petrucelli L, Fryer JD. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biology. 2019;20:97. doi: 10.1186/s13059-019-1707-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Faust & Hall (2012).Faust GG, Hall IM. YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics. 2012;28:2417–2424. doi: 10.1093/bioinformatics/bts456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Formenti et al. (2019).Formenti G, Chiara M, Poveda L, Francoijs K-J, Bonisoli-Alquati A, Canova L, Gianfranceschi L, Horner DS, Saino N. SMRT long reads and direct label and stain optical maps allow the generation of a high-quality genome assembly for the European barn swallow (Hirundo rustica rustica) GigaScience. 2019;8(1):giy142. doi: 10.1093/gigascience/giy142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Fuller et al. (2009).Fuller CW, Middendorf LR, Benner SA, Church GM, Harris T, Huang X, Jovanovich SB, Nelson JR, Schloss JA, Schwartz DC, Vezenov DV. The challenges of sequencing by synthesis. Nature Biotechnology. 2009;27:1013–1023. doi: 10.1038/nbt.1585. [DOI] [PubMed] [Google Scholar]
  • GA4GH (2023).GA4GH hts-specs: specifications of SAM/BAM and related high-throughput sequencing file formats. Github 2023
  • Grigorev (2018).Grigorev K. Venn. https://pypi.org/project/venn/ [11 July 2023];2018
  • Haghshenas, Sahinalp & Hach (2019).Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2019;35:20–27. doi: 10.1093/bioinformatics/bty544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ho, Urban & Mills (2020).Ho SS, Urban AE, Mills RE. Structural variation in the sequencing era. Nature Reviews. Genetics. 2020;21:171–189. doi: 10.1038/s41576-019-0180-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Huddleston et al. (2017).Huddleston J, Chaisson MJP, Steinberg KM, Warren W, Hoekzema K, Gordon D, Graves-Lindsay TA, Munson KM, Kronenberg ZN, Vives L, Peluso P, Boitano M, Chin C-S, Korlach J, Wilson RK, Eichler EE. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Research. 2017;27:677–685. doi: 10.1101/gr.214007.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Hunter (2007).Hunter Matplotlib: a 2D graphics environment. 2007. 9:90–95.
  • Jain et al. (2018a).Jain C, Koren S, Dilthey A, Phillippy AM, Aluru S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics. 2018a;34:i748–i756. doi: 10.1093/bioinformatics/bty597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jain et al. (2022).Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods. 2022 doi: 10.1038/s41592-022-01457-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jain et al. (2020).Jain C, Rhie A, Zhang H, Chu C, Walenz BP, Koren S, Phillippy AM. Weighted minimizer sampling improves long read mapping. Bioinformatics. 2020;36:i111–i118. doi: 10.1093/bioinformatics/btaa435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Jain et al. (2018b).Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S, Marriott H, Nieto T, O’Grady J, Olsen HE, Pedersen BS, Rhie A, Richardson H, Quinlan AR, Snutch TP, Tee L, Paten B, Phillippy AM, Simpson JT, Loman NJ, Loose M. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology. 2018b;36:338–345. doi: 10.1038/nbt.4060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kantartzis et al. (2012).Kantartzis A, Williams GM, Balakrishnan L, Roberts RL, Surtees JA, Bambara RA. Msh2-Msh3 interferes with Okazaki fragment processing to promote trinucleotide repeat expansions. Cell Reports. 2012;2:216–222. doi: 10.1016/j.celrep.2012.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kiełbasa et al. (2011).Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Research. 2011;21:487–493. doi: 10.1101/gr.113985.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kolmogorov et al. (2019).Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
  • Koren et al. (2018).Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nature Biotechnology. 2018;36:1174–1182. doi: 10.1038/nbt.4277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Kumar, Agarwal & Ranvijay (2019).Kumar S, Agarwal S, Ranvijay Fast and memory efficient approach for mapping NGS reads to a reference genome. Journal of Bioinformatics and Computational Biology. 2019;17:1950008. doi: 10.1142/S0219720019500082. [DOI] [PubMed] [Google Scholar]
  • Levy-Sakin et al. (2019).Levy-Sakin M, Pastor S, Mostovoy Y, Li L, Leung AKY, McCaffrey J, Young E, Lam ET, Hastie AR, Wong KHY, Chung CYL, Ma W, Sibert J, Rajagopalan R, Jin N, Chow EYC, Chu C, Poon A, Lin C, Naguib A, Wang W-P, Cao H, Chan T-F, Yip KY, Xiao M, Kwok P-Y. Genome maps across 26 human populations reveal population-specific patterns of structural variation. Nature Communications. 2019;10:1025. doi: 10.1038/s41467-019-08992-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Li (2013).Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013 arXiv [q-bio.GN]
  • Li (2018).Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Li et al. (2009).Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Liu, Gao & Wang (2017).Liu B, Gao Y, Wang Y. LAMSA: fast split read alignment with long approximate matches. Bioinformatics. 2017;33:192–201. doi: 10.1093/bioinformatics/btw594. [DOI] [PubMed] [Google Scholar]
  • Liu et al. (2016).Liu B, Guan D, Teng M, Wang Y. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2016;32:1625–1631. doi: 10.1093/bioinformatics/btv662. [DOI] [PubMed] [Google Scholar]
  • Liu et al. (2017).Liu Q, Zhang P, Wang D, Gu W, Wang K. Interrogating the unsequenceable genomic trinucleotide repeat disorders by long-read sequencing. Genome Medicine. 2017;9:65. doi: 10.1186/s13073-017-0456-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • LoTempio, Délot & Vilain (2023).LoTempio J, Délot E, Vilain E. Benchmarking long-read genome sequence alignment tools for human genomics applications. bioRxiv. 2023 doi: 10.1101/2021.07.09.451840. [DOI] [PMC free article] [PubMed]
  • MacDonald et al. (2014).MacDonald JR, Ziman R, Yuen RKC, Feuk L, Scherer SW. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Research. 2014;42:D986–D992. doi: 10.1093/nar/gkt958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Mahmoud et al. (2019).Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. Structural variant calling: the long and the short of it. Genome Biology. 2019;20:246. doi: 10.1186/s13059-019-1828-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Mantere, Kersten & Hoischen (2019).Mantere T, Kersten S, Hoischen A. Long-read sequencing emerging in medical genetics. Frontiers in Genetics. 2019;10:426. doi: 10.3389/fgene.2019.00426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Marco-Sola et al. (2020).Marco-Sola S, Moure JC, Moreto M, Espinosa A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics. 2020;37(4):456–463. doi: 10.1093/bioinformatics/btaa777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Mardis (2010).Mardis ER. The 1,000 genome, the 100, 000 analysis? Genome Medicine. 2010;2:84. doi: 10.1186/gm205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Marić et al. (2019).Marić J, Sović I, Križanović K, Nagarajan N, Šikić M. Graphmap2—splice-aware RNA-seq mapper for long reads. BioRxiv. 2019 doi: 10.1101/720458. 720458. [DOI]
  • Miga et al. (2020).Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, Brooks S, Howe E, Porubsky D, Logsdon GA, Schneider VA, Potapova T, Wood J, Chow W, Armstrong J, Fredrickson J, Pak E, Tigyi K, Kremitzki M, Markovic C, Maduro V, Dutra A, Bouffard GG, Chang AM, Hansen NF, Wilfert AB, Thibaud-Nissen F, Schmitt AD, Belton J-M, Selvaraj S, Dennis MY, Soto DC, Sahasrabudhe R, Kaya G, Quick J, Loman NJ, Holmes N, Loose M, Surti U, Risques RA, Graves Lindsay TA, Fulton R, Hall I, Paten B, Howe K, Timp W, Young A, Mullikin JC, Pevzner PA, Gerton JL, Sullivan BA, Eichler EE, Phillippy AM. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585:79–84. doi: 10.1038/s41586-020-2547-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Miga & Wang (2021).Miga KH, Wang T. The need for a human pangenome reference sequence. Annual Review of Genomics and Human Genetics. 2021;22:81–102. doi: 10.1146/annurev-genom-120120-081921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Mitsuhashi et al. (2019).Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biology. 2019;20:58. doi: 10.1186/s13059-019-1667-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Mitsuhashi & Matsumoto (2020).Mitsuhashi S, Matsumoto N. Long-read sequencing for rare human genetic diseases. Journal of Human Genetics. 2020;65:11–19. doi: 10.1038/s10038-019-0671-8. [DOI] [PubMed] [Google Scholar]
  • Modi et al. (2021).Modi A, Vai S, Caramelli D, Lari M. The illumina sequencing protocol and the NovaSeq 6000 system. Methods in Molecular Biology. 2021;2242:15–42. doi: 10.1007/978-1-0716-1099-2_2. [DOI] [PubMed] [Google Scholar]
  • Moncke-Buchner (2002).Moncke-Buchner E. Counting CAG repeats in the Huntington’s disease gene by restriction endonuclease EcoP15I cleavage. Nucleic Acids Research. 2002;30:83e–83. doi: 10.1093/nar/gnf082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Nashta-Ali et al. (2017).Nashta-Ali D, Aliyari A, Ahmadian Moghadam A, Edrisi MA, Motahari SA, Hossein Khalaj B. Meta-aligner: long-read alignment based on genome statistics. BMC Bioinformatics. 2017;18:126. doi: 10.1186/s12859-017-1518-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Neveling et al. (2021).Neveling K, Mantere T, Vermeulen S, Oorsprong M, van Beek R, Kater-Baats E, Pauper M, van der Zande G, Smeets D, Weghuis DO, Stevens-Kroef MJPL, Hoischen A. Next-generation cytogenetics: Comprehensive assessment of 52 hematological malignancy genomes by optical genome mapping. American Journal of Human Genetics. 2021;108(8):1423–1435. doi: 10.1016/j.ajhg.2021.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ni et al. (2019).Ni P, Huang N, Zhang Z, Wang D-P, Liang F, Miao Y, Xiao C-L, Luo F, Wang J. DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning. Bioinformatics. 2019;35:4586–4595. doi: 10.1093/bioinformatics/btz276. [DOI] [PubMed] [Google Scholar]
  • NIST (2015).NIST Index of /giab/ftp/data/NA12878/NA12878_PacBio_MtSinai. 2015. https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/ [11 July 2023]. https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai/
  • Norris et al. (2016).Norris AL, Workman RE, Fan Y, Eshleman JR, Timp W. Nanopore sequencing detects structural variants in cancer. Cancer Biology & Therapy. 2016;17:246–253. doi: 10.1080/15384047.2016.1139236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Nurk et al. (2022).Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen N-C, Cheng H, Chin C-S, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • PacBio (2021).PacBio NIST. sv-benchmark: public benchmark of long-read structural variant caller on PacBio CCS HG002 data. Github 2021
  • Pollard et al. (2018).Pollard MO, Gurdasani D, Mentzer AJ, Porter T, Sandhu MS. Long reads: their purpose and place. Human Molecular Genetics. 2018;27:R234–R241. doi: 10.1093/hmg/ddy177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ren & Chaisson (2021).Ren J, Chaisson MJP. lra: a long read aligner for sequences and contigs. PLOS Computational Biology. 2021;17:e1009078. doi: 10.1371/journal.pcbi.1009078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sanderson et al. (2018).Sanderson ND, Street TL, Foster D, Swann J, Atkins BL, Brent AJ, McNally MA, Oakley S, Taylor A, Peto TEA, Crook DW, Eyre DW. Real-time analysis of nanopore-based metagenomic sequencing from infected orthopaedic devices. BMC Genomics. 2018;19:714. doi: 10.1186/s12864-018-5094-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sanger, Nicklen & Coulson (1977).Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America. 1977;74:5463–5467. doi: 10.1073/pnas.74.12.5463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Santoro et al. (2017).Santoro M, Masciullo M, Silvestri G, Novelli G, Botta A. Myotonic dystrophy type 1: role of CCG, CTC and CGG interruptions within DMPK alleles in the pathogenesis and molecular diagnosis. Clinical Genetics. 2017;92:355–364. doi: 10.1111/cge.12954. [DOI] [PubMed] [Google Scholar]
  • Sedlazeck et al. (2018).Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, Haeseler Avon, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sharim et al. (2019).Sharim H, Grunwald A, Gabrieli T, Michaeli Y, Margalit S, Torchinsky D, Arielly R, Nifker G, Juhasz M, Gularek F, Almalvez M, Dufault B, Chandra SS, Liu A, Bhattacharya S, Chen Y-W, Vilain E, Wagner KR, Pevsner J, Reifenberger J, Lam ET, Hastie AR, Cao H, Barseghyan H, Weinhold E, Ebenstein Y. Long-read single-molecule maps of the functional methylome. Genome Research. 2019;29:646–656. doi: 10.1101/gr.240739.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Sović et al. (2016).Sović I, Šikić M, Wilm A, Fenlon SN, Chen S, Nagarajan N. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature Communications. 2016;7:11307. doi: 10.1038/ncomms11307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Talsania et al. (2022).Talsania K, Shen TW, Chen X, Jaeger E, Li Z, Chen Z, Chen W, Tran B, Kusko R, Wang L, Pang AWC, Yang Z, Choudhari S, Colgan M, Fang LT, Carroll A, Shetty J, Kriga Y, German O, Smirnova T, Liu T, Li J, Kellman B, Hong K, Hastie AR, Natarajan A, Moshrefi A, Granat A, Truong T, Bombardi R, Mankinen V, Meerzaman D, Mason CE, Collins J, Stahlberg E, Xiao C, Wang C, Xiao W, Zhao Y. Structural variant analysis of a cancer reference cell line sample using multiple sequencing technologies. Genome Biology. 2022;23(1):255. doi: 10.1186/s13059-022-02816-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Walker et al. (2020).Walker A, Houwaart T, Wienemann T, Vasconcelos MK, Strelow D, Senff T, Hülse L, Adams O, Andree M, Hauka S, Feldt T, Jensen B-E, Keitel V, Kindgen-Milles D, Timm J, Pfeffer K, Dilthey AT. Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events, North-Rhine Westphalia, Germany, February and 2020. Euro Surveillance: Bulletin Europeen sur Les Maladies Transmissibles = European Communicable Disease Bulletin. 2020;25(22):2000746. doi: 10.2807/1560-7917.ES.2020.25.22.2000746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Waskom (2021).Waskom M. seaborn: statistical data visualization. Journal of Open Source Software. 2021;6:3021. doi: 10.21105/joss.03021. [DOI] [Google Scholar]
  • Wei, Zhang & Liu (2020).Wei Z-G, Zhang S-W, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics. 2020;21:341. doi: 10.1186/s12859-020-03698-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wenger et al. (2019).Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, Töpfer A, Alonge M, Mahmoud M, Qian Y, Chin C-S, Phillippy AM, Schatz MC, Myers G, De Pristo MA, Ruan J, Marschall T, Sedlazeck FJ, Zook JM, Li H, Koren S, Carroll A, Rank DR, Hunkapiller MW. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature Biotechnology. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Wick, Judd & Holt (2019).Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biology. 2019;20:129. doi: 10.1186/s13059-019-1727-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Xiao et al. (2017).Xiao C-L, Chen Y, Xie S-Q, Chen K-N, Wang Y, Han Y, Luo F, Xie Z. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nature Methods. 2017;14:1072–1074. doi: 10.1038/nmeth.4432. [DOI] [PubMed] [Google Scholar]
  • Yum, Wang & Kalsotra (2017).Yum K, Wang ET, Kalsotra A. Myotonic dystrophy: disease repeat range, penetrance, age of onset, and relationship between repeat size and phenotypes. Current Opinion in Genetics & Development. 2017;44:30–37. doi: 10.1016/j.gde.2017.01.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Zhou, Lin & Xing (2019).Zhou A, Lin T, Xing J. Evaluating nanopore sequencing data processing pipelines for structural variation identification. Genome Biology. 2019;20(1):237. doi: 10.1186/s13059-019-1858-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Zook et al. (2019).Zook JM, McDaniel J, Olson ND, Wagner J, Parikh H, Heaton H, Irvine SA, Trigg L, Truty R, McLean CY, De La Vega FM, Xiao C, Sherry S, Salit M. An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology. 2019;37:561–566. doi: 10.1038/s41587-019-0074-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information 1. LRS Tools.

Survey of LRS tools available as of 12 June 2023

DOI: 10.7717/peerj.16515/supp-1
Supplemental Information 2. Coverage statistics.

The coverage metrics for each run included in Table 3.

DOI: 10.7717/peerj.16515/supp-2
Supplemental Information 3. Pilot benchmarking.

The values for computational benchmarking during pilot experiments on data which have been superseded by new technologies.

DOI: 10.7717/peerj.16515/supp-3
Supplemental Information 4. Benchmarking summary and raw output.

The summarized and raw values for computations performed in this article.

DOI: 10.7717/peerj.16515/supp-4
Supplemental Information 5. SV 10,001–100,000 bp.

These figures represent the largest structural variants present in callsets from manuscript analyses.

DOI: 10.7717/peerj.16515/supp-5

Data Availability Statement

The following information was supplied regarding data availability:

The code is available at GitHub and Zenodo:

- https://github.com/jlotempiojr/longread_alignment_benchmarking

- jlotempiojr. (2023). jlotempiojr/longread_alignment_benchmarking: Benchmarking long-read genome sequence alignment tools for human genomics applications (1.1). Zenodo. https://doi.org/10.5281/zenodo.8416228

No new sequence data was generated in this study. The data used is available at GitHub:

- https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md

- https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0


Articles from PeerJ are provided here courtesy of PeerJ, Inc

RESOURCES