RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Can Firtina; Nika Mansouri Ghiasi; Joel Lindegger; Gagandeep Singh; Meryem Banu Cavlak; Haiyu Mao; Onur Mutlu

doi:10.1093/bioinformatics/btad272

. 2023 Jun 30;39(Suppl 1):i297–i307. doi: 10.1093/bioinformatics/btad272

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Can Firtina ^1,^✉, Nika Mansouri Ghiasi ², Joel Lindegger ³, Gagandeep Singh ⁴, Meryem Banu Cavlak ⁵, Haiyu Mao ⁶, Onur Mutlu ^7,^✉

PMCID: PMC10311405 PMID: 37387139

Abstract

Summary: Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) $25.8 \times$ and $3.4 \times$ better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.

1 Introduction

High-throughput sequencing (HTS) devices can generate a large amount of genomic data at a relatively low cost. HTS can be used to analyze a wide range of samples, from small amounts of DNA or RNA to entire genomes. Oxford Nanopore Technologies (ONT) is one of the most widely used HTS technologies that can sequence long genomic regions, called reads, with up to a few million bases. ONT devices use the nanopore sequencing technique, which involves passing a single DNA or RNA strand through a tiny pore, nanopore or channel, at an average speed of 450 bases per second (Kovaka et al. 2021) and measuring the electrical current as the strand passes through. Nanopore sequencing enables two key features. First, nanopores provide the electrical raw signals in real-time as the DNA strand passes through a nanopore. Second, nanopore sequencing provides a functionality, known as Read Until (Loose et al. 2016), that can partially sequence DNA strands without fully sequencing them. These two features of nanopores provide opportunities for (i) real-time genome analysis and (ii) significantly reducing sequencing time and cost.

Real-time analysis of nanopore raw signals using Read Until can reduce the sequencing time and cost per read by terminating the sequencing of a read whenever sequencing the full read is not necessary. The freed-up nanopore can then be used to sequence a different read. A purely computational mechanism can send a signal to eject a read from a nanopore by reversing the voltage if the partial sequencing of a read meets certain conditions for particular genome analysis, such as (i) reaching a desired coverage for a species in a sample (Payne et al. 2021) or (ii) identifying that a read does not originate from a certain genome of interest (i.e. a target region; Kovaka et al. 2021; Zhang et al. 2021) and hence, does not need to be fully sequenced. By terminating the sequencing of reads that do not correspond to the target region, the sequencer can spend time and resources on higher coverage sequencing of the reads that correspond to the target. This process is referred to as nanopore adaptive sampling. By providing high coverage at target regions and avoiding unessential sequencing of reads outside those regions, this approach can improve the quality of sequencing and the downstream analysis utilizing the obtained data.

To effectively utilize adaptive sampling in nanopore sequencing, it is crucial to have computational methods that can accurately analyze the raw output signals from nanopores in real-time. These methods must provide (i) low latency and (ii) throughput matching or exceeding that of the sequencer (Dunn et al. 2021; Kovaka et al. 2021; Zhang et al. 2021). Several works propose adaptive sampling methods for real-time analysis of raw nanopore signals (Edwards et al. 2019; Bao et al. 2021; Dunn et al. 2021; Kovaka et al. 2021; Payne et al. 2021; Zhang et al. 2021; Shih et al. 2022; Ulrich et al. 2022; Sadasivan et al. 2023; Senanayake et al. 2023). However, these works have three key limitations. First, most techniques mainly use powerful computational resources, such as GPUs (Bao et al. 2021; Payne et al. 2021), or specialized hardware (Dunn et al. 2021; Shih et al. 2022) due to the use of computationally intensive algorithms such as basecalling as we explain in detail in Supplementary Section S1. This can make real-time genome analysis challenging for portable and low-cost nanopore-based sequencers, such as the ONT Flongle or MinION, which are not typically equipped with such resources. Therefore these techniques introduce challenges for using them in resource-constrained environments. Second, the sheer size of genomic data at the scale of large genomes (e.g. human genome) makes it challenging to process the data in real-time. This is because such large genomes require efficient and accurate similarity identification across a large number of regions. This renders many current methods (Kovaka et al. 2021; Zhang et al. 2021) inaccurate or useless for large genomes as they cannot either provide accurate results or match the throughput of nanopores for these genomes. Third, machine learning models used in past works (Edwards et al. 2019; Payne et al. 2021; Bao et al. 2021; Ulrich et al. 2022; Senanayake et al. 2023) to analyze raw nanopore signals often require retraining or reconfiguring the model to improve accuracy for a certain experiment, which can be a barrier to flexibly and easily performing real-time analysis without retraining or reconfiguring these models. To our knowledge, there is no work that can efficiently and accurately perform real-time analysis of raw nanopore signals on a large scale (e.g. whole-genome analysis for human) without requiring powerful computational resources, which can easily and flexibly be applied to a wide range of applications that could benefit from real-time nanopore raw signal analysis.

Our goal is to enable efficient and accurate real-time genome analysis for large genomes. To this end, we propose RawHash, the first mechanism that can efficiently and accurately perform real-time analysis of raw nanopore signals for large genomes in resource-contained environments. Unlike all the past works, RawHash is the only mechanism that can efficiently scale to large genomes and perform accurate real-time genomic analysis without requiring computationally intensive algorithms such as basecalling. Our key idea is to encode regions of the raw nanopore signal into hash values such that similar signal regions can efficiently be identified by matching their hash values, facilitating efficient similarity identification between signals. However, enabling accurate hashing-based similarity identification in the raw signal domain is challenging because raw signals corresponding to the same DNA content are unlikely to have exactly the same signal amplitudes. This is because the raw signals generated by nanopores can vary each time the same DNA fragment is sequenced due to several factors impacting nanopores during sequencing, such as variations in the properties of the nanopores or the conditions in which the sequencing is performed (David et al. 2017). Although the similarity identification of raw signals is possible via calculating the Euclidean distance between a sequence of signals in a multi-dimensional space (Zhang et al. 2021), such an approach can become impractical when dealing with larger sequences as the number of dimensions increases with the length of the sequences. This increase in dimensionality can lead to computational complexity and the curse of dimensionality, making it expensive and impractical.

To address these challenges, RawHash provides three key mechanisms for efficient signal encoding and similarity identification. First, RawHash encodes signal values that have a wider range of values into a smaller set of values using a quantization technique, such that signal values within a certain range are assigned to the same encoded value. This helps to alleviate the probability of having varying signal values for the same DNA content and enables RawHash to directly match these values using a hashing technique. Second, RawHash concatenates the quantized values of multiple consecutive signals and generates a single hash value for them. The hashing mechanism enables RawHash to efficiently identify similar signal regions of these consecutive signal values by directly matching their corresponding hash values. Representing many consecutive signals with a single hash value increases the size of the regions examined during similarity identification without suffering from the curse of dimensionality. Using larger regions can substantially reduce the number of possible matching regions that need to be examined. RawHash is the first work that can accurately use hash values in the raw signal domain, which enables using efficient data structures commonly used in the sequence domain (e.g. hash tables in minimap2; Li 2018). Third, RawHash uses an existing algorithm, known as chaining (Li 2018), to find the colinear matches of hash values between signals to identify similar signal regions. These efficient and accurate mechanisms enable RawHash to perform real-time genome analysis for large genomes.

While our proposed three key mechanisms have the potential to be used for various purposes in raw signal similarity identification, we design RawHash as a tool for mapping nanopore raw signals to their corresponding reference genomes in real-time. RawHash operates the mapping in two steps (i) indexing and (ii) mapping. First, in the indexing step, RawHash (i) converts the reference genome sequence into expected signal values by simulating the expected behavior of nanopores based on a previously known model, (ii) generates the hash values from these signals, and (iii) stores the hash values in a hash table for efficient matching. Second, in the mapping step, RawHash (i) generates the hash values from the raw signals in a streaming fashion, (ii) queries the hash table from the indexing step with these hash values to find the matching regions in the reference genome with the same hash value, and (iii performs chaining to find the similar region between the reference genome and the raw signal of a read.

RawHash can utilize the unique functionalities of nanopore sequencing to reduce the sequencing time and cost in two ways. First, to avoid redundant sequencing and processing of each read, RawHash can use Read Until to eject a read before it is fully sequenced if RawHash identifies that the sequenced portion of the read can already be mapped to a reference genome. Second, to perform a cost- and time-efficient relative abundance estimation, RawHash can utilize Run Until to fully stop the entire sequencing of all subsequent reads after sequencing a certain amount of reads i.e. sufficient to make an accurate relative abundance estimation. We refer to such usage during abundance estimation as Sequence Until. Avoiding the redundant sequencing of further reads that are unlikely to substantially change the relative abundance estimation has the potential to significantly reduce the sequencing time and cost. To utilize Sequence Until, RawHash integrates a confidence calculation mechanism that evaluates the relative abundance estimations in real-time and fully stops the entire sequencing run if using more reads does not change its estimation. To stop the entire sequencing run for further reads, Run Until can be used to stop the entire sequencing run, which can enable the better utilization of nanopores. We find that Sequence Until can be applied to other mechanisms (e.g. UNCALLED) that can perform real-time relative abundance estimations. Prior work (Weilguny et al. 2023) proposes a technique to terminate the sequencing process when species in the sample reach a certain coverage depth. The key difference of Sequence Until is that it reduces the cost of sequencing for relative abundance estimation and is based on our adaptive, accurate, and low-cost confidence calculation during real-time abundance estimation.

We evaluate RawHash on three important applications that can benefit from real-time genome analysis: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. We compare RawHash with the state-of-the-art approaches, UNCALLED and Sigmap, which can be used with nanopore sequencers that may not be equipped with GPUs, such as the MinION devices. We evaluate RawHash, UNCALLED, and Sigmap in terms of their performance, accuracy, and their estimated benefits in reducing the sequencing time and cost.

This article provides the following key contributions and major results:

We propose RawHash, the first mechanism that can efficiently and accurately find the similarities between raw nanopore signals and a reference genome for large genomes without requiring powerful computational resources such as GPUs.
We propose the first sampling mechanism that can stop the entire sequencing run for certain applications when an accurate decision can be made without sequencing the entire sample, which we call Sequence Until.
We extensively evaluate RawHash by comparing it with state-of-the-art approaches, UNCALLED and Sigmap, on various datasets ranging from small genomes (i.e. genomes with up to 100 million bases) to large genomes (e.g. human genome). Our results show that RawHash provides (i) comparable accuracy to UNCALLED and Sigmap for small genomes and (ii) significantly better accuracy for large genomes than UNCALLED and Sigmap.
We show that Sigmap cannot perform real-time genome analysis for large genomes as it cannot match the throughput of nanopores.
We provide the open source implementation of RawHash and the complete set of scripts to reproduce the results shown in this paper at https://github.com/CMU-SAFARI/RawHash.

2 Methods

We propose RawHash, a mechanism that can efficiently and accurately identify similarities between raw nanopore signals of a read and a large reference genome in real-time (i.e. while the read is sequenced). The raw nanopore signal of each read is a series of electrical current measurements as a strand of DNA passes through a nanopore. The reference genome is a set of strings over the alphabet A, C, G, T. RawHash provides the mechanisms for generating hash values from both a raw nanopore signal and a reference genome such that similar regions between the two can be efficiently and accurately found by matching their hash values.

2.1 Overview

Figure 1 shows the overview of how RawHash identifies similarities between raw nanopore signals of a read and a reference genome in four steps. First, RawHash pre-processes both (i) the raw nanopore signal and (ii) the reference genome into values that are comparable to each other. For raw signals, RawHash segments the raw signal into non-overlapping regions such that each region is expected to contain a certain amount of signal values that are generated from reading a fixed number k of DNA bases. Each such region is called an event (David et al. 2017). Each event is usually represented with a value derived from the signal values in the segment. For the reference genome, RawHash translates each substring of length k (called a k-mer) into their expected event values based on the nanopore model.

The event values from the reference genome are not directly comparable to the event values from raw nanopore signals due to variability in the current measurements in nanopores generating slightly different event values for the same k-mer (David et al. 2017). To generate the same values from slightly different events that may contain the same k-mer information, the second step of RawHash quantizes the event values from a larger set of values into a smaller set. The quantization technique ensures that the event values within a certain range are likely to be assigned to the same quantized value such that the effect of signal variation is alleviated, i.e. the same k-mer is likely assigned the same quantized value.

Due to the nature of nanopores, each event usually represents a very small k-mer of length around k = 6 bases, depending on the nanopore model (Zhang et al. 2021). Such a short k-mer is likely to exist in a large number of locations in the reference genome, making it challenging to efficiently identify the correct one. To make the events more unique (i.e. such that they exist only in a small number of locations in the reference genome), the third step of RawHash combines multiple consecutive quantized events into a single hash value. These hash values can then be used to efficiently identify similar regions between raw signals and the reference genome by matching the hash values generated from their events using efficient data structures such as hash tables.

Fourth, to map a raw nanopore signal of a read to a reference genome, RawHash uses a chaining algorithm (Li 2018; Zhang et al. 2021) that find colinear matching hash values generated from regions that are close to each other both in the reference genome and the raw nanopore signal.

2.2 Event generation

Our goal is to translate a reference genome sequence and a raw nanopore signal into comparable values. To this end, RawHash converts (i) each k-mer of the reference genome and (ii) each segmented region of the raw signal into its corresponding event.

Sequence-to-event conversion: To convert a reference genome sequence into a form that can be compared with raw nanopore signals, RawHash converts the reference genome sequence into event values in three steps, as shown in Fig. 2.

Figure 2. — Converting sequences to event values based on the k-mer model of a nanopore.

First, RawHash extracts all k-mers from the reference genome sequence, where k depends on the nanopore. The k-mer model of a nanopore includes the information about the expected k-mer length of an event and the expected average event value for each k-mer based on certain variables affecting the signal outcome of the nanopore’s current measurements. For many nanopore models, ONT provides the k-mer model including recent R10 and R10.4. These models can also be generated by users (Simpson et al. 2017).

Second, RawHash queries the k-mer model for each k-mer of the reference genome to convert k-mers into their expected event values. Although the k-mer model of a nanopore provides an extensive set of information for each possible k-mer, RawHash uses only the mean values of events that provide an average value for the signals in the same event since these mean values provide a sufficient level of meaningful information for comparison with the raw nanopore signals.

Third, RawHash normalizes the event values from the same reference genome sequence (e.g. entire chromosome sequence or a contig) by calculating the standard scores (i.e. z-scores) of these events. RawHash uses these normalized values as event values since the same normalization step is taken for raw signals to avoid certain variables that may affect the range of raw signal amplitudes during sequencing (Kovaka et al. 2021; Zhang et al. 2021).

Signal-to-event conversion: Our goal is to accurately convert the series of raw nanopore signals into a set of values where each value corresponds to certain DNA sequences of fixed length k, k-mers, and consecutive values differ by one base. To achieve this, RawHash converts the raw signals into their corresponding values in three steps, as shown in Fig. 3. First, to accurately identify the distinct regions in the raw signal that correspond to a certain k-mer from DNA, RawHash performs a segmentation step as described in a basecalling tool, Scrappie, and used by earlier works UNCALLED and Sigmap. The segmentation step aims to eliminate the factors that affect the speed of the DNA molecules passing through a nanopore, as the speed affects the number of signal measurements taken for a certain amount of bases in DNA. To perform the segmentation step, RawHash identifies the boundaries in the signal where the signal value changes significantly compared to the certain amount of previously measured signal values, which indicates a base change in the nanopore. Such boundaries are computed using a statistical test, known as Welch’s t-test (Ruxton 2006), over a rolling window of consecutive signals. RawHash performs this t-test for multiple windows of different lengths to avoid the variables that cause a change in the number of current measurements due to the varying speed of DNA through a nanopore, known as skip and stay errors (David et al. 2017). Signals that fall within the same segment (i.e. between the same measured boundaries) are usually called events since each event contains the signals from a reading of a fixed amount of DNA bases, k-mers.

Figure 3. — Detecting events from raw signals.

Second, since the number of signals that each event includes is not constant across different events due to the stay and skip errors, RawHash generates a single value for each event to quickly avoid these potential errors and other factors that cause variations from reading the same amount of DNA bases. To this end, RawHash measures the mean value of the signals that fall within the same segment and uses this mean value for an event.

Third, since the amplitudes of the signal measurements may significantly vary when reading k-mers at different times, RawHash normalizes the mean event values using the event values generated from the nanopore within the same certain time interval in a streaming fashion. Although this time interval parameter can be modified in our tool, the default configuration of RawHash processes the events of signals generated by the nanopore within one second. For normalization, RawHash uses the same z-score calculation that it uses for normalizing the event values generated from reference sequences as described earlier. RawHash uses these normalized values as event values when comparing with the event values from reference sequences.

2.3 Quantization of events

Our goal is to avoid the effects of generating different event values when reading the same k-mer content from nanopores so that we can identify k-mer matches by directly matching events. Although the segmentation and normalization steps explained in Section 2.2 can avoid the potential sequencing errors, such as stay and skip errors and significant changes in the current readings at different times, these approaches still do not guarantee to generate exactly the same event values when reading the same k-mer content. This is because slight changes in the normalized event values may occur when reading the same DNA content due to the high sensitivity and stochasticity of nanopores (David et al. 2017). Thus, it is challenging to generate the same event value for the same k-mer content after the segmentation and normalization steps. Since these event values generated from reading the same k-mer content are expected to be close to each other (Zhang et al. 2021), we propose a quantization mechanism that encodes event values so that events with close mean values can have the same quantized value in two steps as shown in Fig. 4.

Figure 4. — Quantization of two event values.

First, to increase the probability of assigning the same value for similar event values, RawHash trims the least significant fractional part of mean values by using only the most significant Q bits of these mean event values from their binary format, which we represent as E[1, Q] for simplicity where E is the event value and E[1, Q] gives the most significant Q bits of E. We assume that the mean event values are represented by the standard single-precision floating-point format with the sign, exponent, and fraction bits. This enables RawHash to reduce the wide range of floating-point numbers into a smaller range without significantly losing from the accuracy such that event values closer to each other can be represented by the same value in the smaller range of values. We can perform this trimming technique without significant sensitivity loss because we observe that these normalized event values mostly use at most six digits from the fractional part of their values, leaving a large number of fractional bits useless.

Second, to avoid using redundant bits that may carry little or no information in the most significant Q bits of an event value, RawHash prunes p bits after the most significant two bits of E[1, Q] such that $2 + p < Q$ and the resulting quantized value is $E [1, 2] E [3 + p, Q]$ . For simplicity, we show the quantized value of E as $E_{Q, p}$ . By ignoring these p bits, we effectively pack Q bits into $Q - p$ bits without losing significant information from event values. We can perform such a pruning operation because we observe that the normalized event values are usually in the range $[- 3, 3]$ such that these p bits provide little information in distinguishing different event values due to the small range of values. We note that these Q and p values are parameters to RawHash and can empirically be adjusted based on the required sensitivity and quantization efficiency. This quantization technique enables RawHash to assign the same quantized values for a pair of close event values, E and F, that may be generated from reading the same k-mer such that $E_{Q, p} = F_{Q, p}$ where $| E - F | < ϵ$ and $ϵ$ is small enough for two events to represent the same k-mer content. RawHash always uses the most significant two bits as these two bits consistently carry the most significant information of the normalized event values, including the sign bit.

2.4 Generating the Hash values

Our goal is to generate values for large regions of raw nanopore signals and reference sequences such that these values can be used to efficiently and accurately identify similarities between raw signals and a reference genome. To this end, RawHash generates hash values using quantized values of events in two steps, as shown in Fig. 5. First, to avoid finding a large number of matches, RawHash uses the quantized values of n consecutive events to pack them in $n \times (Q - p)$ bits while preserving the order information of these consecutive events. RawHash uses several consecutive events in a single hash value because matching a single event is likely to generate a larger number of matches for larger genomes as a single event usually corresponds to a k-mer of 6–9 bases depending on the nanopore model (David et al. 2017). It is essential to use several consecutive events to reduce the number of matching regions between raw signals and the reference genome by increasing the region that these consecutive events span.

Figure 5. — Generating a hash value from n consecutive quantized event values.

Second, to efficiently and accurately find matches between large regions of raw signals and a reference genome using a constrained space, RawHash uses a low collision hash function to generate a 32-bit hash value from $n \times (Q - p)$ bits of n consecutive quantized event values. Since $n \times (Q - p)$ can be ˃32, using such a hash function is likely to increase the collision rate for dissimilar regions. To avoid inaccurate similarity identifications due to these incorrect collisions, RawHash requires several matches of hash values within close proximity for similarity identification, which we explain next.

2.5 Seeding and mapping

To efficiently identify similarities, RawHash uses hash values generated from raw nanopore signals and the reference genome in two steps. First, RawHash efficiently identifies matching regions between raw nanopore signals and a reference genome by matching their hash values. These hash values used for matching are usually known as seeds. Matching seeds enable efficiently finding similar regions between raw nanopore signals and a reference genome. Second, RawHash uses the chaining algorithm proposed in Sigmap (Zhang et al. 2021) to identify the best colinear matching seeds that are close to each other in both raw nanopore signal and a reference genome. The region that the best chain of seed matches cover is the mapping position that RawHash identifies as a similar region.

The chaining algorithm is useful for two reasons. First, the chaining algorithm can tolerate mismatches and indels as it allows including gaps between seed matches, which enables finding similar regions with many seed matches without requiring the entire region to match exactly, as shown in Supplementary Table S2. Second, incorrect seed matches due to collisions or our quantization mechanism that may generate the same quantized value for distinctly dissimilar events are likely to be filtered in the chaining step due to the difficulty of finding colinear seed matches in highly dissimilar regions. We note that we modify the original chaining algorithm in Sigmap by disabling the distance coefficient as RawHash does not calculate the distance between seed matches.

To efficiently map raw signals to a reference genome, RawHash provides efficient data structures. To this end, RawHash uses hash tables to store the hash values generated from reference genomes (i.e. the indexing step) and efficiently query the same hash table with the hash values generated from the raw signal as the read is sequenced from a nanopore to find positions in the reference genome with matching hash values. RawHash uses the events in chunks (i.e. collection of events generated within a certain time interval) to find seed matches and perform chaining in a streaming fashion such that the chaining computation from previous chunks (i.e. seed matches) is transferred to the next chunk if the mapping is unsuccessful for the current chunk.

3 Results

3.1 Evaluation methodology

We implement RawHash as a tool for mapping raw nanopore signals to a reference genome. Similar to regular read mapping tools, RawHash has two steps to complete the mapping process: (i) indexing the reference genome and (ii) mapping raw signals. Although indexing is usually a one-time task that can be performed prior to the mapping step, the indexing of RawHash can be performed relatively quickly within a few minutes for large genomes (Supplementary Table S3). RawHash provides the mapping information using a standard pairwise mapping format. In our implementation, we provide an extensive set of parameters that allow configuring several options to fit RawHash for many other applications and nanopore models that we do not evaluate, such as configuring details about the nanopore model (e.g. number of bases per second), number of events that can be included in a single hash value, range of bits to quantize, enabling seeding techniques such as minimizers and fuzzy seed matching. We also provide a default set of parameters that we empirically choose for each common application of real-time genome analysis. These default parameters are set to accurately and efficiently analyze (i) very small (e.g. viral) genomes, (ii) small and mid-sized genomes (i.e. genomes with less than a few hundred million bases), (iii) large genomes (e.g. genomes with a few billion bases such as a human genome). We show the details regarding these parameter selections and the versions of tools in Supplementary Tables S5–S7.

We evaluate RawHash in terms of its performance, peak memory usage, accuracy, and estimated benefits in sequencing time and cost compared to two state-of-the-art tools UNCALLED and Sigmap. For performance, we evaluate the throughput and overall runtime of each tool in terms of the number of bases they can process per second. Throughput determines if the tool is at least as fast as the speed of DNA passing through a nanopore. For many nanopore models (e.g. R9.4), a DNA strand passes through a pore at around 450 bases per second (Kovaka et al. 2021; Zhang et al. 2021). It is essential to provide a throughput higher than the throughput of the nanopore to enable real-time genome analysis. To calculate the throughput, we use the tool that UNCALLED provides, UNCALLED pafstats, which measures the throughput of the tool from the number of bases that the tool processes and the time it takes to process those bases. Although theoretically, it is not possible to exceed the throughput of a nanopore due to the speed of raw signal generation, for comparison purposes, such a limitation is ignored by UNCALLED pafstats. For overall runtime, we calculate CPU time and real-time using 32 threads. CPU time shows the overall amount of CPU seconds spent running a tool, while real-time shows the overall elapsed (i.e. wall clock) time. All of these tools support multi-threading, where multiple reads can be mapped simultaneously using a single thread for each read. For all of these tools, assigning a larger number of threads enables processing a larger number of reads in parallel, similar to the behavior of nanopore sequencers with hundreds to thousands of pores (i.e. channels). We note that the throughput and mapping time per read values are not affected by the thread counts as (i) these are measured per read and (ii) single thread performs the mapping of a single read.

For accuracy, we evaluate the correctness of the mapping positions that each tool provides when compared to the ground truth mapping positions. To generate the ground truth mapping, we use a read mapping tool, minimap2 (Li 2018), to map the basecalled sequences of raw nanopore signals to their corresponding whole-genome references. We use UNCALLED pafstats to compare the mapping output of a tool with the ground truth mapping to find the number of true positives or TP (i.e. correct mappings), false positives or FP (i.e. incorrect mappings), and false negatives or FN (i.e. unmapped reads that are mapped in ground truth). Correct and incorrect mappings are identified based on the distance of the mapping positions between ground truth and the tool. To evaluate the accuracy, we calculate the precision ( $P = T P / (T P + F P)$ ), recall ( $R = T P / (T P + F N)$ ) and the $F_{1}$ ( $F_{1} = 2 \times (P \times R) / (P + R)$ ) values.

For estimating the benefits in sequencing time and cost of each tool, we calculate the average length of sequenced bases per read when using UNCALLED and RawHash and the average number sequenced chunk of signals for Sigmap and RawHash. We compare RawHash with Sigmap in terms of the number of chunks because Sigmap does not provide the number of bases when a read is unmapped, while both tools provide the number of chunks used when a read is mapped or unmapped. These chunks include a portion of the signal produced by a nanopore within a certain time interval, which is by default set as one second of data for both RawHash and Sigmap. The average length of bases and the number of chunks determine the estimations of how quickly each tool can make a mapping decision to activate Read Until before sequencing the remaining portion of a read, which indicates the potential savings from overall sequencing time and cost.

We evaluate RawHash, UNCALLED, and Sigmap for three applications (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Read mapping aims to map the raw signals to their corresponding reference genomes. Relative abundance estimation measures the abundance of each genome relative to other genomes in the same sample by mapping raw signals to a given set of reference genomes. Contamination analysis aims to identify if a sample is contaminated with a certain genome (e.g. a viral genome) by mapping raw signals to the reference genome that the sample may be contaminated with. For each tool, we use their default parameter settings in our evaluation.

To evaluate each of these applications, we use real datasets that we list in Table 1. These datasets include both raw nanopore signals in the FAST5 format and their corresponding basecalled sequences in the FASTA format. We note that RawHash can also use POD5 files. For relative abundance estimation, we create a mock community using all the read sets from datasets D1 to D5, and the reference genome is the combination of reference genomes used in these datasets. We slightly modify the reference genome we use in the relative abundance estimation such that the sequence IDs in the reference genome provide additional information about the species (e.g. taxonomy IDs) to enable calculating relative abundance in real-time. For contamination analysis, we combine the SARS-CoV-2 read sets (D1) with human read sets (D5) to identify if the combined sample is contaminated with the SARS-CoV-2 sample by mapping raw signals in the combined set to the SARS-CoV-2 reference genome. For all evaluations, we use the AMD EPYC 7742 processor at 2.26 GHz to run the tools.

Table 1.

Details of datasets used in our evaluation.

	Organism	Reads (#)	Bases (#)	SRA accession	Reference genome	Genome size
Read mapping
D1	SARS-CoV-2	1 382 016	594M^b	CADDE Centre	GCF_009858895.2	29 903
D2	E.coli	353 317	2365M	ERR9127551	GCA_000007445.1	5M
D3	Yeast	49 989	380M	SRR8648503	GCA_000146045.2	12M
D4	Green algae	29 933	609M	ERR3237140	GCF_000002595.2	111M
D5	Human HG001	269 507	1584M	FAB42260 Nanopore WGS	T2T-CHM13 (v2)	3117M
Relative abundance estimation
	D1–D5^a	2 084 762	5531M	D1–D5	D1-D5	3246M
Contamination analysis
	D1, D5	1 651 523	2 178M	D1, D5	D1	29 903

Open in a new tab

Dataset numbers (e.g. D1–D5) show the combined datasets.

Base counts in millions (M).

Evaluating Sequence Until: Our goal is to avoid redundant sequencing to reduce sequencing time and cost for relative abundance estimation. We find that the Run Until mechanism can be utilized to fully stop the sequencing run when the real-time relative abundance estimation reaches a certain confidence level to achieve accurate estimations, which we call Sequence Until. While a similar mechanism is evaluated to enrich the coverage depth of low-abundance species (Weilguny et al. 2023) using Read Until, we evaluate the potential benefits of Run Until for low-cost relative abundance estimations. We integrate a real-time confidence calculation mechanism in RawHash to activate the Sequence Until mechanism in three steps. First, RawHash measures the relative abundance estimation after every n reads that can be mapped to a reference genome in real-time. Second, to identify if the recently mapped reads provide substantial changes in the abundance estimations, RawHash performs a cross-correlation calculation between the last w estimations. Cross-correlation can identify outliers from a set of estimations to identify if the outlier is substantially different than other estimations, which indicates that recent reads can still change the relative abundance estimation, and more reads should be sequenced from the sample. Third, RawHash activates Sequence Until by fully stopping the sequencing using Run Until when there are no outliers in the last w estimations, which indicates a convergence to a certain relative abundance estimation, and further sequencing is unlikely to change this estimation. RawHash provides a set of parameters to adjust these parameters related to Sequence Until.

We evaluate the benefits of Sequence Until by comparing (i) RawHash without Sequence Until and (ii) RawHash with Sequence Until in terms of (i) the difference in the relative abundance estimations and (ii) the estimated benefits in sequencing time and cost. To evaluate Sequence Until in a realistic sequencing environment where reads from different species can be sequenced in a random order, we randomly shuffle the reads in the relative abundance dataset and generate a set of 50 000 reads with a random order of species so that we can simulate this random behavior. We also find that Sequence Until can be applied to other mechanisms. To evaluate the potential benefits of Sequence Until, we simulate the benefits when using UNCALLED with Sequence Until and compare it with RawHash.

3.2 Performance and peak memory

Figure 6 shows the throughput of regular nanopores that we use as a baseline and the throughput of the tools when mapping raw nanopore signals to each dataset for read mapping, contamination analysis, and relative abundance estimation. Supplementary Fig. S1 and Supplementary Tables S3 and S4 show the mapping time per read, and the computational resources required for indexing and mapping, respectively. We make six key observations. First, RawHash and UNCALLED are the only tools that can perform real-time genome analysis for large genomes, as they can provide higher throughputs than nanopores for all datasets. Sigmap cannot perform real-time genome analysis for large genomes as it can provide $0.7 \times$ and $0.6 \times$ throughput of a nanopore for human genome mapping and relative abundance estimations, respectively. RawHash can achieve high throughput as its seeding mechanism is based on efficiently matching hash values compared to the costly distance calculations that Sigmap performs for matching seeds, which shows poor scalability for larger genomes. Second, the throughput of UNCALLED is not affected by the genome size as it provides a near-constant throughput of around $16 \times$ for all applications. This is because UNCALLED uses FM-index (Ferragina and Manzini 2000) and a branching algorithm that provides robust scaling with respect to the reference genome size (Kovaka et al. 2021). Third, the throughput of RawHash decreases with larger genomes as the seeding and chaining steps start taking up a larger fraction of the entire runtime of RawHash as shown in Supplementary Table S1. Fourth, RawHash provides an average throughput $25.8 \times$ and $3.4 \times$ better than UNCALLED and Sigmap, while providing an average mapping speedup of $32.1 \times$ and $2.1 \times$ per read, respectively. Higher throughput with faster mapping times suggests that the mapping time improvements of RawHash are mainly due to its computational efficiency rather than the ability to sequence shorter prefixes of reads than UNCALLED and Sigmap. Fifth, for indexing, Sigmap usually requires a larger amount of computational resources in terms of both runtime and peak memory usage. Sixth, for mapping, UNCALLED is the most efficient tool in terms of the peak memory usage as it requires at most 10GB of peak memory while (i) RawHash requires ˂12GB of memory for almost all the datasets and (ii) Sigmap requires significantly larger memory space than both tools. RawHash has a larger memory footprint, $\sim 52$ GB, than UNCALLED for large genomes. Although such large memory requirements for larger genomes can lead to challenges in using RawHash for mobile devices with limited computational resources, such a requirement can be mitigated by using more efficient seeding techniques such as minimizers, which we leave as future work. We conclude that RawHash provides significant benefits in improving the throughput and performance for the real-time analysis of large genomes while matching the throughput of nanopores.

3.3 Accuracy

Table 2 shows the accuracy results of tools for each dataset and application. We make four key observations. First, RawHash provides the best accuracy in terms of precision, recall, and $F_{1}$ values compared to UNCALLED and Sigmap when mapping reads to large genomes (i.e. the human genome and the relative abundance estimation). RawHash can efficiently match several events using hash values, which is specifically beneficial in reducing the number of matching regions in large genomes and increasing the specificity due to finding longer matches compared to UNCALLED and Sigmap.

Table 2.

Mapping accuracy.

Dataset		UNCALLED	Sigmap	RawHash
Read mapping
D1	Precision	0.9547	0.9929 ^a	0.9868
SARS-CoV-2	Recall	0.9910	0.5540	0.8735
	$F_{1}$	0.9725	0.7112	0.9267
D2	Precision	0.9816	0.9842	0.9573
E.coli	Recall	0.9647	0.9504	0.9009
	$F_{1}$	0.9731	0.9670	0.9282
D3	Precision	0.9459	0.9856	0.9862
Yeast	Recall	0.9366	0.9123	0.8412
	$F_{1}$	0.9412	0.9475	0.9079
D4	Precision	0.8836	0.9741	0.9691
Green algae	Recall	0.7778	0.8987	0.7015
	$F_{1}$	0.8273	0.9349	0.8139
D5	Precision	0.4867	0.4287	0.8959
Human HG001	Recall	0.2379	0.2641	0.4054
	$F_{1}$	0.3196	0.3268	0.5582
Relative abundance estimation
	Precision	0.7683	0.7928	0.9484
D1–D5	Recall	0.1273	0.2739	0.3076
	$F_{1}$	0.2184	0.4072	0.4645
Contamination analysis
	Precision	0.9378	0.7856	0.8733
D1, D5	Recall	0.9910	0.5540	0.8735
	$F_{1}$	0.9637	0.6498	0.8734

Open in a new tab

Best results are highlighted with bold text.

Second, RawHash and UNCALLED can accurately perform contamination analysis while Sigmap suffers from significantly lower precision and recall values. Due to the nature of a contamination analysis, it is essential to correctly eliminate the genomes other than the contaminating genome (precision) without missing the correct mappings of reads from the contaminating genome (recall). Unfortunately, Sigmap cannot provide high values in any of these categories, making it significantly unsafe for contamination detection.

Third, the precision of RawHash does not drop with the increased length in the reference genome due to the benefits of finding long matches, which provides a higher confidence in read mapping.

Fourth, although RawHash does not provide the best accuracy when mapping reads to genomes smaller than the human genome, its accuracy is on par with UNCALLED and Sigmap for these genomes. UNCALLED and Sigmap can achieve high recall values as their mechanisms are best optimized for accurately handling matches in relatively smaller genomes with fewer repeats and ambiguous mappings (Kovaka et al. 2021; Zhang et al. 2021). We conclude that RawHash is the only tool that can accurately scale to performing real-time genome analysis for large genomes, especially with significantly high precision rates.

Relative abundance estimations: Table 3 shows the relative abundance estimations that each tool makes and the Euclidean distance of their estimation to the ground truth estimation. We make two key observations. First, we find that RawHash provides the most accurate relative abundance estimations in terms of the estimation distance to the ground truth compared to UNCALLED and Sigmap. This observation correlates with the accuracy results we show in Table 2 where RawHash provides the best overall accuracy for relative estimation, which results in generating the most accurate relative abundance estimations. Second, although Sigmap cannot perform real-time relative abundance estimation due to its throughput being lower than a nanopore (Fig. 6), Sigmap provides accurate estimations that are on par with RawHash. This observation shows that while Sigmap provides mappings with more incorrect positions due to lower precision than RawHash (Table 2), these reads with incorrect mapping positions are mostly mapped to their correct species. We conclude that RawHash is the only tool that can accurately be applied to analyze relative abundance estimations while matching the throughput of nanopores at a large-scale based on the prior knowledge of the set of reference genomes to map the reads.

Table 3.

Relative abundance estimations.

	Estimated relative abundance ratios
Tool	SARS-CoV-2	E.coli	Yeast	Green algae	Human	Distance
Ground Truth	0.0929	0.4365	0.0698	0.1179	0.2828	N/A
UNCALLED	0.0026	0.5884	0.0615	0.1313	0.2161	0.1895
Sigmap	0.0419	0.4191	0.1038	0.0962	0.3390	0.0877
RawHash	0.1249	0.4701	0.0957	0.0629	0.2464	0.0847 ^a

Open in a new tab

Best results are highlighted with bold text.

3.4 Sequencing time and cost

Our goal is to estimate the benefits that each tool provides in reducing the sequencing time and cost. To this end, we measure the average length of sequenced bases and the average number of sequenced chunks per read as shown in Table 4. We make two key observations. First, RawHash provides significant benefits in reducing the sequencing time and cost for large genomes (e.g. Green Algae and Human) compared to UNCALLED, as RawHash can complete the mapping process per read by using smaller prefixes of reads. Second, RawHash uses on average $1.58 \times$ more chunks compared to Sigmap when mapping reads, which can proportionally lead to worse sequencing time and cost for RawHash compared to Sigmap. We conclude that although UNCALLED and Sigmap provide better advantages in reducing sequencing time and cost for smaller genomes, RawHash can provide significant reductions in sequencing time and cost for larger genomes compared to UNCALLED.

Table 4.

The average sequenced length of bases and the number of chunks.

Tool	SARS-CoV-2	E.coli	Yeast	Green algae	Human
Average sequenced base length per read
UNCALLED	184.51 ^a	580.52	1233.20	5300.15	6060.23
RawHash	513.95	1376.14	2565.09	4760.59	4773.58
Average sequenced number of chunks per read
Sigmap	1.01	2.11	4.14	5.76	10.40
RawHash	1.24	3.20	5.83	10.72	10.70

Open in a new tab

Best results are highlighted with bold text.

3.5 Benefits of Sequence Until

Simulated Sequence Until: Our goal is to estimate the benefits of implementing the Sequence Until mechanism in UNCALLED and compare it with RawHash when they both use Sequence Until under the same conditions. To this end, we use shuf in Linux to randomly shuffle the mapping files that both RawHash and UNCALLED generate for relative abundance and extract a certain portion of the randomly shuffled file to identify their relative abundance estimations after $0.01 %$ , $0.1 %$ , $1 %$ , $10 %$ , and $25 %$ of the overall reads in the sample are randomly sequenced from nanopores.

Table 5 shows the distance of relative abundance estimations after a certain portion of the read is randomly sequenced from nanopores. We make two key observations. First, both RawHash and UNCALLED can significantly benefit from Sequence Until by stopping sequencing after processing a smaller portion of the entire sample since their estimations using smaller portions are close to those using the entire set of reads (Table 3) in terms of their distance to the ground truth. This suggests that many other tools can benefit from Sequence Until as their sensitivity to relative abundance estimations may not significantly change while providing opportunities for reducing the sequencing time and cost up to a certain threshold based on the tool.

Table 5.

Relative abundance with simulated Sequence Until.

	Estimated relative abundance ratios
Tool	SARS-CoV-2	E.coli	Yeast	Green algae	Human	Distance
Ground Truth	0.0929	0.4365	0.0698	0.1179	0.2828	N/A
UNCALLED ( $25 %$ )^a	0.0026	0.5890	0.0613	0.1332	0.2139	0.1910
RawHash ( $25 %$ )	0.0271	0.4853	0.0920	0.0786	0.3170	0.0995 ^b
UNCALLED ( $10 %$ )	0.0026	0.5906	0.0611	0.1316	0.2141	0.1920
RawHash ( $10 %$ )	0.0273	0.4869	0.0963	0.0772	0.3124	0.1004
UNCALLED ( $1 %$ )	0.0026	0.5750	0.0616	0.1506	0.2103	0.1836
RawHash ( $1 %$ )	0.0259	0.4783	0.0987	0.0882	0.3088	0.0928
UNCALLED ( $0.1 %$ )	0.0040	0.4565	0.0380	0.1910	0.3105	0.1242
RawHash ( $0.1 %$ )	0.0212	0.5045	0.1120	0.0810	0.2814	0.1136
UNCALLED ( $0.01 %$ )	0.0000	0.5551	0.0000	0.0000	0.4449	0.2602
RawHash ( $0.01 %$ )	0.0906	0.6122	0.0000	0.0000	0.2972	0.2232

Open in a new tab

Percentages show the portion of the overall reads used.

Best results are highlighted with bold text.

Second, RawHash can provide more accurate relative abundance estimations when using only $0.1 %$ of the reads than the estimation that UNCALLED provides using the entire set of reads (Table 3). We conclude that Sequence Until provides significant opportunities in reducing sequencing time and cost while more accurate tools such as RawHash can benefit further from Sequence Until by using fewer portions of the entire read set than the portions that less accurate tools would need to achieve similar accuracy.

Sequence Until with RawHash: Our goal is to evaluate Sequence Until when used in real-time with RawHash for relative abundance estimation. Table 6 shows the relative abundance estimations that RawHash makes with and without Sequence Until. We note that the estimations we show for RawHash in Table 6 are different than the estimations in Table 3 since we randomly subsample the reads in the relative abundance estimation dataset, as explained in Section 3.1. We make two key observations. First, we observe that the distance between the relative abundance estimations between these two configurations of RawHash is substantially low. This indicates that our outlier detection mechanism can accurately detect the convergence to the relative abundance estimations without using a full set of reads. Second, Sequence Until enables accurately stopping the entire sequencing after processing $7 %$ of the reads in the entire set without substantially sacrificing accuracy. We conclude that Sequence Until has the potential to significantly reduce the sequencing time and cost by using only fewer reads from a sample while producing accurate results.

Table 6.

Relative abundance with Sequence Until.

	Estimated relative abundance ratios in 50 000 random reads
Tool	SARS-CoV-2	E.coli	Yeast	Green algae	Human	Distance
RawHash ( $100 %$ )^a	0.0270	0.3636	0.3062	0.1951	0.1081	N/A
RawHash +	0.0283	0.3539	0.3100	0.1946	0.1133	0.0118
Sequence Until ( $7 %$ )

Open in a new tab

Percentages show the portion of the overall reads used.

4 Discussion

We discuss the benefits we expect RawHash can immediately make, the limitations of RawHash, and future work. We envision that RawHash can be useful mainly for two directions. First, RawHash provides a low-cost solution for analyzing large genomes in real-time. Such an analysis can be significantly useful when using nanopore sequencers with limited computational resources to enable portable real-time genome analysis at a large scale.

Second, we expect that RawHash can also be useful for genome analysis that does not require real-time solutions by reducing the time and energy that further steps in genome analysis may require. One of the immediate steps after generating raw nanopore signals is their translation to their corresponding DNA bases as sequences of characters with a computationally intensive step, basecalling. Basecalling approaches are usually computationally costly and consume significant energy as they use complex deep learning models (Mao et al. 2022; Singh et al. 2022). Although we do not evaluate in this work, we expect that RawHash can be used as a low-cost filter (Cavlak et al. 2022) to eliminate the reads that are unlikely to be useful in downstream analysis, which can reduce the overall workload of basecallers and downstream analysis.

Future work: We find three key directions for future work. First, we find that our efficient hash-based similarity identification mechanism can be used to efficiently find overlaps between signals as the reads are sequenced in real-time. Although we observe that our indexing technique is efficient in terms of the amount it requires to construct an index even for large genomes, such an overlapping technique requires substantially more optimized indexing methods and techniques that can efficiently find overlaps as more reads are sequenced and evolves the index. Finding overlaps between signals can be beneficial in (i) providing enriched information to basecallers to increase their accuracy and (ii) identifying redundant signals that fully overlap with already sequenced reads in an effort for generating assemblies from signals.

Second, since RawHash generates hash values for matching similar regions, it provides opportunities to use the hash-based seeding techniques that are optimized for identifying sequence similarities accurately without requiring large memory space, such as minimizers (Roberts et al. 2004; Li 2018), spaced seeds (Ma et al. 2002), syncmers (Edgar 2021), strobemers (Sahlin 2021), and fuzzy seed matching as in BLEND (Firtina et al. 2023). Although we do not evaluate in this work, we implement the minimizer seeding technique in RawHash. Our initial observation motivates us that future work can exploit these seeding techniques with slight modifications in their seeding mechanisms to significantly improve the performance of certain applications without reducing the accuracy.

Third, we find that RawHash can also benefit from a GPU implementation as its low-cost and accurate implementation can effectively be scaled to nanopore sequencers that include thousands of nanopores such that these pores can be analyzed in parallel with an efficient GPU implementation, which we leave as future work.

5 Conclusion

We propose RawHash, a novel mechanism that provides a low-cost and accurate approach for real-time genome analysis for large genomes. RawHash can efficiently and accurately perform real-time analysis of raw nanopore signals to identify similarities between the signals and a reference genome in real-time at a large-scale (e.g. whole-genome analysis for human or communities with multiple samples). To efficiently and accurately identify similarities, RawHash (i) generates events from both raw signals and the reference genome, (ii) quantizes the events into values such that slightly different events that correspond to the same DNA content can have the same value, and (iii) generates hash values from multiple events to efficiently find matching regions between raw signals and a reference genome using hash values with efficient data structures such as hash tables. We compare RawHash with the state-of-the-art approaches, UNCALLED and Sigmap, on three important applications in terms of their performance, accuracy, and estimated benefits in reducing sequencing time and cost. Our results show that (i) RawHash is the only tool that can be accurately applied to analyze raw nanopore signals at large-scale, (ii) provides $25.8 \times$ and $3.4 \times$ better average throughput, and (iii) can map reads $32.1 \times$ and $2.1 \times$ faster than UNCALLED and Sigmap, respectively.

Supplementary Material

btad272_Supplementary_Data

Click here for additional data file.^{(273.3KB, pdf)}

Acknowledgements

We thank the SAFARI Research Group members for their valuable feedback and the stimulating intellectual and scholarly environment they provide. We thank the anonymous reviewers of ISMB/ECCB 2023.

Contributor Information

Can Firtina, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Nika Mansouri Ghiasi, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Joel Lindegger, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Gagandeep Singh, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Meryem Banu Cavlak, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Haiyu Mao, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Onur Mutlu, Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

We acknowledge the generous gifts of our industrial partners, including Intel and VMware. This work is also partially supported by the European Union’s Horizon programme for research and innovation [101047160 - BioPIM] and the Swiss National Science Foundation (SNSF) [200021_213084].

Data availability

We provide the accession numbers of all the available public datasets we use in Table 1. We provide the scripts to download all the datasets and to fully reproduce our results at https://github.com/CMU-SAFARI/RawHash/tree/main/test. The source code of RawHash is available at https://github.com/CMU-SAFARI/RawHash.

References

Bao Y, Wadden J, Erb-Downward JR. et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol 2021;22:298. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cavlak MB, Singh G, Alser M. et al. Targetcall: eliminating the wasted computation in basecalling via pre-basecalling filtering. bioRxiv, 2022, preprint: not peer reviewed.
David M, Dursi LJ, Yao D. et al. Nanocall: an open source basecaller for oxford nanopore sequencing data. Bioinformatics 2017;33:49–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunn T, Sadasivan H, Wadden J. et al. SquiggleFilter: an accelerator for portable virus detection. In: MICRO, New York, NY, USA.2021.
Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 2021;9:e10805. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards HS, Krishnakumar R, Sinha A. et al. Real-time selective sequencing with RUBRIC: read until with basecall and reference-informed criteria. Sci Rep 2019;9:11475. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA. 2000, 390–98.
Firtina C, Park J, Alser M. et al. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023;5:lqad004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kovaka S, Fan Y, Ni B. et al. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol 2021;39:431–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018;34:3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loose M, Malla S, Stout M.. Real-time selective sequencing using nanopore technology. Nat Methods 2016;13:751–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma B, Tromp J, Li M.. PatternHunter: faster and more sensitive homology search. Bioinformatics 2002;18:440–5. [DOI] [PubMed] [Google Scholar]
Mao H, Alser M, Sadrosadati M. et al. Genpip: in-memory acceleration of genome analysis via tight integration of basecalling and read mapping. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA. IEEE, 2022, 710–26.
Payne A, Holmes N, Clarke T. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat Biotechnol 2021;39:442–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roberts M, Hayes W, Hunt BR. et al. Reducing storage requirements for biological sequence comparison. Bioinformatics 2004;20:3363–9. [DOI] [PubMed] [Google Scholar]
Ruxton GD. The unequal variance t-test is an underused alternative to student’s t-test and the Mann–Whitney U test. Behav Ecol 2006;17:688–90. [Google Scholar]
Sadasivan H, Wadden J, Goliya K. et al. Rapid Real-time Squiggle Classification for Read until using RawMap. In: Archives of Clinical and Biomedical Research, 2023;7:45–57. 10.26502/acbr.50170318 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021;31:2080–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
Senanayake A, Gamaarachchi H, Herath D. et al. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics 2023;24:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shih PJ, Saadat H, Parameswaran S. et al. Efficient real-time selective genome sequencing on resource-constrained devices. arXiv, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
Simpson JT, Workman RE, Zuzarte PC. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 2017;14:407–10. [DOI] [PubMed] [Google Scholar]
Singh G, Alser M, Khodamoradi A. et al. A framework for designing efficient deep learning-based genomic basecallers. bioRxiv, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
Ulrich JU, Lutfi A, Rutzen K. et al. ReadBouncer: precise and scalable adaptive sampling for nanopore sequencing. Bioinformatics 2022;38:i153–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weilguny L, De Maio N, Munro R. et al. Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design. Nat Biotechnol 2023:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang H, Li H, Jain C. et al. Real-time mapping of nanopore raw signals. Bioinformatics 2021;37:i477–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad272_Supplementary_Data

Click here for additional data file.^{(273.3KB, pdf)}

Data Availability Statement

[btad272-B1] Bao Y, Wadden J, Erb-Downward JR. et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol 2021;22:298. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B2] Cavlak MB, Singh G, Alser M. et al. Targetcall: eliminating the wasted computation in basecalling via pre-basecalling filtering. bioRxiv, 2022, preprint: not peer reviewed.

[btad272-B3] David M, Dursi LJ, Yao D. et al. Nanocall: an open source basecaller for oxford nanopore sequencing data. Bioinformatics 2017;33:49–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B4] Dunn T, Sadasivan H, Wadden J. et al. SquiggleFilter: an accelerator for portable virus detection. In: MICRO, New York, NY, USA.2021.

[btad272-B5] Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 2021;9:e10805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B6] Edwards HS, Krishnakumar R, Sinha A. et al. Real-time selective sequencing with RUBRIC: read until with basecall and reference-informed criteria. Sci Rep 2019;9:11475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B7] Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, CA, USA. 2000, 390–98.

[btad272-B8] Firtina C, Park J, Alser M. et al. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023;5:lqad004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B9] Kovaka S, Fan Y, Ni B. et al. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol 2021;39:431–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B10] Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018;34:3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B11] Loose M, Malla S, Stout M.. Real-time selective sequencing using nanopore technology. Nat Methods 2016;13:751–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B12] Ma B, Tromp J, Li M.. PatternHunter: faster and more sensitive homology search. Bioinformatics 2002;18:440–5. [DOI] [PubMed] [Google Scholar]

[btad272-B13] Mao H, Alser M, Sadrosadati M. et al. Genpip: in-memory acceleration of genome analysis via tight integration of basecalling and read mapping. In: 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), Chicago, IL, USA. IEEE, 2022, 710–26.

[btad272-B14] Payne A, Holmes N, Clarke T. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat Biotechnol 2021;39:442–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B15] Roberts M, Hayes W, Hunt BR. et al. Reducing storage requirements for biological sequence comparison. Bioinformatics 2004;20:3363–9. [DOI] [PubMed] [Google Scholar]

[btad272-B16] Ruxton GD. The unequal variance t-test is an underused alternative to student’s t-test and the Mann–Whitney U test. Behav Ecol 2006;17:688–90. [Google Scholar]

[btad272-B17] Sadasivan H, Wadden J, Goliya K. et al. Rapid Real-time Squiggle Classification for Read until using RawMap. In: Archives of Clinical and Biomedical Research, 2023;7:45–57. 10.26502/acbr.50170318 [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B18] Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021;31:2080–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B19] Senanayake A, Gamaarachchi H, Herath D. et al. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics 2023;24:31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B20] Shih PJ, Saadat H, Parameswaran S. et al. Efficient real-time selective genome sequencing on resource-constrained devices. arXiv, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]

[btad272-B21] Simpson JT, Workman RE, Zuzarte PC. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 2017;14:407–10. [DOI] [PubMed] [Google Scholar]

[btad272-B22] Singh G, Alser M, Khodamoradi A. et al. A framework for designing efficient deep learning-based genomic basecallers. bioRxiv, 2022, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]

[btad272-B23] Ulrich JU, Lutfi A, Rutzen K. et al. ReadBouncer: precise and scalable adaptive sampling for nanopore sequencing. Bioinformatics 2022;38:i153–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B24] Weilguny L, De Maio N, Munro R. et al. Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design. Nat Biotechnol 2023:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad272-B25] Zhang H, Li H, Jain C. et al. Real-time mapping of nanopore raw signals. Bioinformatics 2021;37:i477–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Can Firtina

Nika Mansouri Ghiasi

Joel Lindegger

Gagandeep Singh

Meryem Banu Cavlak

Haiyu Mao

Onur Mutlu

Abstract

1 Introduction

2 Methods

2.1 Overview

Figure 1.

2.2 Event generation

Figure 2.

Figure 3.

2.3 Quantization of events

Figure 4.

2.4 Generating the Hash values

Figure 5.

2.5 Seeding and mapping

3 Results

3.1 Evaluation methodology

Table 1.

3.2 Performance and peak memory

Figure 6.

3.3 Accuracy

Table 2.

Table 3.

3.4 Sequencing time and cost

Table 4.

3.5 Benefits of Sequence Until

Table 5.

Table 6.

4 Discussion

5 Conclusion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases