Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Mar 19;25(2):bbae107. doi: 10.1093/bib/bbae107

pathMap: a path-based mapping tool for long noisy reads with high sensitivity

Ze-Gang Wei 1,2, Xiao-Dan Zhang 3, Xing-Guo Fan 4, Yu Qian 5,, Fei Liu 6, Fang-Xiang Wu 7,
PMCID: PMC10959152  PMID: 38517696

Abstract

With the rapid development of single-molecule sequencing (SMS) technologies, the output read length is continuously increasing. Mapping such reads onto a reference genome is one of the most fundamental tasks in sequence analysis. Mapping sensitivity is becoming a major concern since high sensitivity can detect more aligned regions on the reference and obtain more aligned bases, which are useful for downstream analysis. In this study, we present pathMap, a novel k-mer graph-based mapper that is specifically designed for mapping SMS reads with high sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph, pathMap treats chaining as a path selection problem in the directed graph. pathMap iteratively searches the longest path in the remaining nodes; more candidate chains with high quality can be effectively detected and aligned. Compared to other state-of-the-art mapping methods such as minimap2 and Winnowmap2, experiment results on simulated and real-life datasets demonstrate that pathMap obtains the number of mapped chains at least 11.50% more than its closest competitor and increases the mapping sensitivity by 17.28% and 13.84% of bases over the next-best mapper for Pacific Biosciences and Oxford Nanopore sequencing data, respectively. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens using MinION reads.

Keywords: read mapping, sequence alignment, long-read sequencing, long noisy reads

INTRODUCTION

Since the emergence of sequencing technologies in bioinformatics [1], read alignment has become one of hot research topics in sequence analysis [2, 3]. With the rapid development of single-molecule sequencing (SMS) technologies such as Pacific Biosciences (PacBio) sequencing and Oxford Nanopore technologies (ONT) [4], the output read length is continuously increasing. SMS technologies can generate reads with the average length of 10–60 kbp [5], which are considerably longer than reads generated by the next-generation sequencing (NGS). However, reads produced by SMS technologies contain higher error rates (~15%) than NGS (1%) [6]. These characteristics (long read length and high error rate) of SMS reads appeal the development of advanced algorithms for efficient sequence alignment.

Read alignment (also called mapping) is used to determine potential locations for each read on a given reference genome [7, 8]. As an essential procedure for the downstream analysis pipelines of SMS reads, efficient read mapping methods are greatly on demand [9, 10]. Consequently, in the past decade, a number of available mapping algorithms have been designed. According to the indexing technique in the seed searching phase, existing mapping methods can be classified into two categories: Burrows–Wheeler Transform (BWT) [11] and hash table–based [12] methods.

Mappers based on BWT include BLASR [13], BWA-MEM [14], LAMSA [15], lordFAST [16] and smsMap [17]. BLASR [13] finds all matches (also known as seeds) between a query read and the reference sequence. Then, a rough alignment is generated with sparse dynamic programming. Alignments with a high score are realigned to obtain base-to-base alignments. BWA-MEM [14] collects the super-maximal exact matches and then greedily chains them and filters out short chains. Finally, it extends the seed to reach the whole read alignment. LAMSA [15] extracts the approximate matches by employing GEM [18]. These matches are processed to obtain a series of chains. Finally, LASMA separately fills the gaps within chains to get the whole read alignment. lordFAST [16] partitions the reference into windows with overlap, which is selected as a candidate alignment region if the number of matches in that window reaches a pre-defined threshold. Next, a set of co-linear, non-overlapping matches is identified. Lastly, the detail alignment is completed between neighboring matches in the selected chain. smsMap [17] formulates a credibility model to locate two starting alignment positions in the query read and reference. The final alignment is formed by the column reduction banded dynamic alignment method.

Mappers based on the hash table include rHAT [19], GraphMap [20], minimap2 [21], Winnowmap2 [22], NGMLR [23] and kngMap [24]. rHAT [19] splits the reference into overlapping windows and constructs a regional hash table for k-mer retrieval within each window. Windows with the most occurrences of matches are selected as candidate regions for further extension [25]. GraphMap [20] uses gapped seeds to find seed hits and then clusters them as the coarse alignment and chains seed hits based on the longest common subsequence construction. At the end, it refines alignments to complete the final alignment. minimap2 [21] collects minimizers of the reference genomes. For each read, it finds exact matched minimizers and recognizes a set of co-linear seeds as chains. The detail alignment is formed by extending from the seeds to the unseeded regions. In order to deal with repeats in the reference genome, Winnowmap2 [22] introduces the minimal confidently alignable substrings (MCASs) and identifies MCASs for each read to the reference. NGMLR [23] starts with getting alignment results of subsegments of a read-through NextGenMap [26]. It then adopts a convex gap–cost scoring strategy to compute pairwise sequence alignment. The set of linear alignments with the highest joint score is selected as the final alignment result. kngMap [24] identifies matched k-mers to generate an alignment chain with a high quality for each query read; then, unaligned regions in the chain are categorized and aligned with a specific strategy.

Among the above tools, minimap2, LAMASA, NGMLR, lordFAST and rHAT are faster but not sensitive enough to detect more mapping locations (regions) and obtain more aligned bases for many reads. Thus, these methods gain the speed at the expense of mapping sensitivity. For instance, minimap2 selects the minimizers in a surrounding window for the references and the reads, which can reduce the k-mer space. LAMSA splits each query read to evenly spaced fragments. Like LAMSA, NGMLR partitions the long reads into 256 bp non-overlapping fragments to reduce the number of fragments for seeding. Analogously, lordFAST extracts 1000 evenly spaced k-mers for each query read. Different from minimap2, LASMA, NGMLR and lordFAST, rHAT only extracts those seeds from long substrings (e.g. >100 bp) for each read. All these conditional choices of seeds could be problematic especially if a selected substring contains higher sequencing errors or comes from repetitive regions. Since the emergence of the SMS technologies, the throughput and read length are continuously increasing. Mapping sensitivity is becoming a major issue since higher sensitivity can detect more aligned regions and produce more aligned bases, which are useful for downstream analysis [20, 27].

To address above issues, in this study, we propose a novel path-based long-read mapper, referred to as pathMap, to map SMS reads onto a reference with better mapping sensitivity. By viewing the alignment chain as a path containing as many anchors as possible in the matched k-mers graph, we treat chaining as a path selection problem in the directed graph. pathMap finds the chains by searching the paths in the directed matched k-mer graph; the key feature of pathMap is that it iteratively searches the longest path in the remaining nodes. Specifically, pathMap contains three main modules, i.e. (i) reference indexing; (ii) k-mer graph construction and path selection; (iii) read alignment for each selected path.

pathMap was tested and compared with other state-of-the-art long read mappers, including Winnowmap2, minimap2, NGMLR and GraphMap, on simulated reads with different error rates, as well as several real-life datasets generated by PacBio, ONT and Illumina Moleculo platforms. The experiments showed that pathMap could effectively map long noisy reads with a considerable improvement on detecting more candidate chains and obtaining more mapped bases. In addition, pathMap is more robust to sequence errors and more sensitive to species- and strain-specific identification of pathogens with MinION reads.

METHODS

pathMap is a path-based algorithm for mapping SMS reads onto a reference. Compared to most existing mapping methods, one unique feature of pathMap is that it recursively finds chains by searching the longest path with the maximum number of nodes in the directed matched k-mer graph; it can detect more candidate chains than other methods and then produce more aligned bases. Figure 1 illustrates the workflow of pathMap, while Table S1 (in the supplementary file) describes the pseudo-code of pathMap.

Figure 1.

Figure 1

A schematic illustration of the pathMap algorithm. (A) Building a searching index of the reference genome for fast look-ups. (B) Searching the matched k-mers (2-mers here) between the query sequence and the reference. (C) Constructing a k-mer l-neighborhood (l is the query length, here l = 11) graph where matched k-mers are viewed as vertices and each pair of matched k-mers is connected by a direct unweighted edge based on the positions in genome and the query read. (D) Selecting the longest path after filtering the single nodes and subgraphs which is less than a given threshold. In this case, there are one single node (CT) and one small subgraph (GA → AA) in (C), they are directly removed before finding the path. (E) Partition of the read and local reference sequence by the selected path. (F) Each pair of segments are carefully aligned to compose the final detail bases-to-base alignment.

Stage 1: reference genome indexing

Theoretically, any efficient indexing method can be applied to quickly locate matched positions in a reference for a given k-mer. Currently two index techniques are employed by most existing methods: BWT-FM index [28] and hash table [29, 30]. The BWT-FM technique supports long reference genomes to be retrieved within small memory space [31], while the hash table can search matched positions in a reference with linear time for a given k-mer [32]. Therefore, for improving the runtime efficiency pathMap utilizes the hash table (implemented in mimimizers [32]) to index the reference genomes.

Stage 2: alignment path selection

Given a query long read, pathMap generates all alignment paths as follows:

Step 1: match extraction

pathMap extracts all k-mers and finds the matched positions on the reference for each query read. We denote each match as a tuple matchi(pr, pg, dr, dg), where matchi(pr) and matchi(pg) are, respectively, the matching positions on the read and reference genome, matchi(dr) and matchi(dg) represent the match direction (1 or 0 for forward or reverse strand) on the read and reference genome, respectively. Additionally, we use a modified triplet mi(pr, pg, r) to present each matchi for addressing co-linear and non-co-linear events:

graphic file with name DmEquation1.gif (1)

From Equation (1), one can observe that the main differences of mi from matchi are the position on the query sequence and direction representation. That is, if matchi(dr) and matchi(dg) have the same direction, then set mi(r) = ‘+’ and mi(pr) = matchi(pr), otherwise, set mi(r) = ‘-’, and mi(pr) = len(r) – matchi(pr), where len(r) denotes the length of the query read, mi(r) indicates the strand consistency: ‘+’ indicates that this k-mer from the query read is matched to the forward strand of the reference, while ‘−’ to the reverse strand of the reference. The modified mi can convert the non-co-linear chain to co-linear case (as described in Figure S1), which can significantly facilitate the detection for reverse alignment or inversion event. At the end, pathMap obtains a list of matched k-mers (anchors) Inline graphic sorted by their positions on the reference and then on the query read, where Nm = |M|.

Step 2: directed k-mer graph construction

The best chain for an alignment is a high-scoring subset of concordant anchors in M. To do this, pathMap views the chain as a path or walk that connects these anchors. Suppose that all the anchors in M can be applied to build a directed unweighted graph, where anchors are viewed as vertices; each pair of anchors is connected by a directed edge. For every vertex vi (vi = mi), its l-neighborhood contains a certain number of directed outbound edges connecting vi to vj (Figure 1C), where vj belongs to the l-neighborhood of vi. The l-neighborhood of node vi is the set of nodes whose distance to vi satisfies |vj(pr) - vi(pr)| < l. The parameter of l represents the maximum allowed distance between two nodes. The rationale for such a design is illustrated in Figure S2. With this graph, pathMap allows a linear walk to be searched not only by following consecutive k-mers in the graph but also by jump-overing regions that do not contain anchors because of the sequencing errors. Figure 1C depicts such an example.

Step 3: path selection

In the directed graph, paths are selected as follows: (i) all single nodes are first discarded. (ii) Subgraphs that are too small (default containing less than eight nodes) are also excluded to avoid a large search space. (iii) With the remaining nodes, the longest path that contains the maximum number of nodes is chosen as the best alignment chain. Considering sequencing errors or SVs in the reads, pathMap iteratively searches the longest path in the remaining nodes. That is, after obtaining the longest path, the nodes in the path are removed in the graph, the processes of (i), (ii) and (iii) are recursively repeated to find other candidate paths. Finally, the top N (N = 10 is default) longest, non-overlapping paths are selected as the chains for performing the pairwise alignment.

Stage 3: alignment

Each path reported at the previous stage can partition the query read and the reference into a series of paired segments, as described in Figure 1E. pathMap directly performs a classical NW (Needleman-Wunsch) method [33] to align each pair of segments. For the ending boundaries of the chain, a modified global alignment is performed. Detail operations for the ending boundaries of the chain are shown in Figure S3. The end-to-end alignment for the entire query read is obtained by integrating the anchors and the alignments of the paired segments.

Measuring alignment accuracy

For simulated reads, the simulator software reports the ground truth of the starting and ending location on the reference and detail base-to-base alignment to the reference; thus, the number of correctly located reads and mapped bases can be calculated. A correctly located read should satisfy the following: (i) it gets mapped to the true genome and strand direction, and (ii) the mapped region on the reference shares ≥90% with the simulated region. A base in a read is correctly aligned if (i) the read is correctly located and (ii) the mapped position in the reference is within 5 bp of the ground truth. Based on the evaluation in GraphMap [20], precision is defined as the number of correctly aligned reads (bases) divided by the total number of aligned reads (bases), while recall is defined as the number of correctly aligned reads (bases) divided by the number of all simulated reads (bases). All the metrics were calculated based on the primary alignment for each method.

RESULTS

pathMap is implemented in 64 bit Linux OS and programmed with C/C++. The performance of pathMap was benchmarked with other five mappers: Winnowmap2, minimap2, NGMLR, GraphMap and kngMap. We compared all methods on both simulated and real-life datasets. All experiments were run on a workstation with 20-core (2 threads per core) Intel(R) Xeon(R) Gold 5218R at 2.10GHz and 256 gigabytes of random-access memory (RAM), running CentOS 7.5. Table S2 lists the running parameters and commands for each method.

Experiment 1: evaluation on simulated datasets

To assess the performance of pathMap, eight simulated PacBio datasets with 1%, 2%, 5%, 10%, 15%, 20%, 25% and 30% error rates were generated by PBSIM2 simulator [34]. The average read length of all simulated datasets is about 10 kbp; the detail running commands for PBSIM2 and the simulated datasets with different error rates can be found in Tables S3 and S4. The mapping results for each method are presented in Figure 2; the detail values of the correct mapping location and bases are listed in Tables S5 and S6. On the low–error rate (1%, 2% and 5%) datasets, as shown in Figure 2, pathMap reached the highest mapping precision for detecting the correct read location and obtaining the correct base-to-base alignment to the reference. pathMap’s precision for read location and base-wise alignment consistently stayed high up to 99% for low-error rates. Winnowmap2, minimap2 and kngMap achieved similar results to pathMap, while other aligners of NGMLR and GraphMap obtained lower mapping precision for aligned reads and bases.

Figure 2.

Figure 2

Precision of six methods on the simulated PacBio datasets (average length is 10 kbp) with different error rates. The figure on the top shows performance for determining the correct mapping location and the one on the bottom for the correct alignment of bases. The precision for read location is defined as the number of correctly mapped reads divided by the number of aligned reads. The base-wise precision is defined as the number of correctly aligned bases divided by the total aligned bases.

From Figure 2, both read location precision and bases-wise mapping precision of pathMap are consistently higher than all competing methods. Furthermore, as the error rate increased, both read location precision and bases-wise mapping precision of Winnowmap2, minimap2 and NGMLR degraded dramatically while those of pathMap dropped a little bit. For instance, at the error rate of 30%, pathMap still can correctly map about 99.42% of the total number of aligned bases, while Winnowmap2, minimap2, NGMLR, GraphMap and kngMap only correctly mapped about 75.28%, 97.56%, 79.94%, 98.70% and 97.23%, respectively, which was 2.25–32.06% lower than pathMap. In addition, the base-wise precision of pathMap at the error rate of 30% was even higher than those of NGMLR and GraphMap at the error rate of 1%. In summary, Figure 2 and Tables S5 and S6 demonstrated that pathMap can produce more accurate alignments for both low– and high–error rate reads, which indicated pathMap is more robust to the sequencing errors than the competing methods. Similar results can be found for recall and F1-score, which are described in Figures S4 and S5 and Tables S7S10.

Additionally, simulated PacBio datasets with other four kinds of average read lengths (ranging from 20 to 50 kbp with a step size of 10 kbp) were generated to access the impact of read length for the mapping performance. Figures S6S11 depict the precision, recall and F1-score for read location and base with different lengths and error rates. It can be observed that the read length had little influence on pathMap over all error rates, indicating pathMap was also robust to read length at the same error rate, while other methods, such as Winnowmap2, minimap2, NGMLR and GraphMap, their performances varied largely with the average read length increases at the same error rate. Furthermore, a series of simulated ONT datasets (same read length and error rate with simulated PacBio datasets) were generated to test the performance of each mapper for the ONT sequencing platform. These similar results with the simulated PacBio datasets can also be supported by simulated ONT reads, which are shown in Figures S12S20.

The above evaluations demonstrated that pathMap can correctly map simulated SMS reads onto the reference in terms of the read location and the base-to-base alignment. In order to test the improvement of pathMap in mapping sensitivity, we compared the total number of mapped chains (each alignment record in the sequence alignment map (SAM) file can be viewed as a mapped chain) and mapped bases for all mappers. Table S11 reports the mapping results on the simulated PacBio datasets (the average length is 10 kbp) with various error rates; pathMap produced the most mapped chains and mapped the most bases to the reference over various error rates. For instance, at the error rate of 1%, pathMap obtained 36 251 mapped chains and 260 814 035 mapped bases, which were, respectively, 54.76% and 10.65% more than the next best method (minimap2). Similar results can be found for the simulated ONT datasets, which are summarized in Table S12. These results demonstrated that the mapping sensitivity of pathMap was greatly higher than other methods in finding more candidate chains and mapped bases, even at high-error rates.

Experiment 2: evaluation on real-life datasets

We also used three real-life datasets produced from PacBio, Oxford Nanopore MinION and Illumina Moleculo technologies to evaluate above mapping methods. The PacBio and MinION datasets were from Caenorhabditis elegans strain genome and the Oikopleura dioica male (KUM-M3) organism, respectively. The Illumina Moleculo dataset came from the CEU HapMap individual NA12878 with 40X coverage. The average lengths of each dataset were, respectively, 12 087, 10 099 and 5225 bp. Since the original sequenced datasets were too large, a random subset of each dataset was selected to compare, more detailed information about these datasets can be found in Tables S13 and S14. We excluded kngMap because it always crashed (segment fault) for these real datasets.

For real datasets, it became difficult to assess the mapping correctness of alignments because the true derived genome and mapping coordinate on the genome for each read were unknown. Therefore, all algorithms were compared based on the quality of their results, that is, the number of mapped reads, mapped chains and mapped bases; the alignment score; and the number of matched bases. A base is defined as the matched base if it is mapped to the identical one on the reference. The alignment score can be evaluated by adding up +1 for each matched base and −1 for other cases (including deletion, insertion and substitution after removing clipped bases). The sum of alignment scores of all aligned reads was calculated. The alignment score can be viewed as a complementary metric to the number of matched bases since one mapper could map all the bases in the query read without focusing on the alignment gaps created in the reference. Additionally, the average mapping identity, deletion, insertion and substitution were also computed to reflect the ability of detecting how similar regions are.

Table 1 shows the mapping results on three real-life datasets. We can see that for the PacBio dataset, pathMap aligned the most reads, chains and bases to the references, achieved more mapped chains by 11.50–16.0% and improved base-wise sensitivity by 4.44–17.28% over the next-best method (minimap2). More precisely, for the PacBio dataset, pathMap generated 311 more aligned reads, 145 935 more mapped chains and 147.38 million more matched bases compared to the second-best method (minimap2). pathMap also achieved the highest total alignment score, which was 90.07 million higher than that of minimap2, indicating its high quality of alignments. Additionally, pathMap outperformed other methods in term of the matched bases across all sequencing platforms, which was, respectively, 17.84%, 15.84%, 23.53% and 22.65% higher than those of Winnowmap2, minimap2, NGMLR and GraphMap on the PacBio dataset, confirming that pathMap’s high sensitivity did not come at the expense of mapping quality. A higher number of matched bases also explained the main reason for the higher alignment score for pathMap. Similar results can be found for MinION and Illumina Moleculo datasets in Table 1. These mapping results demonstrated that pathMap can not only achieve more sensitive alignments in read-level and base-level sensitivity than other approaches by detecting more candidate chains but also provide the high quality of alignments. Furthermore, in order to assess the performance of methods for mapping the same amount of sequencing data against the same reference genome, three human sequencing datasets were applied and 20 000 reads of each dataset were randomly selected. More detailed information about these datasets can be found in Table S15. The mapping results are summarized in Table S16. pathMap still achieved higher mapping sensitivity that can detect more chains, map more reads and bases to the reference genome.

Table 1.

The number of mapped reads, mapped chains, mapped bases, the alignment score and matched bases of five methods on the three real-life datasets. Bold values denote the best results for each metric

Methods Mapped reads (chains) Mapped bases Score Matched bases
PacBio SMRT dataset
pathMap 68 593 (236 830) 1 000 253 819 768 286 770 905 726 634
Winnowmap2 66 174 (83 568) 831 350 841 674 865 884 768 605 907
minimap2 68 282 (90 895) 852 872 450 678 210 481 781 822 396
NGMLR 66 521 (79 590) 786 494 718 648 444 501 733 172 860
GraphMap 67 365 (67 365) 812 169 226 628 122 482 738 412 470
Oxford Nanopore MinION dataset
pathMap 209 490 (891 839) 2 487 393 700 1 644 615 044 2 200 673 950
Winnowmap2 202 058 (439 059) 2 050 909 965 1 515 566 773 1 857 786 220
minimap2 204 695 (503 952) 2 184 840 399 1 471 422 310 1 943 322 296
NGMLR 190 031 (390 949) 1 848 563 454 1 426 946 697 1 697 404 181
GraphMap 194 053 (194 053) 1 933 609 040 1 178 876 202 1 660 551 880
Illumina Moleculo dataset
pathMap 198 521 (251 106) 1 189 692 663 1 137 811 873 1 172 281 520
Winnowmap2 198 413 (218 679) 1 103 568 303 1 080 480 612 1 093 869 664
minimap2 198 474 (225 203) 1 139 079 388 1 108 886 625 1 127 360 146
NGMLR 193 689 (195 643) 1 013 059 700 1 007 890 978 1 011 088 313
GraphMap 196 216 (196 261) 1 024 677 416 1 008 416 638 1 019 869 040

Table S17 reports the average mapping identity, deletion, insertion and substitution for each method. It can be found that the alignment statistics changed apparently across different algorithms even for the same dataset, partly due to the differences in alignment chain selection and the pairwise alignment strategy for the unseeded region. Not surprisingly, all methods can obtain the average identity higher than 80% for PacBio and MinION datasets and higher than 90% for Illumina Moleculo dataset. In the absence of ground truth for these real datasets, the results in Table S17 show that the quality of the base-to-base alignments generated by pathMap was as reliable as other methods.

Next, the aligned consistency between each pair of mapping algorithms based on their reported alignments was also compared. For the same query sequence, two alignments x and y can be produced by two mappers; x is consistent with y if and only if the overlap was ≥90% between the two aligned regions covered by x and y on the reference genome. Figure S21 illustrates some examples of consistent and inconsistent alignments. Table 2 describes how the best alignments among different algorithms cover each other for the MinION dataset. More specifically, each row lists the percentage of mapping results (reported by the corresponding method in this row) that cover results of other methods. For instance, among all reads for which both pathMap and Winnowmap2 generated an alignment, 91.35% of alignments obtained by Winnowmap2 were covered by pathMap, while only 79.34% of alignments produced by pathMap were covered by Winnowmap2. We can find that pathMap provided a high coverage of alignments reported by alternative tools. Similar results can be found for PacBio and Illumina Moleculo datasets, which are reported in Tables S18 and S19, respectively.

Table 2.

Agreement (%) of different methods for real Oxford Nanopore MinION O. dioica dataset

pathMap Minimap2 Winnowmap2 NGMLR GraphMap
pathMap 93.26 79.34 63.96 86.21
minimap2 96.95 82.53 66.61 86.67
Winnowmap2 91.35 91.41 73.41 86.09
NGMLR 90.32 90.52 89.87 88.15
GraphMap 67.57 65.19 59.58 50.52

Note: The value in each cell represents the percentage of the best alignments from the mapper of the row that are covered by the corresponding mappers of the column. This table is not symmetric.

Moreover, we also analyzed the reads that were only mapped by pathMap and compared them with those that were mapped by pathMap and other methods; the results for the PacBio dataset are listed in Table 3. To give an example, compared to minimap2, these reads had a lower average identity (78.47% versus 90.06%) and were significantly shorter (1896 versus 11 449 bp) on average. This further confirmed that pathMap can map more reads and bases with a higher error rate, highlighting its ability of sensitivity and tolerance to sequencing errors. Similar results can be found for MinION and Illumina Moleculo datasets, which are reported in Tables S20 and S21.

Table 3.

Analysis of the reads only mapped by pathMap compared to those that are mapped by both pathMap and other methods on the PacBio C. elegans dataset

minimap2 Winnowmap2 NGMLR GraphMap
Average identity Both 90.06 90.31 90.39 90.27
Only 78.47 81.80 77.86 75.79
Average length Both 11 449 11 695 11 618 11 502
Only 1896 3507 4587 6227

Note: The ‘Both’ means the alignments that are mapped by both pathMap and another one method, the ‘Only’ means the alignments that are only mapped by pathMap.

Experiment 3: evaluation on genome in a bottle benchmark

In order to assess the capability of pathMap for predicting structural variants (SVs) [35, 36], another real-life dataset, the Genome in a Bottle (GIAB) Tier1 v.0.6 benchmark set [37], was applied to compare in this experiment. GIAB dataset provides a high-quality description of SVs in the Ashkenazi cell line HG002 with respect to the GRCh37 human reference. One PacBio Sequel release dataset (with 10-fold coverage) from HG002 was used for evaluation. The information about the raw dataset and reference genomes can be found in Table S22. Here, we compared the results of SV calling using Sniffles (version 2.0.7) [23]. A call is defined as ‘exact’ if (i) it is predicted in the same chromosome with the true and (ii) its start coordinate is within 50 bp of the actual true breakpoints. Table 4 lists the number of exact and total predicted SVs for pathMap, Winnowmap2, minimap2, NGMLR and GraphMap. We can see that Sniffles not only finds more SVs with pathMap but also more ‘exact’ calls with pathMap in comparison to mappings provided by other methods. Table S23 reports the precision, recall and F1-score of SVs called by Sniffles; pathMap had higher recall than other methods. Furthermore, Table S24 lists the results of SV calling using another SVs detection approach of cuteSV (version 2.0.3) [38]; pathMap achieved the best recall and F1-score.

Table 4.

SVs called by Sniffles based on mapping results from different methods

Called SVs pathMap Winnowmap2 minimap2 NGMLR GraphMap
Exact 12 452 12 104 12 399 12 165 6107
Total 65 051 51 948 60 078 48 430 51 277

Experiment 4: evaluation on strain identification

The real-time and form factor characteristics of the MinION sequencing technology has attracted a lot of attention for the identification of closely related pathogens from clinical samples. The choice of read mapper could greatly influence the alignments and lead to misdiagnosis. As seen in previous experiments, pathMap’s high sensitivity in read mapping implied that it could be useful in this situation. Thus, the Escherichia coli K-12 MG1655 R7.3 (NCBI accession run: ERR1147229 and ERR1147230) [39] and Salmonella enterica Typhi strain ISP2825 (NCBI accession run: SRR15411315) datasets sequenced by MinION were downloaded and mapped to several closely related genomes of different strains (18 E. coli strains and 10 S. enterica strains) for testing if each method could help identify the correct species or strain. Figure 3 shows the number of aligned reads against different species/strain genomes for all approaches on the E. coli K-12 MG1655 R7.3 dataset, while Table S25 lists the number of aligned reads and the genome strain names. One can observe that pathMap mapped the more reads to the correct genome of strain E. coli K-12 MG1655 (U00096.2) than other methods, providing an improvement of 2–18% over Winnowmap2 and minimap2 and even more compared with NGMLR and GraphMap in terms of the number of aligned reads. In addition, pathMap assigned the most reads to a handful of strains that were very similar to the correct strain genome (e.g. E. coli K-12 MG1655 and BW2952 share 99.99% identity). It is also found that pathMap, Winnowmap2 and minimap2 dramatically aligned more reads than NGMLR and GraphMap across all strain genomes. Moreover, pathMap aligned 9 reads to E. coli UTI89 plasmid pUTI89, where the number of aligned reads for Winnowmap2, minimap2 and GraphMap was 0; this further demonstrated the better ability of searching closely related species or strain for pathMap. Table S26 reports the precision, recall and F1-score using MinION reads on the on the E. coli K-12 MG1655 R7.3 dataset, we can see that pathMap had the highest precision, recall and F1-score for species identification. Therefore, this experiment indicated that pathMap could be used to systematically identify pathogens at the strain level. Similar results can be found for mapping the S. enterica Typhi strain ISP2825 dataset to different strain genomes, which are represented in Figure S22 and Tables S27 and S28.

Figure 3.

Figure 3

Strain identification using E. coli K-12 MG1655 R7.3 sequencing dataset produced by Oxford Nanopore MinION for different methods. The horizontal axis is the different genomes at strain level for E. coli, the vertical axis denotes the number of aligned reads. Detail values are reported in Table S25.

DISCUSSION

Mapping long noisy reads to a reference is computation- and memory-intensive in sequence analysis; it can be tackled by mappers based on the seed-and-extend strategy. The selection of chain lays an important foundation for the mapping as a whole since a good quality of the chain has an influence on not only determining the correct genomic location on the reference but also reconstructing the correct base-to-base alignment. Different from most existing methods with the chaining algorithm to select the chain, pathMap views matched k-mers as nodes and constructs a directed graph to select the longest path as the chain.

The growing massive quantities of data generated by SMS technologies bring serious challenges to existing mapping algorithms. Apart from mapping accuracy, speed and memory usage are another two critical aspects that needs to be addressed. Figure 4 depicts the time cost and memory consumption of different algorithms for the PacBio dataset of C. elegans used in experiment 2. pathMap’s runtime was close to Winnowmap2; both were slower than minimap2 yet faster than NGMLR and GraphMap. For memory usage, pathMap still remains competitive to Winnowmap2. Overall, compared to minimap2 and Winnowmap2, which are the two fastest mappers as far as we know, pathMap is more computationally demanding, which could be attributed to the two following aspects: (i) pathMap keeps all k-mers in reference sequences and query reads, while other methods, such as Winowmap2, only selects the weighted minimizers of the references to avoid masking frequently occurring k-mers during the seeding stage, and (ii) pathMap selects more chains and performs more detail base-to-base alignments. In the current implementation of pathMap, the parameter of w (−w in the command line) can control how many k-mers are extracted for references and reads. The default value of w is 1 in this study, which means that all k-mers at every 1 bp window are extracted. Thus, in order to explore how the parameter w affects the mapping accuracy, we performed one test study on one simulated dataset with 15% error rate (used in experiment 1). The running time and accuracy were evaluated based on the correct read location and correct base-to-base alignment, which are described in Table 5. It can be seen that the running time gained an improvement of more than 2-fold faster than that of w = 1, where the read location accuracy and base-wise accuracy were, respectively, 5% and 1.81% lower than those of w = 1. Therefore, users can increase the value of w to quickly generate the mapping results for massive SMS datasets.

Figure 4.

Figure 4

Wall-clock time and memory usage of five mapping methods on the real-life C. elegans dataset generated by PacBio used in experiment 2. Note that the y axis of each plot is log-scaled.

Table 5.

Influence of the parameter w value for the mapping results on simulated dataset with 15% error rate used in experiment 1

Value of w Runtime (s) Read location accuracy (%) Base-wise accuracy (%)
1 46 99.89 99.87
2 26 99.58 99.75
3 23 98.79 99.34
4 20 97.40 98.57
5 19 94.94 98.06

CONCLUSIONS

With the rapid development of SMS technologies, the throughput and read lengths are continuing to increase. One of the most basic procedures in analysis of massive long-read datasets is mapping reads against a known reference genome. Mapping sensitivity is an important requirement for mapping tools and mapping-based analysis. Therefore, it is still on urgent demand to develop novel algorithms with higher mapping sensitivity.

In this study, we designed pathMap, a novel path–based long noisy read mapper by viewing the alignment chain as a path containing as many anchors as possible in the matched k-mer graph. pathMap was benchmarked on simulated and real datasets from various genomes and sequencing platforms. The mapping accuracy, quality and sensitivity were measured and compared with four mainstream mappers, minimap2, Winnowmap2, NGMLR and GraphMap. The experimental results showed that pathMap could effectively align long reads with a convincing improvement on detecting more candidate chains and obtaining more mapped bases. In addition, pathMap is more robust to sequence errors and more sensitive to species and strain-level identification of pathogens with MinION reads. We believe that it could be a good choice to combine pathMap into the sequence analysis workflows to facilitate SMS data application.

Key Points

  • In order to improve the mapping sensitivity for long reads with high sequencing error rates, a novel path–based long noisy read mapper, termed pathMap, is developed.

  • pathMap treats chaining as a path selection problem in the directed matched k-mer graph and iteratively searches the longest path in the remaining nodes, and thus, more candidate chains with high quality can be effectively detected and aligned.

  • The path-based selection strategy helps pathMap outperform start-of-the-art methods.

  • pathMap is more sensitive to species- and strain-specific identification of pathogens with MinION reads.

Supplementary Material

Final_supplementary_file_20240224_bbae107

Author Biographies

Ze-Gang Wei is an associate professor in Baoji University of Arts and Sciences. He was a visiting scholar at the University of Saskatchewan in 2023. His research mainly focuses on bioinformatics, the development of tools and algorithms for long noisy reads.

Xiao-Dan Zhang is currently a lecturer in Baoji University of Arts and Science. Her research mainly focuses on sequence alignment and clustering.

Xing-Guo Fan is a graduate student in Baoji University of Arts and Sciences. His research mainly focuses on software development for biological sequence analysis.

Yu Qian is currently a professor in Baoji University of Arts and Science. His research mainly focuses on sequence analysis for third-generation sequencing.

Fei Liu is currently a professor in Baoji University of Arts and Science. His research mainly focuses on graph theory and system biology.

Fang-Xiang Wu is currently a full professor at the University of Saskatchewan. His research interests include computational biology, machine/deep learning, artificial intelligence, medical image analytics, pattern recognition and complex network analytics.

Contributor Information

Ze-Gang Wei, School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China; Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada.

Xiao-Dan Zhang, School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China.

Xing-Guo Fan, School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China.

Yu Qian, School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China.

Fei Liu, School of Physics and Opto-Electronics Technology, Baoji University of Arts and Sciences, Baoji, 721016, China.

Fang-Xiang Wu, Division of Biomedical Engineering, Department of Computer Science and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada.

FUNDING

This study was funded by the China Scholarship Council (202108615064), the Scientific Research Program of Shaanxi Provincial Education Department (23JK0287), the Natural Science Basic Research Plan of Shaanxi Province (2024SF-YBXM-134, 2022GD-TSLD-27, 2021JQ-811 and 2022JZ-03), Shaanxi Fundamental Science Research Project for Mathematics and Physics (22JSY021), Teaching Reform Project of Baoji University of Arts and Sciences (22JGYB37), Ministry of Education Industry-University Cooperation and Collaborative Education Project (230705211175618).

AUTHOR CONTRIBUTIONS

Z.-G.W. prepared the manuscript draft, implemented pathMap and analyzed the comparison results. X.-D.Z. and X.-G.F. wrote the code for pathMap. F.-X.W. conceived the project and improved the manuscript. F.L. and Y.Q. performed the experimental analysis, with guidance from F.X.W.

DATA AVAILABILITY

The datasets used in all experiments are available in their article or Supplementary File. The code of pathMap is publicly available at https://github.com/zhang134/pathmap.git.

References

  • 1. Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022;23(3):bbac069. [DOI] [PubMed] [Google Scholar]
  • 2. Wei Z-G, Zhang S-W. DBH: a de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs. J Theor Biol 2017;425:80–7. [DOI] [PubMed] [Google Scholar]
  • 3. Wei, Chen, Zhang, et al. Comparison of methods for biological sequence clustering. IEEE/ACM Trans Comput Biol Bioinform 2023;20(5):2874–88. [DOI] [PubMed] [Google Scholar]
  • 4. Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2022;24(1):133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Wang, Bao, Lv, et al. Genome sequence resource of Phytophthora colocasiae from China using nanopore sequencing technology. Plant Dis 2021;105(12):4141–5. [DOI] [PubMed] [Google Scholar]
  • 6. Riaz, Leung, Barton, et al. Adaptation of Oxford Nanopore technology for hepatitis C whole genome sequencing and identification of within-host viral variants. BMC Genomics 2021;22(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lu, Giordano, Ning Z. Oxford Nanopore MinION sequencing and genome assembly. Genomics Proteomics Bioinformatics 2016;14(5):265–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Wei, Zhang, Cao, et al. Comparison of methods for picking the operational taxonomic units from amplicon sequences. Front Microbiol 2021;12(474):644012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Lin HN, Hsu WL. Kart: a divide-and-conquer algorithm for NGS read alignment. Bioinformatics 2017;33(15):2281–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Wei Z-G, Zhang S-W. NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics 2018;19(1):177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Mantaci S, Restivo A, Rosone G, Sciortino M. An extension of the burrows–wheeler transform. Theor Comput Sci 2007;387(3):298–312. [Google Scholar]
  • 12. Liu H, Zou Q, Xu Y. A novel fast multiple nucleotide sequence alignment method based on FM-index. Brief Bioinform 2022;23(1):bbab519. [DOI] [PubMed] [Google Scholar]
  • 13. Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 2012;13(1):238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013;1303.3997. [Google Scholar]
  • 15. Liu B, Gao Y, Wang Y. LAMSA: fast split read alignment with long approximate matches. Bioinformatics 2016;33(2):192–201. [DOI] [PubMed] [Google Scholar]
  • 16. Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and fast alignment search tool for LOng noisy read sequencing data. Bioinformatics 2018;35:20–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Wei Z-G, Zhang S-W, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics 2020;21(1):341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Marcosola S, Sammeth M, Guigó R, et al. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 2012;9(12):1185–8. [DOI] [PubMed] [Google Scholar]
  • 19. Liu B, Guan D, Teng M, Wang Y. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics 2015;32(11):1625–31. [DOI] [PubMed] [Google Scholar]
  • 20. Ivan S, Šikić M, Wilm A, et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun 2016;7:11307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018;34(18):3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Jain C, Rhie A, Hansen NF, et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods 2022;19(6):705–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Sedlazeck FJ, Rescheneder P, Smolka M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 2018;15(6):461–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wei Z-G, Fan XG, Zhang H, et al. kngMap: sensitive and fast mapping algorithm for noisy long reads based on the K-Mer neighborhood graph. Front Genet 2022;13:890651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Ding Y, Fu M, Luo P, Wu FX. Network learning for biomarker discovery. Int J Netw Dyn Intell 2023;2(1):51–65. [Google Scholar]
  • 26. Sedlazeck FJ, Rescheneder P, Von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 2013;29(21):2790–1. [DOI] [PubMed] [Google Scholar]
  • 27. Ashton PM, Nair S, Dallman T, et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat Biotechnol 2015;33(3):296–300. [DOI] [PubMed] [Google Scholar]
  • 28. Alser M, Rotman J, Deshpande D, et al. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021;22(1):1–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zhang Y, Tai He J, Zhang Y, Zuo K. A comprehensive analysis of sequence alignment algorithms for LongRead sequencing. Curr Bioinform 2016;11(3):375–81. [Google Scholar]
  • 30. Esmat AM, Amin N, Sima E, Reza GM. A parallel hash-based method for local sequence alignment. Concurr Comput Pract Exp 2022;34(3):e6568. [Google Scholar]
  • 31. Fu M, Wang M, Wu Y, et al. A two-branch neural network for short-axis PET image quality enhancement. IEEE J Biomed Health Inform 2023;27:2864–75. [DOI] [PubMed] [Google Scholar]
  • 32. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 2016;32(14):2103–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Wei Z, Zhang S-W. DMSC: a dynamic multi-seeds method for clustering 16S rRNA sequences into OTUs. Front Microbiol 2019;10:428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Ono Y, Asai K, Hamada MJB. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Bioinformatics 2020;37:589–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Tham CY, Tirado-Magallanes R, Goh Y, et al. NanoVar: accurate characterization of patients’ genomic structural variants using low-depth nanopore sequencing. Genome Biol 2020;21(1):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Rausch T, Zichner T, Schlattl A, et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 2012;28(18):i333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Zook JM, Hansen NF, Olson ND, et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020;38(11):1347–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Jiang T, Liu Y, Jiang Y, et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol 2020;21(1):1–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Quick J, Quinlan AR, Loman NJJG. A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer. Gigascience 2014;3(1):22. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Final_supplementary_file_20240224_bbae107

Data Availability Statement

The datasets used in all experiments are available in their article or Supplementary File. The code of pathMap is publicly available at https://github.com/zhang134/pathmap.git.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES