S-conLSH: alignment-free gapped mapping of noisy long reads

Angana Chakraborty; Burkhard Morgenstern; Sanghamitra Bandyopadhyay

doi:10.1186/s12859-020-03918-3

. 2021 Feb 11;22:64. doi: 10.1186/s12859-020-03918-3

S-conLSH: alignment-free gapped mapping of noisy long reads

Angana Chakraborty ¹, Burkhard Morgenstern ^2,^✉, Sanghamitra Bandyopadhyay ^3,^✉

PMCID: PMC7879691 PMID: 33573603

Abstract

Background

The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate.

Results

We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing.

Conclusions

S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.

Keywords: Sequence analysis, Alignment-free sequence comparison, Noisy long SMRT reads, Locality sensitive hashing

Background

Single molecule real time (SMRT) sequencing developed by Pacific Biosciences [1] and Oxford nanopore technologies [2] have started to replace previous short length next generation sequencing (NGS) technologies. These new technologies have enabled us to address many unsolved problems regarding genetic variations. With the increase in read length to around 20 KB [3], SMRT reads can be used to resolve ambiguities in read mapping caused by repetitive regions. Low GC bias and the ability to detect DNA methylation [1] from native DNA made SMRT data appealing for many real life applications. However, the high sequencing error rate of 13-15% per base [3] poses a real challenge in sequence analysis. Specialized methods like BWA-MEM [4], BLASR [5], rHAT [6], Minimap2 [7], lordFAST [8], etc., have been designed to align noisy long reads back to the respective reference genomes. BLASR [5] clusters the matched words from the reads and genome after indexing using suffix arrays or BWT-FM [9]. It uses a probability-based error optimization technique to find the alignment. BWA-MEM [4], originally designed for short read mapping, has been extended for PacBio and Oxford nanopore reads (with option -x pacbio and -x ont2d respectively) by efficient seeding and chaining of short exact matches. However, both methods are too slow to achieve a desired level of sensitivity [6]. This issue was addressed by rHAT [6] using a regional hash table where windows from the reference genome with the highest k-mer matches are chosen as candidate sites for further extension using a direct acyclic graph. Unfortunately, this method has a large memory footprint if used with the default word length of $k = 13$ , and it fails to accommodate longer k-mers to resolve repeats. Minimap2 [7], a recently developed method, uses concave gap cost, efficient chaining and fast implementation using SSE or NEON instructions to align reads with high sensitivity and speed. Another new method lordFAST [8] has been introduced to align PacBio’s continuous long reads with improved accuracy. MUMmer4 [10], a versatile genome alignment system, also has an option for PacBio read alignment (-l 15 -c 31), although it is less sensitive and accurate than the specialized aligners.

However, all the above mentioned methods come with large computational costs. Here, time and memory consumption are dominated by the alignment overhead. On top of that, alignment algorithms are often unable to correctly align distant homologs in the “twilight zone” with 20–35% sequence identity, as such weak similarities are difficult to distinguish from random similarities. For these reasons, alignment-free methods have become popular in recent years. See [11–13] for recent review papers and [14] for a systematic evaluation of these approaches. An alignment-free method Minimap [15] has been developed in 2016 for mapping of reads to the appropriate positions in the reference genome. Minimap groups approximate colinear hits using single linkage clustering to map the reads. However, Minimap suffers from low specificity. In this article, a new alignment-free method called S-conLSH has been proposed to overcome the above mentioned problems. Being suitable for low conserved areas and less computationally expensive, S-conLSH is sensitive as well as very fast at the same time.

A large proportion of the sequencing errors in SMRT data are indels rather than mismatches [3]. This makes it even more complicated to differentiate genomic variations from sequence errors. To resolve this issue, a concept of ‘context-based’ Locality Sensitive Hashing (conLSH) has been introduced by Chakraborty and Bandyopadhyay [16]. Locality Sensitive Hashing (LSH) [17, 18] has been successfully applied in many real life-science applications, ranging from genome comparison [19, 20] to large scale genome assembly [21]. In LSH, points close to each other in the feature space are hashed into localized slots of the hash table. However, in practice, the neighborhood or context of an object plays a key role in measuring its similarity to another object. Chakraborty and Bandyopadhyay [16] have shown that contexts of symbols (a base in reference to DNA) are important to decide the closeness of strings. They proposed conLSH to group sequences in localized slots of the hash table if they share a common context. However, a match for the entire context is a stringent criterion, considering the error profile of SMRT data. Even a mismatch or indel of length one, caused by a sequencing error, may mislead the aligner.

Therefore, to address this problem, an idea of spaced-context is introduced in this article. Unlike conLSH [16] which produces base-level alignments of the sequences using Sparse Dynamic Programming (SDP) based algorithm, the proposed method (S-conLSH) is an alignment-free tool. It employs multiple spaced-seeds or patterns to find gapped mappings of noisy SMRT reads to reference genomes. The spaced-seeds are strings of 0’s and 1’s where ‘1’ represents the match position and ‘0’ denotes don’t care position where matching in the symbols is not mandatory. The substring formed by extracting the symbols corresponding to the ‘1’ positions in the pattern is defined as the spaced-context of a sequence. Therefore, a spaced-context can minimize the effect of erroneous bases and, thereby, enhances the quality of mapping because it does not check all the bases for a match. This differentiates the proposed method from conLSH which looks into the entire context to compute the hash values.

A pattern-based approach was originally proposed by [22] when they developed PatternHunter, a fast and sensitive homology search tool. Later, multiple patterns or “spaced seeds” were proposed by the same authors [23]. Efficient algorithms to find optimal sets of patterns have been introduced by [24] and [25]. A fast alignment-free sequence comparison method using multiple spaced seeds has been described in [26], see also [27] and [28].

The algorithm, S-conLSH, described in this article is an alignment-free tool designed for mapping of noisy and long reads to the reference genome. The concept of Spaced-context has been elaborated in the Methods section along with a description of the proposed algorithm.

Results

Six different real and simulated datasets of E.coli, A.thaliana, O.sativa, S.cerevisiae and H.sapiens have been used to benchmark the performance of S-conLSH in comparison to other state-of-the art aligners, viz., Minimap2 [7], lordFAST [8], Minimap [15], conLSH [16] and MUMmer4 [10]. All these methods are executed in a setting designed for PacBio read alignment (see Table 1). The default parameter settings used for S-conLSH are $K = 2$ , context size ( $2 λ + 1$ ) $= 7$ , $L = 2$ and $z = 5$ (refer to the Methods section of this article for details of the S-conLSH parameters). The two patterns used in our experiment in the default set up of $L = 2$ are 11111110011111110000 and 111111100000111111100. The datasets used in the experiment have been summarized in Table 2. For the sake of simplicity, the results have been demonstrated by executing different aligners in a single thread. Please refer to the Tables [S-1] to [S-5] of Additional file 1: Note 1 for detailed review of their performance in multi-threaded mode.

Table 1.

Parameter settings and commands used by different methods for mapping of PacBio SMRT reads

Mapper	Command line settings
Minimap [15]	$/ minimap - d in . mmi ref_file$
Minimap [15]	$. / minimap - l in . mmi read_file > output_file$
Minimap2 [7]	$. / minimap 2 - Hk 19 - d in . mmi ref_file$
Minimap2 [7]	$. / minimap 2 - a - Y - x map - pb - t 1 in . mmi read_file > output_file$
lordFAST [8]	$. / lordfast - - index ref_file$
lordFAST [8]	$. / lortfast - - search ref_file - - seq read_file - - thread 1 > output_file$
MUMmer4 [10]	$. / nucmer - - sam - long = output_file ref_file read_file - l 15 - c 31$
conLSH [16]	$. / conLSH - indexer$ $$ PATH ref_file - K 2 - L 1 > output_file$
conLSH [16]	$. / conLSH - aligner$ $$ PATH read_file ref_file - K 2 - L 1 - l 8 - w 1000 - m 5 - - lambda 2 > output_file$
S-conLSH	$. / S - conLSH$ $$ PATH ref_file read_file - K 2 - - lambda 3 - L 2 - z 5 > output_file$

Open in a new tab

Table 2.

The summary of real and simulated datasets used in the experiment along with the corresponding reference genome links

Dataset	Type	Platform	# of reads	Reference genome
H. sapiens-real	Real	PacBio RS II P5/C3 release	290,992	hg38
E. coli-real	Real	PacBio RS II P5/C3 release	300	Escherichia coli str. K-12 substr. MG1655
A. thaliana-real	Real	PacBio RS II P5/C3 release	3,448,228	TAIR10
O. sativa-real	Real	PacBio RS II P5/C3 release	590,268	Build 4.0
S. cerevisiae-real	Real	PacBio RS II P5/C3 release	594,243	S288C (assembly R64)
H. sapiens-sim	Simulated	PBSIM	146,932	hg38

Open in a new tab

The aligner, rHAT [6] has been excluded from the study, as it has been reported to malfunction in certain scenarios [7]. The PacBio read alignment module of BWA-MEM [4] has been replaced by Minimap2, as it retains all the main features of BWA-MEM, while being 50 $\times$ faster and more accurate. Therefore, the results of BWA-MEM are not shown separately in the tables. Moreover, BLASR [5] has also not been used in the comparative study, as Minimap2 and lordFAST have been found to outperform it in all respects.

By default, S-conLSH produces output in pairwise read mapping format (PAF) ( [15]). There are scripts available to convert the popular SAM [29] alignment formats to PAF ( [15]). If a base-to-base alignment is requested, S-conLSH provides an option (--align 1), where the target locations are aligned using ksw alignment library (https://github.com/attractivechaos/klib) to produce the SAM file. The entire experiment has been conducted on an Intel Core i7-6200U CPU @ 2.30 GHz $\times$ 16(cores), 64-bit machine with 32GB RAM.

The results demonstrated in this article are organized into three categories: (1) performance on simulated datasets, (2) study on real PacBio reads, and (3) Robustness of S-conLSH for different parameter settings.

Experiment on simulated dataset

To study the accuracy of SMRT read mapping, a total of 146,932 noisy long reads have been simulated from hg38 human genome using PBSIM [30] command “pbsim --data-type CLR --depth 1 --length-min 1 --length-max 200000 --seed 0 --sample-fastq real.fastq hg38.fa. The error profile has been sampled from three real human PacBio RS II P5/C3 reads listed below, concatenated as real.fastq.

m130929_024849_42213_c100518541*_s1_p0.1.subreads.fastq
m130929_024849_42213_c100518541*_s1_p0.2.subreads.fastq
m130929_024849_42213_c100518541*_s1_p0.3.subreads.fastq

The simulated reads from 5 different Human chromosomes are used to test the performance of S-conLSH in comparison to the other standard aligners. The sensitivity and precision have been computed based on the ground truth as obtained from the .maf files of PBSIM. A read is considered to be mapped correctly (as defined by [8]) if (1) it gets mapped to the correct chromosome and strand; and (2) the target subsequence of reference genome where the read maps to, must overlap with the true mapping by at least 1bp. The sensitivity is measured as a fraction of correctly mapped reads out of the total number of reads. Precision is defined, in the same way, as the fraction of correctly mapped reads out of the total number of mapped reads.

Table 3 summarizes the number of correct mappings, sensitivity, precision, and running time by different methods, Minimap, Minimap2, lordFAST, MUMmer4, conLSH, and S-conLSH for a total of 146,932 reads simulated from five different human chromosomes. The number of reads extracted from each chromosome is listed in Table 3. The result shows that S-conLSH produces the highest number of correct mappings among all five aligners for different chromosomes of Human-sim dataset. S-conLSH maps 32,111 reads out of total 32,290 reads of Chr#1, among which 31,964 mappings are found to be correct when compared with the ground truth. Minimap2 is the second highest in producing the correct mappings in this case. A similar scenario has been generally observed for the four other chromosomes as well. It is clear that Minimap2 always aligns all the reads to some location in the reference genome, but produces more incorrect mappings when compared to S-conLSH. Evidently, S-conLSH provides the highest sensitivity for all the chromosomes considered. Minimap, on the other hand, exhibits higher precision but lower sensitivity as it leaves a large number of reads unaligned. The number of unaligned reads by Minimap increases for large and complicated real datasets.

Table 3.

Comparative study of the number of correct mappings, sensitivity, precision, and running time by different methods, Minimap, Minimap2, lordFAST, MUMmer4, conLSH, and S-conLSH, for a total of 146,932 reads simulated from five different human chromosomes

Chr#	#Reads	Mapper	#Mapped reads	#Correct mapping	Sensitivity (%)	Precision (%)	Indexing time (s)	Mapping time (s)
Chr1	32,290	Minimap	31,591	31,585	97.82	99.99	15	30
		Minimap2	32,290	31,863	98.69	98.69	10	61
		lordFAST	32,290	29,313	90.79	90.79	192	206
		MUMmer4	31,940	31,645	98.01	99.08	-	310
		conLSH	31,945	29,620	91.73	92.72	08	235
		S-conLSH	32,111	31,964	99	99.55	51	38
Chr2	34,309	Minimap	33,623	33,613	97.98	99.98	16	33
		Minimap2	34,309	33,864	98.71	98.71	10	64
		lordFAST	34,309	31,173	90.87	90.87	170	216
		MUMmer4	34,056	33,914	98.85	99.59	-	312
		conLSH	34,082	31,230	91.03	91.63	10	229
		S-conLSH	34,153	34,008	99.13	99.58	54	44
Chr3	28,109	Minimap	27,481	27,477	97.76	99.99	15	25
		Minimap2	28,109	27,698	98.55	98.55	8	52
		lordFAST	28,109	25,513	90.77	90.77	135	167
		MUMmer4	27,894	27,791	98.87	99.64	-	253
		conLSH	27,899	25,603	91.08	91.77	07	198
		S-conLSH	27,957	27,863	99.13	99.67	45	30
Chr4	26,871	Minimap	26,307	26,301	97.88	99.98	16	23
		Minimap2	26,871	26,501	98.63	98.63	8	51
		lordFAST	26,871	24,403	90.82	90.82	129	158
		MUMmer4	26,638	26,533	98.75	99.61	-	248
		conLSH	26,650	24,449	90.98	91.74	07	180
		S-conLSH	26,748	26,650	99.18	99.64	41	29
Chr5	25,353	Minimap	24,859	24,849	98.02	99.96	14	21
		Minimap2	25,353	25,056	98.84	98.84	7	48
		lordFAST	25,353	23,069	91	91	123	149
		MUMmer4	25,126	24,951	98.42	99.31	-	234
		conLSH	25,149	23,106	91.14	91.87	06	167
		S-conLSH	25,242	25,155	99.22	99.66	39	23

Open in a new tab

Italic values are the best results in each category

As can be seen, S-conLSH takes 38 CPU seconds to map the reads of chromosome 1, which is slightly slower than Minimap. The speed of Minimap is achieved as it maps a smaller number of reads compared to other aligners. Interestingly, S-conLSH has been found to have smaller mapping time than all the remaining algorithms, while having the maximum number of correctly mapped reads. As there was no separate indexing and aligning time available for MUMmer4, the total time is mentioned as “Mapping time”. MUMmer4 has been found to consume a large amount of time to achieve a desired level of sensitivity. It is evident from Table 3 that the indexing time is quite low for both Minimap and Minimap2. Indexing time for S-conLSH is relatively higher, though it is much smaller as compared to lordFAST. Here, it may be noted that indexing is performed only once for a given reference genome, while the read mapping will need to be performed every time a different individual is sequenced. The compressed and memory-efficient B-tree indexing of conLSH makes it the fastest in processing of reference genomes. However, the mapping time of conLSH is large as it performs base-to-base alignments using Sparse Dynamic Programming. The stringent ungapped matching requirement of the aligner over the entire context of the sequences results in lower sensitivity, after lordFAST. The proposed alignment-free tool, S-conLSH, has been found to be useful in such cases as it obtains the gapped mapping of the noisy reads using spaced-contexts.

While this section reports results of single-threaded execution, the Tables [S-1]–[S-5] of the Additional file 1: Note 1 exhibit the performance boost-up of S-conLSH in multi-threaded systems. S-conLSH achieves more than $50 %$ reduction in mapping time when run in 4 concurrent threads over the single-threaded version of itself. The indexing time of S-conLSH also improves with higher degree of parallelism and becomes comparable with that of Minimap2 when the number of threads is equal to 8. Moreover, this performance achievement comes with almost no additional burden of memory requirement. Please refer to Additional file 1: Note 1 for a detailed report.

Experiment on real PacBio datasets

This section demonstrates the performance of S-conLSH in comparison to other state-of-the-art aligners on five different SMRT datasets of E.coli-real, A.thaliana-real, O.sativa-real, S.cerevisiae-real and H.sapiens-real (refer Table 2 for details). A comparative study of running time, percentage of reads aligned and coverage by different aligners has been detailed in Table 4 for real human SMRT subread named m130929_024849_42213_c100518541* _s1_p0.1.subreads.fastq consisting of 23,235 reads . Results on MUMmer4 are excluded since it takes inordinately long.

Table 4.

Comparative study of running time, percentage of reads aligned and coverage by different aligners for H.sapiens-real SMRT dataset of 23,235 reads

Mapper	Indexing time (s)	Mapping time (s)	% of reads aligned	Mean coverage
Minimap	140	30	94.8	NA
Minimap2	138	106	100	0.0473
lordFAST	2286	327	100	0.0566
conLSH	47	404	99	0.0579
S-conLSH	794	99	99.9	0.078

Open in a new tab

It can be seen that S-conLSH provides the highest coverage value among the five standard methods used in the experiment. Minimap does not have any coverage statistics, as it is unable to produce alignment as SAM file. The performance in terms of indexing and mapping time, as shown in Table 4, is similar to that has already been observed for simulated datasets. The percentage of read alignment is the highest by Minimap2 and lordFAST. This is similar to the scenario obtained on simulated datasets where Minimap2 and lordFAST align all the reads against the reference genome, even though it may contain some incorrect mappings. S-conLSH, on the other hand, has a mapping ratio of $99.9 %$ , which is lower than Minimap2 and lordFAST. This is due to the fact that S-conLSH gives higher priority to the mapping accuracy and it leaves a few reads unaligned if potential target locations are not found. S-conLSH has a higher memory footprint of about 13GB for indexing the entire human genome.

Similar results are observed for E.coli-real, A.thaliana-real, O.sativa-real, and S.cerevisiae-real real PacBio datasets as can be seen in Table 5. It is clear that S-conLSH is among the fastest in terms of mapping time, after Minimap. However, Minimap fails to align a good portion of the reads for large datasets like A.thaliana and S.cerevisiae. The aligner, conLSH, on the other hand, requires lower indexing time but higher mapping time to align a reasonable amount of reads to the reference genome. However, the alignment quality of conLSH is often compromised as studied in previous subsections. The percentage of reads mapped by S-conLSH is generally lower than Minimap2 and lordFAST, as it tries to ensure the best of the mapping quality. It is, however, difficult to measure the quality of read mappings for real datasets, as there is no ground truth available for such cases.

Table 5.

Comparative study of running time, percentage of reads aligned by different aligners for four datasets of E. coli-real, A. thaliana-real, O. sativa-real and S. cerevisiae-real

Dataset	#Reads	Mapper	Index time (s)	Mapping time (s)	% of reads aligned
E. coli-real	300	Minimap	1	1	91.6
		Minimap2	2	2	100
		lordFAST	3	10	100
		conLSH	1	2	100
		S-conLSH	2	1	98.3
A. thaliana-real	3,448,228	Minimap	7	798	88
		Minimap2	6	10,562	100
		lordFAST	78	24,277	100
		conLSH	1	40,028	99.4
		S-conLSH	27	2073	93.7
O. sativa-real	590,268	Minimap	20	487	94.3
		Minimap2	20	6898	100
		lordFAST	287	10,462	100
		conLSH	6	18,024	97
		S-conLSH	88	962	98.2
S. cerevisiae-real	594,243	Minimap	1	104	41.8
		Minimap2	1	881	100
		lordFAST	7	2130	100
		conLSH	1	7890	99.8
		S-conLSH	3	299	90.3

Open in a new tab

Robustness of S-conLSH for different parameter settings

An exhaustive experiment with different values of K, $λ$ , L, and z has been carried out to study the robustness of the proposed method S-conLSH.

Table 6 summarizes the study of indexing and mapping time along with the percentage of reads aligned for different values of S-conLSH parameters on real human SMRT dataset m130929_024849_42213_c100518541*_s1_p0.1.subreads.fastq. As can be observed the best performance (highest percentage of read mapping in minimum time) is achieved with the settings $K = 2$ , $(2 λ + 1) = 7$ , $L = 2$ and $z = 5$ . The mapping time increases with L as it directly corresponds to the number of hash tables (one for each of the L different spaced-seeds) used to retrieve the target locations. Therefore the search becomes more rigorous as it considers all spaced-contexts obtained from L different patterns. This is, however, useful for highly sensitive applications at the expense of a few more seconds of mapping time. An efficient solution could be the use of S-conLSH with higher values of L, distributed over multiple concurrent threads. Please refer to Additional file 1: Note 2 regarding the performance of S-conLSH for different values of L in a multi-threaded system.

Table 6.

Performance of conLSH with change of K, L, z, and $λ$ for real human SMRT dataset

Concatenation	Context size	L	z	Indexing	Mapping	% of reads
Factor (K)	$2 \times λ + 1$	L	z	Time (s)	Time (s)	Mapped
2	7	1	5	790	80	97.7
2	7	2	5	794	99	99.9
2	7	3	5	797	122	99.9
2	7	1	7	791	82	97.7
2	7	1	11	794	80	97.7
2	5	1	5	67	11	73.4
4	3	1	5	302	107	97.2
3	5	1	5	849	85	96.8

Open in a new tab

The default setting is marked as italic

Indexing time, on the other hand, is proportional to the product $(2 λ + 1) K$ . It seems that z has little effect on the performance of S-conLSH as the running time and the percentage of reads aligned mostly stay invariant with z. However, the parameter z is important to enhance the sensitivity of the method. This is reflected in Table 7 when studied on simulated reads. The highest number of correct mappings is obtained with the default settings (shown in italic) when $z = 5$ . The zeros in the spaced-seed help to find the distant similarities as it encompasses a larger portion of the sequence while the weight ( $(2 λ + 1) K$ ) of the pattern remains the same. However, a very large value of z may degrade the accuracy as it joins unrelated contexts together. It is evident from Tables 6 and 7 that the performance of S-conLSH remains reasonably good irrespective of the variation of the parameter values. Therefore, it can be concluded that the algorithm S-conLSH is quite robust even though it requires some tuning of different parameters for the best performance.

Table 7.

Performance of conLSH with change of z for chromosome 1 of H. sapiens-sim dataset consisting of 32,290 reads

Concatenation	Context size	L	z	Indexing	Mapping	# of correct
Factor (K)	$2 \times λ + 1$	L	z	Time (s)	Time (s)	Mapping
2	7	2	3	51	37	31,960
2	7	2	5	51	38	31,964
2	7	2	11	48	122	31,963
2	7	2	20	48	122	31,950

Open in a new tab

The default parameter setting is shown in italic

Discussion and conclusions

S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. Earlier, Minimap was designed to map reads against the reference genome without performing an actual base-to-base alignment. However, the low sensitivity of Minimap precluded its applications in real-life domains. Minimap2 is one of the best performing state-of-the-art alignment-based methods which provides an excellent balance of running time and sensitivity. The method described in this article, S-conLSH, has been observed to outperform Minimap2 in respect of sensitivity, precision, and mapping time. However, it has a longer indexing time and a higher memory footprint. Nevertheless, sequence indexing is a one-time affair, and memory is inexpensive nowadays.

The spaced-context in S-conLSH is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm. Multiple patterns (with higher values of L) increase the sensitivity but at the cost of more time. Moreover, with the introduction of don’t care positions, the patterns become longer, thus providing better performance in resolving conflicts that occur due to the repetitive regions. The provision of rehashing for chimeric read alignment and reverse strand mapping make S-conLSH ideal for applications in the real-life sequence analysis pipeline.

A memory-efficient version of the S-conLSH can be developed in the future. The algorithm, at its current stage, can not conclude on the optimal selection of the patterns. A study on finding the optimal set of spaced-seeds can be carried out in future to improve the performance of the algorithm. Though the experiment demonstrated in this article is confined to the noisy long reads of PacBio datasets, it can be further extended on ONT reads as well. Finally, we would like to conclude with a strong expectation that the proposed method S-conLSH will draw the attention of the peers as one of the best performing reference mapping tools designed so far.

Methods

The algorithm S-conLSH for mapping noisy long reads to the reference genome essentially consists of two steps, reference genome indexing and read mapping. The complete workflow of S-conLSH is provided in Fig. 1 and the entire procedure is detailed below.

Reference Genome Indexing

The reference genome is sliced into overlapping windows, and these windows are hashed into hash tables using suitably designed S-conLSH functions (see Definition 5) as shown in Fig. 1. S-conLSH uses two hash tables ‘ $h_i n d e x$ ’ and ‘Hashtab’. An entry in $h_i n d e x$ has two fields (f, n): f stores an offset to the table Hashtab, where sequences are clustered according to their hash values, and n is the total number of sequences hashed at a particular value. Therefore, $H a s h t a b [h_i n d e x [$ x].f] to $H a s h t a b [h_i n d e x [$ x $] . f + h_i n d e x [$ x].n] are the sequences hashed at value x.
Read mapping

For each noisy long read, S-conLSH utilizes the same hash function for computing the hash values and retrieves sequences of the reference genome that are hashed in the same position as the read. Finally, the locations of the sequences with the highest hits are chained and reported as an alignment-free mapping of the query read (see Fig. 1).

By default, S-conLSH provides alignment-free mappings of the SMRT reads to the reference genome. If a base level alignment is required, S-conLSH provides an option (--align 1) to generate alignment in SAM format using ksw library (https://github.com/attractivechaos/klib). Some key aspects of S-conLSH are detailed in the following subsections.

Fig. 1 — A schematic workflow of indexing and mapping using S-conLSH

Context based locality sensitive hashing

Locality Sensitive Hashing [17, 18] is an approximate near-neighbor search algorithm, where the points having a smaller distance in the feature space, will have a higher probability of making a collision. Under this assumption, a query is compared only to the objects having the same hash value, rather than to all the items in the database. This makes the algorithm work in sublinear time. In the definitions below, we use the following notations:

For a string x of length d over some set $Σ$ of symbols and $1 \leq i \leq j \leq d$ , x[i] denotes the ith symbol of x, and x[i..j] denotes the (contiguous) substring of x from position i to position j. If $H$ is a finite set of functions defined on some set X, for any $h \in H$ , randomly drawn with uniform probability, and $x, y \in X$ , $P r_{H} [h (x) = h (y)]$ denotes the probability that $h (x) = h (y)$ .

The definition of Locality Sensitive Hashing as introduced in [17, 18] is given below:

Definition 1

(Locality Sensitive Hashing) [17, 18] Let (X, D) be a metric space, let $H$ be a family of hash functions mapping X to some set U, and let $R, c, P_{1}, P_{2}$ be real numbers with $c > 1$ and $0 \leq P_{2} < P_{1} \leq 1$ . $H$ is said to be $(R, c R, P_{1}, P_{2})$ -sensitive if for any $x, y \in X$ and $h \in H$

$\begin{matrix} P r_{H} [h (x) = h (y)] \geq P_{1} \end{matrix}$ whenever $D (x, y) \leq R$ , and
$\begin{matrix} P r_{H} [h (x) = h (y)] \leq P_{2} \end{matrix}$ whenever $D (x, y) \geq c R$ .

To illustrate the concept of locality sensitive hashing for DNA sequences, let us consider a finite set $Σ = {A, T, C, G}$ called the alphabet, together with an integer $d > 0$ . Let X be the set of all length-d words over $Σ$ , endowed with the Hamming distance, and let U be the alphabet $Σ$ . For $1 \leq i \leq d$ , let the function $h_{i} : X \to U$ be defined by $h_{i} (x) = x [i]$ , $\forall x \in X$ . Next, let R and cR be real numbers with $c > 1$ and $0 \leq R < c R \leq d$ , and define $P_{1} = \frac{d - R}{d}$ and $P_{2} = \frac{d - c R}{d}$ . Then the set $H = {h_{i} : 1 \leq i \leq d$ } is $(R, c R, P_{1}, P_{2})$ -sensitive. To see this, observe that for any two words $p, q \in X$ , the probability $P r_{H} [h (p) = h (q)]$ is same as the fraction of positions i with $p [i] = q [i]$ . Therefore,

\begin{matrix} P r_{H} [h (p) = h (q)] = \frac{d - D (p, q)}{d} \geq \frac{d - R}{d} = P_{1} \end{matrix}

if $D (p, q) \leq R$ , and

\begin{matrix} P r_{H} [h (p) = h (q)] = \frac{d - D (p, q)}{d} \leq \frac{d - c R}{d} = P_{2} \end{matrix}

if $c R \leq D (p, q)$ .

Therefore, $P_{1} > P_{2}$ as $c R > R$ . This proves that the family of hash functions $H = {h_{i} : 1 \leq i \leq d$ } is locality sensitive.

In biological applications, it is often useful to consider the local context of sequence positions and to consider matching subwords, as shown in conLSH [16]. It groups similar sequences in the localized slots of the hash tables considering the neighborhoods or contexts of the data points. A context in connection to sequence analysis can be formally defined as:

Definition 2

(Context) Let $x : (x_{1} x_{2} \dots x_{d})$ be a sequence of length d. A context at the i-th position of x, for $i \in {λ + 1, \dots, d - λ}$ , is a subsequence $x [i - λ \dots i \dots i + λ]$ of length $2 λ + 1$ , formed by taking $λ$ characters from each of the right and left sides of x[i]. Here, $λ$ is a positive constant termed the context factor.

To define context based locality sensitive hashing, the above example is generalized such that, for a given subword length $(2 λ + 1) < d$ , each hash function in $H$ will map words containing the same length- $(2 λ + 1)$ subwords at some position to the same bucket in U. The subword length $(2 λ + 1)$ is called the context size, where $λ$ is the context factor.

Definition 3

(Context based Locality Sensitive Hashing (conLSH)) Let $Σ$ be a set called the alphabet. Let $λ$ and d be integers with $(2 λ + 1) < d$ . Let X be the set of all length-d words over $Σ$ and U be the set of all length- $(2 λ + 1)$ words over $Σ$ . For $R, c R, P_{1}$ , and $P_{2}$ as above, a $(R, c R, P_{1}, P_{2})$ -sensitive family $H$ of functions mapping X to U is called $(R, c R, λ, P_{1}, P_{2})$ -sensitive, if for each $h \in H$ , there are positions $i_{h}$ and $j_{h}$ with $λ + 1 \leq i_{h}, j_{h} \leq d - λ$ such that for all $p, q \in X$ one has $h (p) = h (q)$ whenever

\begin{matrix} p [i_{h} - λ \dots i_{h} \dots i_{h} + λ] = q [j_{h} - λ \dots j_{h} \dots j_{h} + λ] \end{matrix}

holds.

Gapped read mapping using spaced-context based locality sensitive hashing

The proposed method S-conLSH, uses spaced-seeds or patterns of 0’s and 1’s in connection with S-conLSH function. For a pattern $P$ , the spaced-context of a DNA sequence can be defined as:

Definition 4

(Spaced-context) Let $P$ be a binary string or pattern of length $ℓ$ , where ‘1’ represents match position and ‘0’ represents don’t-care position. Let $ℓ_{w}$ denote the weight of $P$ which is equal to the number of ‘1’s in the pattern. Evidently, $ℓ_{w} \leq ℓ$ . Let x be a sequence of length d over alphabet ${A, T, G, C}$ such that $ℓ \leq d$ . Then, a string sw over ${A, T, G, C}$ of length $ℓ_{w}$ is called a spaced-context of x with respect to $P$ , if $s w [i] = x [j]$ holds if and only if $P [j] = 1$ , where $i \leq j$ , $1 \leq i \leq ℓ_{w}$ and $1 \leq j \leq ℓ$ .

Sequences sharing a similar spaced-context with respect to a pre-defined pattern $P$ , are hashed together in S-conLSH.

The concept of gap-amplification is used in locality sensitive hashing to ensure that the dissimilar items are well separated from the similar ones. To do this, gap between the probability values $P_{1}$ and $P_{2}$ needs to be increased. This is achieved by choosing L different hash functions, $g_{1}, g_{2}, \dots, g_{L}$ , such that $g_{j}$ is the concatenation of K randomly chosen hash functions from $H$ , i.e., $g_{j} = (h_{1, j}, h_{2, j}, \dots, h_{K, j})$ , for $1 \leq j \leq L$ . This procedure is known as “gap amplification” and K is called the “concatenation factor” [18]. For every hash function $g_{j}$ , $1 \leq j \leq L$ , there is a pattern $P_{j}$ associated with it. The spaced-context based Locality Sensitive Hashing is now defined as follows:

Definition 5

(Spaced-context based Locality Sensitive Hashing (S-conLSH)) Let $s w_{j} (x)$ be the spaced-context of sequence x with respect to the binary pattern $P_{j}$ of length $ℓ$ , $1 \leq j \leq L$ . Let $P_{j}$ be defined by the regular expression ${(0^{*}, {(1)}^{(2 λ + 1)})}^{K} 0^{*}$ . Therefore, the weight of $P_{j}$ , i.e., $ℓ_{w} = (2 λ + 1) K$ . The maximum value of $ℓ$ would be $(2 λ + 1) K + z (K + 1)$ assuming that at most z zeros are present between two successive contexts of 1’s in $P_{j}$ , where $z \geq 0$ is an integer parameter. Let d be an integer with $ℓ \leq d$ , X be the set of all length-d words over $Σ$ and U be the set of all length- $ℓ_{w}$ words over $Σ$ . For $R, c R, P_{1}$ , $P_{2}$ , and $λ$ as introduced in Definition 3, a $(R, c R, λ, P_{1}, P_{2})$ -sensitive hash function $g_{j} = (h_{1, j}, h_{2, j}, \dots, h_{K, j})$ , where $h_{i, j} \in H, 1 \leq i \leq K$ , mapping X to U is called $(R, c R, λ, z, P_{1}, P_{2})$ -sensitive, if for any $p, q \in X$ one has $g_{j} (p) = g_{j} (q)$ whenever $s w_{j} (p) = s w_{j} (q)$ holds with respect to the pattern $P_{j}$ .

Therefore, instead of restricting similarity over the $(2 λ + 1) K$ consecutive bases as was done for conLSH [16], S-conLSH incorporates greater flexibility by checking only the positions which correspond to a 1 in the pattern. For example, the binary string “011100111” is a pattern for a system having $K = 2$ , $z = 2$ and context size $(2 λ + 1) = 3$ . The hash value or the spaced-context of the string “ATTCGGTAA” for the above pattern will be “TTCTAA” (see Fig. 2(b)). In S-conLSH, noisy long reads are hashed using L functions corresponding to L different patterns generated using Algorithm 1. Multiple pattern based functions enable gapped-mapping of the reads as illustrated in Fig. 2. Consider a scenario of two patterns $P_{1} =$ “011100111” and $P_{2} =$ “111111” having context size $= 3$ , $L = 2$ and $K = 2$ . The string $p =$ “ATTCGGTAA” generates two hash values $s w_{1} (p) =$ “TTCTAA” and $s w_{2} (p) =$ “ATTCGG” for the patterns $P_{1}$ and $P_{2}$ respectively (see Fig. 2b). Similarly, $s w_{1} (q) =$ “TCTGTA” and $s w_{2} (q) =$ “TTCTAA” are the hash values for string $q =$ “TTCTAAGTA” (Fig. 2c). As shown in the hash table of Fig. 2d, the two strings collide to the same bucket of the hash table due to the common hash value “TTCTAA”. This results in mapping with three gaps or indels, corresponding to the three 0’s of “011100111”, in the second string. This gapped-mapping is a powerful feature of S-conLSH which is quite uncommon in standard spaced-seed based methods (refer Additional file 1: Note 3 for details).

Fig. 2 — A schematic illustration of gapped-mapping using S-conLSH. a Multiple patterns having context size $= 3$ and $K = 2$ . b, c Hashing of the strings “ATTCGGTAA” and “TTCTAAGTA” respectively using different patterns. d Final hash table and gapped-mapping of the two strings due to the collision at “TTCTAA”

To obtain an integer hash value from the Spaced-context, an encoding function $f : S \mapsto {0, 1, \dots, (4^{K (2 λ + 1)} - 1)}$ , $f (s w) = \sum_{i = 1}^{(2 λ + 1) K} f (s w [i]) \times 4^{(2 λ + 1) K - i}$ , $\forall s w \in S$ , has been defined assuming $f (A) = 0$ , $f (C) = 1$ , $f (G) = 2$ and $f (T) = 3$ , where S is the set of all spaced-contexts of length $(2 λ + 1) K$ defined over the alphabet ${A, T, C, G}$ . A pattern produces hash values of length equal to its weight. Keeping the weight same, the pattern length is increased in S-conLSH by introducing don’t care positions (or, zeros). This allows S-conLSH to look at a larger portion of the sequences without increasing the computational overhead. Consequently, S-conLSH is able to find distant homologs that might otherwise be overlooked. Not only that, it provides better sensitivity in resolving repeats because of the consideration of the neighborhood (or, contexts) when measuring the similarity between the sequences. S-conLSH has a provision of split mapping for chimeric reads as follows. If a read fails to get associated with end-to-end mapping, it is split into a series of non-overlapping segments and re-hashed to find target location(s) for each segment.

Supplementary Information

12859_2020_3918_MOESM1_ESM.pdf^{(415.8KB, pdf)}

Additional file 1. Supplementary material which contains additional results and instruction manual of the software proposed in the main article.

Acknowledgements

Not applicable.

Abbreviations

SMRT: Single molecule real time
S-conLSH: Spaced-context based locality sensitive hashing
LSH: Locality sensitive hashing
conLSH: Context based locality sensitive hashing
NGS: Next generation sequencing
DNA: Deoxyribonucleic acid
PAF: Pairwise read mapping format
SAM: Sequence alignment/map
PacBio: Pacific biosciences
ONT: Oxford nanopore technologies

Authors’ contributions

AC conceived the idea, developed the software and designed the experiment. AC, BM and SB evaluated the results and wrote the manuscript. All authors read and approved the final manuscript.

Funding

We acknowledge support by the Open Access Publication Funds of the Göttingen University. SB acknowledges the grant from the J. C. Bose Fellowship (SB/S1/JCB-033/2016) awarded by the Dept. of Sci. and Tech., Govt. of India. The funding body did not play any role in the design of the methodology, creation of the algorithms, analysis, and interpretation of data, or in writing the manuscript.

Availability of data materials

The datasets used in the experiment are publicly available with the accession links mentioned in Table 1.

Software availability

Project Name: S-conLSH. https://github.com/anganachakraborty/S-conLSH-2.0.git. Operating System: LINUX/WINDOWS. Programming Language: C++. License: GNU GENERAL PUBLIC LICENSE.

Compliance with ethical standards

Ethics approval and consent to participate

No ethics approval was required for the study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Angana Chakraborty, Email: angana_r@isical.ac.in.

Burkhard Morgenstern, Email: bmorgen@gwdg.de.

Sanghamitra Bandyopadhyay, Email: sanghami@isical.ac.in.

Supplementary information

The online version contains supplementary material available at 10.1186/s12859-020-03918-3.

References

1.Rhoads A, Au KF. PacBio sequencing and its applications. Genom Proteom Bioinform. 2015;13(5):278–289. doi: 10.1016/j.gpb.2015.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12(8):733. doi: 10.1038/nmeth.3444. [DOI] [PubMed] [Google Scholar]
3.Ardui S, Ameur A, Vermeesch JR, Hestand MS. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 2018;46(5):2159–2168. doi: 10.1093/nar/gky066. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
5.Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinform. 2012;13(1):238. doi: 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Liu B, Guan D, Teng M, Wang Y. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2015;32(11):1625–1631. doi: 10.1093/bioinformatics/btv662. [DOI] [PubMed] [Google Scholar]
7.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2018;35(1):20–27. doi: 10.1093/bioinformatics/bty544. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Simpson JT, Durbin R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics. 2010;26(12):367–373. doi: 10.1093/bioinformatics/btq217. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):186. doi: 10.1186/s13059-017-1319-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-free sequence analysis and applications. Annu Rev Biomed Data Sci. 2018;1:93–114. doi: 10.1146/annurev-biodatasci-080917-013431. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Bernard G, Chan CX, Chan Y-B, Chua X-Y, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform. 2019;22:426–435. doi: 10.1093/bib/bbx067. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zielezinski A, Girgis HZ, Bernard G, Leimeister C-A, Tang K, Dencker T, Lau AK, Röhling S, Choi J, Waterman MS, Comin M, Kim S-H, Vinga S, Almeida JS, Chan CX, James B, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 2019;20:144. doi: 10.1186/s13059-019-1755-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–2110. doi: 10.1093/bioinformatics/btw152. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Chakraborty A, Bandyopadhyay S. conLSH: context based locality sensitive hashing for mapping of noisy SMRT reads. Comput Biol Chem. 2020;85:107206. doi: 10.1016/j.compbiolchem.2020.107206. [DOI] [PubMed] [Google Scholar]
17.Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM; 1998. p. 604–13.
18.Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Communications of the ACM—50th anniversary issue. ACM; 2008. p. 117–22.
19.Buhler J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics. 2001;17(5):419–428. doi: 10.1093/bioinformatics/17.5.419. [DOI] [PubMed] [Google Scholar]
20.Chakraborty A, Bandyopadhyay S. Ultrafast genomic database search using layered locality sensitive hashing. In: Fifth international conference on emerging applications of information technology. IEEE; 2018. p. 1–4.
21.Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015;33(6):623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]
22.Ma B, Tromp J, Li M. Patternhunter: faster and more sensitive homology search. Bioinformatics. 2002;18(3):440–445. doi: 10.1093/bioinformatics/18.3.440. [DOI] [PubMed] [Google Scholar]
23.Li M, Ma B, Kisman D, Tromp J. PatternHunter II: Highly sensitive and fast homology search. Genome Inform. 2003;14:164–175. [PubMed] [Google Scholar]
24.Ilie L, Ilie S, Mansouri Bigvand A. Speed: fast computation of sensitive spaced seeds. Bioinformatics. 2011;27(17):2433–2434. doi: 10.1093/bioinformatics/btr368. [DOI] [PubMed] [Google Scholar]
25.Hahn L, Leimeister C-A, Ounit R, Lonardi S, Morgenstern B. rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLoS Comput Biol. 2016;12:1005107. doi: 10.1371/journal.pcbi.1005107. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Leimeister C-A, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics. 2014;30(14):1991–1999. doi: 10.1093/bioinformatics/btu177. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Morgenstern B, Zhu B, Horwege S, Leimeister C-A. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol. 2015;10:5. doi: 10.1186/s13015-015-0032-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Leimeister C-A, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics. 2017;33:971–979. doi: 10.1093/bioinformatics/btw776. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and samtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Ono Y, Asai K, Hamada M. Pbsim: Pacbio reads simulator-toward accurate genome assembly. Bioinformatics. 2012;29(1):119–121. doi: 10.1093/bioinformatics/bts649. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2020_3918_MOESM1_ESM.pdf^{(415.8KB, pdf)}

Additional file 1. Supplementary material which contains additional results and instruction manual of the software proposed in the main article.

Data Availability Statement

The datasets used in the experiment are publicly available with the accession links mentioned in Table 1.

Project Name: S-conLSH. https://github.com/anganachakraborty/S-conLSH-2.0.git. Operating System: LINUX/WINDOWS. Programming Language: C++. License: GNU GENERAL PUBLIC LICENSE.

[CR1] 1.Rhoads A, Au KF. PacBio sequencing and its applications. Genom Proteom Bioinform. 2015;13(5):278–289. doi: 10.1016/j.gpb.2015.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Loman NJ, Quick J, Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods. 2015;12(8):733. doi: 10.1038/nmeth.3444. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Ardui S, Ameur A, Vermeesch JR, Hestand MS. Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res. 2018;46(5):2159–2168. doi: 10.1093/nar/gky066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).

[CR5] 5.Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinform. 2012;13(1):238. doi: 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Liu B, Guan D, Teng M, Wang Y. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2015;32(11):1625–1631. doi: 10.1093/bioinformatics/btv662. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Haghshenas E, Sahinalp SC, Hach F. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2018;35(1):20–27. doi: 10.1093/bioinformatics/bty544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Simpson JT, Durbin R. Efficient construction of an assembly string graph using the FM-index. Bioinformatics. 2010;26(12):367–373. doi: 10.1093/bioinformatics/btq217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):186. doi: 10.1186/s13059-017-1319-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-free sequence analysis and applications. Annu Rev Biomed Data Sci. 2018;1:93–114. doi: 10.1146/annurev-biodatasci-080917-013431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Bernard G, Chan CX, Chan Y-B, Chua X-Y, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform. 2019;22:426–435. doi: 10.1093/bib/bbx067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Zielezinski A, Girgis HZ, Bernard G, Leimeister C-A, Tang K, Dencker T, Lau AK, Röhling S, Choi J, Waterman MS, Comin M, Kim S-H, Vinga S, Almeida JS, Chan CX, James B, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 2019;20:144. doi: 10.1186/s13059-019-1755-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016;32(14):2103–2110. doi: 10.1093/bioinformatics/btw152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Chakraborty A, Bandyopadhyay S. conLSH: context based locality sensitive hashing for mapping of noisy SMRT reads. Comput Biol Chem. 2020;85:107206. doi: 10.1016/j.compbiolchem.2020.107206. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Indyk P, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM; 1998. p. 604–13.

[CR18] 18.Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Communications of the ACM—50th anniversary issue. ACM; 2008. p. 117–22.

[CR19] 19.Buhler J. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics. 2001;17(5):419–428. doi: 10.1093/bioinformatics/17.5.419. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Chakraborty A, Bandyopadhyay S. Ultrafast genomic database search using layered locality sensitive hashing. In: Fifth international conference on emerging applications of information technology. IEEE; 2018. p. 1–4.

[CR21] 21.Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol. 2015;33(6):623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Ma B, Tromp J, Li M. Patternhunter: faster and more sensitive homology search. Bioinformatics. 2002;18(3):440–445. doi: 10.1093/bioinformatics/18.3.440. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Li M, Ma B, Kisman D, Tromp J. PatternHunter II: Highly sensitive and fast homology search. Genome Inform. 2003;14:164–175. [PubMed] [Google Scholar]

[CR24] 24.Ilie L, Ilie S, Mansouri Bigvand A. Speed: fast computation of sensitive spaced seeds. Bioinformatics. 2011;27(17):2433–2434. doi: 10.1093/bioinformatics/btr368. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Hahn L, Leimeister C-A, Ounit R, Lonardi S, Morgenstern B. rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison. PLoS Comput Biol. 2016;12:1005107. doi: 10.1371/journal.pcbi.1005107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Leimeister C-A, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics. 2014;30(14):1991–1999. doi: 10.1093/bioinformatics/btu177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Morgenstern B, Zhu B, Horwege S, Leimeister C-A. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol Biol. 2015;10:5. doi: 10.1186/s13015-015-0032-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Leimeister C-A, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics. 2017;33:971–979. doi: 10.1093/bioinformatics/btw776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and samtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Ono Y, Asai K, Hamada M. Pbsim: Pacbio reads simulator-toward accurate genome assembly. Bioinformatics. 2012;29(1):119–121. doi: 10.1093/bioinformatics/bts649. [DOI] [PubMed] [Google Scholar]

PERMALINK

S-conLSH: alignment-free gapped mapping of noisy long reads

Angana Chakraborty

Burkhard Morgenstern

Sanghamitra Bandyopadhyay

Abstract

Background

Results

Conclusions

Background

Results

Table 1.

Table 2.

Experiment on simulated dataset

Table 3.

Experiment on real PacBio datasets

Table 4.

Table 5.

Robustness of S-conLSH for different parameter settings

Table 6.

Table 7.

Discussion and conclusions

Methods

Fig. 1.

Context based locality sensitive hashing

Definition 1

Definition 2

Definition 3

Gapped read mapping using spaced-context based locality sensitive hashing

Definition 4

Definition 5

Fig. 2.

Supplementary Information

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Availability of data materials

Software availability

Compliance with ethical standards

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases