Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Aug 16.
Published in final edited form as: Lebniz Int Proc Inform. 2022 Jul 11;233:16. doi: 10.4230/LIPIcs.SEA.2022.16

RLBWT Tricks

Nathaniel K Brown 1, Travis Gagie 2, Massimiliano Rossi 3
PMCID: PMC11327918  NIHMSID: NIHMS2015461  PMID: 39157646

Abstract

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP ‘21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation π, it stores an Or-space table — where r is the number of positions i where either i=0 or πi+1πi+1 — that enables the computation of successive values of πi by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing πi is constant while maintaining Or-space.

In this paper we refine Nishimoto and Tabei’s approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation π corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.

Keywords and phrases: Compressed String Indexes, Repetitive Text Collections, Burrows-Wheeler Transform

2012 ACM Subject Classification: Theory of computation → Data compression

1. Introduction

The FM-index [5] is the basis for key tools in computational genomics, such as the popular short-read aligners BWA [12] and Bowtie [11], and is probably now the most important application of the Burrows-Wheeler Transform (BWT) [4]. As genomic databases have grown and researchers and clinicians have realized that using only one or a few reference genomes biases their results and diagnoses, interest in computational pan-genomics has surged and versions of the FM-index based on the run-length compressed BWT (RLBWT) [13] have been developed that can index thousands of genomes in reasonable space [6, 10, 16]. Those versions have all relied heavily on compressed sparse bitvectors, however, which are inherently slower than the bitvectors used in regular FM-indexes (see [14] for details). Experts would probably have guessed that sparse bitvectors were an essential component for RLBWT-based pan-genomic indexes — until Nishimoto and Tabei [15] recently showed how to replace them with theoretically more efficient alternatives.

In particular, Nishimoto and Tabei’s result gives an approach which achieves constant time LF-mapping in Or-space [15]. Speeding up LF can reduce the time for basic queries over the RLBWT and other applications. For example, Ahmed et al.’s SPUMONI [1] tool allows rapid targeted nanopore sequencing over compressed pan-genome indexes using approximate matching statistics; “nontarget” DNA molecules are ejected from the sequencer with an emphasis on speed. Their method depends on LF-mapping to extend matches, and otherwise “jumping” forwards or backwards in the BWT based on threshold computation. Thresholds over the BWT is a rather new approach, introduced by Bannai et al. in 2020 [2], suggesting further improvements may be developed; however, avoiding the lower bounds inherited from predecessor queries from rank on sparse bitvectors1 is a more surprising result. For tools that heavily depend on LF, experiments showing practical results provide an opportunity for speed improvements that otherwise would not have been expected to be attainable.

In this paper we focus on the first part of Nishimoto and Tabei’s result: we demonstrate experimentally that we can reduce the time for basic queries on an RLBWT by replacing queries on sparse bitvectors by table lookups, sequential scans, and queries on relatively short uncompressed bitvectors. We implement LF-mapping over the RLBWT using table lookup; preliminary results showed this could be made practical even without theoretical worst case time guarantees. Although their result also applies to the ϕ function over the RLBWT [15], we focus on LF since it allows backward-stepping (performed before locating, which requires ϕ) and its seems more compressible for LF; we leverage the unique structure of LF to partition columns of the table into non-decreasing subsequences.

With this motivation, we present various techniques and optimizations towards a practical implementation. To demonstrate its practicality, we use real-world genomic datasets to perform count queries using haplotypes of chromosome 19 and SARS-CoV2 genomes. We find that our implementations are competitive in time/space with the best existing methods: in the average case without row insertions, and exploring a run splitting approach to loosely bound sequential scanning in the worst case. Further analysis shows in practice, sequential scans are quite rare, but can become more common as n/r grows, motivating our run splitting and further approaches.

The rest of this paper is laid out as follows: in Section 2 we present the two parts of Nishimoto and Tabei’s result and explain how they relate to RLBWT-based pan-genomic indexes; Section 3 describes methods used to make the result practical for implementation; in Section 4 we present our experimental results; and in Section 5 we analyse its practicality and summarize findings.

2. Nishimoto and Tabei’s Result

Suppose we want to compactly store a permutation π on 0,,n1 such that we can evaluate πi quickly when given i. If π is chosen arbitrarily then Θn space is necessary to store it in the worst case, and sufficient to allow constant-time evaluation. If the sequence π0,π1,π2,,πn1 consists of a relatively small number b of unbroken incrementing subsequences, however — meaning πi+1=πi+1 whenever πi and πi+1 are in the same subsequence — then we can store π in Ob space and evaluate it in Ologlogn time. To do this, we simply store in an Ob-space predecessor data structure with Ologlogn query time — such as a compressed sparse bitvector — each value i such that πi is the head of one of those subsequences, with πi as satellite data; we evaluate any πi in Ologlogn time as

πi=πpredi+ipredi.

Nishimoto and Tabei first proposed a simple alternative Ob-space representation:2 we store a sorted table in which, for each subsequence head p, there is quadruple: p; the length of the subsequence starting with p;πp; and the index of the subsequence containing πp.

If we know the index of the subsequence containing i then we can look up the quadruple for that subsequence and find both its head p and πp, then compute πi=πp+ip in constant time. If we want to compute π2i the same way, however, we should compute the index of the subsequence containing πi, since πi may be in a later subsequence than πp. To do this, we look up the quadruple for the subsequence containing πp (which takes constant time since we have its index) and find its head and length, from which we can tell if πi is in the subsequence. If it is not, we continue reading and checking the quadruples for the following subsequences (which takes constant time for each one, since they are next in the table) until we find the one that does contain πi.

Sequentially scanning the table to find the quadruple for the subsequence containing πi could take Ωb time in the worst case, so Nishimoto and Tabei then proved the following result, which implies we can artificially divide some of the subsequences before building the table, such that all the sequential scans are short. We still find their proof surprising, so we have included a summary of it below which introduces our parameter d. This refinement of the original theorem allows for a time/space tradeoff.

Theorem 1 (Nishimoto and Tabei [15]). Let π be a permutation on 0,,n1,

P=0i:0<in1,πiπi1+1,

and Q=πi:iP. For any integer d2, we can construct P with PP0,,n1 and Q=πi:iP such that

  • if q,qQ and q is the predecessor of q in Q, then q,qP<2d,

  • PdPd1.

Proof. We start by setting P0=P and Q0=Q. Suppose at some point we have Pi and Qi=πi:iPi. If there do not exist q,qQi such that q is the predecessor of q in Qi and q,qPi2d, then we stop and return P=Pi and Q=Qi; otherwise, we choose some such q and q.

We choose the d+1 st largest element p in q,qPi and set Pi+1=Piπ1p and Qi+1=Qip=πi:iPi+1. Since q<p<q we have pQi and so π1pPi. Therefore, Pi+1=Pi+1 and so, by induction, Pi+1=P+i+1.

Let Ei be the set of intervals u,u such that u,uQi and u is the predecessor of u in Qi and u,uPid, and let Ei+1 be the set of intervals u,u such that u,uQi+1 and u is the predecessor of u in Qi+1 and u,uPi+1d. Since Ei+1=Eiq,qq,p,p,q, we have Ei+1=Ei+1 and, by induction, Ei+1i+1.

Since the intervals in Ei+1 are disjoint and each contain at least d elements of Pi+1, we have Pi+1dEi+1di+1. Since Pi+1=P+i+1 and Pi+1di+1, we have P+i+1di+1 and thus i+1Pd1 and Pi+1=P+i+1dPd1. It follows that we find P and Q after at most Pd1 steps.

To discuss how Theorem 1 relates to RLBWTs, we first recall the definitions of the suffix array (SA), the BWT, the LF mapping and ϕ for a text 0..n1:

  • SAi is the starting position of the lexicographically ith suffix of T;

  • BWTi is the character immediately preceding that suffix;

  • LFi is the position of SAi1 in SA;

  • ϕi is the value that precedes i in SA.

Let T be defined over an alphabet Σ of size σ. For convenience we assume T ends with a special symbol Tn1=$ that occurs nowhere else, we consider strings and arrays as cyclic and we work modulo n.

It is not difficult to see that LF and ϕ (and thus also ϕ1) are permutations that can be divided into at most r of unbroken incrementing subsequences, where r is the number of runs in the BWT.3 First, if BWTi=BWTi+1 then LFi+1=LFi+1, so there are at most r values for which LFi+1LFi+1. Second, if BWTi=BWTi+1 so LFi+1=LFi+1 then

SALFi=ϕSALFi+1

and, as illustrated in Figure 1,

ϕSAi+1=SAi=SALFi+1=ϕSALFi+1+1=ϕSAi+11+1

or, choosing i=SAi+11, we have ϕi+1=ϕi+1. It follows that there are at most r values for which ϕi+1ϕi+1. Nishimoto and Tabei’s result therefore gives us Or-space data structures supporting LF, ϕ and ϕ1 in constant time.

Figure 1.

Figure 1

An illustration of why BWTi=BWTi+1 implies ϕSAi+1=ϕSAi+11+1.

As a practical aside we note that, although applying Theorem 1 means we store quadruples for sub-runs in the BWT, we can store with them the indexes of the maximal runs containing them and thus, for example, store SA samples in an r-index only at the boundaries of maximal runs and not sub-runs.

The queries needed for most RLBWT-based pan-genomic indexes4 can be implemented using LF, ϕ, ϕ1 and access, rank and select queries on the string R0..r1 in which Ri is the distinct character appearing in the ith run in BWT, which can be supported with a wavelet tree on R. Of course that wavelet tree uses bitvectors, but even with uncompressed bitvectors it takes only rlgσ+orlogσ bits, where σ is the size of the alphabet (usually 4 for genomics and pan-genomics), and supports those queries in Ologσ time (or constant time when σ=logO1n).

3. Practical Approach

To provide a practical implementation of Nishimoto and Tabei’s first result, we slightly modify the structure of the table. Consider the permutation to be LF(i) over the BWT, with runs being unary substrings of the BWT. In Section 2 we presented the quadruples using absolute indexes over the permutation, but we can instead perform access using the run index itself: let positions of run heads in the BWT be the array I0..r1 storing the sorted values i such that i=0 or LFi1LFi1. For all k0,1,,I1 we store a triple containing: the length of the run, i.e. Ik+1Ik, where Ir=n; the index of the run containing LFIk, i.e., maxjIjLFIk; and the offset d of LFIk in that run. Let j be a position in the k-th run, the offset of LFj and LFIk is Ikj, hence we can find the correct run containing LFj and its offset in that run using a sequential scan as described in Section 2. With this approach, we can represent positions in the BWT as run/offset pairs and implement LF accordingly, i.e. k,dLFk,d. This change removes the need for the p column of the table, with successive LF steps performed using the returned run/offset pair; access row k with offset d and perform LFk,d.

3.1. Block Compression

For each row on the previous representation of the BWT, we store the character of the run corresponding to the row to enable support of count and inversion queries. Figure 2 shows an example of this uncompressed table. Preliminary results showed that left uncompressed, LF-mapping could be made drastically faster than a sparse bitvector implementation (seen in Section 4 as rle-string) for inversion or LF queries. However, the result is also drastically larger; this formulation is not practical because it requires storing three integers and one character for each run, and to perform count operations, it requires scanning the run heads to find the preceding and following run of the character we are seeking. One first improvement is to store the array R0..r1 in a wavelet-tree as described in Section 2, which supports rank and select queries to efficiently find the preceding and following run of a given character.

Figure 2.

Figure 2

For an example text T=GATTAGATACAT, the LF mapping and subsequent uncompressed table is built (with appended terminal character $). The run/offset columns show positions with respect to the L column used to find a mappings predecessor. Notice that highlighted stored mappings (destinations) for any run of As form a non-decreasing subsequence.

The tabular approach exploits space locality of the entries that facilitate the linear scans required by the algorithm when accessing rows sequentially; however, there is no apparent relationship which makes row-wise compression easy. To mitigate locality concerns, we partition the table into blocks of size B which are loaded in a cache friendly manner. Using a fixed B, we can easily perform modular arithmetic to map positions within the blocks. For each block we store the corresponding character of each run in a wavelet tree that allows fast rank and select queries inside the block (using uncompressed bitvectors). For each character c of the alphabet, the position of the first run of c’s preceding the beginning of the block and following the end of the block is stored, allowing efficient retrieval of these characters’ correct rows when they are not stored in the wavelet tree and occur in another block. For example, we may need to look to another block if some character has no occurrences in the current block, or has no occurrences before/after some position.

To improve compression inside the block, we compress the lists of lengths and offsets using directly-addressable codes (DACs, see [14]); we divide the list of run indices into σ sub-lists, each containing the indices from rows corresponding to runs of a distinct character cΣ. Compressing the lengths and offsets in DACs is naive compression5 leveraging the length of a value’s bit representation while also supporting random access. For mapping destinations, it follows from LF that the mapping indices across a common character c form a non-decreasing sub-sequence [4] as highlighted in Figure 2. If we store in a block, for each of the σ sub-lists, the mapping index of the first occurrence in the list, then the rest of the list can be truncated as a difference from the base mapping. We can also choose to represent the sub-lists by partial differences; for m occurrences of a character c let M0..m1 be such a sub-list where we explicitly store the first mapping M0, and represent the list as D0=0,Di=MiMi1. Storing only partial differences allows us to recover the mapping using prefix sum, which we expand upon in Section 3.2 alongside an approach over absolute offsets from the base. To manoeuvre around our positional change to run indices, we also store a sparse bitvector marking sampled run head positions in the BWT, which is used after backwards-stepping to recover the absolute index from a run/offset pair.6

3.2. Optimizations

Compressing the mapping column as “difference lists” gives various representations of exploiting the σ non-decreasing sub-sequences:

DAC Sampling

By storing the partial differences space efficiently and sampling the absolute difference from the base, the number of random accesses needed to recover the correct value is bounded when computing the prefix sum. Implementing the approach using DACs to store the partial differences, we have a first method to retrieve mappings in compressed space while avoiding a costly traversal of the entire list. Although basic, this method is a simple choice to illustrate how we can leverage these sequences being non-decreasing.

Linear Interpolation

We perform linear interpolation between sampled offsets (as opposed to partial differences); with a sample rate s, prior sample x, next sample z, and unsampled difference y at position i. For each y, we then store its difference Δ=yϵ from a weighted average defined by

ϵ=x+zxisi/s/s

into a DAC.7 Given i and s, we lookup x and z to compute ϵ, after which we compute yϵ+ϵ=y from our stored value Δ=yϵ to recover the mapping. At worst the stored value can only be the difference between the sampled values themselves, and we expect each value to tend towards the interpolated average obtained by assuming a linear increase between samples.

Bitvector

Construct an uncompressed bitvector in which the number of 0s before the kth 1 is the offset from the first pointer (which is stored explicitly) to the kth. For example, given a sequence M=11,16,19,21 and corresponding partial differences D=0,5,3,2, we store the first pointer M0=11 alongside the bitvector

10000010001001

constructed as described above. Performing select kk over this bitvector returns the number of 0s prior to the kth 1 and recovers the difference; in essence, a prefix sum over the partial differences where we remove the k number of 1s from our calculation. Adding the stored M0 to this difference restores the original value. Given our example and k=3, we have

M0+select33=11+113=19=M2

and we recover the correct value at M2 (i=2 corresponds to the k=3 bit due to 0-based array indexing).

To further optimize for practical input, consider an alternative to the wavelet tree suitable for small alphabets or when query support is needed for only a subset of characters. Where the wavelet tree performs rank and select over multiple tree levels, we could instead store full length uncompressed bitvectors in our blocks, one for each chosen character c marking positions i where BWTi=c. For large alphabets, this approach is much larger than a wavelet tree representation; however, for genomic datasets which in practice support queries on few characters such as the nucleobases A,C,G,T, this alternative may be preferred. As this is the case in our experiments, we use this restricted alphabet trick to trade off space for increased speed in performing rank and select operations. A summary of the structure of our proposed practical approach is shown in Figure 3; an overview of the hierarchy of the proposed optimizations with respect to components of the data structure and the varying options which we have implemented.

Figure 3.

Figure 3

Shows hierarchy of implementation, outlining different approaches and optimizations. Solid lines show required components of parents given our work, where dotted lines denote multiple options being available. For example, the various methods to recover the mapping of a run head are shown as children of difference list. Shaded nodes show paths that are implemented for experiments in Section 4.

3.3. Scanning Complexity

We have not yet implemented the second part of Nishimoto and Tabei’s result because we correctly expected their idea of table lookup (perhaps modified slightly) to be interesting and practical by itself. Over real world datasets (as discussed in Section 5), our typical sequential scan is very small; however, theoretically we use Ωr-time in the worst case for such a scan for LF. In fact, there are strings for which the average time for a scan is Ωr. Suppose a string has BWT0..n1=(bc)n/10(a)4n/5 with r=n/5+1 runs. By LF properties we have 3n5 LF steps which require scanning r1 rows, as described in Figure 4. Similarly, we encounter Ωnr-time for inversion, as we perform exactly n possible LF steps during a full retrieval of the original string.

Figure 4.

Figure 4

Visual representation of amortized analysis in Section 3.3. Notice that given a BWT of this form, any character a corresponding to run k with Ik=n/5 stores LFn/5=0 as its mapping. If the offset d is greater than n/5, then the sequential scan must cross the boundaries of each run of b or c, of which there are n/5 in total; since there is only one run of a, we scan r1 entries, and perform this operation for 3n5 possible steps. Amortized over all possible LF steps, we cannot avoid Ωr scans in the worst case.

In practice, a very similar string can be produced which preserves a similar worst case. Consider a randomly generated binary string, for our purposes over the alphabet Σ=b,c. We then interleave the sequence with four consecutive a characters between each of the original characters (resulting in 4n5 a characters). The number of runs we expect in its BWT cannot be much less than the number of runs in the original sequence, since the introduced a characters are easily run-length compressed and such a technique would improve compression of any random sequence otherwise. The expected number of runs in a random binary string is half its length n10, also observed in practical experiments, and thus mapping to a characters results in almost the same case as Figure 4. To perform some practical bounding against scans without theoretical guarantees, we allow splitting of large runs by specifying the maximum acceptable run length to provide an alternative construction.

3.4. Count Queries

Standard FM-indexes are particularly good at counting queries, both in theory and in practice, and counting was also the first query supported quickly and in small space by RLBWT-based versions of the FM-index [13] (time- and space-efficient reporting was developed much later [6]). It seems appropriate, therefore, to test with counting queries our implementation of the first part of Nishimoto and Tabei’s result. A counting query for pattern S0..m1 in text T0..n1 returns the number of occurrences of S in T, by backward searching for S and returning the length of the BWT interval containing the characters preceding occurrences of S in T. We can implement a backward step using access to the string R described in Section 2, up to 2 rank queries and 2 select queries on R, and 2 LF queries.

Suppose the interval BWTs..e contains the characters preceding occurrences of Si+1..m1 in T and we know both the indices js and je of the runs containing BWTs and BWTe, and the offsets of those characters in those runs. We need not assume we know s and e themselves. If Rjs=Si then the BWT interval containing the characters preceding occurrences of Si..m1 in T, starts at BWTLFs. Otherwise, it starts at BWTLFs, where s is the first character in run

js=R.selectSiR.rankSijs+1.

Symmetrically, if Rje=Si then the interval ends at BTW[LF(e)]; otherwise, it ends at BWTLFe, where e is the last character in run

je=R.selectSiR.rankSije.

(If js>je then Si..m1 does not occur in T.) These operations are all supported across our block compressed table, and the final interval positions in the BWT can be computed using sampled run head positions to return the final count.

4. Experiments

Our code was written in C++ and compiled with flags -03 -DNDEBUG -funroll-loops-msse4.2 using data structures from sdsl-lite [7]. We performed our experiments on a server with an Intel® Xeon® Silver 4214 CPU running at 2.20GHz with 32 cores and 100 GB of memory. Our code is available at https://github.com/drnatebrown/r-index-f.git. Count query times were measured using Google Benchmark, and construction with the Unix/usr/bin/time command.

4.1. Data Structures

For our table lookup implementations, we partition into blocks of size B=220 and sample every 16th run position in the BWT. We compared the following data structures:

lookup-bv table lookup with bitvector marking differences with 0s, recovered with select described in Section 3.2.

lookup-int table lookup with linear interpolation between sampled values described in Section 3.2 with sample rate 16.

lookup-dac table lookup with DAC sampling of differences described in Section 3.2 with sample rate 5.

lookup-split2 table lookup with naive run splitting using lookup-bv data structure described in Sections 3.2, 3.3. Runs larger than twice the average length n/r are split.

lookup-split5 table lookup identical to lookup-split2, except runs larger than five times the average length n/r are split.

wt-fbb fixed-block boosting wavelet tree of [8] using default parameters; implementation at https://github.com/dominikkempa/faster-minuter.

rle-string run-length encoded string of the r-index [6]; implementation based off https://github.com/nicolaprezza/r-index.

RLCSA the BWT component8 of the run-length encoded compressed suffix array of [13] using default parameters; implementation at https://github.com/adamnovak/rlcsa.

4.2. Datasets

We tested our data structures for construction and query on 4 collections of 128, 256, 512 and 1000 haplotypes of chromosome 19 from the 1000 Genomes Project [17] (chr19) and 4 collections of 100k, 200k, 300k, 400k SARS-CoV2 genomes from the EBI’s COVID-19 data portal [9]9 (Sars-CoV2). Each set is a superset of the previous one. Table 1 describes the lengths n and ratio n/r of the datasets.

Table 1.

Table of the different datasets. In column 1 and 2 we report the name and description of the datasets, in column 3 we report the number of sequences in the collection, in column 4 we report the length of the file, and in column 5 the ratio of the length to the number of runs in the BWT.

Name Description N n/106 n/r
chr19 Human chromosome 19 128 7568.01 222.24
chr19 Human chromosome 19 256 15136.04 424.93
chr19 Human chromosome 19 512 30272.08 771.54
chr19 Human chromosome 19 1,000 59125.12 1287.38
Sars-CoV2 Sars-CoV2 genomes database 100,000 2979.01 881.16
Sars-CoV2 Sars-CoV2 genomes database 200,000 5958.35 977.19
Sars-CoV2 Sars-CoV2 genomes database 300,000 8944.37 1178.00
Sars-CoV2 Sars-CoV2 genomes database 400,000 11931.17 1328.92

4.3. Construction

In Figure 5 we report the time and memory for construction of the data structures for the chr19 and Sars-CoV2 datasets. RLCSA is omitted, since it is the only data structure not built using prefix free parsing (PFP) [3], and its construction time far exceeded the other methods.

Figure 5.

Figure 5

Figure 5

Construction for chr19 of 128, 256, 512 and 1000 copies (left) and Sars-CoV2 of 100k, 200k, 300k and 400k copies (right). Copies increase for an instance plotted left to right. For chr19 we partially omit wt-fbb for being magnitudes larger than other values (approximately 4 times slower and larger than lookup-bv for 512 copies and similarly 5 times slower and 7 times larger for 1000).

4.4. Query

To query the data structures we performed counting queries for 10000 randomly chosen substrings each of length 10, 100, 1000 and 10000. In Figure 6 and 7 we report the time and memory for querying of the data structures for the chr19 and Sars-CoV2 datasets respectively.

Figure 6.

Figure 6

The time per query to count the occurrences of 128, 256, 512 and 1000 copies of chr19 for 10000 randomly-chosen substrings of length 10, 100, 1000 and 10000 each. Copies for a single line are read from largest number of copies to smallest, left to right. The x axis is logarithmically scaled, motivated by doubling the number of copies across examples.

Figure 7.

Figure 7

The time per query to count the occurrences of 100k, 200k, 300k and 400k Sars-CoV2 copies for 10000 randomly-chosen substrings. Results are given for queries of length 10, 100, 1000 and 10000. Copies for a single line are read from largest number of copies to smallest, left to right.

5. Discussion

With respect to our table lookup implementations, lookup-bv and its variants (lookup-split2, lookup-split5) perform better than the alternatives (lookup-int, lookup-dac) a majority of the time across all queries, while being smaller in space. For query lengths greater than 10 on chr19, these approaches are faster than rle-string but slightly larger, while slower than RLCSA but smaller in size; we occupy a time/space trade-off position between these values. This is while also being much smaller than wt-fbb whose space makes it an outlier despite best speeds for various queries.

On Sars-CoV2, our implementations perform well on queries of length 10, with lookup-split2 the fastest implementation and other approaches competitive in both time/space. For query lengths greater than 10, the non-splitting approaches (lookup-bv, lookup-int, lookup-dac) perform the worst across data structures with respect to speed. With splitting approaches, we are comparable to rle-string in time but worse in space. Although again an outlier in space, wt-fbb performs fastest, with RLCSA occuping the least space with comparable speed to wt-fbb.

In terms of size/construction, we perform worse than rle-string across all data, but are highly competitive for lookup-bv’s space despite slower construction. For our implementations, lookup-bv is the definitive choice across results in regard to both space and construction time. When compared to RLCSA, despite being more space-efficient on chr19 across lookup-bv approaches, we cannot compete on Sars-CoV2 where it is a clear winner across all data structures. This motivates applying table lookup to also speed up RLCSA; however, we note adding support for ϕ and ϕ1 (thus, supporting locate) to RLCSA is still an open problem.

With regard to our splitting approaches, they are superior to lookup-bv for long query lengths and as n/r rises. To examine the cause in terms of n/r and growing text collections, we examine the number of sequential scans required across LF steps during count queries of length 100 for chr19 in Figure 8. Although the distribution is similar across all copies near zero, with a majority requiring no sequential scan and most of the rest scanning very few, worst cases become both more prevalent and longer as the number of copies and n/r grows. This gives further insight into the success of the splitting approaches in these instances, as bounding the maximum runs also bounds worst case sequential scans. We find this result intriguing with respect to Theorem 1 when n/r or the worst case number of scans is high. Concentrating on Nishimoto and Tabei’s first result, lookup-bv performs competitively in space/time for low n/r with naive run splitting as a practical alternative otherwise in our observed experiments.

Figure 8.

Figure 8

Frequencies in percentage of runs scanned for any LF step across 10000 count queries of length 100 for 100, 200, 512 and 1000 copies of chr19. Plot on left is restricted only to steps scanning 0 to 9 runs; plot on right shows all scans, log scaled since the frequency of scans decreases quickly for large values.

Acknowledgements

Many thanks to Omar Ahmed, Christina Boucher and Ben Langmead for discussions and assistance during our research, and to the anonymous reviewers for their insightful feedback.

Funding

This work was funded by NIH R01AI141810 and R01HG011392, NSERC Discovery Grant RGPIN-07185-2020, and NSF IIBR 2029552 and IIS 1618814.

Footnotes

1

Conventionally, LF-mapping in runs bounded space relies on rank queries over sparse bitvectors.

2

We may have taken some artistic license with their format.

3

Realizing this about ϕ, however, led directly to Gagie, Navarro and Prezza’s r-index [6].

4

For example, for the recent pan-genomic index MONI [16], we need LF, ϕ, ϕ1 and access to so-called thresholds. A threshold for a consecutive pair of runs of the same character in BWT is a position of a minimum LCP value in the interval between those runs. If we know the index of the run containing a particular character BWTi and its offset in that run, and we want to know whether it is before or after the threshold for the pair of runs of another character c bracketing BWTi, then we can find in Ologσ time the index of the preceding run of cs; if we have the index of the run containing the threshold and its offset in that run stored with that preceding run of cs, then we can tell immediately if BWTi is before or after the threshold.

5

DACs are a simple method to allow both random access alongside compression; however, more specific techniques would be preferred if these columns have exploitable properties that we could not uncover.

6

Although we introduce a sparse bitvector into our data structure, it is not used during sequential LF stepping, but rather as an “exit” or “entrance” from the table’s run/offset pairs.

7

We store a bitvector denoting the sign of the stored component, allowing us to compress unsigned integers using the DAC.

8

We build the data structure without suffix-array sampling.

9

The complete list of accession numbers is reported in the repository.

Supplementary Material Source code available from https://github.com/drnatebrown/r-index-f

Contributor Information

Nathaniel K. Brown, Faculty of Computer Science, Dalhousie University, NS, Canada

Travis Gagie, Faculty of Computer Science, Dalhousie University, NS, Canada.

Massimiliano Rossi, Department of Computer and Information Science and Engineering, University of Florida, FL, USA.

References

  • 1.Ahmed Omar, Rossi Massimiliano, Kovaka Sam, Schatz Michael C., Gagie Travis, Boucher Christina, and Langmead Ben. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bannai Hideo, Gagie Travis, and Tomohiro I. Refining the r-index. Theor. Comput. Sci, 812:96–108, 2020. [Google Scholar]
  • 3.Boucher Christina, Gagie Travis, Kuhnle Alan, Langmead Ben, Manzini Giovanni, and Mun Taher. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol, 14(1):13:1–13:15, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Burrows Michael and Wheeler David J.. A block-sorting lossless data compression algorithm. Technical Report 124, DEC, 1994. [Google Scholar]
  • 5.Ferragina Paolo and Manzini Giovanni. Indexing compressed text. J. ACM, 52(4):552–581, 2005. [Google Scholar]
  • 6.Gagie Travis, Navarro Gonzalo, and Prezza Nicola. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. [Google Scholar]
  • 7.Gog Simon, Beller Timo, Moffat Alistair, and Petri Matthias. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms (SEA), pages 326–337, 2014. [Google Scholar]
  • 8.Gog Simon, Kärkkäinen Juha, Kempa Dominik, Petri Matthias, and Puglisi Simon J.. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370–1391, 2019. [Google Scholar]
  • 9.Harrison Peter W et al. The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Research, 49(W1):W619–W623, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kuhnle Alan, Mun Taher, Boucher Christina, Gagie Travis, Langmead Ben, and Manzini Giovanni. Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol, 27(4):500–513, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Langmead Ben, Trapnell Cole, Pop Mihai, and Salzberg Steven L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):1–10, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Li Heng and Durbin Richard. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinform., 25(14):1754–1760, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mäkinen Veli, Navarro Gonzalo, Sirén Jouni, and Välimäki Niko. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol, 17(3):281–308, 2010. [DOI] [PubMed] [Google Scholar]
  • 14.Navarro Gonzalo. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. [Google Scholar]
  • 15.Nishimoto Takaaki and Tabei Yasuo. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP), pages 101:1–101:15, 2021. [Google Scholar]
  • 16.Rossi Massimiliano, Oliva Marco, Langmead Ben, Gagie Travis, and Boucher Christina. Moni: A pangenomic index for finding maximal exact matches. J. Comput. Biol, 29(2):169–187, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES