Abstract
The extended Burrows Wheeler Transform (eBWT) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the BWT to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the eBWT that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the eBWT, i.e., run-length encoded eBWT and SA samples of Gagie et al. [SODA 2018] can be constructed efficiently from the components of the PFP. Moreover, we show that finding maximal exact matches (MEMs) between a query string and the r-index of the eBWT can be efficiently supported.
1. Introduction
There exists a number of large sequencing projects that aim to identify the biological variation of individuals of a given species. For example, the 100K Human Genome Project [29], the 1001 Arabidopsis Project [28], and the 3,000 Rice Genomes Project (3K RGP) [27]. Although biological variation is common and necessary among cultivars or individuals of these sequencing projects, a large portion of sequencing data is shared—leading to repetition within the dataset. Given the thousands of individuals within these sequencing projects, indexing them in a manner that allows variation to be identified and compared is challenging. The FM-index [18] has been the cornerstone of this indexing as most standard read alignment algorithms (e.g., BWA [16] and Bowtie [15]) build and use the FM-index of the set of one or more reference genomes. More specifically, these read alignment algorithms build the Burrows Wheeler Transform (BWT) and suffix array (SA) or SA samples to find alignments between sequence reads and the database of genomes. However, as the number of genomes increases, there has been an aim to build the BWT and SA samples in space that is linear in number of runs in the BWT which is typically denoted as r.
Mäkinen and Navarro may have been the first to pose the search from a FM-index that can be constructed in space [17]. They introduced the run-length Burrows Wheeler Transform (RL BWT) and showed how to locate all occurrences of a query string P[1..p] in string T[1..n] in -time, where occ is the number of occurrences, σ is the alphabet size, and s is a parameter. The downside is that they require -space for the SA samples. Given a query string P and the RL BWT of a string T, Policriti and Prezza [25] showed how to find a single SA sample in the interval in RL BWT containing P in -space. Then in 2018, Gagie et al. [8] showed how to fully support locate queries, i.e., locate all occ SA samples in -space. The resulting data structure is referred to as the r-index. We note that this result defined the r-index but did not give an algorithm to construct it. The algorithm to construct the r-index was described later by Boucher et al. [5] and Kuhnle et al. [14]—both are based on a preprocessing technique called Prefix Free Parsing (PFP). PFP produces two temporary structures called the dictionary and parse. From these components the r-index can be built. This was a significant achievement but was not fully set up to accomplish read alignment since finding short exact matches between a read and an index was not defined for the r-index. To accomplish this latter task Rossi et al. [26] augmented PFP to construct an auxiliary data structure, called thresholds, in addition to the r-index. The addition of thresholds allows for finding maximal exact matches (MEMs) between a query string (e.g., sequence read) and an index (e.g., genomes), where a MEM is defined as an exact match that cannot be extended to the left or to the right.
In this paper, we consider the extended Burrows Wheeler Transform (eBWT), which extends the definition of the BWT to a collection of strings. Previously, we showed that the eBWT can be constructed by combining a modified version of the Suffix Array Induced Sorting (SAIS) algorithm of Nong et al. [24] with PFP. Here, we show that it follows from this construction that the run-length encoding of the eBWT and the SA samples of Gagie et al. [8] can be constructed in linear time in the size of the input, and linear space in the size of the dictionary and parse. Similarly, the thresholds of Rossi et al. [26] can be constructed for the eBWT and thus, be used to find MEMs.
2. Preliminaries
Basic definitions.
A string T = T[1..n] is a sequence of characters T[1] · · · T[n] drawn from an ordered alphabet Σ of size σ. We denote by |T| the length n of T. Given a multiset of m strings , we denote the total length of the strings in as , i.e., .
Given two integers 1 ≤ i, j ≤ n where i ≤ j, the substring T[i] · · · T[j] is denoted by T[i..j], the j-th prefix T[1..j] is denoted by T[..j], and the i-th suffix T[i..n] by T[i..]. A substring S of T is called proper if S ≠ T.
Given two strings S and T, we denote by lcp(S, T) the length of the longest common prefix (LCP) of S and T, i.e., lcp(S, T) = max{i | S[1..i] = T[1..i]}.
Given a string T = T[1..n] and an integer k, we denote by Tk the kn-length string TT · · · T (k-fold concatenation of T), and by Tω the infinite string TT · · · obtained by concatenating an infinite number of copies of T. A string T is called primitive if T = Sk implies T = S and k = 1. For any string T, there exists a unique primitive word S and a unique integer k such that T = Sk. We refer to as root(T) and to k as exp(T). Thus, T = root(T)exp(T).
Suffix array.
We denote by <lex the lexicographic order: for two strings S[1..m] and T[1..n], S <lex T if S is a proper prefix of T, or there exists an index 1 ≤ i ≤ n, m such that S[1..i − 1] = T[1..i − 1] and S[i] < T[i]. Given a string T[1..n], the suffix array [20], denoted by SA = SAT, is the permutation of {1, . . . , n} such that T[SA[i]..] is the i-th lexicographically smallest suffix of T.
Given a string T[1..n] and the SA of T, we denote the inverse suffix array as ISA, and define it as ISA[SA[i]] = i for all i = 1, . . . , n.
Definition of ϕ.
Kärkkäinen et al. [12] introduced the permutation ϕ. It is defined as follows: ϕ(i) = SA[ISA[i] − 1] if ISA[i] > 1; and ϕ(i) = SA[n] otherwise. We can rewrite this as: ϕ(SA[j]) = SA[j − 1], for all j > 1.
ω-order.
We denote by ≺ω the ω-order [10,21], defined as follows: for two strings S and T, S ≺ω T if root(S) = root(T) and exp(S) < exp(T), or Sω <lex Tω (this implies root(S) ≠ root(T)). One can verify that the ω-order relation is different from the lexicographic one. For instance, CG <lex CGA but CGA ≺ω CG.
Conjugate array.
The string S is a conjugate of the string T if S = T[i..n]T[1..i−1], for some i ∈ {1, . . . , n} (also called the i-th rotation of T). The conjugate S is also denoted conji(T). For a string T, the conjugate array4 CA = CAT of T is the permutation of {1, . . . , n} such that CA[i] = j if conjj(T) is the i-th conjugate of T with respect to the lexicographic order, with ties broken according to string order, i.e. if CA[i] = j and CA[i′] = j′ for some i < i′, then either conjj(T) <lex conjj′(T), or conjj(T) = conjj′(T) and j < j′.
Given a string T, U is a circular or cyclic substring of T if it is a substring of TT of length at most |T|, or equivalently, if it is the prefix of some conjugate of T. For instance, ATA is a cyclic substring of AGCAT. It is sometimes also convenient to regard a given string T[1..n] itself as circular (or cyclic); in this case we set T[0] = T[n] and T[n + 1] = T[1].
Burrows Wheeler Transform.
Given a string T, BWT(T) [6] is a permutation of the letters of T which equals the last column of the matrix of the lexicographically sorted conjugates of T. The construction is reversible, allowing the original string T to be computed in linear time [6]. The BWT itself can be computed from the conjugate array, since for all i = 1, . . . , n, BWT(T)[i] = T[CA[i] − 1], where T is considered to be cyclic.
It should be noted that in many applications, it is assumed that an end-of-string-character (usually denoted $), which is not element of Σ, is appended to the string; this character is assumed to be smaller than all characters from Σ. Computing the conjugate array becomes equivalent to computing the suffix array, since CAT$[i] = SAT$[i]. Thus, applying one of the linear-time suffix array computation algorithms [22] leads to linear-time computation of the BWT.
LF-mapping.
Last-to-first-mapping (LF-mapping) is the mapping of the lexicographical rank of a conjugate of T, conji(T) to the lexicographical rank of the conjugate conji−1(T). In particular, the LF-mapping maps characters from the last column of the matrix of the lexicographically sorted conjugates of T (commonly referred to as L) to the corresponding occurrence of the character in the first column of the same matrix (commonly referred to as F); hence the name last-to-first. It is the fundamental operation behind backward search, and what allows the BWT to be reversed.
r-index.
Given a text T[1..n] whose BWT has r runs, and a pattern P[1..p], the r-index [9] is a data structure that supports count queries, i.e. computing the number occ of occurrences of P in T, in time and words of space, where w is the machine word size. It also supports locate queries, i.e., return the occ positions in T where P occurs, in time and space [9, Theorem 3.6]. Recently, Nishimoto and Tabei [23] improved the previous running times for counting to and locating to .
The r-index is made of three main components, which are: (1) a data structure that stores the run-length encoded BWT supporting LF-mapping queries, (2) a SA sample for each of the r runs, and (3) a data structure supporting ϕ operations. In particular, in [9], (1) builds on the RLFM-index of Mäkinen et al. [19] combined with the data structures of Belazzougui and Navarro [3], while (2) is an array storing the SA samples at the end of each run. Lastly, (3) is implemented as a predecessor data structure built on SA samples at the beginning of each run, with the corresponding SA sample at the end of the previous run as a satellite information.
3. r-indexing the eBWT
In this section, we show how to construct and use the r-index for the eBWT. The eBWT [21] (extended BWT) is a generalization of the BWT to a multiset of strings that is independent of the order in which the strings appear in the multiset. Similarly to the BWT, is a permutation of the characters of the strings in which is, however, based on the ω-order between the conjugates of the strings in , rather than on lexicographic order. It uses the generalized conjugate array , which is an array whose k’th entry equals (j, d) if conjd(Tj) is the k’th conjugate in ω-order. In [4], we showed how to compute the eBWT and the generalized conjugate array in time linear in the total size of .
Remark 1.
In [4], we showed that given a multiset of strings , we can compute the from the GCA of the roots of the strings in . Therefore, we assume that the multiset of strings consists of m primitive strings. Moreover, we assume that ties between same roots in in the eBWT are broken by string index.
Given a pattern P[1..p], we say that P occurs in if P occurs as a substring of any of . Formally, we define an occurrence of P in as a pair (j, d) such that for all i = 1, . . . , p, Td[j + i − 1] = P[i], where Td is considered as circular.
It follows from the definition of the r-index that we need three main components to build an r-index on the eBWT, namely (1) a data structure that stores the run-length encoded eBWT supporting LF-mapping queries, (2) the samples of the GCA at the end of each run, and (3) a data structure supporting ϕ operations, extended to the GCA. Here, we note that we make the assumption throughout this paper that no two strings have the same set of conjugates. For the eBWT, we need to store an auxiliary data structure marking the first conjugate of all the strings in .
We now show that all the components can be stored in -space, building upon the results of Gagie et al. [9].
For (1), the data structure storing the eBWT supporting LF-mapping, we can use the same data structure used by the r-index. We summarize this in the next corollary, which follows directly from Gagie et al. [9].
Corollary 1.
Given a multiset of strings , of total length N and pattern P[1..p], we can build an index of words such that we can count all occurrences of P in in -time.
In the following proposition, we describe an analogue of a known property of the BWT for the eBWT. It states that the indices corresponding to characters within the same run are mapped contiguously by the LF-mapping. It follows from the fact that the LF-property holds for the eBWT [21].
Proposition 1.
Let be a multiset of strings of total length N and its conjugate array, i.e. . If, for some integers k and h, eBWT[k] = eBWT[k−1] and GCA[h] = (jk−1, ik), then , where the strings are considered cyclic, i.e. T[0] = T[|T|] and T[|T| + 1] = T[1].
Next, we show that for a multiset of strings where no two strings have the same set of conjugates, at least one of the characters of each string in the set appears at the beginning of a run, and at least one of the characters of each string appears at the end of a run.
Proposition 2.
Given a multiset of strings of total length N and where no two strings have the same set of conjugates, with . Then, for all 1 ≤ k ≤ m, there exist two integers h and h′ such that ih = ih′ = k, and the following are true: (1) either h = 1 or , and (2) either h′ = N or .
The following corollary follows directly from the previous proposition.
Corollary 2.
Given a multiset of strings of total length N and where no two strings have the same set of conjugates, we have m ≤ r.
From the previous corollary, it follows that we can store the positions of the starting rotations of each string in -space. Moreover, the following result immediately follows from the previous corollary. It says that, for example, the compression ratio for a set of reads of length 150 cannot be better than 150, when using either the BWT or the eBWT.
Corollary 3.
Given a multiset of ℓ-length strings such that no two strings have the same set of conjugates, we have that N/r ≤ ℓ.
We can extend the definition of ϕ to the , such that for each h > 1 ϕ associates to the h-th pair (jh, ih) in the pair (jh−1, ih−1). However, we have one additional constraint when we extend the description of ϕ to the eBWT—namely, we need to ensure that each string in the set has at least one conjugate appearing at the beginning and at least one conjugate appearing at the end of a run. This follows from Proposition 2. Therefore, by storing the samples at the beginning and the end of each run, we are guaranteed to cover all strings. Hence, the number of samples is . We can then build a predecessor data structure for each string Ti in with the GCA samples at the beginning of each run corresponding to samples of Ti, and associating the GCA sample at the end of the preceding run. Note that we need the predecessor search to be circular, i.e, the predecessor of the first element is the last element. Therefore, using Proposition 1, we can prove the following result.
Proposition 3.
Given a multiset of strings , of total length N and where no two strings have the same set of conjugates, we evaluate ϕ in words of space and -time, where w is the machine word size.
We can summarize our results in the following theorem.
Theorem 1.
Given a multiset of strings of total length N and where no two strings have the same set of conjugates, we can build an index of words such that given pattern P[1..p], we can count the occ occurrences of P in in -time and we can locate the occ positions in , in additional -time.
4. Constructing the r-index of the eBWT
In [4], we demonstrated how to construct the eBWT efficiently and in a manner that preserves the original definition of Mantaci et al. [21]. In particular, we combined a modified version of the Suffix Array Induced Sorting (SAIS) algorithm of Nong et al. [24] with PFP to develop a novel construction algorithm. Hence, we produce a dictionary and parse of the eBWT as a result of this algorithm. It follows from Kuhnle et al. [14] that we can produce the run-length encoding of the eBWT and the GCA samples from the dictionary and parse in linear time and space in their size.
Corollary 4.
Given a multiset of input strings, we can build the run-length encoding of the eBWT and the GCA samples in -space, where D and are the dictionary and parse defined by the PFP.
As previously noted, the main components of the r-index—namely the RL BWT and the SA samples—are not enough to support efficient MEM-finding. In order to accomplish this, Bannai et al. [1] showed that MEM-finding can be supported by computing matching statistics for a query string P, which is defined for each position in P as the length of the longest substring starting at that position that occurs in the indexed string T, and the starting position in T of one of its occurrences. From the matching statistics for P, one can compute the occurrence of a MEM using a two-pass process: first, working right to left, for each suffix of P until it finds a suffix of the text that matches for as long as possible; then, working left to right, it uses random access to T to determine the length of those matches. Thus, to compute the matching statistics, Bannai et al. described the addition of a small data structure to the r-index that is referred to as thresholds. A threshold between a consecutive pair of runs can be defined as the position of the minimum LCP value in the interval between them. Rossi et al. [26] showed how to compute all thresholds efficiently by modifying the PFP construction of the r-index. Here, we see that such a modification can also be made for the eBWT construction, allowing the thresholds for the eBWT to be constructed along with the GCA samples.
Corollary 5.
Given a set of input strings , we can construct the thresholds in addition to the run-length encoding of the eBWT and the GCA samples in -space, where D and are the dictionary and parse defined by the PFP.
It follows directly from Bannai et al. [2] and Rossi et al. [26] that given a query string P[1..p] that has occ occurrences of a MEM in , we can find a single MEM in -time and words of space, where tRA and sRA are the time and space of any data structure that is able to provide random access to the string. Moreover, we can extend this search to find all occ occurrences in additional -time.
Figure 1 depicts an example of matching statistics query of the pattern P = CABAA against the collection of strings .
Fig. 1.

An illustration of the thresholds for calculating the matching statistics of a query string P[1..p] in a set of strings . Shown on the left is P, the longest Prefix of the suffix of that occurs in , and the position of the corresponding prefix in the text. Shown on the right, continuing from left to right, is the of , the thresholds for the characters A, B, and C, the , and all conjugates of of all strings in , with the samples of the highlighted in red. The arrows illustrate the position in the which corresponds to the prefix on the left.
5. Conclusions
In this paper, we describe how the fundamentals of the r-index can be transferred to the context of the eBWT. We note that the eBWT has an advantage over other BWT-based data structures for string collections, which is that it is independent of the order of the input strings. An r-index based on the eBWT inherits this important property. Yet, we note that the applicability of this data structure has not been fully explored. Thus, we think that implementing this r-index of the eBWT and evaluating the efficiency of its construction on large datasets is warranted. From a more theoretical point of view, recasting some of the more recent results—including the results of Nishimoto and Tabei [23], Bannai et al. [1], and Cobas et al. [7]—of the r-index to the context of eBWT merits attention.
Acknowledgements.
We thank Travis Gagie for comments on preliminary versions of this manuscript. CB and MR are funded by National Science Foundation NSF IIBR (Grant No. 2029552), NSF SCH (Grant No. 2013998), National Institutes of Health (NIH) NIAID (Grant No. HG011392) and NIH NIAID (Grant No. R01AI141810).
Footnotes
References
- 1.Bannai H, Gagie T, and T. I. Refining the r-index. Theor Comput Sci, 812:96–108, 2020. [Google Scholar]
- 2.Bannai H, Kärkkäinen J, Köppl D, and Piatkowski M Constructing the Bijective and the extended Burrows-Wheeler-Transform in linear time. In Proc. of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, volume 191 of LIPIcs, pages 7:1–7:16. [Google Scholar]
- 3.Belazzougui D and Navarro G Optimal lower and upper bounds for representing sequences. ACM Trans Algorithms, 11(4):31:1–31:21, 2015. [Google Scholar]
- 4.Boucher C, Cenzato D, Lipták Zs., Rossi M, and Sciortino M Computing the original eBWT faster, simpler, and with less memory. In Proc. of the 28th International Symposium on String Processing and Information Retrieval, SPIRE 2021, LNCS, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, and Mun T Prefix-free parsing for building big BWTs. Algorithms Mol Biol, 14(1):13:1–13:15, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Burrows M and Wheeler DJ A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. [Google Scholar]
- 7.Cobas Dustin, Gagie Travis, and Navarro Gonzalo. A Fast and Small Subsampled R-Index. In Proc. of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, volume 191 of LIPIcs, pages 13:1–13:16, 2021. [Google Scholar]
- 8.Gagie T, Navarro G, and Prezza N Optimal-time text indexing in BWT-runs bounded space. In Proc. of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pages 1459–1477, 2018. [Google Scholar]
- 9.Gagie T, Navarro G, and Prezza N Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J ACM, 67(1):2:1–2:54, 2020. [Google Scholar]
- 10.Gessel IM and Reutenauer C Counting permutations with given cycle structure and descent set. J Combin Theory Ser A, 64(2):189–215, 1993. [Google Scholar]
- 11.Hon W, Ku T, Lu C, Shah R, and Thankachan SV Efficient Algorithm for Circular Burrows-Wheeler Transform. In Proc. of the 23rd Annual Symposium on Combinatorial Pattern Matching, CPM 2012, volume 7354 of LNCS, pages 257–268, 2012. [Google Scholar]
- 12.Kärkkäinen J, Manzini G, and Puglisi SJ Permuted Longest-Common-Prefix Array. In Proc. of the 20th Annual Symposium on Combinatorial Pattern Matching CPM 2009, volume 5577 of LNCS, pages 181–192, 2009. [Google Scholar]
- 13.Kucherov G, Tóthmérész L, and Vialette S On the combinatorics of suffix arrays. Inf Process Lett, 113(22–24):915–920, 2013. [Google Scholar]
- 14.Kuhnle A, Mun T, Boucher C, Gagie T, Langmead Ben, and Manzini G Efficient construction of a complete index for pan-genomics read alignment. J Comput Biol, 27(4):500–513, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Langmead B, Trapnell C, Pop M, and Salzberg S Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, 10:R25, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li H and Durbin R Fast and accurate short read alignment with Burrows–Wheeler Transform. Bioinformatics, 25(14):1754–1760, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mäkinen V and Navarro G Succinct suffix arrays based on run-length encoding. Nord J Comput, 12:40–66, 2005. [Google Scholar]
- 18.Mäkinen V, Välimäki N, Laaksonen A, and Katainen A. Algorithms and Applications. Springer-Verlag, Berlin, Heidelberg, 2010. [Google Scholar]
- 19.Veli Mäkinen Gonzalo Navarro, Sirén Jouni, and Välimäki Niko. Storage and retrieval of highly repetitive sequence collections. J Comput Biol, 17(3):281–308, 2010. [DOI] [PubMed] [Google Scholar]
- 20.Manber U and Myers GW Suffix arrays: a new method for on-line string searches. SIAM J Comput, 22(5):935–948, 1993. [Google Scholar]
- 21.Mantaci S, Restivo A, Rosone G, and Sciortino M An extension of the Burrows-Wheeler Transform. Theor Comput Sci, 387(3):298–312, 2007. [Google Scholar]
- 22.Navarro G Compact Data Structures: A Practical Approach. Cambridge University Press, 2016. [Google Scholar]
- 23.Nishimoto T and Tabei Y Optimal-time queries on BWT-runs compressed indexes. In Proc. of the 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 101:1–101:15. [Google Scholar]
- 24.Nong G, Zhang S, and Chan WH Two efficient algorithms for linear time suffix array construction. IEEE Trans Comput, 60(10):1471–1484, 2011. [Google Scholar]
- 25.Policriti A and Prezza N LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986–2011, 2017. [Google Scholar]
- 26.Rossi M, Oliva M, Langmead B, Gagie T, and Boucher C MONI: A pangenomics index for finding MEMs. In Proc. of the 25th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2021, 2021. [Google Scholar]
- 27.Sun C et al. RPAN: rice pan-genome browser for 3000 rice genomes. Nucleic Acids Res, 45(2):597–605, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.The 1001 Genomes Consortium. Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions. Cell, 166(2):492–505, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Turnbull C et al. The 100,000 genomes project: bringing whole genome sequencing to the NHS. Br Med J, 361, 2018. [DOI] [PubMed] [Google Scholar]
