Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jun 4.
Published in final edited form as: Res Comput Mol Biol. 2020 Apr 21;12074:37–53. doi: 10.1007/978-3-030-45257-5_3

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

Barış Ekim 1,2, Bonnie Berger 1,2, Yaron Orenstein 3
PMCID: PMC11148856  NIHMSID: NIHMS1712525  PMID: 38835399

Abstract

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. k>13). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles k>13. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

Keywords: Universal hitting sets, Parallelization, Randomization

1. Introduction

The NIH Sequence Read Archive [8] currently contains over 26 petabases of sequence data. Increased use of sequence-based assays in research and clinical settings creates high computational processing burden; metagenomics studies generate even larger sequencing datasets [17, 19]. New computational ideas are essential to manage and analyze these data. To this end, researchers have turned to k-mer-based approaches to more efficiently index datasets [7].

Minimizer techniques were introduced to select k-mers from a sequence to allow efficient binning of sequences such that some information about the sequence’s identity is preserved [18]. Formally, given a sequence of length L and an integer k, its minimizer is the lexicographically smallest k-mer in it. The method has two key advantages: selected k-mers are close; and similar k-mers are selected from similar sequences. Minimizers were adopted for biological sequence analysis to design more efficient algorithms, both in terms of memory usage and runtime, by reducing the amount of information processed, while not losing much or any information [12]. The minimizer method has been applied in a large number of settings [4, 6, 20].

Orenstein and Pellow et al. [14, 15] generalized and improved upon the minimizer idea by introducing the notion of a universal hitting set (UHS). For integers k and L, set Uk,L is called a universal hitting set of k-mers if every possible sequence of length L contains at least one k-mer from Uk,L. Note that a UHS for any given k and L only needs to be computed once. Their heuristic DOCKS finds a small UHS in two steps: (i) remove a minimum-size set of vertices from a complete de Bruijn graph of order k to make it acyclic; and (ii) remove additional vertices to eliminate all (Lk)-long paths. The removed vertices comprise the UHS. The first step was solved optimally, while the second required a heuristic. The method is limited by runtime to k13, and thus applicable to only a small subset of minimizer scenarios. Recently, Marçais et al. [10] showed that there exists an algorithm to compute a set of k-mers that covers every path of length L in a de Bruijn graph of order k. This algorithm gives an asymptotically optimal solution for a value of k approaching L. Yet this condition is rarely the case for real applications where 10k30 and 100L300. The results of Marçais et al. show that for k30, the results are far from optimal for fixed L. A more recent method by DeBlasio et al. [3] can handle larger values of k, but with L21, which is impractical for real applications. Thus, it is still desirable to devise faster algorithms to generate small UHSs.

Here, we present PASHA (Parallel Algorithm for Small Hitting set Approximation), the first randomized parallel algorithm to efficiently generate nearoptimal UHSs. Our novel algorithmic contributions are twofold. First, we improve upon the process of calculating vertex hitting numbers, i.e. the number of (Lk)-long paths they go through. Second, we build upon a randomized parallel algorithm for Set Cover to substantially speedup removal of k-mers for the UHS—the major time-limiting step—with a guaranteed approximation ratio on the k-mer set size. PASHA performs substantially better than current algorithms at finding a UHS in terms of runtime, with only a small increase in set size; it is consequently applicable to much larger values of k. Software and computed sets are available at: pasha.csail.mit.edu and github.com/ekimb/pasha.

2. Background and Preliminaries

Preliminary Definitions

For k1 and finite alphabet , directed graph Bk=(V,E) is a de Bruijn graph of order k if V and E represent k- and (k + 1)-long strings over , respectively. An edge may exist from vertex u to vertex v if the (k1)-suffix of u is the (k1)-prefix of v. For any edge (u,v)E with label 𝓛, labels of vertices u and v are the prefix and suffix of length k of 𝓛, respectively. If a de Bruijn graph contains all possible edges, it is complete, and the set of edges represents all possible (k+1)-mers. An 𝓁=(Lk))-long path in the graph, i.e. a path of 𝓁 edges, represents an L-long sequence over (for further details, see [1]).

For any L-long string s over , k-mer set M hits s if there exists a k-mer in M that is a contiguous substring in s. Consequently, universal hitting set (UHS) Uk,L is a set of k-mers that hits any L-long string over Σ. A trivial UHS is the set of all k-mers, but due to its size (|Σ|k), it does not reduce the computational expense for practical use. Note that a UHS for any given k and L does not depend on a dataset, but rather needs to be computed only once.

Although the problem of computing a universal hitting set has no known hardness results, there are several NP-hard problems related to it. In particular, the problem of computing a universal hitting set is highly similar, although not identical, to the (k,L)-hitting set problem, which is the problem of finding a minimum-size k-mer set that hits an input set of L-long sequences. Orenstein and Pellow et al. [14, 15] proved that the (k,L)-hitting set problem is NP-hard, and consequently developed the near-optimal DOCKS heuristic. DOCKS relies on the Set Cover problem, which is the problem of finding a minimum-size collection of subsets S1,,Sk of finite set U whose union is U.

The DOCKS Heuristic

DOCKS first removes from a complete de Bruijn graph of order k a decycling set, turning the graph into a directed acyclic graph (DAG). This set of vertices represent a set of k-mers that hits all sequences of infinite length. A minimum-size decycling set can be found by Mykkelveit’s algorithm [13] in O(|Σ|k) time. Even after all cycles, which represent sequences of infinite length, are removed from the graph, there may still be paths representing sequences of length L, which also need to be hit by the UHS. DOCKS removes an additional set of k-mers that hits all remaining sequences of length L, so that no path representing an L-long sequence, i.e. a path of length 𝓁=Lk, remains in the graph.

However, finding a minimum-size set of vertices to cover all paths of length 𝓁 in a directed acyclic graph (DAG) is NP-hard [16]. In order to find a small, but not necessarily minimum-size, set of vertices to cover all 𝓁-long paths, Orenstein and Pellow et al. [14, 15] introduced the notion of a hitting number, the number of 𝓁-long paths containing vertex v, denoted by T(v,𝓁). DOCKS uses the hitting number to prioritize removal of vertices that are likely to cover a large number of paths in the graph. This, in fact, is an application of the greedy method for the Set Cover problem, thus guaranteeing an approximation ratio of O(1+log(maxvT(v,𝓁))) on the removal of additional k-mers.

The hitting numbers for all vertices can be computed efficiently by dynamic programming: For any vertex v and 0i𝓁, DOCKS calculates the number of i-long paths starting at v, D(v,i), and the number of i-long paths ending at v, F(v,i). Then, the hitting number is directly computable by

T(v,𝓁)=i=0𝓁F(v,i)D(v,𝓁i) (1)

and the dynamic programming calculation in graph G=(V,E) is given by

vV,D(v,0)=F(v,0)=1D(v,i)=(v,u)ED(u,i1)F(v,i)=(u,v)EF(u,i1) (2)

Overall, DOCKS performs two main steps: First, it finds and removes a minimum-size decycling set, turning the graph into a DAG. Then, it iteratively removes vertex v with the largest hitting number T(v,𝓁) until there are no 𝓁-long paths in the graph. DOCKS is sequential: In each iteration, one vertex with the largest hitting number is removed and added to the UHS output, and the hitting numbers are recalculated. Since the first phase of DOCKS is solved optimally in polynomial time, the bottleneck of the heuristic lies in the removal of the remaining set of k-mers to cover all paths of length 𝓁=Lk in the graph, which represent all remaining sequences of length L.

As an additional heuristic, Orenstein and Pellow et al. [14, 15] developed DOCKSany with a similar structure as DOCKS, but instead of removing the vertex that hits the most (Lk)-long paths, it removes a vertex that hits the most paths in each iteration. This reduces the runtime by a factor of L, as calculating the hitting number T(v) for each vertex can be done in linear time with respect to the size of the graph. DOCKSanyX extends DOCKSany by removing X vertices with the largest hitting numbers in each iteration. DOCKSany and DOCKSanyX run faster compared to DOCKS, but the resulting hitting sets are larger.

3. Methods

Overview of the Algorithm.

Similar to DOCKS, PASHA is run in two phases: First, a minimum-size decycling set is found and removed; then, an additional set of k-mers that hits remaining L-long sequences is removed. The removal of the decycling set is identical to that of DOCKS; however, in PASHA we introduce randomization and parallelization to efficiently remove the additional set of k-mers. We present two novel contributions to efficiently parallelize and randomize the second phase of DOCKS. The first contribution leads to a faster calculation of hitting numbers, thus reducing the runtime of each iteration. The second contribution leads to selecting multiple vertices for removal at each iteration, thus reducing the number of iterations to obtain a graph with no (Lk)-long paths. Together, the two contributions provide orthogonal improvements in runtime.

Improved Hitting Number Calculation

Memory Usage Improvements.

We reduce memory usage through algorithmic and technical advances. Instead of storing the number of i-long paths for 0i𝓁 in both F and D, we apply the following approach (Algorithm 1): We compute D for all vV and 0i𝓁. Then, while computing the hitting number, we calculate F for iteration i. For this aim, we define two arrays: Fcurr and Fprev, to store only two instances of i-long path counts for each vertex: The current and previous iterations. Then, for some j, we compute Fcurr based on Fprev, set Fprev=Fcurr, and add Fcurr(v)D(v,𝓁j) to the hitting number sum. Lastly, we increase j, and repeat the procedure, adding the computed hitting numbers iteratively. This approach allows the reduction of matrix F, since in each iteration we are storing only two arrays, Fcurr and Fprev, instead of the original F matrix consisting of 𝓁+1 arrays. Therefore, we are able to reduce memory usage by close to half, with no change in runtime.

To further reduce memory usage, we use float variable type (of size 4 bytes) instead of double variable type (of size 8 bytes). The number of paths kept in F and D increase exponentially with i, the length of the paths. To be able to use the 8 bit exponent field, we initialize F and D to float minimum positive value. This does not disturb algorithm correctness, as path counting is only scaled to some arbitrary unit value, which may be 2−149, the smallest positive value that can be represented by float. This is done in order to account for the high numbers that path counts can reach. The remaining main memory bottleneck is matrix D, whose size is 44k(𝓁+1) bytes.

Lastly, we utilized the property of a complete de Bruijn graph of order k being the line graph of a de Bruijn graph of order k1. While all k-mers are represented as the set of vertices in the graph of order k, they are represented as edges in the graph of order k1. If we remove edges of a de Bruijn graph of order k1, instead of vertices in a graph of order k, we can reduce memory usage by another factor of |Σ|. In our implementation we compute D and F for all vertices of a graph of order k1, and calculate hitting numbers for edges. Thus, the bottleneck of the memory usage is reduced to 44k1(𝓁+1) bytes.

Runtime Reduction by Parallelization.

We parallelize the calculation of the hitting numbers to achieve a constant factor reduction in runtime. The calculation of i-long paths through vertex v only depends on the previously calculated matrices for the (i1)-long paths through all vertices adjacent to v (Eq. 2). Therefore, for some i, we can compute D(v,i) and F(v,i) for all vertices in V in parallel, where V is the set of vertices left after the removal of the decycling set. In addition, we can calculate the hitting number T(v,𝓁) for all vertices V in parallel (similar to computing D and F), since the calculation does not depend on the hitting number of any other vertex (we call this parallel variant PDOCKS for the purpose of comparison with PASHA). We note that for DOCKSany and DOCKSanyX, the calculations of hitting numbers for each vertex cannot be computed in parallel, since the number of paths starting and ending at each vertex both depend on those of the previous vertex in topological order.

Algorithm 1.

Improved hitting number calculation. Input: G=(V,E)

1: D[|V|][𝓁+1], with [|V|][0] initialized to 1
2: Fcurr[|V|]
3: Fprev[|V|] initialized to 1
4: T[|V|] initialized to 0
5: for 1i𝓁 do:
6: for vVdo:
7:   for (v,u)Edo:
8:    D[v][i]+=D[u][i1]
9: for 1i𝓁+1 do:
10: for vV do:
11:   Fcurr[v]=0
12:   for (u,v)E do:
13:    Fcurr[v]+=Fprev[u]
14:   T[v]+=Fprev[v]D[v][𝓁i+1]
15: Fprev=Fcurr
16: return T

Parallel Randomized k-mer Selection

Our goal is to find a minimum-size set of vertices that covers all 𝓁-long paths. We can represent the remaining graph as an instance of the Set Cover problem. While the greedy algorithm for the second phase of DOCKS is serial, we will show that we can devise a parallel algorithm, which is close to the greedy algorithm in terms of performance guarantees, by picking a large set of vertices that cover nearly as many paths as the vertices that the greedy algorithm picks one by one.

In PASHA, instead of removing the vertex with the maximum hitting number in each iteration, we consider a set of vertices for removal with hitting numbers within an interval, and pick vertices in this set independently with constant probability. Considering vertices within an interval allows us to efficiently introduce randomization while still emulating the deterministic algorithm. Picking vertices independently in each iteration enables parallelization of the procedure. Our randomized parallel algorithm for the second phase of the UHS problem adapts that of Berger et al. [2] for the original Set Cover problem.

The UHS Selection Procedure.

The input includes graph G=(V,E) and randomization variables 0<ε14, 0<δ1𝓁 (Algorithm 2). Let function calcHit() calculate the hitting numbers for all vertices, and return the maximum hitting number (line 2). We set t=log1+εTmax (line 3), and run a series of steps from t, iteratively decreasing t by 1. In step t, we first calculate the hitting numbers of all vertices (line 5); then, we define vertex set S to contain vertices with a hitting number between (1+ε)t1 and (1+ε)t for potential removal (lines 8–9).

Let PS be the sum of all hitting numbers of the vertices in S, i.e. PS=vST(v,𝓁) (line 10). In each step, if the hitting number for vertex v is more than a δ3 fraction of PS, i.e. T(v,𝓁)δ3PS, we add v to the picked vertex set Vt (lines 11–13). For vertices with a hitting number smaller than δ3PS, we pairwise independently pick them with probability δ𝓁. We test the vertices in pairs to impose pairwise independence: If an unpicked vertex u satisfies the probability δ𝓁, we choose another unpicked vertex v and test the same probability δ𝓁. If both are satisfied, we add both vertices to the picked vertex set Vt; if not, neither of them are added to the set (lines 14–16). This serves as a bound on the probability of picking a vertex. If the sum of hitting numbers of the vertices in set Vt is at least |Vt|(1+ε)t(14δ2ε), we add the vertices to the output set, remove them from the graph, and decrease t by 1 (lines 17–20). The next iteration runs with decreased t. Otherwise, we rerun the selection procedure without decreasing t.

Algorithm 2.

The selection procedure. Input: G=(V,E), 0<ε14, 0<δ1𝓁

1: R{}
2: TmaxcalcHit()
3: tlog1+εTmax
4: while t>0 do
5: if calcHit()==0 then break
6: S{}
7: Vt{}
8: for vV do:
9:   if (1+ε)t1T(v,𝓁)(1+ε)t then SS{v}
10: PSvST(v,𝓁)
11: for vS do:
12:   if T(v,𝓁)δ3PS then
13:    VtVt{v}
14: for u,vS do:
15:   if uVt and unirand(0,1)δ𝓁 and vVt and unirand(0,1)δ𝓁 then
16:    VtVt{u,v}
17: if vVtT(v,𝓁)|Vt|(1+ε)t(14δ2ε) then
18:   RRVt
19:   G=G(VVt,E)
20: tt1
21:   return R

Performance Guarantees.

At step t, we add the selected vertex set Vt to the output set if vVtT(v,𝓁)|Vt|(1+ε)t(14δ2ε). Otherwise, we rerun the selection procedure with the same value of t. We show in Appendix A that with high probability, vVtT(v,𝓁)|Vt|(1+ε)t(14δ2ε). We also show that PASHA produces a cover α(1+logTmax) times the optimal size, where α=1/(14δ2ε). In Appendix B, we give the asymptotic number of the selection steps and prove the average runtime complexity of the algorithm. Performance summaries in terms of theoretical runtime and approximation ratio are in Table 1.

Table 1.

Summary of theoretical results for the second phase of different algorithms for generating a set of k-mers hitting all L-long sequences. PDOCKS is DOCKS with the improved hitting number calculation, i.e. greedy removal of one vertex at each iteration. pD, pDA denote the total number of picked vertices for DOCKS/PDOCKS and DOCKSany, respectively. m denotes the number of parallel threads used, Tmax the maximum vertex hitting number, and ϵ and δ PASHA’s randomization parameters.

Algorithm DOCKS PDOCKS DOCKSany PASHA
Theoretical runtime O((1+pD)|Σ|k+1L) O((1+pD)|Σ|k+1L/m) O((1+pDA)|Σ|k+1) O((L2|Σ|k+1log2(|Σ|k))/(εδ3m))
Approximation ratio 1+logTmax 1+logTmax N/A (1+logTmax)/(14δ2ε)

4. Results

PASHA Outperforms Extant Algorithms for k13

We compared PASHA and PDOCKS to extant methods on several combinations of k and L. We ran DOCKS, DOCKSany, PDOCKS, and PASHA over 5k10, DOCKSanyX, PDOCKS, and PASHA for k=11 and X=10, and PASHA and DOCKSanyX for X=100, 1000 for k=12,13 respectively, for 20L200. We say that an algorithm is limited by runtime if for some value of k13 and for L=100, its runtime exceeds 1 day (86400 s), in which case we stopped the operation and excluded the method from the results for the corresponding value of k. While running PASHA, we set δ=1/𝓁, and 14δ2ε=1/2 to set an emulation ratio α=2 (see Sect. 3 and Appendix A). The methods were benchmarked on a 24-CPU Intel Xeon Gold (2.10 GHz) with 754 GB of RAM. We ran all tests using all available cores (m=24 in Table 1).

Comparing Runtimes and UHS Sizes.

We ran DOCKS, PDOCKS, DOCKSany, and PASHA for k=10 and 20L200. As seen in Fig. 1A, DOCKS has a significantly higher runtime than the parallel variant PDOCKS, while producing identical sets (Fig. 1B). For small values of L, DOCKSany produces the largest UHSs compared to other methods, and as L increases, the differences in both runtime and UHS size for all methods decrease, since there are fewer k-mers to add to the removed decycling set to produce a UHS.

Fig. 1.

Fig. 1.

Runtimes (left) and UHS sizes (divided by 104, right) for values of k=10 (A, B), 11 (C, D), 12 (E, F), and 13 (G, H) and 20L200 for the different methods. Note that the y-axes for runtimes are in logarithmic scale.

We ran PDOCKS, DOCKSany10, and PASHA for k=11 and 20L200. As seen in Fig. 1C, for small values of L, both PDOCKS and DOCKSany10 have significantly higher runtimes than PASHA; while for larger L, DOCKSany10 and PASHA are comparable in their runtimes (with PASHA being negligibly slower). In Fig. 1D, we observe that PDOCKS computes the smallest sets for all values of L. Indeed, its guaranteed approximation ratio is the smallest among all three benchmarked methods. While the set sizes for all methods converge to the same value for larger L, DOCKSany10 produces the largest UHSs for small values of L, in which case PASHA and PDOCKS are preferable.

PASHA’s runtime behaves differently than that of other methods. For all methods but PASHA, runtime decreases as L increases. Instead of gradually decreasing with L, PASHA’s runtime gradually decreases up to L=70, at which it starts to increase at a much slower rate. This is explained by the asymptotic complexity of PASHA (Table 1). Since computing a UHS for small L requires a larger number of vertices to be removed, the decrease in runtime with increasing L up to L=70 is significant; however, due to PASHA’s asymptotic complexity being quadratic with respect to L, we see a small increase from L=70 to L=200. All other methods depend linearly on the number of removed vertices, which decreases as L increases.

Despite the significant decrease in runtime in PDOCKS compared to DOCKS, PDOCKS was still limited by runtime to k12. Therefore, we ran DOCKSany100 and PASHA for k=12 and 20L200. As seen in Figs. 1E and F, both methods follow a similar trend as in k=11, with DOCKSany100 being significantly slower and generating significantly larger UHSs for small values of L. For larger values of L, DOCKSany100 is slightly faster, while PASHA produces sets that are slightly smaller.

At k=13 we observed the superior performance of PASHA over DOCKSany1000 in both runtime and set size for all values of L. We ran DOCKSany1000 and PASHA for k=13 and 20L200. As seen in Figs. 1G and H, DOCKSany1000 produces larger sets and is significantly slower compared to PASHA for all values of L. This result demonstrates that the slow increase in runtime for PASHA compared to other algorithms for k<13 does not have a significant effect on runtime for larger values of k.

PASHA Enables UHS for k=14, 15, 16

Since all existing algorithms and PDOCKS are limited by runtime to k13, we report the first UHSs for 14k16 and L=100 computed using PASHA, run on a 24-CPU Intel Xeon Gold (2.10 GHz) with 754 GB of RAM using all 24 cores. Figure 2 shows runtimes and sizes of the sets computed by PASHA.

Fig. 2.

Fig. 2.

Runtimes (A) and UHS sizes (divided by 106) (B) for 14k16 and L=100 for PASHA. Note that the y-axis for runtime is in logarithmic scale.

Density Comparisons for the Different Methods

In addition to runtimes and UHS sizes, we report values of another measure of UHS performance known as density. The density of the minimizers scheme d(M,S,k) is the fraction of selected k-mers’ positions over the number of k-mers in the sequence. Formally, the density of scheme M over sequence S is defined

as

d(M,S,k)=|M(S,k)||S|k+1 (3)

where M(S,k) is the set of positions of the k-mers selected over sequence S.

We calculate densities for a UHS by selecting the lexicographically smallest k-mer that is in the UHS within each window of Lk+1 consecutive k-mers, since at least one k-mer is guaranteed to be in each such window. Marçais et al. [11] showed that using UHSs for k-mer selection in this manner yields smaller densities than lexicographic or random minimizer selection schemes. Therefore, we do not report comparisons between UHSs and minimizer schemes, but rather comparisons among UHSs constructed by different methods.

Marçais et al. [11] also showed that the expected density of a minimizers scheme for any k and window size Lk+1 is equal to the density of the minimizers scheme on a de Bruijn sequence of order L. This allows for exact calculation of expected density for any k-mer selection procedure. However, for 14k16 we calculated UHSs only for L=100, and iterating over a de Bruijn sequence of order 100 is infeasible. Therefore, we computed the approximate expected density on long random sequences, since the computed expected density on these sequences converges to the expected density [11]. In addition, we computed the density of different methods on the entire human reference genome (GRCh38).

We computed the density values of UHSs generated by PDOCKS, DOCKSany, and PASHA over 10 random sequences of length 106, and the entire human reference genome (GRCh38), for 5k16 and L=100, when a UHS was available for such (k,L) combination.

As seen in Fig. 3, the differences in both approximate expected density and density computed on the human reference genome are negligible when comparing UHSs generated by the different methods. For most values of k, DOCKS yields the smallest approximate expected density and human genome density values, while DOCKSany generally yields lower human genome density values, but higher expected density values than PASHA. For k6, the UHS is only the decycling set; therefore, density values for these values of k are identical for the different methods.

Fig. 3.

Fig. 3.

Mean approximate expected density (A), and density on the human reference genome (B) for different methods, for 5k16 and L=100. Error bars represent one standard deviation from the mean across 10 random sequences of length 106. Density is the fraction of selected k-mer positions over the number of k-mers in the sequence.

Since there is no significant difference in the density of the UHSs generated by the different methods, other criteria, such as runtime and set size, are relevant when evaluating the performance of the methods: As k increases, PASHA produces sets that are only slightly smaller or larger in density, but significantly smaller in size and significantly faster than extant methods.

5. Discussion

We presented an efficient randomized parallel algorithm for generating a small set of k-mers that hits every possible sequence of length L and produces a set that is a small guaranteed factor away from the optimal set size. Since the runtime of DOCKS variants and PASHA depend exponentially on k, these greedy heuristics are eventually limited by runtime. However, using these heuristics in conjunction with parallelization, we are newly able to compute UHSs for values of k and L large enough for most biological applications.

The improvements in runtime for the hitting number calculation are due to parallelization of the dynamic programming phase, which is the bottleneck in sequential DOCKS variants. A minimum-size set that hits all infinite-length sequences is optimally and rapidly removed; however, the remaining sequences of length L are calculated and removed in time polynomial in the output size. We show that a constant factor reduction is beneficial in mitigating this bottleneck for practical use. In addition, we reduce the memory usage of this phase by theoretical and technical advancements. Last, we build on a randomized parallel algorithm for Set Cover to significantly speed up vertex selection. The randomized algorithm can be derandomized, while preserving the same approximation ratio, since it requires only pairwise independence of the random variables [2].

One main open problem still remains from this work. Although the randomized approximation algorithm enables us to generate a UHS more efficiently, the hitting numbers still need to be calculated at each iteration. The task of computing hitting numbers remains as the bottleneck in computing a UHS. Is there a more efficient way of calculating hitting numbers than the dynamic programming calculation done in DOCKS and PASHA? A more efficient calculation of hitting numbers will enable PASHA to run over k>6 in a reasonable time.

As for long reads, which are becoming more popular for genome assembly tasks, a k-mer set that hits all infinite long sequences, as computed optimally by Mykkelveit’s algorithm [13], is enough due to the length of these long read sequences. Still, due to the inaccuracies and high cost of long read sequencing compared to short read sequencing, the latter is still the prevailing method to produce sequencing data, and is expected to remain so for the near future.

We expect the efficient calculation of UHSs to lead to improvements in sequence analysis and construction of space-efficient data structures. Unfortunately, previous methods were limited to small values of k, thus allowing application to only a small subset of sequence analysis tasks. As there is an inherent exponential dependency on k in terms of both runtime and memory, efficiency in calculating these sets is crucial. We expect that the UHSs newly-enabled by PASHA for k>13 will be useful in improving various applications in genomics.

6. Conclusion

We developed a novel randomized parallel algorithm PASHA to compute a small set of k-mers which together hit every sequence of length L. It is based on two algorithmic innovations: (i) improved calculation of hitting numbers through paralleization and memory reduction; and (ii) randomized parallel selection of additional k-mers to remove. We demonstrated the scalability of PASHA to larger values of k up to 16. Notably, the universal hitting sets need to be computed only once, and can then be used in many sequence analysis applications. We expect our algorithms to be an essential part of the sequence analysis toolkit.

Supplementary Material

Supp_material

Acknowledgments.

This work was supported by NIH grant R01GM081871 to B.B. B.E. was supported by the MISTI MIT-Israel program at MIT and Ben-Gurion University of the Negev. We gratefully acknowledge the support of Intel Corporation for giving access to the Intel®AI DevCloud platform used for part of this work.

A. Emulating the Greedy Algorithm

The greedy Set Cover algorithm was developed independently by Johnson and Lovász for unweighted vertices [5, 9]. Lovász [9] proved:

Theorem 1.

The greedy algorithm for Set Cover outputs cover R with |R|(1+logTmax)|OPT|, where Tmax is the maximum cardinality of a set.

We adapt a definition for an algorithm emulating the greedy algorithm for the Set Cover problem to the second phase of DOCKS [2]. We say that an algorithm for the second phase of DOCKS α-emulates the greedy algorithm if it outputs a set of vertices serially, during which it selects a vertex set A such that

|A||PA|αTmax,

where PA is the set of 𝓁-long paths covered by A. Using this definition, we come up with a near-optimal approximation by the following theorem:

Theorem 2.

An algorithm for the second phase of DOCKS that α-emulates the greedy algorithm produces cover RV with |R|α(1+logTmax)|OPT|, where OPT is the optimal cover.

Proof. We define the cost of covering path p as 𝒞(p)=|S||PS|, where S is the set of vertices selected in the selection step in which p was covered, and PS the set of 𝓁-long paths covered by S. Then, pPS𝒞(p)=|S|.

Let P𝓁 be the set of all 𝓁-long paths in G. A fractional cover of graph G=(V,E) is function 𝓕:V{0,1} s.t. for all pP𝓁, vp𝓕(v)1. The optimal cover 𝓕OPT has minimum vV𝓕OPT(v).

Let 𝓕 be such an optimal fractional cover. The size of the cover produced is

|R|=pP𝓁𝒞(p)vV(𝓕(v)pPv𝒞(p))

where Pv is the set of all 𝓁-long paths through vertex v.

Lemma 1.

There are at most αk paths pPv such that 𝒞(p)k for any v, k.

Proof. Assume the contrary: Before such a path p is covered, T(v,𝓁)>αk. Thus,

|S||PS|k>α/T(v,𝓁)α/Tmax,

contradicting the definition.

Suppose we rank the T(v,𝓁) paths pPv by decreasing order of 𝒞(p). From the above remark, if the ith path has cost k, then iα/k. Then, we can write

pPv𝒞(p)i=1T(v,𝓁)α/iαi=1T(v,𝓁)1/iα(1+logT(v,𝓁))α(1+logTmax)

Then,

pP𝓁𝒞(p)vV𝓕(v)α(1+logTmax)

and finally

|R|α(1+logTmax)|OPT|.

In PASHA, we ensure that in step t, the sum of vertex hitting numbers of selected vertex set Vt is at least |Vt|(1+ε)t(14δ2ε). We now show that this is satisfied with high probability in each step.

Theorem 3.

With probability at least 1/2, the sum of vertex hitting numbers of selected vertex set Vt at step t is at least |Vt|(1+ε)t(14δ2ε).

Proof. For any vertex v in selected vertex set Vt at step t, let Xv be an indicator variable for the random event that vertex v is picked, and f(X)=vVtXv.

Note that Var[f(X)]|Vt|δ/𝓁, and |Vt|𝓁/δ3, since we are given that no vertex covers a δ3 fraction of the 𝓁-long paths covered by the vertices in Vt. By Chebyshev’s inequality, for any k0,

Pr[|f(X)E[f(X)]|k(|Vt|δ/𝓁)]1k2

and with probability 3/4,

(f(X)E[f(X)])24|Vt|2δ4𝓁2

and

|f(X)E[f(X)]|2|Vt|δ2𝓁.

Let PVt denote the set of 𝓁-long paths covered by vertex set Vt. Then,

|PVt|uVtT(u,𝓁)XupPVtu,vpXuXv

We know that uVtT(u,𝓁)Xu|Vt|(1+ε)t1, which is bounded below by ((δ2δ2)|Vt|(1+ε)t1)/𝓁. Let g(X)=pPVtu,vpXuXv. Then,

E[g(X)]=pPVtE[u,vpXuXv]=pPVt(l2)(δ/𝓁)2=pPVt(𝓁1)δ22𝓁pPVtδ22.

Hence, with probability at least 3/4,

g(X)4E[g(X)]2δ2|Vt|(1+ε)t

Both events hold with probability at least 1/2, and the sum of vertex hitting numbers is at least

((δ2δ2)|Vt|(1+ε)t1)𝓁2δ2|Vt|(1+ε)t|Vt|(1+ε)t1(δ𝓁2δ2𝓁2δ22δ2ε)|Vt|(1+ε)t(δ𝓁2δ2𝓁2δ22δ2ε)/(1+ε)|Vt|(1+ε)t(14δ2ε).

B. Runtime Analysis

Here, we show the number of the selection steps and the average-time asymptotic complexity of PASHA.

Lemma 2.

The number of selection steps is O(log|V|log|P𝓁|/(εδ3m)).

Proof. The number of steps is O(log|V|/ε), and within each step, there are O(log|PS|/(δ3m)) selection steps (where PS is the sum of vertex hitting numbers of the vertex set S for that step and m the number of threads used), since we are guaranteed to remove at least δ3 fraction of the paths during that step. Overall, there are O(log|V|log|P𝓁|/(εδ3m)) selection steps.

Theorem 4.

For φ<1, there is an approximation algorithm for the second phase of DOCKS that runs in O((L2|Σ|k+1log2(|Σ|k))/(εδ3m)) average time, where m is the number of threads used, and produces a cover of size at most (1+φ)(1+logTmax) times the optimal size, where 1+φ=1/(14δ2ϵ).

Proof. Follows immediately from Theorem 2 and Lemma 2.

References

  • 1.Berger B, Peng J, Singh M: Computational solutions for omics data. Nat. Rev. Genet 14(5), 333 (2013) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Berger B, Rompel J, Shor PW: Efficient NC algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci 49(3), 454–477 (1994) [Google Scholar]
  • 3.DeBlasio D, Gbosibo F, Kingsford C, Marçais G.: Practical universal k-mer sets for minimizer schemes. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 167–176. ACM (2019) [Google Scholar]
  • 4.Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015) [DOI] [PubMed] [Google Scholar]
  • 5.Johnson DS: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci 9(3), 256–278 (1974) [Google Scholar]
  • 6.Kawulok J, Deorowicz S: CoMeta: classification of metagenomes using k-mers. PLoS ONE 10(4), e0121453 (2015) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kucherov G: Evolution of biosequence search algorithms: a brief survey. Bioinformatics 35(19), 3547–3552 (2019) [DOI] [PubMed] [Google Scholar]
  • 8.Leinonen R, Sugawara H, Shumway M, Collaboration INSD: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lovász L: On the ratio of optimal integral and fractional covers. Discret. Math 13(4), 383–390 (1975) [Google Scholar]
  • 10.Marçais G, DeBlasio D, Kingsford C: Asymptotically optimal minimizers schemes. Bioinformatics 34(13), i13–i22 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Marçais G, Pellow D, Bork D, Orenstein Y, Shamir R, Kingsford C: Improving the performance of minimizers and winnowing schemes. Bioinformatics 33(14), i110–i117 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Marçais G, Solomon B, Patro R, Kingsford C: Sketching and sublinear data structures in genomics. Ann. Rev. Biomed. Data Sci 2, 93–118 (2019) [Google Scholar]
  • 13.Mykkeltveit J: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory 13(1), 40–45 (1972) [Google Scholar]
  • 14.Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C: Compact universal k-mer hitting sets. In: Frith M, Storm Pedersen CN (eds.) WABI 2016. LNCS, vol. 9838, pp. 257–268. Springer, Cham: (2016). 10.1007/978-3-319-43681-4_21 [DOI] [Google Scholar]
  • 15.Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C: Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput. Biol 13(10), e1005777 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Paindavoine M, Vialla B: Minimizing the number of bootstrappings in fully homomorphic encryption. In: Dunkelman O,Keliher L (eds.) SAC 2015. LNCS, vol. 9566, pp. 25–43. Springer, Cham: (2016). 10.1007/978-3-319-31301-6_2 [DOI] [Google Scholar]
  • 17.Qin J, et al. : A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004) [DOI] [PubMed] [Google Scholar]
  • 19.Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, Gordon JI: The human microbiome project. Nature 449(7164), 804 (2007) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ye C, Ma ZS, Cannon CH, Pop M, Douglas WY: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(6), S1 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp_material

RESOURCES