Abstract
Advancements in Next-Generation Sequencing (NGS) have significantly reduced the cost of generating DNA sequence data and increased the speed of data production. However, such high-throughput data production has increased the need for efficient data analysis programs. One of the most computationally demanding steps in analyzing sequencing data is mapping short reads produced by NGS to a reference DNA sequence, such as a human genome. The mapping program BWA-MEM and its newer version BWA-MEM2, optimized for CPUs, are some of the most popular choices for this task. In this study, we discuss the implementation of BWA-MEM on GPUs. This is a challenging task because many algorithms and data structures in BWA-MEM do not execute efficiently on the GPU architecture. This paper identifies major challenges in developing efficient GPU code on all major stages of the BWA-MEM program, including seeding, seed chaining, Smith-Waterman alignment, memory management, and I/O handling. We conduct comparison experiments against BWA-MEM and BWA-MEM2 running on a 64-thread CPU. The results show that our implementation achieved up to 3.2x speedup over BWA-MEM2 and up to 5.8x over BWA-MEM when using an NVIDIA A40. Using an NVIDIA A6000 and an NVIDIA A100, we achieved a wall-time speedup of up to 3.4x/3.8x over BWA-MEM2 and up to 6.1x/6.8x over BWA-MEM, respectively. In stage-wise comparison, the A40/A6000/A100 GPUs respectively achieved up to 3.7/3.8/4x, 2/2.3/2.5x, and 3.1/5/7.9x speedup on the three major stages of BWA-MEM: seeding and seed chaining, Smith-Waterman, and making SAM output. To the best of our knowledge, this is the first study that attempts to implement the entire BWA-MEM program on GPUs.
Keywords: NGS alignment, GPU, Massively parallel algorithms, BWA
1. INTRODUCTION
Next-Generation Sequencing (NGS), also known as Massively Parallel Sequencing, refers to technologies that allow drawing nucleotide sequences from biological samples at high throughput. It has led to many innovations and discoveries in the biomedical field.
In the process of generating and analyzing NGS (Next-Generation Sequencing) data, sequencing incurs the highest monetary cost, while data interpretation is the most time-consuming. A typical NGS data analysis pipeline consists of four processes: Alignment, Analysis, Annotation, and Report Generation [13]. Raw NGS data contains many short reads produced by randomly dividing the original sequence into fragments. Aligning these short reads with a reference sequence to determine their original position is a crucial and time-consuming task in the analysis process. Various algorithms and data structures have been developed to solve the alignment problem since 2005. Benchmarks and comparative studies of these mapping programs have also been reported. There is a consensus that algorithmic improvements have plateaued [26] and a prediction that further enhancements would come from the use of parallel and distributed computing and hardware acceleration, such as graphical processing units (GPU), Field Programmable Gate Arrays (FPGAs), and many-core co-processors [19].
GPUs are seemingly suitable for accelerating read mapping thanks to their parallel processing capability and high memory bandwidth. However, due to their unique hardware architecture, there are significant challenges in implementing a read-mapping program on GPUs. In our first attempt to implement BWA-MEM, one of the most popular read mappers, on GPUs, we translated its source code into CUDA code with minor modifications. Executing such code on an NVIDIA A40 GPU is more than three times slower than the original BWA-MEM running on a 64-core AMD EPYC 7662 CPU. GPU-specific issues such as code divergence, warp inefficiency, and inefficient memory accesses are the main factors leading to poor performance. Therefore, an efficient GPU implementation requires a good understanding of the BWA-MEM program and a careful redesign of its algorithms and data structures.
This paper presents what we believe to be the first systematic study toward efficient design and implementation of the entire BWA-MEM program on GPUs. Previous studies on this topic only focused on specific components of BWA-MEM workflow (e.g., Smith-Waterman alignment). This study identifies the main bottlenecks in all stages of the BWA-MEM computational pipeline on GPUs and presents novel approaches to address such challenges.
It is important to note that this study focuses on accelerating the BWA-MEM program without changing the actual computation performed by the original program. As a result, our GPU code yields the same results as the original BWA-MEM code. Although there are many NGS aligners, BWA-MEM stands out as one of the most successful ones due to its high performance and accuracy [29]. On the other hand, many programs use similar (if not the same) algorithms and data structures found in BWA-MEM; they can all benefit from lessons learned from this study.
The results of our experiments on Next-Generation Sequencing (NGS) data demonstrate that our GPU implementation can provide significant speedup compared to the state-of-the-art CPU acceleration of BWA-MEM, BWA-MEM2. When running on an AMD EPYC 7662 CPU, BWA-MEM2 achieved up to 3.2x in wall time compared to an NVIDIA A40 GPU, whose cost is similar to the CPU and was executed on the same machine. Similarly, an NVIDIA A6000 and an NVIDIA A100 GPU achieved up to 3.4x and 3.8x speedup in wall time, respectively, compared to BWA-MEM2. We evaluated the speedup on three major stages of BWA-MEM: seeding and seed chaining, Smith-Waterman, and output creation. With an NVIDIA A40 GPU, we achieved up to 3.7x, 2.1x, and 3.1x speedup, respectively. Using an NVIDIA A6000 and an NVIDIA A100 GPU, we obtained up to 3.8x/2.3x/5x and 4x/2.5x/7.9x speedup, respectively, on the three stages. Notably, we compared our implementation against Clara Parabricks, NVIDIA’s proprietary implementation of BWA-MEM on GPUs, and found that its throughput was slightly lower than BWA-MEM2 across all datasets. Hence, we still consider BWA-MEM2 as the state-of-the-art implementation for mapping short reads produced by NGS to a reference sequence. Our study highlights the great potential of GPUs for efficient data analysis.
The remainder of this paper is organized as follows. Section 2 discusses the history and background information of BWA-MEM and previous studies on its hardware acceleration; Section 3 presents the significant challenges in implementing BWA-MEM on GPUs and how we addressed them; Section 4 presents experimental details between our implementation, BWA-MEM, BWA-MEM2, and Clara Parabricks and the effects of our optimization techniques; We conclude this paper in Section 5.
2. BACKGROUND AND PREVIOUS WORK
Next-Generation Sequencing (NGS) has given rise to the read mapping problem, which entails aligning billions of short DNA reads (ranging from 50 to 400 bases each) to their original positions on a reference genome. This process is critical for detecting genetic variations and sequencing errors and obtaining useful information from the sequencing data. The challenge arises due to the miniaturized and parallel platforms used in NGS that generate high-throughput data, making it essential to efficiently align these short reads to a reference genome that was previously assembled. Therefore, read mapping is a fundamental step in NGS data analysis that enables scientists to obtain accurate and insightful information from the sequencing data.
The Smith-Waterman algorithm [30] can provide the exact solution to the mapping problem. However, this approach is highly inefficient because it has a time complexity of O(N × M × L) where N is the length of the reference, M is the number of short reads, and L is the read length. The alignment problem involves the popular subproblem of searching for exact matches of a string from large texts. Two classic index systems for solving this problem are the Prefix Tree [33] and the Suffix Array [22]. While the Prefix Tree provides a fast O(L) search time where L is the length of the queried string, it requires quadratic space proportional to the length of the reference. The Suffix Tree, on the other hand, requires linear space proportional to the reference length but takes O(L log L) time for searching. Ferragina and Manzini [5] developed the FM index based on mathematical results on the Burrows-Wheeler Transform (BWT) to address these limitations. This powerful index achieves O(L) search time while its space almost linearly scales with the reference length. The main results can be summarized as follows. Given that S is a substring of an indexed reference and given all the match positions, then by appending a letter b to the front of S, we can compute in constant time O(1) whether the resulted string bS is also a substring of the indexed reference along with all the match positions. The FM index is the foundation of many popular read-alignment software such as BWA, BWA-MEM, SOAPv2, and Bowtie.
There are now more than 90 read-alignment programs available, among which the most well-known and trusted ones are BWA, BWA-MEM, Bowtie, NovoAlign, and SOAPv2 [29]. BWA is the first alignment software to utilize the FM index. BWA-MEM was proposed to address the challenges of longer read length by introducing more efficient seed-and-extend heuristics [17]. Bowtie is another software that claims a small memory footprint and provides user options to adjust the trade-off between speed and accuracy [15]. Novoalign is a proprietary product of Novocraft that uses a hash-based index system [7]. SOAPv2 is another software that utilizes the BWT and is designed for analyzing single nucleotide polymorphisms [20]. Benchmarking studies have found that Novoalign, SOAPv2, and Bowtie consistently show the highest computational efficiency, whereas BWA-MEM shows outstanding mapping sensitivity [29].
Researchers have spent considerable effort optimizing these read-alignment programs for high-performance computing environments and accelerating hardware. In this study, we focus on BWA-MEM because its high accuracy and high computational demand would make hardware acceleration more impactful. The BWA-MEM program consists of three parts: finding potential locations (called seeds) using the FM-index, extending and scoring seeds using the Smith-Waterman algorithms, and making SAM output. There have been several studies that discussed FGPA implementation [3, 8, 12, 14, 24, 25] and GPU implementation [9–11, 16]. However, all of these studies only targeted the Smith-Waterman part. In 2019, Vasimuddin et al. [32] presented a new version of BWA-MEM, named BWA-MEM2, that optimized all parts of the original program in terms of memory and SIMD instruction utilization. When running BWA-MEM2 and BWA-MEM on a single core, they reported up to 3.5x speedup on end-to-end compute time. NVIDIA also developed a proprietary software named Clara Parabricks [1] that implements BWA-MEM and other genomic data analysis tools on GPUs. The implementation details have yet to be published. A recent study [28] showed that it achieved 8x, 16x, and 28x speedup over the original BWA-MEM running on an 18-core Intel Xeon Gold 6154 when using 2, 4, and 8 Telsa P100 GPUs, respectively.
BWA-MEM2 and Clara Parabricks are the latest hardware acceleration efforts for BWA-MEM, but we consider BWA-MEM2 the state-of-the-art for two reasons. First, its implementation details and source code are public, allowing easy verification that it closely follows the original BWA-MEM design. Secondly, our experiments in Section 4 show that BWA-MEM2 has slightly higher throughput than Clara Parabricks across all datasets. We follow the same approach as BWA-MEM2 to design a GPU-based version to optimize BWA-MEM’s performance while maintaining the same results. However, the same challenges discussed in the BWA-MEM2 study [32], such as code divergence, memory management, and I/O handling, become significantly more complex on the GPU architecture. We discuss these challenges in detail and propose solutions in section 3.
3. IMPLEMENTING BWA-MEM ON GPUS
This section discusses the challenges of implementing BWA-MEM on GPUs and how we addressed those issues. The BWA-MEM program consists of the following stages:
Seeding: Find seeds that are exact matches between the read and the reference sequence. BWA-MEM searches for seeds by using the FM index. The FM index allows finding exact sequence matches in linear time to the length of the sequence.
Suffix array lookup: translate the indices into coordinates on the reference sequence
Seed chaining: seeds that are close and colinear to each other on both the read and the reference are chained. The chains are used for filtering seeds and optimizing the next stage.
Smith-Waterman alignment: Extend the seeds in both directions and calculate the matching score based on the Smith-Waterman algorithm. The best alignments that specify insertions, deletions, and mismatches are produced during this stage.
Output creation: format alignment information from Smith-Waterman, such as chromosome, coordinate, and matching score, into a string that follows the SAM specification [18].
Among these stages, suffix array lookup and output creation require the least modification to execute efficiently on GPUs. They are both memory-bound processes that benefit from modern GPUs’ high-bandwidth memory. Furthermore, they are both embarrassingly parallel tasks that can be easily implemented for each seed and each alignment. We skip the details on these two stages and focus on the seeding, seed chaining, and Smith-Waterman stages. We discuss these stages in subsections 3.1, 3.2, and 3.3, respectively. Furthermore, we address the system-level challenges of memory management and I/O handling in subsections 3.4 and 3.5.
3.1. Finding seeds with the FM Index
Ferragina and Manzini published the FM index that allows compression of a large text with no significant slowdown in the query performance [5]. They mathematically proved that the FM index could compute the index of a string bS, where b is a single character, in constant time O(1) given the index of the suffix string S. In BWA-MEM, the reference and its reverse complement are concatenated and indexed with the FM index. Searching for exact matches of a string on the FM index results in three integers (k, l, s) where k is the index of the first match, l is the index of the first match of the reverse complement, and s is the number of matches. If a string S has an FM index result of (k, l, s), then we can compute the index (k’, l’, s’) of the string bS and (k”, l”, s”) of the string Sb in constant time. We omit this computation due to space limitations; they can be found in [32] – Algorithm 2 and 3. As a result, we can find all exact matches of a string X in linear time to X’s length by sequentially computing the indices for all of its suffices or prefixes, from the shortest to the longest. After this step, we can look up the original locations of the matches by accessing an array called the Suffix Array [5] on s consecutive positions from k to k+l-1.
Unfortunately, the FM index does not allow mismatches. All alignment programs address this problem with heuristic processes. BWA-MEM takes an approach called seeding that finds potential locations called super maximal exact matches (SMEM). Maximal exact matches (MEM) are exact matches between substrings of two sequences that cannot be stretched further in either direction. A MEM not contained in any other MEM on the short read is a SMEM. BWA-MEM finds all the SMEMs that cover each position on a read. For each position i on the read, it finds all the matches by extending forward, then extending them backward to find the SMEMs. Figure 1 illustrate this process that starts at a specific location i0 in the middle of the read. Note that BWA-MEM repeats this process for all positions on a read.
Figure 1:

An example of finding SMEMs at a position i0
Problem:
Implementing the FM index search on GPUs is exceptionally challenging because the algorithm causes very bad warp divergence in GPUs. In CUDA, the basic execution unit is a warp – a group of 32 threads scheduled and executed simultaneously by a streaming multiprocessor. The entire warp will hold the computing resources until all threads in it exited. In other words, the latency of a warp is the maximum latency among all 32 threads in the warp. The natural way to perform the FM index search is to assign one thread to execute the entire search algorithm for a read, but this approach causes severe warp divergence because reads in a warp may have very different numbers of seeds and seed lengths. Let us use an example of a warp that computes the forward-extension phase for 32 reads, one thread per read, where 31 reads finish finding all matches after extending for only ten bases, and one can still find matches after extending for 200 bases. In this case, the entire warp has to wait for the single thread to compute all 200 extensions and waste resources on the other 31 threads.
Solution:
We propose three innovative ideas to address the problem above. First, we use an entire warp to process a read as follows. We start each thread at one position on the read and extend each position forward only. Backward extensions are unnecessary because we can find all seeds starting at every position on the read. This approach is redundant because the seeds we find are not necessarily SMEMs, but the results are guaranteed to contain all the SMEMs. We can detect the non-SMEM seeds by comparing them to those at the adjacent previous position. An example is demonstrated in Figure 2. In this example, threads 0, 1, and 2 find seeds that end at position 198 on the read, and thread 3 finds a seed at position 3 to 200 on the read. The seeds found on threads 1 and 2 are discarded because they are contained within the seed at thread 0 and thus are not SMEMs by definition. Comparison between seeds can be performed very efficiently because threads can share seed information by taking advantage of CUDA shuffle instructions that allow access to data stored in registers by all threads in a warp. Such instructions are extremely efficient, with a latency of 7-12 cycles. This design created a significant improvement in the seeding stage. We measured the time spent on the seeding stage with batches of 40,000 reads, each 152 bases long. The natural design of one-thread-per-read took 428 ms per batch on average on the seeding stage with a warp efficiency of 14%, which means that 86% of a streaming multiprocessor is idle. The one-warp-per-read design took 104 ms per batch on average and achieved a warp efficiency of 49%. This is a 4x speed up, and the imperfect warp efficiency of 49% means that GPUs can achieve even greater performance with a better design.
Figure 2:

A warp seeding for a read
Second, we introduce an enhancement to the FM index data structure. We pre-calculate the FM index triplets (k, l, s) for all possible k-mers, which are biological sequences of length k, and store the results in a hash table. The seeding process then starts from the FM indices of the k-mers instead of a single base. We calculate the maximum value of k such that the hash table fits into the remaining GPU memory after accounting for the FM index, the reads, and the memory management buffer presented in section 3.4. For example, on the A40 GPU, this maximum value of k is 13, which allows us to reduce a further 9% of the total time spent on the seeding stage on top of the one-warp-per-read design. With the more modern GPUs, whose memory capacities are much larger, this performance boost will be even more significant.
Third, we reorder reads to achieve a better cache hit rate and more efficient memory access patterns. A streaming multiprocessor on GPUs can execute tens of concurrent warp that contain tens of reads. By placing similar reads in the same streaming multiprocessor, these warps are more likely to have similar search paths on the FM index. For example, after lexicographically sorting an entire data set of 71 million reads, we observed nearly 40% improvement on top of the first two optimization techniques. In practice, however, we cannot afford to sort 71 million reads. Instead, we load the reads in batches and sort them within a batch. Too few reads in a batch would be ineffective because neighbor reads are not similar enough, and too many reads would cause overhead. We experimented with different numbers of batch sizes and found that the effective batch size is between 1 and 5 million reads. In subsection 3.5, we describe a two-layer batching system where we perform sorting on the CPU while BWA-MEM algorithms are running on the GPU so that we have effectively zero extra cost sorting a batch.
3.2. Seed Chaining
A chain is a group of seeds that are colinear and close to each other on both the read and the reference [17]. The purpose of chaining is to achieve higher efficiency at later stages. Seeds in a chain likely belong to the same final alignment. Therefore, longer chains have higher chances of becoming the best alignments, and seeds in the same chain can later be aligned by the Smith-Waterman algorithm together efficiently Furthermore, chains that are too short and covered by another chain are unlikely to produce good alignments and are therefore discarded. An example output of the chaining process is illustrated in Figure 3. In this example, the first chain includes Seed 1 and Seed 2 because they are close to each other on both the read and the reference. Seed 3 does not belong to this chain because it is too far from Seed 2 on the reference. Similarly, Seed 5 does not belong to this chain because it is too far from Seed 2 on the read. Seed 3 and Seed 4 cannot be in a chain together because their positioning on the read differs from that on the reference.
Figure 3:

Seed chaining example
Problem:
the implementation details of seed chaining have yet to be discussed in previous publications on BWA-MEM [17, 32] and was primarily overlooked because this process only takes about 6% of total processing time. However, this process took approximately 17% of the wall time of our GPU implementation after we optimized the seeding stage and the Smith-Waterman stage. Like the seeding stage, the chaining algorithm is affected by severe warp divergence. As we optimize other stages of BWA-MEM, the chaining stage gains more weight in total time due to its inefficiency.
The chaining algorithm has severe warp divergence because reads in a warp may have vastly different numbers of seeds. Therefore, we must find a solution that allows threads in a warp to process all seeds from a read collaboratively. Since the chaining algorithm has not been published in detail, we explain it in Algorithm 1.
Algorithm 1:
Original BWA-MEM seed chaining
|
input :
seeds: array of seeds. Each seed has: • qb: starting position on read • qe: ending position on read • rb: starting position on reference • re: ending position on reference | |
|
output: Chains of seeds such that within a chain c: (1) c.seeds are on the same chromosome (2) c.seeds[i+1].qb ≥ c.seeds[i].qb (3) c.seeds[i+1].rb ≥ c.seeds[i].rb (4) c.seeds[i+1].qb – c.seeds[i].qe ≤ threshold g (5) c.seeds[i+1].rb – c.seeds[i].re ≤ threshold g (6) abs((c.seeds[i+1].qb–c.seeds[i].qb) – (c.seeds[i+1].rb–c.seeds[i].rb)) < threshold d | |
| 1: | sort seeds by qb |
| 2: | tree ← empty B-tree |
| 3: | chains ← empty array |
| 4: | procedure make_new_chain(seed) |
| 5: | new_chain ← a new empty chain |
| 6: | Add seed to new_chain |
| 7: | Add new_chain to chains |
| 8: | insert seed.rb to tree |
| 9: | end procedure |
| 10: | for seed in seeds do |
| 11: | if tree is empty then |
| 12: | make_new_chain(seed) |
| 13: | else |
| 14: | search for key seed.rb on tree |
| 15: | nearest_chain ← chain with nearest smaller key |
| 16: | last_seed ← last seed on nearest_chain |
| 17: | if seed and last_seed satisfy 6 conditions then |
| 18: | Add seed to nearest_chain |
| 19: | else |
| 20: | make_new_chain(seed) |
| 21: | end if |
| 22: | end if |
| 23: | end for |
| 24: | return chains |
In Algorithm 1, a chain is simply an array of seeds. A chain’s seeds must satisfy six conditions outlined in Algorithm 1. These conditions are: (1) seeds in a chain must be from the same chromosome, (2) a seed is positioned after the preceding seed on the read, (3) similar to (2) on the reference, (4) the gap between two consecutive seeds on the read is smaller than a threshold g, (5) similar to (4) on the reference, and (6) the gap on the read differs from the gap on the reference by less than a threshold d. Algorithm 1 considers the seeds in order of their positions on the read (line 1). As chains are created, a B-tree is built whose keys are the chains’ positions on the reference to help each seed efficiently find the nearest chain (lines 14-15). The last seed on this nearest chain is then used to validate the six conditions to determine whether the seed can be added to this nearest chain (lines 16-18). If such a chain does not exist or the seed cannot be appended to it, a new chain is created for this seed (lines 12 and 20). Dividing works within a warp is challenging with this algorithm because seeds are processed sequentially by their order on the read and are compared against a B-tree that gets updated as new chains are created. The loop on line 10 cannot be executed in parallel because an iteration depends on the state of the B-tree being updated by the previous iterations; processing seeds in parallel will create racing conditions and produce nondeterministic results.
Solution:
We propose an alternative solution in Algorithm 2 that achieves the same result as Algorithm 1 but can be parallelized easily. The main idea in Algorithm 1 is to find for each seed the nearest seed that it can be chained with. Algorithm 1 achieves this by searching for the nearest chain on the B-tree and comparing a seed with the last seed on the nearest chain. In contrast to Algorithm 1, which finds the closest seed by distance on the read and checks condition (4) first, we sort seeds by their positions on the reference and check condition (5) first. The reason is that seed locations on the reference are much more sparse than on the read due to the reference’s size; therefore, we have very few, if any, seeds that can pass condition (5). Given a threshold g for the distance between two seeds, we can calculate the furthest seed that can pass condition (5) and then find all seeds that can pass condition (5) with a single binary search. This process can be performed independently and make our solution easily parallelizable.
In Algorithm 2, we introduce two arrays, predecessor and successor. predecessor[i] will store the index of the seed that precedes seed i on a chain or the value i if seed i is the first seed on a chain. Similarly, successor[i] will store the index of the seed that succeeds seed i on a chain and 0 if no such seed exists. Initially, we set all predecessor[i] = i and successor[i] = 0, i.e., each seed is in a separate chain. For each seed, we perform the following tasks (lines 6-19). First, we identify the seeds that precede the current seed on the reference within a distance of the threshold g. Note that we sort the seeds by rb instead of qb. This search for potential predecessors (lines 7-8) can be a binary search because the seeds are sorted by rb. Then we check the current seed against the potential predecessors for conditions 1, 2, 4, and 6 and stop at the first candidate that satisfies all the conditions (lines 13-16). We also stop at the first failure of condition 1 because a candidate on a different chromosome means that the further candidates will also be on a different chromosome. If no potential predecessor satisfies all the conditions to become a predecessor, the current seed remains the first seed on a chain. After this process, the two arrays predecessor and successor essentially become doubly linked lists of chains: predecessor[i] == i indicates the start of a chain and successor[i] allows us to traverse the chain. Lines 20-32 perform the traversal and create the chains. Figure 4 illustrates the results of searching for predecessors based on the example in Figure 3.
Figure 4:

Our chaining algorithm: searching predecessors
The tasks in Algorithm 2 can be parallelized by a warp as follows. The sorting in line 1 has been studied and optimized extensively and is available through the CUDA Thrust library [2]. The loop on line 6 can be performed by many threads independently because it does not depend on any shared data structure. Traversing the chains (lines 20-32) can be done in parallel using one thread per chain. This part will have warp divergence because chains may have different lengths. However, in practice, chains are not long because they are created by mismatches or gaps between exact matches, and there cannot be too many of them within a short read. Therefore, we can efficiently perform Algorithm 2 for seeds of a read by using a warp with minimal divergence.
Algorithm 2:
Our solution for seed chaining
|
input :
seeds: array of seeds. Each seed has: • qb: starting position on read • qe: ending position on read • rb: starting position on reference • re: ending position on reference | |
|
output: Chains of seeds such that within a chain c: (1) c.seeds are on the same chromosome (2) c.seeds[i+1].qb ≥ c.seeds[i].qb (3) c.seeds[i+1].rb ≥ c.seeds[i].rb (4) c.seeds[i+1].qb – c.seeds[i].qe ≤ threshold g (5) c.seeds[i+1].rb - c.seeds[i].re ≥ threshold g (6) abs((c.seeds[i+1].qb–c.seeds[i].qb) – (c.seeds[i+1].rb–c.seeds[i].rb)) < threshold d | |
| 1: | sort seeds by rb |
| 2: | predecessor ← array of length seeds.size() |
| 3: | successor ← array of length seeds.size() |
| 4: | set predecessor[i] = i for all i |
| 5: | set successor[i] = 0 for all i |
| 6: | for j from 1 to seeds.size() do |
| 7: | rb_low_bound ← seeds[j].rb - read_length - g |
| 8: | seedId_low_bound ← binary search for rb_low_bound on seeds |
| 9: | for i from j-1 to seedId_low_bound do |
| 10: | if seeds[i] and seeds[j] on different chromosome then |
| 11: | break |
| 12: | end if |
| 13: | if conditions 2, 4, 5, 6 are met then |
| 14: | predecessor[j] ← i |
| 15: | successor[i] ← j |
| 16: | break |
| 17: | end if |
| 18: | end for |
| 19: | end for |
| 20: | chains ← empty array |
| 21: | for i from 0 to seeds.size() do |
| 22: | if predecessor[i] == i then |
| 23: | new_chain ← a new empty chain |
| 24: | Add seeds[i] to new_chain |
| 25: | Add new_chain to chains |
| 26: | k ← i |
| 27: | while successor[k] ≠ 0 do |
| 28: | k ← successor[k] |
| 29: | add seeds[k] to new_chain |
| 30: | end while |
| 31: | end if |
| 32: | end for |
| 33: | return chains |
Both Algorithms 1 and 2 run in asymptotically O(nlog(n)) time where n is the number of seeds. Therefore, Algorithm 2 does not introduce extra work.
Proof of correctness:
Algorithm 2 guarantees that all seeds can be chained with the nearest other seed that satisfies all the six conditions. The rb_low_bound computed in line 7 is the smallest position on the reference that satisfies condition 3, so all seeds from rb_low_bound to the current seed’s position would satisfy conditions 3. Line 10 verifies condition 1, and line 13 verifies conditions 2, 4, 5, and 6. Checking potential predecessors in their order of proximity (line 9) guarantees that we will find the nearest seed that can be chained.
3.3. Smith-Waterman
The next major computing task in BWA-MEM and many other seed-and-extend mappers is to find the best alignment beyond an exact-match seed. The Smith-Waterman algorithm [30] is a popular choice for this step. Given a scoring scheme, this dynamic programming algorithm finds the optimal alignment between two sequences. The algorithm computes a scoring matrix H whose sides are equal to the lengths of the two sequences, and each cell H[i, j] depends on the values of the previous three cells H[i − 1, j], H[i, j − 1], and H[i − 1, j − 1]. BWA-MEM and BWA-MEM2 implemented a slight variation where they only compute the cells within a certain distance from the main diagonal because the final alignment is likely near the main diagonal.
Hardware acceleration for Smith-Waterman is a popular topic on both CPUs and GPUs. Notable works on CPUs include Rognes [27] and Farrar [4]; they took advantage of Intel’s SIMD instructions for 32 8-bit integers and designed the optimal ways to compute the dynamic programming cells by using 8-bit integers so that they can compute 32 cells simultaneously. However, 16-bit integers are often needed for longer sequences, and the degree of parallelism is reduced. Suziki-Kasahara [31] is the latest to address this challenge by reformulating the Smith-Waterman recursion into recursion of score differences; score differences between neighbor cells are sufficiently small to always use 8-bit integers.
The challenge on GPUs differs significantly from that on CPUs due to their architectures. Nvidia GPUs do not have 8-bit SIMD instructions, but a group of 32 threads always compute together regardless of instruction. Therefore, the challenge on GPUs is not about fitting calculations into 8-bit integers, but about data movement and resource utilization. Many GPU-based solutions have been proposed in [9–12, 16, 23]. The state-of-the-art solution is Muller et al. (2022) [23] who presented an implementation that minimizes memory access using warp shuffle instructions and half-precision arithmetic.
Maleki et al. [21] also presented an interesting framework that breaks the data dependency. Suppose the scoring matrix can be partitioned into k submatrices where each submatrix requires the results from the previous ones to compute. Maleki et al. proposed that all submatrices can be computed in parallel by using arbitrary inputs in place of their prerequisites. After doing this, the first top-left submatrix has the correct results, whereas the others may not. In the subsequent iterations, the submatrices are recalculated using the results of their prerequisites from the previous iteration until they converge to the correct results. This approach has more workload than the traditional Smith-Waterman by n times, but is rewarded with massive parallelism. This approach is suitable for large scoring matrices because it needs more space for massive parallelism to offset the extra workload, but it is not ideal for BWA-MEM because short-read alignments often involve small sequences.
We introduce optimization techniques to Muller et al.’s solution [23] to further increase warp efficiency. Each warp executes in lockstep to compute the scoring matrix for aligning the short read sequence Q and the reference sequence T where Q is on the vertical side, and T is on the horizontal side of the matrix. Computation proceeds along a wavefront passing through the characters in T. In iteration i, thread t compute cell H[t, i − t]. This requires H[t, i − t − 1], H[t − 1, i − t], and H[t − 1, i − t − 1]. H[t, i − t − 1] is the result computed by the same thread from the previous iteration, which is still in its registers. H[t − 1, i − t] is the result computed by the adjacent thread from the previous iteration, which can be obtained through a warp shuffle instruction. Finally, H[t − 1, i − t − 1] is a prerequisite for the adjacent thread from the previous iteration, which can also be obtained through a warp shuffle instruction. This process is demonstrated in Figure 5(a), showing a scoring matrix computed by a warp. Cells on a diagonal line are computed together. Note that the first few diagonal lines, colored red, do not have enough cells for all 32 threads and thus create warp divergence. For example, we can only compute cell H[0, 0] in the first iteration, cells H[0, 1] and H[1, 0] in the second iteration, and so on. After that, there are enough cells for the entire warp to reach 100% efficiency. When the warp reaches the end of T, divergence happens again, as shown in Figure 5(b). After the warp passes through T, it moves on to the next tile of Q and repeats the same process. When starting the new tile, thread 0 needs the value of H[31, 0] from thread 31. However, thread 31 no longer holds this value in its registers. Therefore, we allocate a shared memory array within the warp to store the results from thread 31. This shared memory array serves inputs to thread 0 when it starts a new tile.
Figure 5:

Designing Smith-Waterman on GPUs - workload distribution within a warp
In Figure 5(c), we show an improvement in removing the areas where divergence happens. The idea is to allow threads that have finished their tasks at the end of a tile to start working on the next tile. For example, at the end of the first tile, thread 0 is the first to finish all of its tasks, thread 1 completes its tasks in the next iteration when thread 0 is idle, and so on. To not waste resources on thread 0, we let thread 0 compute the first cell H[32, 0] in the second tile because it only requires H[31, 0], which should have been calculated by thread 31. Similarly, we let thread 1 compute cell H[33, 0] in the next iteration. This transition is colored yellow. With this improvement, we have 100% warp efficiency on most of the scoring matrix except for the top-left and the bottom-right regions. We increased the warp efficiency from 89% to 99% with this optimization.
3.4. Dynamic Memory Management
Challenges:
The BWA-MEM program has many small and fragmented memory allocations through malloc calls during seeding, seed chaining, and output creation. BWA-MEM2 addressed this inefficiency by grouping small allocations into a few large contiguous allocations [32]. Small fragmented memory allocation is also a problem with GPU architectures. Making a few large contiguous allocations is also beneficial in GPUs, but we can achieve even more improvements from a low-level malloc redesign. Dynamic memory allocation and freeing during a GPU kernel execution is a challenging task in GPU programming due to high concurrency, thread contention, and synchronization overhead [34]. The CUDA’s built-in malloc() function has very high latency and is rarely used in CUDA applications [34]. A typical memory management practice in GPU programming is pre-allocating a certain amount of global memory before executing the kernel. However, the amount of memory needed is uncertain before the kernel execution in many applications, including BWA-MEM. Performing the work twice is a typical strategy for resolving this problem [6]. The amount of memory needed is calculated in the first pass, and the second pass completes the work after this amount of memory is allocated. However, this strategy essentially doubles the amount of computing work.
Algorithm 3:
superbatch_main
| input : infile: a file stream containing short reads | |
| 1: | N ← size of super-batch |
| 2: | Initialize SB1 and SB2 as empty arrays of N reads |
| 3: | while infile is not empty or SB2 is not empy do |
| 4: | f1 ← async call minibatch_main(SB2) |
| 5: | f2 ← async load N reads from disks to SB1 |
| 6: | wait f1 |
| 7: | f3 ← async reorder reads in SB1 |
| 8: | wait f2 and f3 |
| 9: | swap SB1 ↔ SB2 |
| 10: | end while |
Algorithm 4:
minibatch_main
| input : SB: Super-batch contains reads for processing | |
| 1: | n ← size of minibatch |
| 2: | MB1 and MB2 ← empty arrays of n reads on host |
| 3: | MB1G and MB2G ← empty arrays of n reads on GPU |
| 4: | while end of SB not reached or MB2 is not empty do |
| 5: | f1 ← async copy SAM MB1G → MB1 → disks |
| 6: | f2 ← async call BWA_MEM_GPU_main(MB2G) |
| 7: | wait f1 |
| 8: | f3 ← async copy n reads SB → MB1 → MB1G |
| 9: | wait f2 and f3 |
| 10: | swap MB1 ↔ MB2 and MB1G ↔ MB2G |
| 11: | end while |
Solution:
We exploit a characteristic of BWA-MEM to design a simple yet efficient memory allocator: the application processes reads in independent batches and memory allocated for one batch is no longer needed after processing the batch. This means we can allocate a single big chunk of memory and reuse this memory for every batch. When a thread needs memory allocation, we shift the pointer to the free memory region on the big chunk. A thread does not need to free the memory allocated to it. Instead, all memory allocations are freed at once at the end of the batch processing by simply resetting the pointer. Figure 6 presents a small example with two threads.
Figure 6:

Memory allocation design
This approach is not a traditional memory allocator but is very efficient: we do not need to keep any metadata because we do not want to free small memory fragments. It is a good fit for alignment programs because they must process reads in batches, produce a lot of intermediate data, and discard all the intermediate data. There are two significant drawbacks to this approach. First, the preallocated big chunk needs to be sufficiently large for a batch. This can be addressed by reducing the batch size and the fact that recent GPUs have significantly more memory. Second, threads must shift the free pointer on the big chunk by atomic operations or locks. Thousands of threads accessing a single pointer can become a major bottleneck. A simple solution is splitting the big memory chunk into several smaller chunks and dividing accesses to these chunks evenly among threads.
3.5. I/O Handling
At first glance, I/O on a GPU-based BWA-MEM is more costly than a CPU-based one because it requires extra transfers to copy the short reads from the host memory to the GPU memory and to copy the output from the GPU memory to the host memory. The reordering strategy discussed in subsection 3.1 adds even more overhead. Furthermore, for the reordering strategy to be effective, the number of reads in a batch must be sufficiently large and is therefore much larger than the amount of reads a GPU can process at once due to the GPU’s memory limit. We address all of these challenges by designing a two-layer batching system.
The main idea of our system is to load a large number of reads from disks (named super-batch) for reordering and split a super-batch into small batches (named mini-batch) whose sizes are the number of reads a GPU can process at a time. While the GPU is processing all mini-batches in a super-batch, we perform the disk I/O and reordering for the next super-batch. While the GPU is processing a single mini-batch, we perform all the I/O for the output of the previous mini-batch. This two-layer system allows us to keep the GPU processing all the time; I/O and reordering costs take less time than processing a super-batch/mini-batch and add zero overhead to our program.
The two-layer system is presented in detail in Algorithms 3 and 4. superbatch_main handles disk I/O, reorders reads, and calls minibatch_main asynchronously. minibatch_main handles host-GPU I/O and SAM output I/O and calls the main BWA-MEM process on GPU asynchronously. Figure 7 illustrates the first three iterations of superbatch_main (a) and the first four iterations of minibatch_main (b). On superbatch_main, the whole minibatch_main takes longer than the disk I/O and reordering combined. On minibatch_main, GPU processing takes longer than the host-GPU I/O and disk output I/O combined. Therefore, this system adds zero extra cost to GPU processing. Two CPU threads are utilized - one for superbatch_main and another for minibatch_main.
Figure 7:

Asynchronous I/O actions
4. EXPERIMENTS
4.1. Experimental setup
In this section, we describe the experimental setup used to compare the performance of four different tools: BWA-MEM-GPU, BWA-MEM (v0.7.17), BWA-MEM2 (v2.2.1), and Clara Parabricks (v4.0.1). The experiments were conducted on machines within a computing cluster. BWA-MEM and BWA-MEM2 run on an AMD EPYC 7662 using all 64 cores. BWA-MEM-GPU and Clara Parabricks run on three types of NVIDIA GPUs - A40, A6000, and A100. It is challenging to compare the amount of resources between CPUs and GPUs and provide a simple means of evaluation by comparing their market prices. The AMD EPYC 7662 CPU and the A40 GPU are noted to have comparable retail prices, while the A6000 and the A100 are slightly more expensive.
We compiled BWA-MEM and BWA-MEM2 with GCC 8.3 on Red Hat Enterprise Linux 7.9, while BWA-MEM-GPU was compiled with GCC 8.3 and Nvidia CUDA Compiler 11.5. Clara Parabricks was downloaded as a container image and executed with the Singularity container platform. The FM index was built before running any experiments. Note that BWA-MEM-GPU uses the same index built by BWA-MEM, with no implementation of the index-building process on GPUs. We also use a single reference for all experiments, the GRCh38 for Homo sapiens, and public NGS read datasets obtained from Homo sapiens available on the National Center for Biotechnology Information (NCBI) database. The datasets are presented in Table 1 and are ordered by increasing read length.
Table 1:
Datasets from NCBI for performance evaluation
| Dataset | NCBI SRA | Read Count | Read Length |
|---|---|---|---|
| D1 | SRR16541116 | 100,280,519 | 75 |
| D2 | SRR622457 | 1,436,823,870 | 101 |
| D3 | SRR043348 | 17,862,821 | 152 |
| D4 | SRR21616113 | 10,959,922 | 251 |
| D5 | SRR726716 | 5,491,290 | 300 |
In our experiments, all machines access the reference genome, indices, and reads from a distributed storage system within the cluster through the high-speed Remote Direct Memory Access (RDMA) technology, achieving a throughput of approximately 1.25 GB/s. Using network storage within a cluster for read-mapping programs is common practice, as their computational demands are massive enough that I/O is not a bottleneck. Their computational demands are so enormous that I/O does not become a bottleneck, and all BWA-MEM, BWA-MEM2, and BWA-MEM-GPU have mechanisms to perform I/O asynchronously. All reported results in this section are based on an average of 10 replica runs to ensure statistical validity.
4.2. Wall Time Performance Evaluation
Figure 8 shows the processing throughput of the alignment programs. BWA-MEM2 is 2-3x faster than BWA-MEM. This result is consistent with the numbers reported in the previous study [32]. Clara Parabricks has slightly lower throughputs than BWA-MEM2 across all datasets. BWA-MEM-GPU outperforms BWA-MEM2 with up to 3.2x higher throughput.
Figure 8:

Throughput of BWA-MEM alignment programs
Figure 9 presents the speedup of BWA-MEM-GPU over BWA-MEM2 and BWA-MEM on all experiment runs. BWA-MEM2 is 1.8–2.5x as fast as BWA-MEM, consistent with the results presented in [32]. Note that [32] reported 2.4–3.5x on single-threaded experiments.
Figure 9:

Wall-time speedups of BWA-MEM-GPU over BWA-MEM2 and BWA-MEM.
Figure 9 shows that the A40 provides 3.6-5.8x speedup over BWA-MEM and 1.7-3.2x speedup over BWA-MEM2; the A6000 provides 4-6.1x speedup over BWA-MEM and 1.9-3.4x speedup over BWA-MEM2, and the A100 provide 4.5-6.8x speedup over BWA-MEM and 2.1-3.8x speedup over BWA-MEM2. Since read length increases from D1 to D5, we can observe that the speedup of BWA-MEM-GPU is more significant at shorter reads. This phenomenon is expected because longer reads create more code divergence in the SMEM seeding heuristics, which hurts GPUs more than CPUs.
4.3. Performance of key stages
We compared the time spent by BWA-MEM2 and BWA-MEM-GPU on the crucial stages of the program. To accomplish this, we incorporated checkpoints into the programs between kernels and summed the kernel timings across the batches. For instance, if a program took one second to execute the Smith-Waterman stage for each batch of reads, and there were 50 batches, then the program’s time spent on Smith-Waterman was 50 seconds. Due to the lack of synchronization points in the middle of a parallel kernel execution on CPU or GPU, we could not obtain finer-grained measurements. As a result, we could not measure the stages of BWA-MEM, which has a single large kernel that performs all three stages, and Clara Parabricks does not provide such information. BWA-MEM2 consists of three kernels: seeding and seed chaining (SEED), Smith-Waterman (SW), and producing SAM output (SAM). We introduced checkpoints between these three kernels and at corresponding positions in BWA-MEM-GPU. Figure 10 displays the time spent on SEED, SW, and SAM.
Figure 10:

Speedup of BWA-MEM-GPU over BWA-MEM2 in each stage: seeding and seed chaining (SEED), Smith-Waterman (SW), and making SAM output (SAM)
On seeding and seed chaining, BWA-MEM-GPU on the A40 GPU is from 1.1 to 3.7x as fast as BWA-MEM2, the A6000 GPU is from 1.1x to 3.8x as fast, and the A100 GPU is from 1.2 to 4x as fast. As discussed in section 3.1, the seeding stage is very challenging for GPUs because of the difference in the number of seeds causing bad warp divergence. Similarly, the seed chaining stage is problematic because the algorithm is unsuitable for executing on GPUs. Nevertheless, the GPU implementation outperforms its CPU counterpart thanks to the optimization techniques we presented. The results also show that BWA-MEM-GPU speedup in this stage is more significant with shorter reads.
On Smith-Waterman, BWA-MEM-GPU on the A40 GPU is from 1.4 to 2x as fast than BWA-MEM2, the A6000 GPU is from 1.5 to 2.3 times as fast, and the A100 GPU is from 1.9 to 2.5 times as fast.
On making SAM output, BWA-MEM-GPU on the A40 GPU is from 2.1 to 3.1x as fast than BWA-MEM2, the A6000 GPU is from 2.3 to 5 times as fast, and the A100 GPU is from 5 to 7.9 times as fast.
Table 2 displays the percentage of time allocated for each stage, including SEED, SW, and SAM, in BWA-MEM2 and BWA-MEM-GPU. Unfortunately, we were unable to calculate these metrics for BWA-MEM and Clara Parabricks because BWA-MEM employs a single kernel for all three stages, and Clara Parabricks does not provide such metrics. Based on our observations, BWA-MEM-GPU dedicates a minimal amount of time in generating SAM output since the process is better suited for GPU processing. Concerning SEED and SW, BWA-MEM-GPU can spend more or less time than BWA-MEM2, depending on the dataset used.
Table 2:
Average proportion of processing time spent in each stage
| Program | Stage | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|---|
| BWA-MEM2 | SEED | 68.9% | 69.0% | 53.2% | 46.6% | 42.5% |
| SW | 23.3% | 23.2% | 42.4% | 49.0% | 51.4% | |
| SAM | 7.8% | 7.8% | 4.3% | 4.4%. | 3.4% | |
|
| ||||||
| BWA-MEM-GPU | SEED | 52.3% | 62.8% | 57.0% | 39.3% | 61.2% |
| SW | 40.5% | 31.1% | 41.2% | 58.3% | 37.6% | |
| SAM | 7.3% | 6.1% | 1.8% | 2.4% | 1.2% | |
4.4. Effect of asynchronous I/O
We measure the effect of the asynchronous I/O design in section 3.5 by running BWA-MEM-GPU with synchronous I/O and with asynchronous I/O, and with all other optimizations applied. The results are presented in Table 3.
Table 3:
wall time (seconds) of BWA-MEM-GPU with and without asynchronous I/O
| Device | I/O method | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|---|
| A40 | Synchronous | 121 | 157 | 89 | 136 | 62 |
| Asynchronous | 115 | 149 | 84 | 131 | 59 | |
| improvement | 5.2% | 5.4% | 6% | 3.8% | 5.1% | |
|
| ||||||
| A6000 | Synchronous | 116 | 153 | 86 | 126 | 59 |
| Asynchronous | 110 | 144 | 80 | 120 | 56 | |
| improvement | 5.5% | 6.3% | 7.5% | 5% | 5.4% | |
|
| ||||||
| A100 | Synchronous | 104 | 133 | 74 | 105 | 53 |
| Asynchronous | 98 | 124 | 69 | 99 | 49 | |
| improvement | 6.1% | 7.3% | 9.4% | 6.1% | 8.2% | |
The improvement in wall time is consistent across all runs by up to 6% with the A40, 7.5% with the A6000, and 9.4% with the A100. I/O handling becomes more important as GPU processing becomes faster. Furthermore, the gain is more significant as we use more advanced GPUs that produce lower wall time. The reason is that as better processors reduce computing time, the cost of I/O becomes more significant, and thus asynchronous I/O becomes more critical.
In addition, we compare the wall time with the time spent on the major computing stages SEED, SW, and SAM combined. Figure 11 illustrates the percentage of time spent on these three computing stages over the wall time. Our findings indicate that BWA-MEM-GPU spends 89–95% of its time on the major computing tasks, whereas BWA-MEM2 only spends 70–80%. This higher efficiency is achieved by performing all high-latency I/O tasks concurrently with the major computing tasks on GPUs. Moreover, we achieved this level of efficiency while sorting within each super-batch, indicating that we can enhance the efficiency of the seeding stage significantly without any sorting overhead.
Figure 11:

Amount of time spent on computing tasks with BWA-MEM2 and BWA-MEM-GPU
4.5. Effect of memory management
We measure the effect of the memory management method proposed in section 3.4 by running two versions of BWA-MEM-GPU, one uses the CUDA’s built-in malloc, and the other uses the proposed method in Section 3.4. The results are presented in Table 4, showing more than an order of magnitude improvement over the CUDA’s built-in malloc.
Table 4:
wall time (seconds) of BWA-MEM GPU when using RW_malloc and the proposed malloc
| Device | Memory Management | D1 | D2 | D3 | D4 | D5 |
|---|---|---|---|---|---|---|
| A40 | NVIDIA malloc | 3506 | 4249 | 2820 | 4120 | 1912 |
| proposed malloc | 115 | 149 | 84 | 131 | 59 | |
|
| ||||||
| A6000 | NVIDIA malloc | 3029 | 4043 | 2705 | 3902 | 1801 |
| proposed malloc | 110 | 144 | 80 | 120 | 56 | |
|
| ||||||
| A100 | NVIDIA malloc | 1940 | 2994 | 2184 | 3048 | 1259 |
| proposed malloc | 98 | 124 | 69 | 99 | 49 | |
5. CONCLUSION AND FUTURE WORK
This paper introduces the successful implementation of BWA-MEM on GPUs and highlights the challenges that previous studies have not addressed. While previous research has only discussed implementing the Smith-Waterman stage on GPUs, we have implemented the entire program on GPUs and provided solutions and optimization techniques for the more challenging stages of the program, including seeding, seed chaining, memory management, and I/O handling. We achieved significant speedup in wall time, ranging from 3.2x to 3.8x when using an NVIDIA A40/A6000/A100 GPU compared to BWA-MEM2 running on a 64-thread AMD EPYC 7662 CPU. In stage-wise comparison, the A40, A6000, and A100 GPUs achieved up to 3.7x-4x, 2x-2.5x, and 3.1x-7.9x speedup on the three major stages of BWA-MEM, namely seeding and seed chaining, Smith-Waterman, and generating SAM output, respectively. The paper presents a promising hardware acceleration tool for BWA-MEM and proposes solutions to address the challenges associated with its implementation on GPUs.
As a future research direction, we aim to enhance the seeding stage in our current implementation, which has only achieved 49% warp efficiency. To achieve higher warp efficiency, we plan to explore improving the FM index search or utilizing a different index system more suitable for GPUs. Additionally, we aim to investigate more efficient and effective methods to reorder the seeds for better read similarity between reads in a multiprocessor instead of relying on sorting. These research directions can potentially optimize BWA-MEM performance on GPUs further and improve the accuracy of read mapping.
CCS CONCEPTS.
• Computing methodologies → Massively parallel algorithms.
ACKNOWLEDGEMENTS
X. Lv is supported by a Tianshan Talent-Young Science and Technology Talent Project (2022TSYCCX0060) of China; M. Pham and Y. Tu are supported by an award (1R01GM140316) from the National Institutes of Health (NIH), USA.
Contributor Information
Minh Pham, University of South Florida, Tampa, FL, USA.
Yicheng Tu, University of South Florida, Tampa, FL, USA.
Xiaoyi Lv, Xinjiang University, Ürümqi, China.
REFERENCES
- [1].[n.d.]. NVIDIA Clara Parabricks. https://developer.nvidia.com/claraparabricks ([n.d.]).
- [2].Bell Nathan and Hoberock Jared. 2012. Thrust: A productivity-oriented library for CUDA. In GPU computing gems Jade edition. Elsevier, 359–371. [Google Scholar]
- [3].Chen Yu-Ting, Cong Jason, Lei Jie, and Wei Peng. 2015. A novel high-throughput acceleration engine for read alignment. In 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 199–202. [Google Scholar]
- [4].Farrar Michael. 2007. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23, 2 (2007), 156–161. [DOI] [PubMed] [Google Scholar]
- [5].Ferragina Paolo and Manzini Giovanni. 2000. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science. IEEE, 390–398. [Google Scholar]
- [6].Bingsheng He, Yang Ke, Fang Rui, Lu Mian, Govindaraju Naga, Luo Qiong, and Sander Pedro. 2008. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 511–524. [Google Scholar]
- [7].Hercus C and Albertyn Z. 2012. Novoalign. Selangor: Novocraft Technologies; (2012). [Google Scholar]
- [8].Houtgast Ernst Joachim, Sima Vlad-Mihai, Bertels Koen, and Al-Ars Zaid. 2015. An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm. In 2015 international conference on embedded computer systems: Architectures, modeling, and simulation (samos). IEEE, 221–227. [Google Scholar]
- [9].Houtgast Ernst Joachim, Sima Vlad-Mihai, Bertels Koen, and Al-Ars Zaid. 2016. GPU-accelerated BWA-MEM genomic mapping algorithm using adaptive load balancing. In International conference on architecture of computing systems. Springer, 130–142. [Google Scholar]
- [10].Houtgast Ernst Joachim, Sima Vlad-Mihai, Bertels Koen, and Al-Ars Zaid. 2018. Comparative analysis of system-level acceleration techniques in bioinformatics: A case study of accelerating the smith-waterman algorithm for bwa-mem. In 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 243–246. [Google Scholar]
- [11].Houtgast Ernst Joachim, Sima Vlad-Mihai, Bertels Koen, and Al-Ars Zaid. 2018. Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths. Computational biology and chemistry 75 (2018), 54–64. [DOI] [PubMed] [Google Scholar]
- [12].Houtgast Ernst Joachim, Sima Vlad-Mihai, Marchiori Giacomo, Bertels Koen, and Al-Ars Zaid. 2016. Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms. In 2016 international conference on reconfigurable computing and fpgas (reconfig). IEEE, 1–8. [Google Scholar]
- [13].Alfredo Iacoangeli A Khleifat Al, Sproviero William, Shatunov A, Jones AR, Morgan SL, Pittman A, Dobson RJ, Newhouse SJ, and Al-Chalabi A. 2019. DNAscan: personal computer compatible NGS analysis, annotation and visualisation. BMC bioinformatics 20, 1 (2019), 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Kieu-Do-Nguyen Binh, Pham-Quoc Cuong, and Pham Cong-Kha. 2021. High-Performance FPGA-Based BWA-MEM Accelerator. International Journal of Machine Learning and Computing 11, 3 (2021). [Google Scholar]
- [15].Langmead Ben and Salzberg Steven L. 2012. Fast gapped-read alignment with Bowtie 2. Nature methods 9, 4 (2012), 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Lévy Jonathan. 2019. Acceleration of Seed Extension for BWA-MEM DNA Alignment Using GPUs. (2019). [Google Scholar]
- [17].Li Heng. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013). [Google Scholar]
- [18].Li Heng, Handsaker Bob, Wysoker Alec, Fennell Tim, Ruan Jue, Homer Nils, Marth Gabor, Abecasis Goncalo, and Durbin Richard. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Li Heng and Homer Nils. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics 11, 5 (2010), 473–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Li Ruiqiang, Yu Chang, Li Yingrui, Lam Tak-Wah, Yiu Siu-Ming, Kristiansen Karsten, and Wang Jun. 2009. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 15 (2009), 1966–1967. [DOI] [PubMed] [Google Scholar]
- [21].Maleki Saeed, Musuvathi Madanlal, and Mytkowicz Todd. 2016. Efficient parallelization using rank convergence in dynamic programming algorithms. Commun. ACM 59, 10 (2016), 85–92. [Google Scholar]
- [22].Manber Udi and Myers Gene. 1993. Suffix arrays: a new method for on-line string searches. siam Journal on Computing 22, 5 (1993), 935–948. [Google Scholar]
- [23].Müller André, Schmidt Bertil, Membarth Richard, Leißa Roland, and Hack Sebastian. 2022. AnySeq/GPU: A Novel Approach for Faster Sequence Alignment on GPUs. In Proceedings of the 36th ACM International Conference on Supercomputing (Virtual Event) (ICS ’22). Association for Computing Machinery, New York, NY, USA, Article 20, 11 pages. 10.1145/3524059.3532376 [DOI] [Google Scholar]
- [24].Cuong Pham-Quoc Binh Kieu-Do, and Thinh Tran Ngoc. 2021. A high-performance fpga-based bwa-mem dna sequence alignment. Concurrency and Computation: Practice and Experience 33, 2 (2021), e5328. [Google Scholar]
- [25].Cuong Pham-Quoc Binh Kieu-Do-Nguyen, and Thinh Tran Ngoc. 2018. An fpga-based seed extension ip core for bwa-mem dna alignment. In 2018 International Conference on Advanced Computing and Applications (ACOMP). IEEE, 1–6. [Google Scholar]
- [26].Reinert Knut, Langmead Ben, Weese David, and Evers Dirk J. 2015. Alignment of next-generation sequencing reads. Annual review of genomics and human genetics 16 (2015), 133–151. [DOI] [PubMed] [Google Scholar]
- [27].Torbjørn Rognes and Erling Seeberg. 2000. Six-fold speed-up of Smith–Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16, 8 (2000), 699–706. [DOI] [PubMed] [Google Scholar]
- [28].Rosati Stefano. 2020. Comparison of CPU and Parabricks GPU Enabled Bioinformatics Software for High Throughput Clinical Genomic Applications. Ph.D. Dissertation. Marquette University. [Google Scholar]
- [29].Ruffalo Matthew, LaFramboise Thomas, and Koyutürk Mehmet. 2011. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 27, 20 (2011), 2790–2796. [DOI] [PubMed] [Google Scholar]
- [30].Smith Temple F, Waterman Michael S, et al. 1981. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195–197. [DOI] [PubMed] [Google Scholar]
- [31].Suzuki Hajime and Kasahara Masahiro. 2018. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC bioinformatics 19, 1 (2018), 33–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Vasimuddin Md, Misra Sanchit, Li Heng, and Aluru Srinivas. 2019. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 314–324. [Google Scholar]
- [33].Willard Dan E. 1984. New trie data structures which support very fast search operations. J. Comput. System Sci 28, 3 (1984), 379–394. [Google Scholar]
- [34].Winter Martin, Parger Mathias, Mlakar Daniel, and Steinberger Markus. 2021. Are dynamic memory managers on gpus slow? a survey and benchmarks. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 219–233. [Google Scholar]
