BOA: A partitioned view of genome assembly

Xiaojing An; Priyanka Ghosh; Patrick Keppler; Sureyya Emre Kurt; Sriram Krishnamoorthy; Ponnuswamy Sadayappan; Aravind Sukumaran Rajam; Ümit V Çatalyürek; Ananth Kalyanaraman

doi:10.1016/j.isci.2022.105273

. 2022 Oct 8;25(11):105273. doi: 10.1016/j.isci.2022.105273

BOA: A partitioned view of genome assembly

Xiaojing An ^1,^7,⁸, Priyanka Ghosh ^2,⁷, Patrick Keppler ³, Sureyya Emre Kurt ⁴, Sriram Krishnamoorthy ⁶, Ponnuswamy Sadayappan ⁴, Aravind Sukumaran Rajam ³, Ümit V Çatalyürek ^1,⁵, Ananth Kalyanaraman ^3,^∗

PMCID: PMC9593263 PMID: 36304115

Summary

De novo genome assembly is a fundamental problem in computational molecular biology that aims to reconstruct an unknown genome sequence from a set of short DNA sequences (or reads) obtained from the genome. The relative ordering of the reads along the target genome is not known a priori, which is one of the main contributors to the increased complexity of the assembly process. In this article, with the dual objective of improving assembly quality and exposing a high degree of parallelism, we present a partitioning-based approach. Our framework, BOA (bucket-order-assemble), uses a bucketing alongside graph- and hypergraph-based partitioning techniques to produce a partial ordering of the reads. This partial ordering enables us to divide the read set into disjoint blocks that can be independently assembled in parallel using any state-of-the-art serial assembler of choice. Experimental results show that BOA improves both the overall assembly quality and performance.

Subject areas: Genomics, Bioinformatics, High-performance computing in bioinformatics, Algorithms

Graphical abstract

Highlights

•
A graph/hypergraph partitioning based method to improve assembly quality and runtime
•
Bucketing and graph/hypergraph partitioning to partition reads into blocks
•
Each block is then independently assembled using any standalone assembler
•
Hypergraph variant produces more precise contigs and is faster than state-of-the-art assemblers

Genomics; Bioinformatics; High-performance computing in bioinformatics; Algorithms.

Introduction

In de novo genome assembly, the relative ordering and orientation of the input reads along the target genome is not known a priori. In fact, it can be argued that one of the primary contributors to the problem complexity is the lack of this information—i.e., if the ordering and orientation of the reads is known at input then the genome assembly problem would reduce to a simpler (albeit less exciting) problem of performing a linear sequence of pairwise alignments between adjacent reads to produce the assembly. However, the DNA sequencers preserve neither the genomic coordinates from where the reads were sequenced nor any significant relative ordering information between the reads (except for paired end read information). Consequently, assembly algorithms are left to infer an ordering and orientation along the course of their respective computations.

Different assembly approaches vary on how much they rely on the read ordering and orientation (henceforth abbreviated as OO for simplicity) information, and at what stages of their algorithm they try to infer it. DeBruijngraph assemblers Compeau et al. (2011); Medvedev and Pop (2021); Pevzner et al. (2001), which now represent a dominant segment of modern day short-read assemblers, use an approach that is largely oblivious to OO information. This is because these assemblers use deBruijn graphs that break the reads into shorter fixed-length k-mers at the early stages of the algorithm. Therefore, the information on how the reads are ordered/oriented along the target genome is typically not recoverable until the end of the assembly pipeline (i.e., until after contigs are generated). On the other hand, the more traditional overlap-layout-consensus (OLC) class of assemblers Li et al. (2012); Medvedev and Pop (2021); Pop (2009) are more explicit in trying to infer the OO information in their assembly pipeline—as the overlap phase aligns reads against one another with an intent to arrive at a read layout. And yet, because the overlap phase is also the most time consuming step of the assembly pipeline for the OLC assemblers, the OO information is practically not available until later stages of the assembly.

In this article, we ask the simple question of what if either a total (ideal but not practical) or at least a partial order information can be generated earlier in the assembly computation(In this article, the notion of a total ordering is used to imply that the relative ordering between every pair of reads is established; whereas in a partial order, the relative ordering is established only for a subset of read pairs). Could that help improve performance and/or assembly quality? If so, what are some of the ways to generate such OO information earlier in the assembly algorithmic stages and what are their assembly efficacies?

Contributions

To address the above questions, we present a parallel assembly framework that uses a graph partitioning-centric approach. Graph partitioning Garey et al. (1974) is a classical optimization problem in graph theory that aims to partition the set of vertices of an input graph into a pre-determined number of partitions in a load balanced manner. The problem has seen decades of research in development and application under numerous contexts including in the parallel processing of graph workloads Hendrickson and Kolda (2000), as well as partitioning assembly graphs Pell et al. (2012) and read datasets Al-Okaily (2016); Jammula et al. (2017).

In this article, we exploit graph partitioning and its properties to produce a partial ordering of reads and in the process also enable parallelization of the assembly workload. More specifically:

•
We cast the assembly problem in two forms: a) one that uses graph partitioning, and b) another that uses hypergraph partitioning.
•
To enable the application for different types of partitioning, we propose a light-weight bucketing algorithm that bins reads into buckets based on fixed-length exact matches and uses the bins to generate graph/hypergraph representations suitable for partitioning.
•
Once bucketed and partitioned, each individual part can be independently assembled. This strategy allows the user to use any standalone (off-the-shelf) assembler of choice. Consequently, we call our assembly framework BOA (stands for bucket-order-assemble). An overview is shown in Figure 1. Two implementations (i.e., concrete instantiations) of this framework are presented and evaluated—one that uses a classical graphpartitioner (ParMETIS Karypis et al. (1997)), Graph-BOA, and another that uses a hypergraph partitioner (Zoltan Devine et al. (2006)), Hyper-BOA.
•
To comparatively assess the assembly efficacy of the partitioning-based approach, we also construct a benchmark Oracle assembly workflow that uses the correct read ordering available from sequencing simulators.

Schematic illustration of the BOA framework

Experimental results on simulated and real-word datasets demonstrate that our partitioning-based implementations a) improve parallel performance of assembly workloads; and b) improve assembly quality, consistently under several qualitative measures. In fact, on the simulated datasets, the partitioning-based approaches yield results that come closest in terms of quality to the Oracle assemblies produced.

Results

Experimental evaluation was performed on a range of genome inputs—covering model organisms, to human and plant chromosomal DNA—downloaded from NCBI GenBank Duke University School of Medicine (Last date accessed: November 2021). All inputs used are listed in Table 1. Short reads were generated from these reference genomes using the ART sequencing simulator Huang et al. (2012) using an average read length of 100bp, coverage of $100 \times$ , and with paired-end read information. For the Betta genome, the ART sequencing run resulted in 86 $\times$ coverage. An experiment on a real world data for D. melanogaster is presented in Section real world experiment. The QUAST Gurevich et al. (2013) tool was used to assess the quality of the output assemblies.

Table 1.

The inputs used in our experiments

Genome	Size (bp)	No. reads (= $V$ )	No. buckets	No. pins (= $N$ )	No. edges (= $E$ )
C.elegans	100,286,401	100,286,100	409,957,423	6,389,329,498	9,342,286,308
D. melanogaster	143,726,002	142,426,015	555,183,926	8,250,921,240	11,757,427,193
Human chr 7	160,567,423	160,567,400	620,586,298	9,651,040,529	16,009,424,797
Human chr 8	146,259,322	146,259,300	574,127,869	8,923,132,914	13,977,225,241
Human chr 10	134,758,122	134,758,100	527,306,188	8,211,994,915	13,248,263,074
Maize chr 10	152,435,371	152,313,178	469,060,854	5,869,048,129	14,305,585,805
Bettasplendens	456,232,186	394,258,510	1,610,294,923	25,105,195,932	36,509,423,159

Open in a new tab

All our experiments were conducted on the NERSC Cori machine (Cray XC40), where each node has 128GB DDR4 memory and is equipped with dual 16-core 2.3 GHz Intel Haswell processors. The nodes are interconnected with the Cray Aries network using a Dragonfly topology.

The BOA framework is a three-step pipeline:(1) parallel bucketing of input reads; (2) parallel partitioning the reads using either hypergraph partitioning (Hyper-BOA) or graph partitioning (Graph-BOA); and (3) subsequently running a standalone assembler on each part (in parallel). For hypergraph partitioning, we use Devine et al. (2006), and for standard graph partitioning we use ParMETIS Karypis et al. (1997). By default, for all our experiments we used k = 31, l = 8 and paired-end read information (Hyper-BOA, Graph-BOA).

For the last step of BOA, any standalone assembler can be used. In our experiments, we used MEGAHIT Li et al. (2015), Minia Chikhi and Rizk (2013) and IDBA-UD Peng et al. (2012) as three different options for assembling each block partition in the last step with k = 31. Hyper-BOA (minia) refers to the version that uses Minia; Hyper-BOA (idba-ud) uses IDBA-UD; and Hyper-BOA (megahit) uses MEGAHIT.

As baselines for comparing our BOA assemblies, we also generated two other assemblies: (1) The Oracle assembly was generated by: i) first recording the true and total read ordering along the genome (i.e., oracle ordering) using the read coordinate information from the ART simulator; ii) then trivially block partitioning the oracle ordering of the reads into roughly equal sized blocks (or parts), with the same block size (ρ) used in the partitioning-based approaches; and iii) subsequently running Minia and MEGAHIT on each individual block. (2) In addition, we ran Minia, IDBA-UD and MEGAHIT on the entire read set to enable direct comparison of our partitioning based approach against a (partitioning-free, or $K = 1$ ) standalone assembler.

Qualitative evaluation

We first present a qualitative evaluation of the BOA framework alongside comparisons to Minia, IDBA-UD, and MEGAHIT standalone assemblies and the Oracle assembly. MEGAHIT and IDBA-UD runs were with paired-end reads, and Minia does not support paired-end reads. Note that the Oracle assembly is not realizable in practice and is used just as a theoretical benchmark for comparison purposes. The Minia, IDBA-UD, and MEGAHIT assemblies are meant to be representative outputs from a typical state-of-the-art standalone assembler. Table 2 shows the results with various qualitative measures including NGA50, N50, largest alignment (in bp), genome coverage (in %), number of misassemblies, and duplication ratio. To enable a fair comparison, we set the number of parts (K) to 400 for both Zoltan and ParMETIS runs.

Table 2.

Quality metrics for our test inputs across multiple assemblers

Input	Assembler	NGA50	N50	Largest Alignment (bp)	Genome Coverage %	Missassemblies	Duplication Ratio
C.elegans	Oracle (minia)	11,162	14,172	153,394	91.65	10	1.002
	Oracle (megahit)	11,979	14,189	157,192	91.49		1.005
	Minia	4,155	5,924	75,229	83.26	37	1.002
	IDBA-UD	4,387	6,026	75,229	83.14	0	1.002
	MEGAHIT	4,464	6,276	108,538	83.71	1	1.002
	Graph-BOA (minia)	7,829	9,028	143,663	85.83	49	1.013
	Hyper-BOA (minia)	11,977	12,715	158,433	89.96	19	1.013
	Hyper-BOA (idba-ud)	11,116	13,404	158,433	89.91	5	1.014
	Hyper-BOA (megahit)	( $2.5 \times$ )11,246	( $1.2 \times$ )12,673	( $1.3 \times$ )143,817	92.10	11	1.026
D. melanogaster	Oracle (minia)	41,283	55,104	356,760	88.81	41	1.005
	Oracle (megahit)	46,516	57,037	356,561	88.51	13	1.006
	Minia	13,229	19,551	162,262	78.79	37	1.002
	MEGAHIT	16,397	24,312	190,107	78.97	0	1.001
	Graph-BOA (minia)	19,421	24,136	201,618	83.78	328	1.106
	Hyper-BOA (minia)	38,923	42,048	295,288	86.16	299	1.081
	Hyper-BOA (megahit)	( $2.4 \times$ )40,101	( $1.7 \times$ )41,729	( $1.8 \times$ )343,434	87.81	225	1.124
Human chr 7	Oracle (minia)	3,350	4,564	39,858	84.26	40	1.003
	Oracle (megahit)	3,558	4,569	39,858	84.21	40	1.124
	Minia	1,544	2,793	36,845	68.10	88	1.002
	IDBA-UD	1,599	2,834	24,503	67.98	0	1.002
	MEGAHIT	1,638	2,904	36,845	68.95	0	1.002
	Hyper-BOA (minia)	4,124	4,385	39,314	79.54	58	1.008
	Hyper-BOA (idba-ud)	3,285	4,585	39,352	79.87	0	1.010
	Hyper-BOA (megahit)	( $2.0 \times$ )3,331	( $1.5 \times$ )4,316	( $1.2 \times$ )43,498	83.30	10	1.018
Human chr 8	Oracle (minia)	3,944	4,869	42,828	88.44	34	1.003
	Oracle (megahit)	4,194	4,883	56,943	88.40	1	1.005
	Minia	1,877	2,784	27,427	74.28	76	1.002
	MEGAHIT	1,987	2,893	31,115	75.27	0	1.002
	Hyper-BOA (minia)	4,379	4,569	37,028	86.02	29	1.010
	Hyper-BOA (megahit)	( $2.0 \times$ )4,044	( $1.6 \times$ )4,604	( $1.5 \times$ )46,122	88.92	4	1.020
Human chr 10	Oracle (minia)	3,462	4,392	37,537	87.12	28	1.003
	Oracle (megahit)	3,685	4,395	37,429	87.10	1	1.005
	Minia	1,672	2,654	33,773	71.73	78	1.002
	MEGAHIT	1,766	2,755	33,773	72.59	0	1.002
	Hyper-BOA (minia)	3,942	4,149	42,959	83.02	41	1.007
	Hyper-BOA (megahit)	( $1.9 \times$ )3,428	( $1.5 \times$ )4,125	( $1.3 \times$ )44,604	86.46	1	1.017
Maize chr 10	Oracle (minia)	841	3,906	35,657	56.33	4	1.003
	Oracle (megahit)	904	3,903	35,657	56.33	0	1.005
	Minia	–	2,058	15,644	17.08	29	1.003
	MEGAHIT	–	2,134	15,645	17.34	0	1.003
	Hyper-BOA (minia)	–	3,629	30,306	34.23	178	1.056
	Hyper-BOA (megahit)	–	( $1.2 \times$ )2,559	( $2.0 \times$ )30,664	39.64	86	1.102

Open in a new tab

The target number of reads per part (ρ) for Graph-BOA and Hyper-BOA was set to 500K. Also shown in parentheses ( $\times$ ) are the factor of improvements achieved by Hyper-BOA (megahit) over the corresponding standalone MEGAHIT values. Boldface entries are best values.

The results show that Hyper-BOA implementations consistently outperform all other assemblers tested by nearly all the qualitative measures, and for almost all inputs tested. Among the Hyper-BOA implementations, Hyper-BOA (megahit) is the best. Relative to the MEGAHIT standalone assembler, Hyper-BOA (megahit) consistently improves the NGA50 values by an average of $2 \times$ and up to $2.5 \times$ ; the N50 values by an average of $1.70 \times$ and up to $2.13 \times$ ; whereas the largest alignment length improves $1.47 \times$ on average and up to $1.94 \times$ . Hyper-BOA (minia) also improves the assembly quality of its standalone counterpart Minia by similar margins. Intuitively, partitioning can help reduce noise within blocks but there is no guarantee for it as the bucketing step still uses exact matches to group the reads. Repetitive k-mers could still confound the partitioning process. We see the effect of these possibly noisy k-mers in the misassemblies reported by the Hyper-BOA implementations. Yet, the choice of the standalone assembler at the end of the partitioning pipeline provides certain degree of control over these misassemblies, with IDBA-UD typically resulting in fewer missassemblies than the other assemblers.

From Table 2, we also observe that Hyper-BOA results consistently come within 90% or more reach of the quality values produced by the corresponding Oracle assembly. For instance, on average Hyper-BOA (megahit) reaches within 93% of the corresponding Oracle (megahit) NGA50 values, and within 100% of the respective largest alignment values on average. The largest gap is seen in Human chr 8, where Hyper-BOA (megahit)’s largest alignment is only 81% of the Oracle’s value. Even in this case, however, the Hyper-BOA’s largest alignment is considerably larger ( $1.48 \times$ ) than that of standalone MEGAHIT value.

Of interest, we also note in Table 2 that for two inputs, Human chr 10 and C.elegans, the largest alignment values produced by Hyper-BOA (minia) are marginally better than that of the Oracle values. This can sometimes happen because, after all, the assembly quality is ultimately a function of the block composition that is fed into the final stage of BOA assembly; and the composition between the blocks for Hyper-BOA could have favored longer growth of the longest contig (relative to the Oracle). NGA50 for Hyper-BOA (minia) is also consistently better than Oracle (minia). Overall, these results show that partitioning helps in closing the gap toward the theoretically achievable peaks in total read order-aware assemblies.

Hyper-BOA versus Graph-BOA

In our results we observed that in general, Hyper-BOA significantly outperforms Graph-BOA. For C.elegans and D. melanogaster, where both results are available, we see from Table 2 that Hyper-BOA implementations outperform Graph-BOA by all qualitative measures. This is to be expected as the input graph into Graph-BOA, are not weighted (see related discussion in Section graph-BOA and hyper-BOA). Note that for the remaining four inputs tested, Graph-BOA could not complete because of lack of memory. As described in Section graph-BOA and hyper-BOA, graphs can have a higher memory complexity even with edge duplication reductions shown in Figure 2.

An illustrative example of our pair generation algorithm

On the left are shown four reads and two maximal matches shared among them (shown underlined). Let k = 3. The right panel shows a selected subset of buckets relevant to the maximal matches (along each column), and the division of the respective read sets across the different left character sets *Lchar* (along each row). For instance, read r₁ appears in the *Lchar* set for t under column *acc* because the k-meracc in read r₁ has t as its left character. The pairs generated from each bucket are shown in the bottom panel.

Runtime performance evaluation

Table 3 shows the runtime performance for Hyper-BOA and Graph-BOA implementations, alongside standalone Miniaand MEGAHIT. The bucketing and partitioning steps are parallel, and therefore we report their parallel runtimes. For the assembly step, we report the mean processing time per block partition.

Table 3.

Runtime performance of the different assemblers

Input	Assembler	Parallel Bucketing (sec): max	Parallel Partitioning (sec): max	Assembly (sec): avg	Total time (sec)
C.elegans	Graph-BOA (minia)	51	180	150	381
	Hyper-BOA (minia)	33	536	39	608
	Hyper-BOA (megahit)	33	536	13	582
	Minia				1,364
	MEGAHIT				2,000
D. melanogaster	Graph-BOA (minia)	81	195	51	327
	Hyper-BOA (minia)	57	867	39	963
	Hyper-BOA (megahit)	57	867	18	942
	Minia				2,444
	MEGAHIT				2,845
Human chr 7	Hyper-BOA (minia)	70	967	86	1,123
	Hyper-BOA (megahit)	70	967	16	1,053
	Minia				2,569
	MEGAHIT				3,377
Human chr 8	Hyper-BOA (minia)	67	826	61	954
	Hyper-BOA (megahit)	67	826	26	919
	Minia				2,518
	MEGAHIT				3,134
Human chr 10	Hyper-BOA (minia)	61	844	115	1,020
	Hyper-BOA (megahit)	61	844	18	923
	Minia				2,027
	MEGAHIT				2,970
Maize chr 10	Hyper-BOA (minia)	51	745	220	1,016
	Hyper-BOA (megahit)	51	745	19	815
	Minia				3,625
	MEGAHIT				3,670

Open in a new tab

The BOA implementations were run on the NERSC Cori machine with 256 cores (i.e. on 32 nodes with 8 processes per node), while the standalone Minia and Bettasplendens baselines run in multithreaded mode on a single node with 32 cores. All times reported are in seconds.

The results in Table 3 show that the BOA implementations are significantly faster than the standalone Minia and MEGAHIT executions. For instance, for the MEGAHIT runs, Hyper-BOA (megahit) delivers speedups consistently between 3 and 4 $\times$ over standalone MEGAHIT. The speedups for the Minia runs are larger.

Large-scale experiment

As one large-scale experiment, we tested our Hyper-BOA (megahit) on the full assembly of the 456MbpBettasplendens (Siamese fighting fish). Table 4 shows the key results. Consistent with the results on smaller genomes, the Hyper-BOA implementations outperform their respective standalone assemblers—e.g., Hyper-BOA (megahit) yields $1.3 \times$ improvement on both NGA50 and N50, $1.7 \times$ improvement on largest alignment, and $1.1 \times$ improvement in genome coverage over standalone MEGAHIT. Hyper-BOA implementations also significantly reduce time to solution—e.g., it took 2 h 52 min for the standalone MEGAHIT to assemble the Betta genome, whereas this only took 30 min for Hyper-BOA (megahit) (i.e., $5.69 \times$ speedup).

Table 4.

Quality and runtime performance for Bettasplendens assembly

	NGA50	N50	Largest Alignment (bp)	Genome Coverage %	Missassemblies	Duplication Ratio	Total time Avg. (sec)	Total time Max. (sec)
Oracle (megahit)	5,551	7,830	84,290	89.58	1,132	1.005		∗
Minia	3,425	5,571	59,787	81.85	878	1.002		5,415
MEGAHIT	4,253	5,765	59,789	82.05	676	1.002		10,313
Graph-BOA (megahit)	4,253	6,516	76,575	84.13	916	1.010	640	663
Hyper-BOA (minia)	5,362	7,458	96,553	88.75	1,254	1.012	2,159	3,017
Hyper-BOA (megahit)	5,427	7,474	101,570	89.88	1,140	1.016	1,791	1,812

Open in a new tab

Parallel bucketing and partitioning was performed across 512 cores of NERSC Cori (64 nodes × 8 cores per node) with 1024 partitions. The runs for baseline (standalone) Minia and MEGAHIT were executed on a shared memory node with 32 cores. (∗ indicates that these timings could not be collected in time on the same system.)

Real world experiment

We evaluated Hyper-BOA with real world data. More specifically, we ran Hyper-BOA (megahit) and MEGAHIT on a D. melanogaster read set (SRA accession SRX13859210) and compared the results. This is an IlluminaHiSeq4000 dataset (average read length $150 b p$ ), containing $40.4 M$ paired-end reads totaling $6.1 G b p$ in size. Similar to previous studies with real world datasets Li et al. (2015); Chikhi and Rizk (2013), we retained only the reads that align to the reference genome. We used minimap2 Li (2018) for the alignment. Following this step, we were left with $31 M$ reads totaling $4.6 G$ base pairs. The setting of Hyper-BOA (megahit) is the same as the simulated D. melanogaster dataset. The results in Table 5 show Hyper-BOA (megahit) generated an assembly comparable to the standalone MEGAHIT in N50 length and largest alignment length, whereas achieving $1.4 \times$ improvement in NGA50 length and $7 \times$ improvement in runtime performance.

Table 5.

Assembly quality and runtime performance for the real world read set SRA accession SRX13859210

	NGA50	N50	Largest Alignment (bp)	Genome Coverage %	Missassemblies	Duplication Ratio	Total time Avg. (sec)	Total time Max. (sec)
MEGAHIT	1,566	2,651	82,462	74.29	22	1.001		1,498
Hyper-BOA (megahit)	2,147	2,668	79,365	78.62	226	1.124	227	233

Open in a new tab

Parallel bucketing and partitioning was performed across 256 cores of NERSC Cori (32 nodes × 8 cores per node) with 400 partitions. The runs for baseline (standalone) MEGAHIT were executed on a shared memory node with 32 cores.

Discussion

We presented a parallel assembly framework named BOA that leverages a graph/hypergraph partitioning-based approach to enforce a partial ordering and orientation of the input reads. Our experiments using three different off-the-shelf assemblers on a variety of inputs, demonstrate that our Hyper-BOA implementations consistently (and significantly) improve both the assembly quality and performance of the standalone assemblers. This work has opened up further research avenues for future exploration including: a) understanding the effect of varying the block (or partition) sizes and modeling that as a space-time-performance quality trade-off problem, b) scaling up to much larger inputs and metagenomic inputs, c) incorporation of long reads as a way to guide the partitioning step, d) extensions of the BOA framework for long read assemblies or hybrid assembly workflows; e) extension of the partitioning-based assembly approach to generate contigs that fall between block boundaries; and f) exploration of alternative partitioning strategies that exploit auxiliary information (e.g., sequence) information.

Limitations of the study

Long reads have become increasing available and have shown to significantly improve assembly quality. As a framework that uses partitioning, BOA can be potentially applied to different read lengths or technologies. But the original design reported in this article was restricted to short reads, as it is important to first demonstrate the utility of the partitioning idea on the more mature problem of short read assembly. In this regard, there are some non-trivial extensions that have been planned and we believe those extensions (for long reads) would have to be part of a future manuscript.

Another limitation of BOA is the larger memory footprint incurred during the partitioning phase. One of the primary motivations for developing a distributed memory implementation was to be able to scale up the input size by scaling up the available memory in the distributed setting. However, we note that it is also the space required by the graph/hypergraphpartitioner that needs to be factored in while determining memory requirements. To scale to larger inputs on the current evaluation system (64 compute nodes), further optimizations focused on memory will be needed.

Our current implementation does not have the capability of extending the contigs beyond the boundaries of a block, whereas doing so could potentially improve the assembly quality even further. The limitation with the current implementation is because traditional partitioning approaches, by default, generate a disjoint partitioning (of the reads in this case). To grow a contig beyond block boundaries, it will be important to take into account potential overlaps between reads that fall into genomically adjacent partitioned blocks. For this, we would have to sort or at least generate an approximate ordering among the blocks to detect potentially adjacent blocks as per (the unknown) genome. The challenge is to ensure such an ordering is done without introducing a chance/risk of a misassembly. Hence, this is a part of our future work/extension.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

C. elegans	C. elegans Sequencing Consortium	NCBI GenBank assembly accession GCA_000002985.3
D. melanogaster	The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics	NCBI GenBank assembly accession GCA_000001215.4
Human chr 7	T2T Consortium	NCBI GenBank assembly accession GCA_009914755.2, GenBank sequence CP068271.1
Human chr 8	T2T Consortium	NCBI GenBank assembly accession GCA_009914755.2, GenBank sequence CP068270.1
Human chr 8	T2T Consortium	NCBI GenBank assembly accession GCA_009914755.2, GenBank sequence CP068268.1
Maize chr 10	MaizeGDB	NCBI GenBank assembly accession GCA_902167145.1, GenBank sequence LR618883.1
Bettasplendens	BGI	NCBI GenBank assembly accession GCA_003650155.1
Real world read set	Duke University	NCBI SRA, accession numberSRX13859210

Software and algorithms

ART_Illumina v 2.8.5	Huang et al. (2012)	RRID:SCR_006538; https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier2016.06.05linux64.tgz
Megahit v 1.2.9	Li et al. (2015)	RRID:SCR_018551; https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
Minia v 0.0.102	Chikhi and Rizk (2013)	RRID:SCR_004986; https://github.com/GATB/minia/releases/download/v0.0.102/minia-v0.0.102-bin-Linux.tar.gz
IDBA v 1.1.3	Peng et al. (2012)	RRID:SCR_011912; https://github.com/loneknightpy/idba/releases/download/1.1.3/idba-1.1.3.tar.gz
ParMetis v 4.0.3	Karypis et al. (1997)	http://glaros.dtc.umn.edu/gkhome/fetch/sw/parmetis/parmetis-4.0.3.tar.gz
Zoltan v 3.83	Devine et al. (2006)	https://github.com/sandialabs/Zoltan/archive/refs/tags/v3.83.tar.gz
QUAST v 5.1.0rc1	Gurevich et al. (2013)	RRID:SCR_001228; https://github.com/ablab/quast
Minimap2 v 2.24	Li (2018)	RRID:SCR_018550; https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2
BOA v0	This work	https://github.com/GT-TDAlab/BOA

Open in a new tab

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Xiaojing An (anxiaojing@gatech.edu).

Materials availability

This study did not generate new unique reagents.

Method details

Preliminaries and notation

Strings and genome assembly

Let s denote an arbitrary string over a fixed alphabet $Σ$ , and let $| s |$ denote the length of the string. Let $s [i, j]$ denote the substring of s starting at index i and ending at index j. As a convention, we index strings from 1, and the $i^{t h}$ character of s is denoted by $s [i]$ . A k-mer is a (sub)string of length k.

Given a substring $s [i, j]$ of s, we refer to the character immediately preceding the substring in s to be its “left-character” or lchar (if one exists). More specifically, $l c h a r_{i} = s [i - 1]$ if $1 < i \leq | s |$ , and if i = 1, then $l c h a r_{i} = B$ , where $B \notin Σ$ is used to represent a blank symbol.

The input to genome assembly is a set of n reads (denoted by $R$ ). Each read is a string over the alphabet $Σ = {a, c, g, t}$ . We denote the reverse complemented form of a read r as rc(r). If reads are generated with paired-end information, then the two reads of the same pair are assigned consecutive read IDs i and $i + 1$ , so that the odd read ID corresponds to the forward strand read and the even read ID corresponds to the reverse strand read. We denote the set of all forward (alternatively, reverse) reads as $R_{f}$ (alternatively, $R_{r}$ ). Note that $R = R_{f} \cup R_{r}$ , and $| R_{f} | = | R_{r} | = \frac{n}{2}$ .

Graph partitioning

A undirected graph $G = (V, E)$ is defined by a set of vertices $V$ and a set of edges $E$ . An edge $e_{i, j}$ is a pair of distinct vertices, i.e., $e_{i j} = {v_{i}, v_{j}}, v_{i} \in V, v_{j} \in V$ . The degree $d_{i}$ of a vertex $v_{i}$ is defined as the number of edges incident to that vertex. Weights and costs can be assigned to vertices and edges. $W$ is used to represent the weight assignment for vertices, where $w_{i}$ is the weight for the vertex $v_{i} \in V$ . $C$ is the cost assignment for edges, where $c_{i j}$ represents the cost for the edge $e_{i j} \in E$ .

A K-way partition of $G$ , $Π = {P_{1}, \dots P_{K}}$ , places each vertex of the graph into a part. More concretely, $Π$ is a K-way partition if each part $P_{i}$ is a non-empty subset of $V$ , each pair of parts is disjoint, i.e., $P_{i} \cap P_{j} = \emptyset$ for all $1 \leq i \neq j \leq K$ , and the union of all parts recovers $V$ , that is $\cup_{1 \leq i \leq K} P_{i} = V$ . For a K-way partition $Π$ , an edge $e_{i j} = {v_{i}, v_{j}}$ is called external (or $c u t$ ) if $v_{i} \in P_{a}$ , $v_{j} \in P_{b}$ with $a \neq b$ , otherwise called internal (or $u n c u t$ ). $E_{E}$ is used to represent the set of all external edges. The cost (or cutsize) χ of $Π$ is defined as: A K-way partition, $Π$ , is called balanced if the following holds:

\forall i \in {1, \dots, K}, \sum_{v_{j} \in P_{i}} w_{j} \leq (1 + ε) W_{a v g}

(Equation 1)

where, $W_{a v g} = (\sum_{v_{j} \in V} w_{j}) / K$ , and ε is a given maximum imbalance ratio.

The graph partitioning problem is defined as follows: given a graph $G = (V, E)$ , vertex weight and edge cost assignments $W$ and $C$ , a part number requirement K, and the maximum allowed imbalance ratio ε, find a balanced K-way partitioning that minimizes the cost. The graph partitioning problem is known to be NP-hard Garey and Johnson (1979), even for seemingly easier problems such as uniform weighted bipartitioning Garey et al. (1974).

Hypergraph partitioning

A hypergraph $H = (V, N)$ contains a set of vertices, $V$ , and a set of nets (hyperedges), $N$ . Hypergraph is a generalization of graph where each hyperedge can connect more than two vertices, i.e., a net $n_{i} \in N$ is a subset of vertices $V$ . The vertices in a net are called its pins, represented by $p i n s [n_{i}]$ ; and the size of the net is the number of its pins. The number of nets incident on $v_{i}$ represents the vertex degree $d_{i}$ . Similar with graphs, we use $W$ and $C$ as vertex weight and net cost assignments, $w_{i}$ to represent the weight of a vertex $v_{i} \in V$ and $c_{j}$ to represent the cost of a net $n_{j} \in N$ .

The K-way partitioning of a hypergraph is similar to that of a standard graph. The main difference comes from the definition of partitioning cost. A net is connected to a part if at least one of its pins is in that part. The connectivity set $Λ_{j}$ of net $n_{j}$ is all the parts that the net connects to. The size of $Λ_{j}$ is denoted $λ_{j}$ , i.e. $λ_{j} = | Λ_{j} |$ . A net $n_{j}$ is external (or $c u t$ ), if it connects to more than one part, i.e. $λ_{j} > 1$ , otherwise, the net is called internal (or $u n c u t$ ). The set of all external nets for a partition $Π$ is represented as $N_{E}$ . There are multiple definitions of cost χ of a partitioning $Π$ , in this work we will use connectivity $- 1$ metric, defined as: The hypergraph partitioning problem is known to be NP-hard as well Lengauer (2012).

The BOA assembly framework overview

The BOA framework hinges on the key idea of block partitioning the reads so that each block is expected to contain reads from neighboring regions of the (unknown) target genome. This blocking mechanism is a proxy to obtaining a fully ordered sequence of reads. After block partitioning, each block can be assembled using any standalone assembler of choice, and the combined set of contigs generated across all the blocks represent the final output assembly. This partitioning-based strategy has several advantages:

•
The quality of the output assembly can potentially see improvements if the block partitioning of reads is faithful to the origins of the reads along the genome (i.e., reads mapping to neighboring genomic regions are assigned to the same block, while unrelated reads are kept separated across blocks).
•
From the performance standpoint, block partitioning can provide significant leverage in controlling the degree of parallelism as each block is independently processed.
•
Finally, the BOA framework is oblivious to and allows the use of any standalone assembler of choice downstream. Instead, the framework shifts the focus on keeping related reads together, unrelated reads separate, and keeping the block sizes reasonably small so as to enable fast parallel assemblies.

Figure 1 illustrates the BOA framework with its different components. In what follows, we describe these major components. In particular, we describe two instantiations of the framework—one using classical graph partitioning (Section the BOA framework using graph partitioning: graph-BOA) and another using hypergraph partitioning (Section the BOA framework using hypergraph partitioning: hyper-BOA). Both the initial bucketing step and final assembly step are common to both instantiations.

Bucketing algorithm

Given the set of reads $R$ , the bucketing algorithm computes set of buckets $B$ , where each bucket $b \in B$ corresponds to a k-mer in $R$ . The bucketing algorithm assigns the reads in $R$ to at most $| Σ |^{k}$ buckets, for a fixed length $k > 0$ . We define a bucket for each distinct k-mer present in $R$ . In particular, a read r is assigned to all buckets corresponding to the list of k-mers it contains. Therefore, a bucket is simply a set of read IDs with that k-mer. To account for bidirectionality of reads, we take the lexicographically smaller variant of each k-mer and assign reads to that bucket. This ensures that the read is either present in the bucket corresponding to the k-mer in its direct form or its reverse complemented form (but not both).

Let $B$ denote the collection of all buckets generated in this process, and b denote an arbitrary member of $B$ . Note that each $b \subseteq R$ . We use $k m e r (b)$ to denote the k-mer that defines bucket b. Note that it is possible for buckets to intersect in reads (given that the same read could have multiple distinct k-mers).

The BOA framework using hypergraph partitioning: Hyper-BOA

Hyper-BOA models the multi-way interaction between reads and buckets using a hypergraph. We describe this hypergraph-based model first because it naturally follows from the bucketing step.

Input to Hyper-BOA is the set of buckets $B$ and output is the read-bucket hypergraph $H = (V, N)$ , where reads are represented as vertices, and buckets as nets. Thisstep produces a partitioning $Π$ of $H$ , which is a partitioning of reads. Each bucket $b \in B$ contains the subset of all reads in $R$ that share the same k-mer (either in the direct or reverse complemented form). With the hypothesis that this is a necessary—but not sufficient—condition for reads originating in the same region of the target genome, we construct a hypergraph $H = (V, N)$ for two possible scenarios.

No paired-end information available

If the input $R$ does not contain paired-end information, then we construct a hypergraph $H = (V, N)$ such that $V = R$ and $N = B$ . In other words, we initialize a hypergraph where each read is represented by a vertex and each bucket by a net. The pins of a net correspond to all the reads that are part of the corresponding bucket. Since each vertex is a read and the subsequent assembly workload is not expected to vary with similar sized reads, we assign unit weights to each vertex. One can use a cost function to represent importance of a k-mer, but for this initial work we simply treat each k-mer equally and thus assign unit costs to nets.

Paired-end information available

If the input read set $R$ contains paired-end information, then we construct our read-bucket hypergraph $H = (V, N)$ after post-processing the buckets as follows. Recall that for paired-end reads, the two reads of a given pair are assigned consecutive IDs i (odd) and $i + 1$ (even). While these two reads of the pair can take part in different sets of buckets, it is desirable to assign these two reads to the same block at the end of partitioning, so that the subsequent assembly step can use the paired-end information. To force this block assignment during partitioning, we fuse the two reads into a single vertex in the hypergraph—i.e., the reads i and $i + 1$ of a pair are both mapped to the same vertex in the hypergraph $H$ , identified by vertex $⌈ \frac{i}{2} ⌉$ (same as $⌈ \frac{i + 1}{2} ⌉$ ). This can be achieved by simply scanning the list of read IDs in each bucket and renumbering each using the above ceiling function (In our implementation, we actually renumber the read IDs as they are entered into their buckets, so that a second pass is unnecessary). Consequently, the new hypergraph $H$ will contain exactly $\frac{n}{2}$ vertices. The set of nets $N$ is the updated set of buckets $B$ with the renumbered read IDs (as its pins). Each vertex and each net are assigned unit weights.

Partitioning

Once the hypergraph $H$ is constructed, we call the partition function on $H$ (described in Section preliminaries and notation) using the Zoltan hypergraph partitioner Devine et al. (2006). Partitioning takes as an input parameter the number of output parts K. However, instead of fixing the number of parts (or equivalently, output blocks) arbitrarily, we set a target for the output block size, i.e., for the number of reads per part, denoted by ρ. Intuitively, since each output block is input to a standalone assembler, it is important to keep related reads together so that contigs have a chance to grow long (and not fragment the assembly). However, if the block size becomes too large, then it may not only start including unrelated reads (from far regions of the genome) but also would have a negative impact on the runtime performance. (Note that a single block configuration ( $K = 1$ ) is equivalent to running the standalone assembler on the entire input $R$ .) Therefore, we set a target ρ for the number of reads per block, and using ρ determine K ( $\approx ⌈ \frac{n}{ρ} ⌉$ ).

To determine an appropriate ρ, we can use the longest contigs produced by state-of-the-art assemblers as a lower-bound. The idea is to set a target for ρ so that the contigs produced from each block have an opportunity to grow into a contig that is longer than this baseline length. For instance, a block with 100K reads can produce only a contig that is as long as $\sim$ 100Kbp (assuming 100bp read length and 100 $\times$ genome sequencing coverage). So if our goal is to surpass this baseline, then the block size has to reflect that—e.g., a constant factor more than that baseline. Setting a high target for ρ as described above is not a guarantee for qualitative improvement, but it provides a chance (to the per-block standalone assemblers). This approach enables empirically calibrating the block size for assembly quality.

One last parameter in partitioning is the balance criterion. To achieve a similar workload across all the individual block assembler runs, we prefer roughly similar sized blocks. However, keeping this very tight might unnecessarily constrain the related reads that will need go into same part. To strike a balance between these two goals we use a balanced constraint of $ε = 1 %$ (see Equation 1).

The BOA framework using graph partitioning: Graph-BOA

Graph-BOA models the interaction among reads using a graph. Input to Graph-BOA is the set of buckets $B$ and output is the read-overlap graph $G = (V, E)$ where reads are represented as vertices, and edges represent alignment-free, exact match overlaps between pairs of reads, identified by bucketing phase. This phase produces a partitioning $Π$ of $G$ , which is a partitioning of reads.

Given a set $R$ of n input reads, we construct a read-overlap graph $G = (V, E)$ where $V = R$ and ${r_{i}, r_{j}} \in E$ if the two reads $r_{i}$ and $r_{j}$ share at least one maximal match (A maximal match is a nonempty exact match between two strings that cannot be extended in either direction) of length $\geq k$ , for some integer constant $k > 0$ . In other words, the set of edges $E$ is generated by enumerating all pairs of reads sharing at least one maximal match α of length $\geq k$ . Let $P$ denote the set of pairs, given by:

P = {{r_{i}, r_{j}} | r_{i}, r_{j} \in R, i \neq j, and \exists a maximal match of length \geq k between r_{i} and r_{j}}

For example, two reads $r_{1} = c \underline{a g c c} a$ and $r_{2} = t g \underline{a g c c}$ share substring $a g c c$ as a maximal match, and if $k = 3$ there will exist an edge between the nodes corresponding to $r_{1}$ and $r_{2}$ in $G$ . The focus on maximal matches is due to the following performance consideration. While buckets are defined based on k-mers, two reads that share a longer exact match of length t could appear in up to $t - k + 1$ distinct buckets. Instead of detecting the same pair ${r_{i}, r_{j}}$ multiple times in those many buckets, our algorithm detects it only once due to the leftmost common k-mer in the maximal matching. In the above example of $r_{1}$ and $r_{2}$ , the pair is detected due to the leftmost k-mer $a g c$ in the maximal match.

Note that once all pairs are generated, each bucket b containing m reads would have effectively contributed $(\begin{array}{l} m \\ 2 \end{array})$ pairs to $P$ —i.e., a clique of size min $G$ . The above maximal match trick is mainly to avoid duplicate detection of any edge in the clique.

Pair generation

To generate $P$ using all the buckets, our algorithm deploys a two-step strategy as described below. Intuitively, we use the two characters (if present) flanking the maximal match to its left and right. A maximal match is a substring that is both left-maximal (i.e., left characters on both reads mismatch) and right-maximal (i.e., right characters on both reads mismatch). The only exception is when there is no flanking character on any one of the read. In such a case, we use the blank character B for maximality. Note that two reads that have B as their respective left characters are left-maximal. Our algorithm exploits the bucketing information for right maximality and instead checks only for left maximality (while still guaranteeing maximality). The details of the algorithm are described below.

a)
Left-maximality: Consider the read collection covered by bucket b. For each read $r \in b$ , let $ψ (r, b)$ denote the set of suffix positions in read r that have $k m e r (b)$ as their prefix; and let $L c h a r s (r, b) \in Σ \cup {B}$ denote the set of all characters that immediately precede those suffix positions in r. Using $L c h a r s (r, b)$ , we generate a bit vector $L_{r}$ of length $| Σ | + 1$ as follows:

L_{r} [x] = {\begin{cases} 1, & if x \in L c h a r s (r, b) \\ 0, & o t h e r w i s e \end{cases}

Example 1. Consider a read $r = \underline{a c} c t t \underline{a c} c$ and the bucket b for 2-mer $a c$ . Then $ψ (r, b)$ are the suffix positions ${1, 6}$ and the corresponding left characters are $r [0] = B$ and $r [5] = t$ (i.e., $L c h a r s (r, b) = {B, t}$ ). Therefore the bit vector $L_{r}$ for the bucket corresponding 2-mer $a c$ is $[1, 0, 0, 0, 1]$ for left characters $[B, a, c, g, t]$ respectively.

Remark 1. As noted in §7.3.3, k-mers are indexed by their lexicographically smaller variant to account for bidirectionality. If a given k-mer in a read r is not in its lexicographically smaller form, we use character following the k-mer in its complemented form, for the purpose of left character lists ( $L c h a r s$ ). This is to capture the reversal in direction.

Example 2. Consider a read $r = a \underline{t g} c g t \underline{t g}$ and the bucket b for 2-mer $t g$ (or equivalently, its smaller form $c a$ ). Then, the $L c h a r s (r, b) = {g, B}$ as they are the corresponding left characters in the reverse complemented form for r. Therefore the bit vector $L_{r}$ for the bucket corresponding 2-mer $t g$ (or equivalently, $c a$ ) is $[1, 0, 0, 1, 0]$ for left characters $[B, a, c, g, t]$ respectively.

b)
Pairing: Subsequently, we use the $L_{r}$ -arrays for all reads $r \in b$ to generate the pairs of reads from that bucket b. The set of pairs contributed by bucket b, denoted by $P (b)$ , is given by:

P (b) = {{r_{i}, r_{j}} | r_{i}, r_{j} \in b, \exists x \in Σ \cup {B} s . t . L_{r_{i}} [x] \oplus L_{r_{j}} [x] = 1 or L_{r_{i}} [B] \lor L_{r_{j}} [B] = 1}

Here, $\oplus$ and $\lor$ are the bitwise XOR and OR operators respectively. Intuitively, a pair of reads is generated at a bucket b only if there exists a pair of suffixes in those reads that differ in their left-characters (thereby guaranteeing left-maximality of the match detected). Note that right-maximality of the match in a pair detected is implicitly guaranteed as the suffixes in those two reads will have to eventually differentiate at some point past the k-mer prefix. Therefore this algorithm is able to report only one instance of a read pair ${r_{i}, r_{j}}$ for the leftmost matching k-mer of a maximal match.

Example 3. Figure 2 presents an example to illustrate our pair generation algorithm. This example shows two maximal matches (accgc and aagg) appearing among four reads. As highlighted in the orange, r₁,r₂ and r₃ share the maximum matching accgc. This match contains multiple 3-mers: acc, ccg and cgc, and therefore the corresponding reads will appear in all those buckets (shown in orange colored buckets in the table). The rows show the left character lists (Lchar) that each read will appear within a given bucket. Our pair generation algorithm will generate pairs from each bucket by performing a cross-product across the different Lchar lists. The only exception is the B list, where reads appearing in that list are left-maximal and so will yield pairs. Pairs generated from each bucket are shown in the bottom panel. The second maximal match in the example aagg (in green), shows a case where the pairing could happen between a read and the reverse complement of another read. In this example, reads r₃ and the reverse complement of r₄ share the maximal match aagg, and therefore will be generated from the bucket corresponding to $a a g$ that is the lexicographically smaller of the two variants (aag, ctt).

BOA has an optional modification dealing with a specific boundary case where edges may be missed between reads, at a cost of increased memory and runtime. Specifically, consider when:

•
The maximum matching α has length ˃ k.
•
The leftmost k-mer in α is lexicographically larger than its reverse complement, and.
•
The rightmost k-mer is lexicographically smaller than its reverse complement.

In this case, for the k-mers in both ends, the leftmost character recorded for the corresponding k-mer in the read is a character within the maximum matching α. BOA does not look outside of the maximum matching in the read to be able to recognize the end of the maximum matching and generate the read pair.

Example 4. Consider the example of two reads $r_{1} = r_{2} = c t a c$ and k = 2. Read r₁ will be in the following buckets: $a g$ , $t a$ , $a c$ with leftmost character as t,c,t respectively. The assignment is the same with r₂. Thus, the baseline algorithm would miss detecting the pair (r_1,r₂).

The algorithm can be easily modified to avoid this boundary case. More specifically, the method can store both the leftmost and rightmost characters for each k-mer in the following way: each bucket has two groups of sub-buckets: a–,t–,g–,c–,B– for leftmost character and $- a, - t, - g, - c, - B$ for rightmost character. Edges are generated only within each group of sub-buckets and not among the groups. Note that this solution comes with a slight increase in cost: each length k or greater maximum matching between two reads will produce two of the same read pairs even if the length of the maximum matching is k. For this reason and as BOA is a heuristic, we have not implemented this change in practice but note that in theory we can address it.

Similar to Hyper-BOA, we assign unit weights to vertices and edges, and create one of two different variants of the read-overlap graph $G$ depending on whether paired-end read information is available or not. More specifically, if paired-end information is available, then we follow the same fuse strategy described underHyper-BOA, by representing both reads of a pair by a single vertex in $V$ of $G$ . This is achieved by renumbering the read IDs within each input bucket bprior to generating pairs.

Partitioning

Once the read-overlap graph $G$ is constructed, then we call the partition function on $G$ (described in Section preliminaries and notation) using the ParMETIS graph partitioner Karypis et al. (1997). Here again, we use the number of reads per output block (ρ) as our guide to determine the number of blocks K and set the balanced constraint ε as 1%.

Graph-BOA and Hyper-BOA

There are a few important connections as well as differences between the graph-based approach (Graph-BOA) and hypergraph-based approach (Hyper-BOA) within our BOA framework that are worth noting.

First, from the assembly problem formulation standpoint, Graph-BOA is very similar to the OLC assembler model with the key difference being the “overlaps” in the read-overlap graph are detected using lightweight exact match-based criteria (as described in the bucketing step). Therefore our approach is alignment-free. The read-bucket hypergraphs we construct under Hyper-BOA, are also alignment-free. Furthermore, they can be viewed as a generalization of the read-overlap graphs (from edges to nets; i.e., read pairs to read subsets).

Secondly, from a method standpoint, intuitively both graph and hypergraph approaches try to put reads that are strongly connected to each other into the same part. In hypergraph model, each bucket (i.e., k-mer) is uniquely represented by a net. If two reads share multiple k-mers, they will be incident in multiple nets, hence representing how strong their connection is. In the graph model, each edge does not represent a unique relation. An edge between two reads might come from different overlaps (or buckets). Hence, one would need an aggregation function to represent that accurately. In our current implementation of Graph-BOA, the edges established between any two reads are unweighted (or equivalently, unit weight). This is in contrast with alignment-based OLC assemblers, which typically use an alignment-based weight along an edge. While edge weights would help guide partitioning decisions, for Graph-BOA there is a tradeoff with performance. One approach to calculate an edge weight between a pair of reads is based on the length of maximal matches that led to detection of that edge. However, in our pair generation algorithm, we only detect the presence of a maximal match for pairing two reads, without explicitly determining the match itself or its length (as it will become more expensive to compute the matches). An alternative strategy is to count the number of buckets a pair of reads co-occurs to use as the corresponding edge weight. However, this also implies detecting and storing a count for each pair—which could become expensive both for runtime and memory. As a compromise, we have used an unweighted representation for Graph-BOA.

Another point of difference between Hyper-BOA and Graph-BOA is their space and run time complexities. For Hyper-BOA, the k-mer based buckets are used to construct the hypergraph. Every bucket with say m distinct reads in it, induces a net with m pins. Whereas, under Graph-BOA, extra computation is needed to establish pairwise connections between reads as described in Section the BOA framework using graph partitioning: graph-BOA—i.e., every bucket with m reads contributes $(\begin{array}{l} m \\ 2 \end{array})$ edges. This leads to higher memory usage for Graph-BOA. For example, in case of C.elegans, the peak memory usage per MPI rank for Graph-BOA in the bucketing phase is 8.3 GB in comparison to 5.3 GB for Hyper-BOA. While in the partitioning phase, Graph-BOA is much lighter in runtime than Hyper-BOA as shown in Section runtime performance evaluation.

Parallelization

The BOA pipeline is comprised of three phases:

1)
Parallel Bucketing: In this step, the algorithm first loads the input FASTA file(s) in a distributed manner such that each process receives roughly the same amount of sequence data $\approx \frac{| R |}{p}$ , where p is the total number of processes. This is achieved by each process loading a chunk of reads using MPI-IO functions MPIForum (2020), such that no read is split among processes. Each read is assigned a distinct read id. We use MPI_Scan to determine the read offset at each process. Next we generate k-mers by sliding a window of length k ( $k = 31$ in our experiments) over each read, as elaborated in the bucketing step (Section bucketing algorithm). For parallel bucketing, an owner process that collects the read IDs for each bucket is assigned. To identify the owner, we use an approach based on minimizers Chikhi et al. (2014). In particular, for each k-mer bucket, a minimizer of length l (l<k; l = 8 in our experiments) is identified. We use the least frequently occurring l-mer within that k-mer as the minimizer. Subsequently, a hash function is used to map the minimizer to its owner process. The idea of using minimizers for this assignment step is to increase the probability that adjacent k-mers in a read are assigned the same owning process for the corresponding buckets (thereby reducing communication latency). Collective aggregation of the read IDs corresponding to each bucket is carried out through an MPI_Alltoallv primitive MPIForum (2020). Any bucket with 200 or more distinct reads represented is pruned. This pruning step is to account for over-representation in the buckets corresponding to repetitive regions.
2)
Parallel Partitioning: In this step at first we generate the input read-overlap graph ( $G$ ) or read-bucket hypergraph ( $H$ ), for Graph-BOA or Hyper-BOA, respectively. For Hyper-BOA, we provide Zoltan’s hypergraph generating function, a list of all distinct sorted read IDs for each k-mer bucket assigned to a process. For Graph-BOA, each process enumerates edges between pairs of reads sharing at least one maximal match (Section the BOA framework using graph partitioning: graph-BOA) in parallel and then sends the edge lists to the owner processes of the vertices through MPI_Alltoallv. We provide ParMETIS the CSR (Compressed Sparse Row) format graph. We then call the partitioning function, providing as input the generated hypergraph or graph, the number of block partitions K and the balanced constraint ε.
3)
Assembly: The final phase of the pipeline takes the K partitions generated by the partitioner and launches K concurrent assembly instances using a standalone assembler on each of the K parts (or equivalently, blocks).

Our BOA framework is available at https://github.com/GT-TDAlab/BOA.

Acknowledgments

The research is supported by U.S. National Science Foundation (awards: CCF 1946752, 1919122, 1919021). This publication describes work performed at the Georgia Institute of Technology and is not associated with Amazon.

Author contributions

Conceptualization, U.V.C. and A.K.; Methodology, Validation, Formal analysis, X.A., P.G., P.K., S.E.K., U.V.C, S.K., P.S., A.S.R., and A.K.; Software, X.A., P.G., P.K., U.V.C., and A.K.; Investigation, X.A., P.G., and P.K.; Resources, A.K.; DataCuration, X.A., P.G., and P.K.; Writing – Original Draft, X.A., P.G., and A.K.; Writing – Review and Editing, X.A., P.G., U.V.C., P.S., and A.K.; Visualization, X.A., and A.K.; Supervision, U.V.C, P.S., and A.K.; Project Administration, A.K.; Funding Acquisition, U.V.C., P.S., and A.K. This author is currently at the National Center for Biotechnology Information (NCBI). The contributions to this work was done during their a_liation with Paci_c Northwest National Laboratory and is not associated with the NCBI.

Declaration of interests

The authors declare no competing interests.

Published: November 18, 2022

Data and code availability

•
The paper analyzes existing, currently available data. The accession identifiers for the datasets are listed in the key resources table.
•
BOA is publicly available online from https://github.com/GT-TDAlab/BOA.
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

References

Al-Okaily A.A. Hga: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genom. 2016;17:1–11. doi: 10.1186/s12864-016-2515-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chikhi R., Limasset A., Jackman S., Simpson J.T., Medvedev P. International conference on Research in computational molecular biology. 2014. On the representation of de bruijn graphs; pp. 35–55. [DOI] [Google Scholar]
Chikhi R., Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithm Mol. Biol. 2013;8:1–9. doi: 10.1186/1748-7188-8-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Compeau P.E.C., Pevzner P.A., Tesler G. How to apply de bruijn graphs to genome assembly. Nat. Biotechnol. 2011;29:987–991. doi: 10.1038/nbt.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Devine K., Boman E.G., Heaphy R., Bisseling R., Çatalyürek U.V. Proceedings of 20th International Parallel and Distributed Processing Symposium (IPDPS) IEEE; 2006. Parallel hypergraph partitioning for scientific computing. [DOI] [Google Scholar]
Duke University School of Medicine, Last date accessed: November 2021. NCBI GenBank.https://www.ncbi.nlm.nih.gov/genbank/.
Garey M.R., Johnson D.S. volume 174. freeman San Francisco; 1979. Computers and Intractability. [DOI] [Google Scholar]
Garey M.R., Johnson D.S., Stockmeyer L. Proceedings of the sixth annual ACM symposium on Theory of computing. 1974. Some simplified NP-complete problems; pp. 47–63. [DOI] [Google Scholar]
Gurevich A., Saveliev V., Vyahhi N., Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hendrickson B., Kolda T.G. Graph partitioning models for parallel computing. Parallel Comput. 2000;26:1519–1534. doi: 10.1016/s0167-8191(00)00048-x. [DOI] [Google Scholar]
Huang W., Li L., Myers J.R., Marth G.T. Art: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. doi: 10.1371/journal.pone.0090581. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jammula N., Chockalingam S.P., Aluru S. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2017. Distributed memory partitioning of high-throughput sequencing datasets for enabling parallel genomics analyses; pp. 417–424. [DOI] [Google Scholar]
Karypis G., Schloegel K., Kumar V. Vol. 48. 1997. Parmetis: Parallel Graph Partitioning and Sparse Matrix Ordering Library; pp. 71–95. [DOI] [Google Scholar]
Lengauer T. volume 21. Springer Science & Business Media; 2012. Combinatorial Algorithms for Integrated Circuit Layout. [DOI] [Google Scholar]
Li Z., Chen Y., Mu D., Yuan J., Shi Y., Zhang H., Gan J., Li N., Hu X., Liu B., et al. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph. Brief. Funct. Genom. 2012;11:25–37. doi: 10.1108/aa-02-2019-0031. [DOI] [PubMed] [Google Scholar]
Li D., Liu C.M., Luo R., Sadakane K., Lam T.W. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–1676. doi: 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
Medvedev P., Pop M. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLoS Comput. Biol. 2021;17:e1008928. doi: 10.1371/journal.pcbi.1008928. [DOI] [PMC free article] [PubMed] [Google Scholar]
MPI Forum . Univ. of Tennessee; 2020. MPI: A Message-Passing Interface Standard. 2020 Draft Specification.Technical Report. Note: This is a MPI-4 Draft Specification. [DOI] [Google Scholar]
Pell J., Hintze A., Canino-Koning R., Howe A., Tiedje J.M., Brown C.T. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc. Natl. Acad. Sci. USA. 2012;109:13272–13277. doi: 10.1073/pnas.1121464109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peng Y., Leung H.C.M., Yiu S.M., Chin F.Y.L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]
Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pop M. Genome assembly reborn: recent computational challenges. Briefings Bioinf. 2009;10:354–366. doi: 10.1093/bib/bbp026. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

•
The paper analyzes existing, currently available data. The accession identifiers for the datasets are listed in the key resources table.
•
BOA is publicly available online from https://github.com/GT-TDAlab/BOA.
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

[bib1] Al-Okaily A.A. Hga: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads. BMC Genom. 2016;17:1–11. doi: 10.1186/s12864-016-2515-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Chikhi R., Limasset A., Jackman S., Simpson J.T., Medvedev P. International conference on Research in computational molecular biology. 2014. On the representation of de bruijn graphs; pp. 35–55. [DOI] [Google Scholar]

[bib3] Chikhi R., Rizk G. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithm Mol. Biol. 2013;8:1–9. doi: 10.1186/1748-7188-8-22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Compeau P.E.C., Pevzner P.A., Tesler G. How to apply de bruijn graphs to genome assembly. Nat. Biotechnol. 2011;29:987–991. doi: 10.1038/nbt.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Devine K., Boman E.G., Heaphy R., Bisseling R., Çatalyürek U.V. Proceedings of 20th International Parallel and Distributed Processing Symposium (IPDPS) IEEE; 2006. Parallel hypergraph partitioning for scientific computing. [DOI] [Google Scholar]

[bib6] Duke University School of Medicine, Last date accessed: November 2021. NCBI GenBank.https://www.ncbi.nlm.nih.gov/genbank/.

[bib7] Garey M.R., Johnson D.S. volume 174. freeman San Francisco; 1979. Computers and Intractability. [DOI] [Google Scholar]

[bib8] Garey M.R., Johnson D.S., Stockmeyer L. Proceedings of the sixth annual ACM symposium on Theory of computing. 1974. Some simplified NP-complete problems; pp. 47–63. [DOI] [Google Scholar]

[bib9] Gurevich A., Saveliev V., Vyahhi N., Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Hendrickson B., Kolda T.G. Graph partitioning models for parallel computing. Parallel Comput. 2000;26:1519–1534. doi: 10.1016/s0167-8191(00)00048-x. [DOI] [Google Scholar]

[bib11] Huang W., Li L., Myers J.R., Marth G.T. Art: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. doi: 10.1371/journal.pone.0090581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Jammula N., Chockalingam S.P., Aluru S. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. 2017. Distributed memory partitioning of high-throughput sequencing datasets for enabling parallel genomics analyses; pp. 417–424. [DOI] [Google Scholar]

[bib13] Karypis G., Schloegel K., Kumar V. Vol. 48. 1997. Parmetis: Parallel Graph Partitioning and Sparse Matrix Ordering Library; pp. 71–95. [DOI] [Google Scholar]

[bib14] Lengauer T. volume 21. Springer Science & Business Media; 2012. Combinatorial Algorithms for Integrated Circuit Layout. [DOI] [Google Scholar]

[bib15] Li Z., Chen Y., Mu D., Yuan J., Shi Y., Zhang H., Gan J., Li N., Hu X., Liu B., et al. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph. Brief. Funct. Genom. 2012;11:25–37. doi: 10.1108/aa-02-2019-0031. [DOI] [PubMed] [Google Scholar]

[bib16] Li D., Liu C.M., Luo R., Sadakane K., Lam T.W. Megahit: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–1676. doi: 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]

[bib17] Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Medvedev P., Pop M. What do Eulerian and Hamiltonian cycles have to do with genome assembly? PLoS Comput. Biol. 2021;17:e1008928. doi: 10.1371/journal.pcbi.1008928. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] MPI Forum . Univ. of Tennessee; 2020. MPI: A Message-Passing Interface Standard. 2020 Draft Specification.Technical Report. Note: This is a MPI-4 Draft Specification. [DOI] [Google Scholar]

[bib20] Pell J., Hintze A., Canino-Koning R., Howe A., Tiedje J.M., Brown C.T. Scaling metagenome sequence assembly with probabilistic de bruijn graphs. Proc. Natl. Acad. Sci. USA. 2012;109:13272–13277. doi: 10.1073/pnas.1121464109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Peng Y., Leung H.C.M., Yiu S.M., Chin F.Y.L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]

[bib22] Pevzner P.A., Tang H., Waterman M.S. An Eulerian path approach to dna fragment assembly. Proc. Natl. Acad. Sci. USA. 2001;98:9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Pop M. Genome assembly reborn: recent computational challenges. Briefings Bioinf. 2009;10:354–366. doi: 10.1093/bib/bbp026. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

BOA: A partitioned view of genome assembly

Xiaojing An

Priyanka Ghosh

Patrick Keppler

Sureyya Emre Kurt

Sriram Krishnamoorthy

Ponnuswamy Sadayappan

Aravind Sukumaran Rajam

Ümit V Çatalyürek

Ananth Kalyanaraman

Summary

Graphical abstract

Highlights

Introduction

Contributions

Figure 1.

Results

Table 1.

Qualitative evaluation

Table 2.

Hyper-BOA versus Graph-BOA

Figure 2.

Runtime performance evaluation

Table 3.

Large-scale experiment

Table 4.

Real world experiment

Table 5.

Discussion

Limitations of the study

STAR★Methods

Key resources table

Resource availability

Lead contact

Materials availability

Method details

Preliminaries and notation

Strings and genome assembly

Graph partitioning

Hypergraph partitioning

The BOA assembly framework overview

Bucketing algorithm

The BOA framework using hypergraph partitioning: Hyper-BOA

No paired-end information available

Paired-end information available

Partitioning

The BOA framework using graph partitioning: Graph-BOA

Pair generation

Partitioning

Graph-BOA and Hyper-BOA

Parallelization

Acknowledgments

Author contributions

Declaration of interests

Data and code availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases