A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective

Abdul Rafay Khan; Muhammad Tariq Pervez; Masroor Ellahi Babar; Nasir Naveed; Muhammad Shoaib

doi:10.1177/1176934318758650

. 2018 Feb 20;14:1176934318758650. doi: 10.1177/1176934318758650

A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective

Abdul Rafay Khan ¹, Muhammad Tariq Pervez ^1,^✉, Masroor Ellahi Babar ², Nasir Naveed ³, Muhammad Shoaib ⁴

PMCID: PMC5826002 PMID: 29511353

Abstract

Background:

Current advancements in next-generation sequencing technology have made possible to sequence whole genome but assembling a large number of short sequence reads is still a big challenge. In this article, we present the comparative study of seven assemblers, namely, ABySS, Velvet, Edena, SGA, Ray, SSAKE, and Perga, using prokaryotic and eukaryotic paired-end as well as single-end data sets from Illumina platform.

Results:

Results showed that in case of single-end data sets, Velvet and ABySS outperformed in all the seven assemblers with comparatively low assembling time and high genome fraction. Velvet consumed the least amount of memory than any other assembler. In case of paired-end data sets, Velvet consumed least amount of time and produced high genome fraction after ABySS and Ray. In terms of low memory usage, SGA and Edena outperformed in all the assemblers. Ray also showed good genome fraction; however, extremely high assembling time consumed by the Ray might make it prohibitively slow on larger data sets of single and paired-end data.

Conclusions:

Our comparison study will provide assistance to the scientists for selecting the suitable assembler according to their data sets and will also assist the developers to upgrade or develop a new assembler for de novo assembling.

Keywords: NGS (next-generation sequencing), DBG (de Bruijn graph), OLC (overlap layout consensus), ENA (European Nucleotide Archive), bps (base pairs)

Introduction

DNA sequencing has revolutionized the current advancements in the field of science and technology. It has been widely used in applied field of medicine, genetic engineering, food science, etc.¹ In current era, next-generation sequencing (NGS) is the most advanced technology of DNA sequencing, which provides more accuracy and speed than previously known Sanger sequencing.² Paired-end sequencing in NGS, which involves the sequencing of both forward and reverse fragments of DNA, has further increased the accuracy and ability to detect indels which otherwise was not possible in single-end sequencing.³

Next-generation sequencing technique produces millions of short sequence reads and assembling these short sequence reads without a reference genome is one of the challenging task for de novo assemblers.⁴ In the past few years, several de novo sequence assembling algorithms have been developed to handle and assemble the large amount of short sequence reads to form longer fragments called contigs but choosing the appropriate assembler for paired-end or single-end data is still a challenging job.⁵

The currently available assembling algorithms include de Bruijn graph (DBG), overlap layout consensus (OLC), string graph, greedy, and hybrid algorithm.⁶

De Bruijn graph is the graph algorithm based on k-mers approach, which splits the short reads into smaller k-mers, and these k-mers overlap by k − 1 which is the next k-mer. Dividing the sequences into smaller sizes also helps improving the crisis of different initial read lengths, whereas OLC is also the graph-based algorithm which builds overlap graph by overlapping the similar sequences.⁷

Finding overlapping sequences is usually the slowest part of the assembly and these overlapped sequences then pack fragments of the overlap graph into contigs. The DBG algorithm is faster and OLC algorithm executes better for longer sequence reads. String graph algorithm is the variant of OLC algorithm, which performs global overlap graph by eliminating unnecessary sequences.⁸

Greedy algorithms start by joining the short sequence reads that are best overlapped to produce contigs. Most greedy assemblers use heuristic techniques that are designed to eliminate misassembling of recurring sequences.⁹

Hybrid assembling algorithm refers to the mixing various assembling algorithms. It is used to reduce the number of contigs and errors produced by other algorithms.¹⁰

There are many de novo assemblers available online which have been developed by applying one of these five assembling algorithms. Our study evaluated the de novo sequence assemblers for Illumina-based paired-end and single-end short reads data sets. This study provides guidance to the biologists and bioinformaticians in selecting the appropriate assembler according to their data sets and it also assists developers to upgrade or develop a new assembler for de novo assembling.

Materials and Methods

Data sets

To compare the performance of each assembler, Illumina HiSeq 2000–based short sequence reads were downloaded from publicly available database European Nucleotide Archive (ENA)¹¹ (Tables 1 and 2). For the estimation of genome fraction, all the reference genomes were downloaded from National Center for Biotechnology Information (NCBI) genome database. Short sequence reads included 7 paired-end and 8 single-end prokaryotic data sets and also 5 paired-end and 5 single-end eukaryotic data sets. All the data sets have maximum read length of 100 bps.

Table 1.

Prokaryotic data sets used in this study.

S. no.	Data set	ENA run accession	Data set type	No. of reads
1	Staphylococcus aureus	ERR353143	Paired-end	137 022
2	Streptococcus pneumoniae	ERR490828	Paired-end	321 004
3	Escherichia coli	ERR490638	Paired-end	737 008
4	Mycobacterium tuberculosis	ERR495003	Paired-end	770 994
5	Neisseria flava	DRR015798	Paired-end	1 218 573
6	Aeromonas salmonicida	DRR015726	Paired-end	2 267 875
7	Rothia mucilaginosa	DRR015851	Paired-end	4 098 002
8	Streptococcus suis	DRR015872	Single-end	113 512
9	Streptococcus pyogenes	SRR1148216	Single-end	724 546
10	Salmonella enterica	ERR233905	Single-end	1 490 584
11	Neisseria gonorrhoeae	SRR969383	Single-end	1 840 438
12	Chlamydia muridarum	SRR1736648	Single-end	3 099 636
13	Clostridioides difficile	ERR465798	Single-end	5 094 314
14	Bacillus anthracis	ERR1596542	Single-end	7 466 661
15	Chlamydia trachomatis	SRR1038047	Single-end	9 129 274

Open in a new tab

Table 2.

Eukaryotic data sets used in this study.

S. no.	Data set	ENA run accession	Data set type	No. of reads
1	Homo sapiens	DRR002191	Paired-end	126 605 856
2	Drosophila melanogaster	DRR016722	Paired-end	95 461 377
3	Arabidopsis thaliana	ERR1224454	Paired-end	30 841 688
4	Saccharomyces cerevisiae	ERR052652	Paired-end	17 584 902
5	Fungi	SRR1614243	Paired-end	22 344 195
6	Homo sapiens	DRR002191	Single-end	126 605 856
7	Drosophila melanogaster	DRR002191	Single-end	95 461 377
8	Arabidopsis thaliana	ERR1224454	Single-end	30 841 688
9	Saccharomyces cerevisiae	ERR052652	Single-end	17 584 902
10	Fungi	SRR1614243	Single-end	22 344 195

Open in a new tab

Genome assemblers

Seven assemblers (Table 3), which represent 5 different assembly algorithm strategies, were selected to assemble paired-end and single-end data sets.

Table 3.

De novo assemblers selected for this study.

S. no.	ASSEMBLER	Programming LANGUAGE	ALGORITHM	Input reads
1	ABySS¹⁴	C++	De Bruijn graph (DBG)	Paired-end and single-end
2	Velvet¹⁵	C	De Bruijn graph (DBG)	Paired-end and single-end
3	Edena¹⁶	C++	Overlap/layout/consensus (OLC)	Paired-end and single-end
4	SGA¹⁷	C++	String graph	Paired-end
5	Ray¹⁸	C++	Hybrid	Paired-end and single-end
6	SSAKE¹⁹	Perl	Greedy	Paired-end and single-end
7	Perga²⁰	C	Greedy	Paired-end and single-end

Open in a new tab

All the selected assemblers were executed on the virtual machine, which was designed using Oracle VM VirtualBox with 2 VCPU, 4 GB of RAM memory and 64-bit Linux Ubuntu Server 14.04 operating system (supplementary file 1).

Efficiency evaluation

The efficiency of each assembler was evaluated using various parameters, which include assembling total time, maximum memory usage, and maximum CPU usage.

Accuracy evaluation

The output of assemblers was decomposed into contigs. All these contig information were stored in contig files which were produced as an end result of assembling by an assembler. Contig files were used for the accuracy evaluation of each assembler using different parameters including the total number of contigs and N50 contig length. These parameters were collected using Assemblathon 2 script¹² which is written in Perl language to calculate the metrics of each contig file. Genome fraction was calculated using QUAST tool¹³ to find the similarity between the contig sequences and the reference genome.

Statistical analyses

For data analysis, R (version 3.3.2) was used. The data were tested using Shapiro-Wilk normality to find whether data are normally distributed or not. To determine statistical significance, parametric and nonparametric tests were used according to the data. A 2-tailed P values less than .05 were considered as significant.

Results

Efficiency, as well as the accuracy of each assembler, was analyzed by generated contig files using various evaluation techniques. Our study involved evaluation of 7 different assemblers with alternative assembly algorithms such as ABySS and Velvet, the DBG-based assemblers; Edena which is an OLC-based assembler; SGA which uses string graph algorithm; SSAKE and Perga, the greedy-based assembler; and Ray which worked on hybrid algorithm (Table 3).

Total assembling time

The total assembling time in minutes was calculated using Linux time command, and median of each assembler was compared using Mann-Whitney test. The results showed that Ray, the hybrid assembler, consumed more time on paired-end data sets with a median time of 553.95 minutes and single-end data sets with a median time of 373.15 minutes than any other assembler and reached very high level of significance with P < .05 in prokaryotic data sets, whereas in eukaryotic data set, SSAKE, the greedy assembler, consumed more time on single-end data sets with a median time of 223.10 minutes and lowest in paired-end data sets with a median time of 13.85 minutes.

Velvet, the DBG assembler, consumed lowest median time of 1.49 minutes on paired-end data sets and 1.26 minutes on single-end data sets, whereas in eukaryotic data sets, Velvet showed lowest time of 1.26 minutes on single-end data sets with median of 4.90 minutes. ABySS, which is also the DBG assembler, was second in consuming the lowest median time of 1.93 minutes on single-end prokaryotic and eukaryotic data sets. SSAKE was second lowest time-consuming tool (on paired-end prokaryotic data sets) with median time of 1.93 minutes (Figure 1).

Figure 1. — The comparison of total median assembling time of each assembler for (A) paired-end and single-end prokaryotic data sets and (B) paired-end and single-end eukaryotic data sets.

Memory and CPU usage

The maximum assembling memory usage in megabytes (MBs) and CPU usage in percentage (%) were also calculated using Linux command and the assemblers were compared using independent samples test. On prokaryotic paired-end data sets, the results showed that ABySS, the DBG assembler, and Perga, the greedy assembler, consumed the highest amount of memory than any other assembler with a significance of P < .05, whereas SGA, the string graph assembler, and Edena, the OLC assembler, used the least amount of memory than other assemblers. On eukaryotic paired-end data sets, ABySS and Velvet consumed the highest amount of memory, whereas SSAKE used the least amount of memory.

On prokaryotic single-end data sets, SSAKE and Perge, the greedy assemblers, consumed the highest amount of memory, whereas Edena consumed the lowest memory among all.

On eukaryotic single-end data sets, Velvet and Edena consumed the lowest memory among all assembler, whereas Perga and SSAKE consumed the highest amount of memory (Figure 2).

Figure 2. — The mean comparison of memory usage and CPU usage of each assembler for (A) paired-end and single-end prokaryotic data sets and (B) paired-end and single-end eukaryotic data sets.

In terms of CPU usage, ABySS, Velvet, and SGA, the graph-based assemblers, consumed a huge amount of CPU, whereas Edena and SSAKE consumed least amount of CPU on prokaryotic paired-end data sets, whereas SSAKE also consumed least amount of CPU on eukaryotic paired-end data sets.

On prokaryotic and eukaryotic single-end data sets, Ray, Perge, and Velvet consumed huge amount of CPU as compared with SSAKE and Edena which consumed least amount of CPU (Figure 2).

Total number of contigs

For further analysis of assembled contigs, the number of contigs was calculated by running Assemblathon script. In an ideal condition, the minimum number of contigs that matches the whole genome sequence could be generated from each assembly procedure. The results showed that on prokaryotic and eukaryotic paired-end data sets, the Velvet, the DBG assembler, assembled short reads into relatively short contigs and achieved significance of P < .05, whereas in case of single-end data sets, ABySS produced the high number of contigs followed by Velvet and SSAKE. However, SSAKE and Perga produced the low number of contigs on paired-end data sets (Figure 3).

Figure 3. — The comparison of the total number of contigs by median of each assembler for (A) paired-end and single-end prokaryotic data sets and (B) paired-end and single-end eukaryotic data sets.

N50 contig length

N50 contig length was calculated by running Assemblathon script on contig files produced by various assemblers. On prokaryotic and eukaryotic paired-end data sets, ABySS produced high N50 contig length, whereas Velvet produced low N50 contig length.

On prokaryotic single-end data sets, Velvet produced high N50 contig length with a median length of 1530.00 bp followed by ABySS with a median length of 1054.00 bp, whereas SSAKE and Perga produced low N50 contig length with a median length of 260.50 and 348.00 bp. On eukaryotic single-end data sets, Edena produced high N50 contig length with a median length of 57 252.00 bp, whereas Perga produced low N50 contig length with a median length of 12 654.50 bp (Figure 4).

Figure 4. — The comparison of the N50 contig length by median of each assembler for (A) paired-end and single-end prokaryotic data sets and (B) paired-end and single-end eukaryotic data sets.

Genome fraction

By mapping all the contigs onto the reference genomes using QUAST tool, we calculated the genome fraction of all the contigs generated by each assemblers which showed the percentage of aligned contig bases in the reference genome (Table 4). ABySS showed the high number of genome fraction with a mean of 66.3% on paired-end data sets and 69.8% in prokaryotic and eukaryotic single-end data sets. Ray showed second highest genome fraction with a mean of 58.8% followed by Velvet with third highest genome fraction with a mean of 57.1% on prokaryotic paired-end data sets, whereas Perga, Edena, and SGA showed average accuracy with a mean genome fraction of 51.9%, 51.4%, and 50.4% and SSAKE showed worst accuracy with mean genome fraction of 13.2%.

Table 4.

List of all assemblers with their mean genome fraction.

Assembler	Prokaryotic single-end	Prokaryotic paired-end
ABySS	69.8	66.3
Velvet	59.6	57.1
Edena	43.8	51.4
SGA	—	50.4
Ray	48.7	58.8
SSAKE	44.3	13.2
Perga	57.6	51.9
Assembler	Eukaryotic single-end	Eukaryotic paired-end
ABySS	85.4	82.4
Velvet	82.6	85.6
Edena	62.2	90.4
Perga	82.0	83.2
SGA	—	52.4
SSAKE	49.2	74.0

Open in a new tab

On single-end prokaryotic data sets, Velvet showed second highest genome fraction with mean of 59.6% followed by Perga with third highest genome fraction with mean of 57.6%, whereas Ray, SSAKE, and Edena showed average accuracy with mean genome fraction of 48.7%, 44.3%, and 43.8%. On eukaryotic paired-end data sets, Edena showed highest genome fraction with a mean of 90.4% and the second highest Velvet with a mean of 85.6% whereas SSAKE showed lowest genome fraction with a mean of 49.2% and 74.0% in single-end and paired-end data sets (Figure 5). Practically, an assembler which produces the fewer number of contigs, with high N50 and high genome fraction, is considered to be ideal.

Figure 5. — The comparison of mean genome fraction of each assembler for (A) paired-end and single-end prokaryotic data sets and (B) paired-end and single-end eukaryotic data sets.

Discussion

We evaluated the selected assemblers with prokaryotic and eukaryotic paired-end and single-end Illumina-based short reads on a Linux-based server. Our results showed that Ray, the hybrid assembler, takes the highest time to complete the whole genome assembling on prokaryotic paired-end and single-end data sets²¹ but Ray was unable to run on eukaryotic paired-end and single-end data sets because Ray required huge RAM and multiple CPUs for assembling large number of reads. However, the DBG assemblers, Velvet and ABySS, are the best options for both types of data sets because of tremendous assembling speed by consuming the lowest assembling time among all other assemblers, whereas Velvet and ABySS are the best options only for eukaryotic single-end data sets.²² Edena, the OLC assembler, consumed lowest memory on both prokaryotic paired-end and single-end data sets; Velvet and Edena consumed lowest memory on eukaryotic single-end data sets; and SSAKE on eukaryotic paired-end data sets. SGA, the string graph assembler, was also a good choice to assemble paired-end data sets consuming low memory,²³ but in terms of assembler data transformation, SGA consumed more time in indexing, correction, duplication removal, and overlapping steps before assembling that made SGA more complex than Edena, which needs only overlapping step to be performed before assembling. Velvet also consumed less memory on single-end data sets after Edena. In terms of high memory usage, ABySS and SSAKE were on the top on paired-end and single-end data sets, respectively.

In summary, in case of paired-end and single-end prokaryotic genomes, ABySS efficiently produced genome assembly and consumed less amount of time but consumed high amount of memory,²⁴ whereas Velvet proved to be a time-efficient and memory-efficient program for only single-end data sets. Edena was a memory-efficient program for both types of data sets, and SGA was also a memory-efficient program, but it is only available for paired-end data.

ABySS and Velvet also provided high scalability to handle a large amount of data than rest of the assemblers.

In terms of total number of contigs, we found that on paired-end data set, the Velvet, produced the greater number of contigs but low N50 value, whereas ABySS produced the greater number of contigs on single-end data sets and showed high N50 value on both data sets.²⁵ This contrasted with the contigs produced from Edena, SGA, Ray, SSAKE, and Perga that produced the low number of contigs and low N50.

Ideally, contigs with high N50 and high genome fraction were our expectation but Velvet and ABySS worked more conservatively than others when it came to merging small contigs into larger contigs, which gave an assembly with a larger number of contigs.²⁶ There could be a number of different things that might have led to this result such as k-mer size for the assembly, quality of the single-end vs paired-end data, and a bunch of other parameters that could have been used to build the assemblies.

To check the accuracy of genome assembly, the contigs were aligned to their related reference genomes using QUAST tool. ABySS showed high number of genome fraction on both paired-end and single-end data sets followed by Ray on paired-end data sets, and Velvet showed second highest genome fraction on single-end data sets.

Velvet and ABySS could be the best choice for both paired-end and single-end prokaryotic data sets with highest genome fraction among all selected assemblers²⁷ but still there are some improvements needed to be incorporated into ABySS. There are several ways in which ABySS can be improved. ABySS consumption of memory and CPU on paired-end data sets is much higher than single-end data sets. ABySS mostly relies on mate pairs to assemble their contigs. This approach may perform poorly in case of lack of coverage and it has a known issue with deadlocking when using higher k values. So, tackling these issues and decreasing the memory and CPU usage make ABySS to be best in all other assemblers. Many research groups worldwide are working on building better genome assemblers. A group of researchers at the European Bioinformatics Institute²⁸ developed the DBG-based genome assembler Velvet. Canada’s Michael Smith Genome Sciences Centre²⁹ developed ABySS. These research groups are still working on improving their assemblers and they periodically release latest versions of their assemblers. De Bruijn graph–based genome assemblers are considered as the best genome assemblers.³⁰

Conclusions

Our study evaluated 7 de novo sequence assemblers in terms of memory, time, and accuracy. We found that each assembler is capable of assembling whole prokaryotic or whole eukaryotic genome but the hybrid assembler Ray is not capable of assembling whole eukaryotic genome if you have about 4 GB of RAM or less. The selection of the best assembler is dependent on the uniqueness of the data sets and the user requirements. On single-end data sets, Velvet and ABySS, produced generally the best results among all 7 assemblers with comparatively low assembling time and high prokaryotic and eukaryotic genome fractions. Velvet also consumed the lowest memory usage on both single-end data sets. Some improvements are needed in ABySS including reduction in memory and CPU usage. On paired-end data sets, when a large amount of memory is not available, SGA and Edena might be a good choice. The hybrid approach, Ray, also showed high genome fraction; however, extremely high assembling time used by the Ray might make it prohibitively slow on larger data sets.

Supplementary Material

Supplementary material

uomf_supplementary.pdf^{(188.4KB, pdf)}

Acknowledgments

The authors thank Higher Education Commission and Virtual University of Pakistan for their generous support for conducting this research work.

Footnotes

Funding:The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions: MTP conceived the idea and guided overall design of experiments and drafting the manuscript. ARK conducted experiments and drafted the manuscript. NN and MEB collected data and analyzed the results. All authors read and approved the final manuscript. MS helped in responding peer reviewers’ comments.

Availability of Data and Materials: All the raw data sets can be downloaded from ENA server; links are available at https://github.com/alrafaykhan/DeNovoGenome/blob/master/Datasets.md.

All the testing codes are available at https://github.com/alrafaykhan/DeNovoGenome/blob/master/Commands.md.

References

1. Sperber GH, Sperber SM. Thompson and Thompson Genetics in Medicine. Baltimore, MD: Elsevier; 2008. [Google Scholar]
2. Buermans HPJ, Den Dunnen JT. Next generation sequencing technology: advances and applications. Biochimica et Biophysica Acta (BBA). 2014;1842:1932–1941. [DOI] [PubMed] [Google Scholar]
3. Grimm D, Hagmann J, Koenig D, Weigel D, Borgwardt K. Accurate indel prediction using paired-end short reads. BMC Genom. 2013;14:132. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. [DOI] [PubMed] [Google Scholar]
5. Baker M. De novo genome assembly: what every biologist should know. Nat Methods. 2012;9:333. [Google Scholar]
6. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Kang X, Tang S, Ma Y, Liu R, Wang Y. De Bruijn graph-based genomic sequence assembly algorithms and applications. Paper presented at: Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing; August 20-23, 2013; Beijing, China, pp. 2094–2097. New York, NY: IEEE. [Google Scholar]
8. Li Z, Chen Y, Mu D, et al. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-Bruijn-graph. Brief Func Genom. 2012;11:25–37. [DOI] [PubMed] [Google Scholar]
9. Pop M, Salzberg SL, Shumway M. Genome sequence assembly: algorithms and issues. Computer. 2002;35:47–54. [Google Scholar]
10. Koren S, Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30:693–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Leinonen R, Akhtar R, Birney E, et al. The European Nucleotide Archive. Nuc Acids Res. 2010;39:D28–D31. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Bradnam KR, Fass JN, Alexandrov A, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;2:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008;18:802–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22:549–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17:1519–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Zhu X, Leung HC, Chin FY, et al. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS ONE. 2014;9:e114253. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Sato K, Sakakibara Y. SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res. 2015;22:69–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Liu Y, Schmidt B, Maskell DL. Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics. 2011;12:354. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Dinh H, Rajasekaran S. A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly. Bioinformatics. 2011;27:1901–1907. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Haiminen N, Kuhn DN, Parida L, Rigoutsos I. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS ONE. 2011;6:e24182. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27:578–579. [DOI] [PubMed] [Google Scholar]
26. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Nat Acad Sci. 2012;109:13272–13277. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA—a practical iterative de Bruijn graph de novo assembler. Paper presented at: Annual International Conference on Research in Computational Molecular Biology; April 25-28, 2010; Lisbon, Portugal, pp. 426–440. Berlin, Germany; Heidelberg, Germany: Springer. [Google Scholar]
28. European Bioinformatics Institute Cambridgeshire UK. http://www.ebi.ac.uk; September 1994.
29. Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada. http://www.bcgsc.ca; 1999.
30. Compeau PE, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 2011;29:987–991. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

uomf_supplementary.pdf^{(188.4KB, pdf)}

[bibr1-1176934318758650] 1. Sperber GH, Sperber SM. Thompson and Thompson Genetics in Medicine. Baltimore, MD: Elsevier; 2008. [Google Scholar]

[bibr2-1176934318758650] 2. Buermans HPJ, Den Dunnen JT. Next generation sequencing technology: advances and applications. Biochimica et Biophysica Acta (BBA). 2014;1842:1932–1941. [DOI] [PubMed] [Google Scholar]

[bibr3-1176934318758650] 3. Grimm D, Hagmann J, Koenig D, Weigel D, Borgwardt K. Accurate indel prediction using paired-end short reads. BMC Genom. 2013;14:132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-1176934318758650] 4. Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26:1135–1145. [DOI] [PubMed] [Google Scholar]

[bibr5-1176934318758650] 5. Baker M. De novo genome assembly: what every biologist should know. Nat Methods. 2012;9:333. [Google Scholar]

[bibr6-1176934318758650] 6. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr7-1176934318758650] 7. Kang X, Tang S, Ma Y, Liu R, Wang Y. De Bruijn graph-based genomic sequence assembly algorithms and applications. Paper presented at: Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing; August 20-23, 2013; Beijing, China, pp. 2094–2097. New York, NY: IEEE. [Google Scholar]

[bibr8-1176934318758650] 8. Li Z, Chen Y, Mu D, et al. Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-Bruijn-graph. Brief Func Genom. 2012;11:25–37. [DOI] [PubMed] [Google Scholar]

[bibr9-1176934318758650] 9. Pop M, Salzberg SL, Shumway M. Genome sequence assembly: algorithms and issues. Computer. 2002;35:47–54. [Google Scholar]

[bibr10-1176934318758650] 10. Koren S, Schatz MC, Walenz BP, et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30:693–700. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr11-1176934318758650] 11. Leinonen R, Akhtar R, Birney E, et al. The European Nucleotide Archive. Nuc Acids Res. 2010;39:D28–D31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-1176934318758650] 12. Bradnam KR, Fass JN, Alexandrov A, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;2:10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr13-1176934318758650] 13. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr14-1176934318758650] 14. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-1176934318758650] 15. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr16-1176934318758650] 16. Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008;18:802–809. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr17-1176934318758650] 17. Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22:549–556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr18-1176934318758650] 18. Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17:1519–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr19-1176934318758650] 19. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007;23:500–501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr20-1176934318758650] 20. Zhu X, Leung HC, Chin FY, et al. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach. PLoS ONE. 2014;9:e114253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr21-1176934318758650] 21. Sato K, Sakakibara Y. SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res. 2015;22:69–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr22-1176934318758650] 22. Liu Y, Schmidt B, Maskell DL. Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics. 2011;12:354. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr23-1176934318758650] 23. Dinh H, Rajasekaran S. A memory-efficient data structure representing exact-match overlap graphs with application for next-generation DNA assembly. Bioinformatics. 2011;27:1901–1907. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr24-1176934318758650] 24. Haiminen N, Kuhn DN, Parida L, Rigoutsos I. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS ONE. 2011;6:e24182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr25-1176934318758650] 25. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011;27:578–579. [DOI] [PubMed] [Google Scholar]

[bibr26-1176934318758650] 26. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Nat Acad Sci. 2012;109:13272–13277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr27-1176934318758650] 27. Peng Y, Leung HC, Yiu SM, Chin FY. IDBA—a practical iterative de Bruijn graph de novo assembler. Paper presented at: Annual International Conference on Research in Computational Molecular Biology; April 25-28, 2010; Lisbon, Portugal, pp. 426–440. Berlin, Germany; Heidelberg, Germany: Springer. [Google Scholar]

[bibr28-1176934318758650] 28. European Bioinformatics Institute Cambridgeshire UK. http://www.ebi.ac.uk; September 1994.

[bibr29-1176934318758650] 29. Canada’s Michael Smith Genome Sciences Centre, Vancouver, BC, Canada. http://www.bcgsc.ca; 1999.

[bibr30-1176934318758650] 30. Compeau PE, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 2011;29:987–991. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Comprehensive Study of De Novo Genome Assemblers: Current Challenges and Future Prospective

Abdul Rafay Khan

Muhammad Tariq Pervez

Masroor Ellahi Babar

Nasir Naveed

Muhammad Shoaib

Abstract

Background:

Results:

Conclusions:

Introduction

Materials and Methods

Data sets

Table 1.

Table 2.

Genome assemblers

Table 3.

Efficiency evaluation

Accuracy evaluation

Statistical analyses

Results

Total assembling time

Figure 1.

Memory and CPU usage

Figure 2.

Total number of contigs

Figure 3.

N50 contig length

Figure 4.

Genome fraction

Table 4.

Figure 5.

Discussion

Conclusions

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases