Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

Mahul Chakraborty; James G Baldwin-Brown; Anthony D Long; J J Emerson

doi:10.1093/nar/gkw654

. 2016 Jul 25;44(19):e147. doi: 10.1093/nar/gkw654

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

Mahul Chakraborty ^1,^†, James G Baldwin-Brown ^1,^†, Anthony D Long ^1,², J J Emerson ^1,^2,^*

PMCID: PMC5100563 PMID: 27458204

Abstract

Genome assemblies that are accurate, complete and contiguous are essential for identifying important structural and functional elements of genomes and for identifying genetic variation. Nevertheless, most recent genome assemblies remain incomplete and fragmented. While long molecule sequencing promises to deliver more complete genome assemblies with fewer gaps, concerns about error rates, low yields, stringent DNA requirements and uncertainty about best practices may discourage many investigators from adopting this technology. Here, in conjunction with the platinum standard Drosophila melanogaster reference genome, we analyze recently published long molecule sequencing data to identify what governs completeness and contiguity of genome assemblies. We also present a hybrid meta-assembly approach that achieves remarkable assembly contiguity for both Drosophila and human assemblies with only modest long molecule sequencing coverage. Our results motivate a set of preliminary best practices for obtaining accurate and contiguous assemblies, a ‘missing manual’ that guides key decisions in building high quality de novo genome assemblies, from DNA isolation to polishing the assembly.

INTRODUCTION

De novo genome assembly is the process of stitching DNA fragments together into contiguous segments (contigs) representing an organism's chromosomes (1). Until recently, genomes were often assembled using fragments shorter than 1000 bp. However, such assemblies tend to be highly fragmented when they are generated using sequencing reads shorter than common repeats (1–4). Paired end short reads from different sized longer inserts can improve contiguity, but uncertainty of fragment length and the lack of sequence between the insert ends makes resolving many repetitive structures challenging (5). Longer reads can circumvent this problem, even when such reads exhibit errors rates as high as 20% (5–8). Importantly, error-prone reads can be corrected, provided there is sufficient coverage and the errors are approximately uniformly distributed. Single molecule sequencing, like that offered by Pacific Biosciences (PacBio), meets these criteria with reads that are routinely tens of kilobases in length (5,9–11). While PacBio sequences have high error rates (∼15%), errors are nearly uniformly distributed across sequences (5). With sufficient coverage, these sequences can be used to correct themselves (12). Assemblies using such correction are referred to as PacBio only assembly (13). Alternatively, hybrid assembly can be performed using a combination of noisy PacBio long molecules and high quality short reads (e.g. Illumina) (11,14).

Recently, the value of long molecule sequencing has been definitively demonstrated with the release of several high quality reference-grade genomes assembled from PacBio sequencing data (10,13,15). Indeed, the Drosophila PacBio assembly closed gaps in the reference genome assembly (13), which is often considered the most contiguous metazoan genome assembly. Despite these successes, shepherding a genome project through the process of DNA isolation, sequencing and assembly is still a challenge, especially for research groups for whom genomes are a means to another goal rather than the goal itself. For example, because high quality genome assembly relies upon long sequencing reads to bridge repetitive genomic regions (6,8,16,17) and high coverage to circumvent read errors (4,7,12), the stringent DNA isolation requirements (size, quantity and purity) for PacBio sequencing (10) intended for genome assembly are higher than those typically employed. Moreover, at present, the low average read quality produced by PacBio sequencing causes coverage requirements to be at least 50-fold (5,13,15). This requirement, combined with the comparatively expensive sequencing, makes striking the right balance between price and assembly quality important. Exacerbating the problem is the fact that rediscovering the optimal approach for a genome project is itself expensive and time consuming. As a consequence of these challenges and uncertainties, many groups may opt out of a long molecule approach, or worse, sink scarce resources into an approach ill-suited for their goals because the consequences of many decisions involved in long molecule sequencing projects have not been synthesized.

In order to optimize a strategy for genome assembly we investigated the consequences of sample preparation (i.e. DNA isolation, quality control, shearing, library loading, etc.), assembly strategies and properties of the data (i.e. read quality, length and read filtering). We first evaluate strategies for assembling PacBio reads, and how they perform with differing amounts of sequence coverage. Then, we assess the contribution of read length and read quality to assembly contiguity. We also introduce quickmerge, a simple, fast and general meta-assembler that merges assemblies to generate a more contiguous assembly. Additionally, we describe the protocols, quality-control practices and size selection strategies that consistently yield high quality DNA reads required for reference grade genome assemblies. Our strategy is flexible enough to yield high quality assemblies using as little as 25× long molecule coverage or as much as >100×.

MATERIALS AND METHODS

Preparing high quality DNA library for long reads

Obtaining high quality, high molecular weight (HMW) genomic DNA

We used Qiagen's Blood and Cell culture DNA Midi Kit for DNA extraction. As single molecule technologies (PacBio and Oxford Nanopore) do not require any sequence amplification step, a large amount of tissue is required to ensure enough DNA for library preparations that opt for no amplification (as is standard for genome assembly sequencing). For flies, 200 females or 250 males flies is sufficient for optimal yield (40–60 μg DNA) from a single anion-exchange column. For other organisms, number of individuals need to be adjusted based on the tissue mass. A good rule of thumb is to keep the total amount of input tissue 100–150 mg for optimal yield from each column.

To extract genomic DNA, 0–2 day old flies were starved for 2 h, flash frozen in liquid nitrogen and then ground into fine powder using a mortar and pestle pre-chilled with liquid nitrogen. The tissue powder was directly transferred into 9.5 ml of buffer G2 premixed with 38 μl of RNaseA (100 mg/ml) and then 250 μl (0.75AU) of protease (Qiagen) was added to the tissue homogenate. The volume of protease can be increased to 500 μl (1.5AU) to reduce the time of proteolysis. The tissue powder was mixed with the buffer by inverting the tube several times, ensuring that there were no large tissue clumps present in the solution. The homogenate was then incubated at 50°C overnight with gentle shaking (with 500 μl protease, this incubation time can be reduced to 2 h or less).

The next day, the sample was taken out of the incubator shaker and centrifuged at 5000 × g for 10 min at 4°C to precipitate the tissue debris. The supernatant was decanted into a fresh 15 ml tube. The little remaining particulate debris in the tube was removed with a 1 ml pipette. The sample was then vortexed for 5 s to increase the flow rate of the sample inside the column and then poured into the anion-exchange column. The column was washed and the DNA was eluted following the manufacturer's protocol. Genomic DNA was precipitated with 0.7 volumes of isopropanol and resuspended in Tris buffer (pH 8.0). For storage of 1 week or less, we kept the DNA at 4°C to minimize freeze-thaw cycles; for longer storage, we kept the DNA at −20°C.

Shearing the DNA

1.5″ blunt end needles (Jensen Global, Santa Barbara, CA, USA) were used to shear the DNA. The needle size can be varied to obtain DNA of different length distribution: 24 gauge needles produces a size range of 24–50 kb. To obtain larger fragments, <24 gauge needles need to be used. For the DNA we have sequenced, up to 200 μg of high molecular weight raw genomic DNA was sheared using the 24 gauge needle (Figure 1). Additionally, we have also sheared DNA with 21, 22 and 23 gauge needles to demonstrate the size distribution they generate (Supplementary Figure S1). In brief, the entire DNA solution is drawn into a 1 ml Luer syringe and dispensed quickly through the needle. This step is repeated 20 times to obtain the desired distribution of fragment sizes.

Figure 1. — An example of correctly extracted and sheared DNA visualized using field inversion gel electrophoresis. The ladder is the NEB low range PFG marker (no longer produced). The lanes of the gel are as follows: (A) ladder, (B) unsheared DNA, (C) DNA sheared with a 24 gauge needle, (D) sheared DNA size selected with 15–50 kb cut-off, (E) SMRTbell template library after 15–50 kb size selection. From the gel, it is evident that there is a minimal ‘tail’ of DNA below ∼15 kb, the preferred size selection minimum.

Quality control using FIGE

We verified the size distribution of unsheared and sheared genomic DNA using field inversion gel electrophoresis (FIGE), which allows separation of high molecular weight DNA. The DNA is run on a 1% agarose gel (0.5 × Tris Borate Ethylenediaminetetraacetate, i.e. TBE) with a pulse field gel ladder (New England Biolabs, Ipswich, MA, USA). The gel is run at 4°C overnight in 0.5 × TBE. To avoid temperature or pH gradient buildup, a pump is used to circulate the buffer. The FIGE was run using a BioRad Pulsewave 760 and a standard power supply with the following run conditions: initial time A: 0.6 s, final time B: 2.5 s, ratio: 3, run time: 8 h, MODE: 10, initial time A: 2.5s, final time B: 8 s, ratio: 3, run time: 8 h, MODE: 11, voltage: 135 V.

Library preparation

The needle sheared DNA is quantified with Qubit fluorometer (Life Technologies, Grand Island, NY, USA) and NanoDrop (Thermo Scientific, Wilmington, DE, USA). Following quantification, 20 μg of sheared DNA was optionally run in four lanes of the Blue Pippin size selection instrument (Sage Science, Beverly, MA, USA) using 15–50 kb as the cut-offs for size selection (Figure 1). This optional size selection step increases final library yield at the cost of requiring more input DNA. This size selected DNA is then used to prepare a SMRTbell template library following PacBio's protocol. A second round of size selection is performed on the SMRTbell template using a 15–50 kb cutoff to remove the smaller fragments generated during the SMRTbell library preparation step (Figure 1). The second step minimizes the number of DNA fragments less than 15 kb subjected to sequencing.

DNA sequencing

PacBio sequencing was conducted to establish length distributions (Drosophila simulans Figure 2A) and evaluate the impact of library preparation on quality (Figure 3), and was performed at the UCI high-throughput core facility using DNA isolated using the protocol described above. We note that the D. simulans reads were not used for assemblies reported here—all of our assemblies are constructed with publicly available Drosophila melanogaster (10) and Homo sapiens data (11). We sequenced one SMRTcell of Drosophila genomic DNA with the following conditions to obtain sequences with standard quality and length distribution: 10:1 polymerase to template ratio, 250 pM template concentration and P6C4 chemistry. The movie time and other conditions were standard for RSII P6C4 chemistry. To demonstrate the tradeoff between yield and quality, we sequenced one SMRTcell each for polymerase:template ratios of 40:1,80:1,100:1 with template concentration held constant at 200 pM and one SMRTcell each with 300 and 400 pM template concentration with the polymerase:template ratio being held constant at 10:1.

Figure 3. — The distribution of read quality in sequencing runs performed at the UCI genomics core using our DNA preparation technique. ‘P’ here refers to polymerase loading during sequencing (the proportion of polymerase to template, where 10 would indicate a 10:1 ratio of polymerase to template), while ‘T’ refers to template loading concentration during sequencing (in picomolarity).

PacBio only assembly

For PacBio sequences, the assembly pipeline is divided into three parts: correction, assembly and polishing. Correction reduces the error rate in the reads to 0.5–1% (13), and is necessary because reads with a high (∼15%) error rate are extremely difficult to assemble (17). Correction is facilitated by high PacBio coverage, which allows the error corrector to successfully ‘vote out’ errors in the PacBio reads. For self correction, we used the PBcR pipeline (13) as implemented in wgs8.3rc1 which, by default, corrects the longest 40× reads. The second step involves assembling the corrected reads into contigs. We used the Celera assembler (17), included in the same wgs package, for assembly. A third optional step involves polishing the contigs using Quiver and Pilon (18,19), which brings the error rate down to 0.01% or lower. All of the assemblies described in this paper were generated with the same PBcR command and spec file (commands and settings, Supplementary Data).

For PacBio only assembly of D. melanogaster ISO1 sequences, we used a publicly available PacBio sequence dataset which was generated using the standard P5C3 chemistry. A complete description of this data is available in Kim et al. (10). We chose the D. melanogaster dataset for our experiments and simulations because D. melanogaster is widely used in genetics and genomics research and its reference sequence (release 5.57,http://www.fruitfly.org) is one of the best, if not the best, eukaryotic multicellular genome assemblies in terms of assembly contiguity. This is true for both the PacBio generated assembly (21 Mb contig N50) (13) and the Sanger assembly (23 Mb scaffold N50) of ISO1. The remarkable contiguity of these assemblies becomes more tangible when the theoretical limits of D. melanogaster chromosome arms’ lengths are considered (20): N50 of both assemblies lie very close to the theoretical maximum N50 (∼28 Mb). This high quality assembly serves as a reference for evaluating assemblies presented here.

We evaluated assembly qualities using the standard assembly statistics (average contig size, number of contigs, assembled genome size, N50, etc.) using the Quast and GAGE (21,22) packages.

Hybrid assembly

PacBio only assembly of high error, long molecule sequences depends upon redundancy between the various low quality reads to ‘vote out’ errors and identify the true sequence in the sequenced individual. An alternative approach to this problem is to use known high quality sequencing reads to correctly call the bases in the sequence, and then to use PacBio reads to identify the connectivity of the genome. In order to achieve the best possible assembly results, we tested several different hybrid assembly pipelines before choosing DBG2OLC (https://arxiv.org/abs/1410.2801, https://sites.google.com/site/dbg2olc/) and Platanus (23). In our early tests, the next highest performing hybrid assembler, a combination of ECTools (https://github.com/jgurtowski/ectools) and Celera, achieved a highest N50 of 616 kb in Arabidopsis thaliana using 19 SMRT cells of data; in contrast, using 20 SMRT cells of the same data, the DBG2OLC and Platanus pipeline produced an N50 of 4.8 Mb. We also tested the alternative error corrector, LoRDEC (24), along with the Celera assembler, but found that the LoRDEC-corrected Celera assembly of our standard D. melanogaster dataset (26× of PacBio data and 67.4× of Illumina data (25)) produced an NG50 of only 109 KB. Consequently we adopted DBG2OLC as our choice for hybrid assembly. We were not able to exhaustively test all hybrid error correction approaches of PacBio reads followed by overlap assembly and acknowledge that other tools that may operate quite differently (e.g. LSC (26)) could potentially lead to further improvements in the assembly. Using the standard 67.4× of Illumina data discussed above and 26× of PacBio data, we compared DBG2OLC runs using three different De Bruijn graph assemblers: SOAP (27), ABySS (28) and Platanus. The NG50s for the three assemblies were, respectively, 2.43, 0.167 and 3.59 Mb. Based on this result, we chose to use Platanus for the remainder of the assemblies.

We used the pipeline recommended by DBG2OLC to perform hybrid assemblies. In this pipeline, we used Platanus to perform De Bruijn graph assembly on the Illumina reads. We used 8.36 Gb (67.4×) of Illumina sequence data of the ISO1 D. melanogaster inbred line generated by the DPGP project (25) to generate a De Bruijn graph assembly using Platanus. We used DBG2OLC to align our PacBio reads to the De Bruijn graph assembly to produce a ‘backbone’, then, according to the DBG2OLC standard pipeline, used the backbone to generate the consensus using the programs BLASR (29) and PBDagCon (https://github.com/PacificBiosciences/pbdagcon). As with the PacBio only assemblies above, we evaluated assembly quality using the Quast and GAGE packages.

Assembly merging

Hybrid assembly and PacBio assembly were merged using a custom C++ program called quickmerge (Figure 4A, available at https://github.com/mahulchak/quickmerge). The program takes two fasta files (containing contigs from a PacBio only assembly and contigs from a hybrid assembly) as inputs and splices contigs from the two assemblies together to produce an assembly with higher contiguity. As the two assemblies used for merging come from the same genome, gaps in one assembly can be bridged using corresponding sequences from the other assembly The first stage of the assembly merging process involves correctly aligning the corresponding sequences (contigs), which in the second stage are exchanged at the sequence gaps so that the part of the sequence with the gap is replaced with a contiguous sequence from the other assembly. The program MUMmer (30) is used to find the correct alignment between the assemblies and assembly merging is handled by quickmerge.

First, the program MUMmer (30) is used to compute the unique alignments between the contigs from the two assemblies, one of which is used as the reference, or donor, assembly and the other is used as the query, or acceptor, assembly. Distinction between the two assemblies is important because, as described below, the user may choose the more reliable, i.e. with fewer errors, of the two assemblies to bridge gaps in the other assembly. Accurate merging occurs when true correspondence between two sequences is high; conversely, pairing between incorrectly matching regions leads to incorporation of incorrect sequences. Hence, identification of the correct pairing is necessary for error-free sequence merging. Presence of repeats may complicate the situation, but the problem can largely be overcome if the two aligned sequences containing repeats come from the same genome and only the unique best alignments are considered. To obtain the unique best alignment between the reference and the query assembly, spurious matches introduced by gene duplications and repeats are removed using the delta-filter utility (with –r and –q options) of the MUMmer package.

Following the repeat filtering step, the alignments are partitioned using a scoring metric called high confidence overlaps (Figure 4B). The program identifies HCOs by dividing the total alignment length between contigs by the length of unaligned but overlapping regions of the alignment partners (Figure 4B). The metric was chosen under the assumption that the length of the overlapping but unaligned portion between the two sequences relative to the length of the overlapping and aligned parts is high for two unrelated sequences. After the alignment partitioning is done based on a HCO cutoff, only the contig alignments above the HCO cutoff are kept for assembly merging. For fly assemblies, we found that an HCO value of 1.5 was an appropriate default for assembly merges. This cutoff can be increased further, as we did for merging human assemblies. The tradeoff is that increasing HCO cutoff will gradually deplete the pool of matching alignments, thereby leading to a reduction in merging events. Thus, the ‘HCO’ parameter controls merging sensitivity at the cost of increased false positives: the higher the HCO parameter value, the more stringent is the cutoff for HCO selection.

The next step involves searching and ordering the contigs that will be merged. To accomplish that, by default quickmerge assigns nodes in the HCO alignment graph with even higher HCO values (>5.0) and reference sequences exceeding a length cutoff (1 Mb) as anchor nodes. The high HCO and the length cutoff are used here to ensure that subsequent searches for contigs for merged contig extension do not begin at spurious alignment nodes. Following the assignment of the anchor nodes, a greedy search is initiated on both the left and the right sides (5′ and 3′ of the reference contig) of the anchor node, in order to find the longest unbroken path through the HCO nodes. In other words, quickmerge looks for contigs that connect two adjacent HCO nodes in the graph and this process is continued until no contig can be found to connect two HCO nodes (e.g. a genomic region where both assemblies are broken). For the search, each contig is used only once to connect two HCO nodes, so once a contig from the HCO alignment pool has been used, it is removed from the alignment pool. Query contigs that are completely contained within a reference contig are also removed from the final merged assembly to prevent sequence duplication in the merged assembly.

In the final step, the ordered chain of contigs found in the previous step is joined by swapping portions of the reference assembly into the query assembly in a manner that maximizes retention of sequences from the reference assembly (Figure 4A). Gap filling within the query assembly occurs as a byproduct of this replacement of sequences; in this way, the process resembles genome editing using homologous recombination.

For coverages of 40×, 53×, 62× and 77×, merged assemblies were generated using the PacBio only assembly and their corresponding hybrid assemblies. For the 99× and 121× (all reads) SMRTcells datasets, the PacBio only assemblies were merged with the hybrid assembly obtained from the 77× SMRTcells dataset. All hybrid assemblies used for merging were generated without downsampling by read length or quality. The time to merge was limited only by the time required to run MUMmer, as quickmerge runs in less than 30 s on Drosophila-sized genomes, and requires <2 GB of memory.

Downsampling

We used a number of different downsampling schemes on the D. melanogaster data: first, we randomly downsampled the data by drawing a random set of SMRTcells of data from the entire set of 42 SMRTcells; second, from those datasets, we downsampled the longest 50 and 75% of the reads. Next, we downsampled the D. melanogaster data to match the read length distributions of PacBio reads from a pilot Drosophila pseudoobscura genome project that was produced using a standard protocol without aggressive size selection (generously made available by Stephen Richards). Finally, we downsampled based on read quality to test the effect of read quality on assembly contiguity. Please see Supplementary Data for more details.

RESULTS

DNA isolation for long reads

As the remainder of the paper will show, read length is an important determinant of genome assembly contiguity. We identified simple and consistent method for isolation of large genomic DNA fragments necessary for PacBio sequencing to achieve long reads. The existing alternative method used for DNA isolation to generate the published PacBio Drosophila assembly involved DNA extraction by CsCl density gradient centrifugation and g-Tube (Covaris, Woburn, MA, USA) based DNA shearing (10). CsCl gradient centrifugation is a time-consuming method that requires expensive equipment that is not routinely found in most labs. Additionally, g-Tubes are expensive, require specific centrifuges and are extremely sensitive to both the total mass of DNA input and to its length. We circumvented these problems by using a widely available DNA gravity flow anion exchange column extraction kit in concert with a blunt needle shearing method (31). Because the DNA fragment size distribution is so important, FIGE is an essential quality control step to validate the length distribution of the input DNA (Figure 1) (see ‘Materials and Methods’ section for details). Sequences generated from libraries constructed from this isolation method are comparable to or longer than the published Drosophila PacBio reads (10) (Figure 2A). The length distribution of the input DNA can potentially be improved further by using wider gauge needles that generate even longer DNA fragments (Supplementary Figure S1).

Long read assembly

PacBio self correction has been used to assemble the D. melanogaster reference strain (ISO1) genome so contiguously that most chromosome arms were represented by fewer than 10 contigs (13). This assembly was generated by using the PBcR pipeline (13) and 121× (15.8 Gb), or 42 SMRTcells’ worth, of PacBio long molecule sequences (13). However, currently, such high coverage may be too expensive for many projects, especially when the genome of the target organism is large. Consequently, we set out to determine how much sequence data is required to obtain assemblies of desired contiguity. We first selected reads from 15, 20, 25, 30 and 35 randomly chosen SMRTcells (40×, 53×, 62×, 77× and 99× assuming a genome size of 130 × 10⁶ bp—coverages calculated by dividing total bases of sequence data by total bases in genome) from the 42 SMRTcells of ISO1 PacBio reads (10). Our sampling method was inclusive and additive: for example, to obtain 20 SMRTcells, we took the 15 previously randomly chosen SMRTcells and then added 5 more randomly selected SMRTcells to it. We then assembled these datasets using the PBcR pipeline. As shown in Figure 5, the contig NG50 (NG50; G = 130 × 10⁶ bp) continues to improve across the entire range of coverage. At extremely high coverage (121×), the NG50 surges again, approaching the theoretical N50 limit of D. melanogaster genome (20). Notably, despite the extreme contiguity of these sequences, we are still discussing complete contigs, not scaffolds with gaps.

Hybrid assembly

As Figure 5 makes clear, PacBio only assembly leads to relatively fragmented genomes at lower coverage (Figure 5), we investigated whether another assembly strategy could perform better with similar amounts of long molecule data. We chose DBG2OLC for its speed and its ability to assemble using less than 30× of long molecule coverage (cf. PacBio only methods, which typically require higher coverage (5)). DBG2OLC is a hybrid method, which uses both long read data and contigs obtained from a De Bruijn graph assembly. We used contigs from a single Illumina assembly generated using 67.4X of Illumina paired end reads (25). As shown in Figure 5, the assembly NG50 increases dramatically as PacBio coverage increased, plateauing near 26×. Beyond this point, NG50 remained relatively constant. Alignment of the test assemblies to the ISO1 reference genome showed that some of the contiguity in the 26× hybrid assembly without downsampling was due to chimeric contigs (i.e. contigs that possess non-syntenic misjoins), and that these errors are fixed as coverage increases (Supplementary Figures S2 and 3). Chimeras were also absent when only the longest 50 or 75% of reads from the 26× dataset were used.

To measure the impact of read length on hybrid assembly contiguity, we downsampled the datasets by discarding the shortest reads such that the resulting datasets contained 50 and 75% of initial total basepairs of data. We then ran the same assembly pipelines using these downsampled datasets and compared to the assemblies constructed from their counterparts that were not downsampled. Our downsampling shows that with high levels of PacBio coverage (>50×), modest gains in assembly contiguity can be obtained by simply discarding the shortest reads (Figure 5, green lines). Our hybrid assembly results indicate that improvements in contiguity above 30× are modest, though hybrid assemblies remain more contiguous than PacBio only assemblies up until above 60× coverage. For projects limited by the cost of long molecule sequencing, a hybrid approach using ∼30× PacBio sequence coverage is an attractive target that minimizes sequencing in exchange for modest sacrifices in contiguity that are in any event available only at higher coverages.

Assembly merging

With modest PacBio sequence coverage (≤50×), hybrid assemblies are less fragmented than their self corrected counterparts, but more fragmented than self corrected assemblies generated from higher read coverage (Figure 5). Despite this, for lower coverage, many contigs exhibit complementary contiguity, as observed in alignments (e.g. Supplementary Figure S4a) between a PacBio only assembly (53× reads; NG50 1.98 Mb) and a hybrid assembly (longest 30× from 53× reads; NG50 3.2 Mb; not featured in Figure 5). For example, the longest contig (16.8 Mb) in the PacBio only assembly, which aligns to the chromosome 3R of the reference sequence (Supplementary Figure S4c), is spanned by five contigs in the hybrid assembly (Supplementary Figure S4b). This complementarity suggests that merging might improve the overall assembly.

We first attempted to merge the hybrid assembly and the PacBio only assembly using the existing meta assembler minimus2 (32), but the program often failed to run to completion when merging a hybrid assembly and a PacBio only assembly, and when it did finish, the run times were measured in days. We therefore developed a program, quickmerge, that merges assemblies using the MUMmer (30) alignment between the assemblies. Assembly contiguity improved dramatically when we merged the above hybrid and PacBio only assemblies (assembly NG50 9.1 Mb; Supplementary Figure S5); however, assembly contiguity can also be increased with false contig joining. To investigate whether merging leads to false joins or introduces assembly errors at the splice junctions, we investigated the result of merging at base pair resolution for the longest merged contig in the aforementioned assembly.

The longest contig (27.9 Mb) in the merged assembly, which aligns to chromosome arm 3R of the reference sequence (Supplementary Figure S6), was longer than the longest 3R contig in the PacBio assembly based on 42 SMRTcells (25.4 Mb) (13) (Supplementary Figure S6). The increased length resulted from closing of gaps present in the published PacBio assembly (Supplementary Figure S6) (13). All joined contigs map to the chromosome arm 3R in the correct order; we take this as evidence that quickmerge does not incorporate spurious sequences or large scale misassemblies Nonetheless, small scale misassemblies could still be introduced at the splice junctions. To check for such errors, we manually inspected a high resolution dot plot between the merged contig and the 3R reference sequence. A total of 18 regions were found where the merged contig differed from the reference sequence (Supplementary Table S2). The affected regions ranged from 3 bp to 20 kb and involved sequence insertion, deletion and duplication. All identified misassemblies had a buried Pacbio coverage of 15 or higher, indicating that misassemblies were due not to lack of coverage, but some other factor (for example, repetitive regions of the genome). For buried coverage calculations, reads are mapped to the genome and only mapped regions supported by 2 kb contiguous read coverage on both sides are counted toward buried coverage, ensuring any feature exhibiting buried coverage is strongly supported by the reads overlapping it. That said, such discordance between the merged contigs and the reference could have been carryover from assembly errors from the hybrid and PacBio only assemblies that were used for merging. Indeed, 11 of the 18 errors in the merged contigs came from the PacBio only assembly, whereas the rest came from the hybrid assembly. Additionally, sequences 201 bp in length from each of the 29 splice joints (break point is the101th bp, see Supplementary Data) from the aforementioned merged assembly were aligned to the reference sequence. None of the sequences revealed any misassemblies introduced by the merging process. Thus, for this dataset, the quickmerge approach splices and merges contigs accurately without introducing any new assembly errors. This indicates that the contiguity of even high coverage PacBio only assemblies can be increased by the addition of inexpensive Illumina reads, and gaps in hybrid assembly can be closed by PacBio only assembly even when the PacBio only assembly quality is suboptimal.

Assessment of assembly quality

We assessed assembly quality using the Quast software package (21) and the quality assessment scripts used in the GAGE study (22). We confined our assessment to assemblies related to application of the quickmerge meta assembler, leaving the assessment of PBcR and DBG2OLC assemblies to their respective publications (13). Quast quantifies assembly contiguity and additionally identifies misassemblies, indels, gaps and substitutions in an assembly when compared to a known reference. We found that, compared to the D. melanogaster reference, all assemblies had relatively few errors, with the primary difference among the assemblies being genome contiguity (NG50). Hybrid assemblies tended to have fewer assembly errors than PacBio only assemblies: the total number of misassemblies and the total number of contigs with misassemblies tended to be higher in PacBio only assemblies compared to hybrid assemblies. Still, PacBio only assemblies tended to have slightly fewer mismatched bases compared to the reference, and slightly fewer small indels. Merged assemblies, being a mix of PacBio only and hybrid assemblies, tended to have intermediate Quast statistics; however, the merged assemblies improved upon the source assemblies in terms of misassemblies and misassembled contigs (Supplementary Figure S8). Overall, the rate of mismatches was low at an average (across all assemblies) of 47 errors per 100 kb (Supplementary Table S1 and Figure S8). Mismatches and indels can be further reduced using existing programs, such as Quiver (18). We used Quiver to polish all non-downsampled hybrid, self and merged assemblies that used at least 40× of data. After Quiver, the average mismatch rate of the selected assemblies decreased from 24 per 100 kb to 15, while the average indel rate decreased from 180 per 100 kb to 32 (Supplementary Figure S9). We also performed post-Quiver polishing on these selected assemblies using Illumina data via the Pilon program (19). Pilon polishing further reduced the average indel rate per 100 kb from 32 to 16 (Supplementary Figure S10).

One concern generated by the pre-polished assemblies was that their N50s were high, but their corrected N50s (22) after accounting for errors were low; however, Quiver and Pilon polishing dramatically improved the corrected N50s of the assemblies, indicating that the low corrected N50 values were due to small local errors that were easily resolved by polishing. The average corrected N50 before polishing was 67kb, while the average corrected N50 after polishing was 530kb. It is evident from the corrected N50s that the first polishing step, Quiver, was responsible for most of the change in corrected N50 (Supplementary Figure S11). Moreover, Supplementary Figure S11 shows that, after correcting for misassemblies, polished versions of quickmerge are almost always more contiguous than polished versions of the component assemblies.

Size selection and assembly contiguity

Long reads generated by library preparation with aggressive size selection (10) can generate extremely contiguous and accurate de novo assemblies (13). Unfortunately, some DNA libraries with less stringent size selection produce considerably shorter reads (Figure 2A). Longer reads are predicted to generate more contiguous genomes (6,7). We tested this hypothesis by assembling genomes using randomly sampled whole reads (see ‘Materials and Methods’ section) from the ISO1 dataset to simulate a read length distribution comparable to, but slightly longer than what is typical when size selection is not aggressive. Due to the long read length distribution of the ISO1 dataset relative to the shorter target distribution above, a maximum of 53× of ISO1 data could be sampled.

Consistent with the theoretical prediction that, all else being equal, shorter reads produce more fragmented assemblies (6,7), reads from the downsampled 53× ISO1 data produced a PacBio only assembly with an NG50 of 1.38 Mb, which is shorter than the NG50 (1.98 Mb) of the assembly from the same amount of ISO1 long read data (Figure 2C). In addition, nearly all long contigs present in the original 53× assembly are fragmented in the assembly from the shorter reads (Supplementary Figure S13), although the amount of sequence data (53×) used to build the assemblies is the same.

For hybrid assembly, the shorter dataset also produced significantly less contiguous assemblies, consistent with predictions from theory (7) (Figure 2B). The NG50 achieved with 26× coverage of the shorter dataset was 1.62 Mb, compared to an NG50 of 3.58 Mb with the original ISO1 data. This is consistent with the PacBio only result—longer read lengths lead to higher assembly contiguity. Thus, a library preparation procedure that aggressively size selects DNA is crucial in delivering long contigs.

The effects of read quality on assembly

As with reduction in read length, increased read errors are predicted to worsen assembly quality because noisier reads increase the required read length and coverage to attain a high quality assembly (12). When a PacBio sequencing experiment is pushed for high yield through either high polymerase or template concentration, the data exhibits lower quality scores (Figure 3). Thus, with equal coverage and read length distribution, reads with higher error rates should result in a more fragmented assembly. To measure this effect, we partitioned the ISO1 PacBio read data into three groups with equal amounts of sequence without changing the read length distribution (see ‘Materials and Methods’ section) (Supplementary Figure S14). For the first two groups, the data was split in half, with one half comprising the reads from the bottom 50% of phred scores and the other comprising the top 50%. The third dataset was generated by randomly selecting 50% of the reads in the full dataset. We then performed PacBio only and hybrid assemblies with these data.

Low read quality had a particularly dramatic effect on assembly by self correction (Figure 6): the high quality and randomly sampled reads produced substantially better assemblies (6.23 and 6.15 Mb, respectively) than the assembly made from low quality reads (NG50 146 kb). Hybrid assembly contiguity was far more robust to low quality reads (Figure 6: NG50 of 3.1 Mb for the high quality reads, 2.5 Mb for the unfiltered reads and 2.2 Mb for the low quality reads), showing only moderate variation among different quality datasets. Throughout this study, we avoided altering the settings from their default states in the various assemblers used in order to do fair comparisons; however, in this case, we chose to also run PBcR in ‘sensitive’ mode to see if it would improve contiguity when data quality is low. We found assembly contiguity was improved (NG50 = 4 Mb), but was still lower than the assembly generated from unselected reads without the sensitive parameters (NG50 = 6.15 Mb).

Merging of human assemblies of the CHM1 cell line

One challenge in a study of this type is determining whether merging performed on a very different genome, like that of H. sapiens, would perform as well as on D. melanogaster. To do this, we used publicly available sequence data and assemblies for the human hydatidiform mole (CHM1 (11)) to generate a merged assembly for H. sapiens, both to gauge the performance of quickmerge on a different species than it was developed on, and to observe its performance on a larger and more repetitive genome (the human genome is ∼3.2 Gb, ∼25× the size of the D. melanogaster genome).

Of the available CHM1 data, we chose to re-use the data used in Berlin et al. 2015 (13) (the P5C3 chemistry). We ran our genome assembly pipeline on the 30× longest reads of PacBio data from the 54× in the CHM1 dataset, plus 40.66× of publicly available human CHM1 Illumina data (NCBI accession: PRJNA176729). The hybrid assembly produced an NG50 of 2.4 Mb, which is in line with the results observed in Figure 5. Along with this, we used the PacBio assembly contigs produced by Berlin et al. (13), which had an NG50 of 4.1 Mb. We merged the two assemblies with more strict parameters because of the larger genome size: we set HCO to 15, c to 5 and l to 5 Mb. Merging the two assemblies produced a final assembly NG50 of 8.85 Mb, a substantial improvement upon the PacBio only assembly. This more than doubling of NG50 is in line with our expectations based on the D. melanogaster results; all available data indicate that this pipeline improves contiguity for CHM1 to the same extent that it does for the D. melanogaster ISO1 strain. We did not polish this assembly with Quiver and Pilon due to computational constraints, but it stands to reason that the gains vis-à-vis SNP and indel rates would be similar between human and D. melanogaster. In order to evaluate misassemblies, we produced a MUMmer dnadiff report by comparing the PacBio only, DBG2OLC and merged assemblies to the most recent and highest contiguity CHM1 PacBio only assembly available (GenBank accession number: GCA_001420765.1). The results show that the large increase in contiguity is not a consequence of merging induced misassembly, mirroring the results in Drosophila (Supplementary Figure S12). Additionally, we generated MUMmer dot plots that indicated that contig orientation and ordering were correct, with the exception of some inversions and translocations that were inherited from the component assemblies (Supplementary Figure S7). While we attempted to run the Quast and GAGE assessment pipelines on the human assemblies, we found that, in all cases, the programs either crashed or failed to finish successfully in a reasonable time frame.

DISCUSSION

Genome assembly projects must balance cost against genome contiguity and quality (4). Self correction and assembly using only long reads clearly produces complete and contiguous genomes (Figure 5 and Supplementary Table S1). However, it is often impractical to collect the quantity of PacBio sequence data (>50×) necessary for high quality self correction either because of price or because of scarcity of appropriate biological material, especially when assembling very large genomes. For example, at least 40 μg of high quality genomic DNA is required for us to generate 1.5 μg of PacBio library when we use two rounds of size selection in the library preparation protocol. A 1.5 μg library produces, on average, 15–20 Gb of long DNA molecules. This dramatic loss of DNA during library preparation limits the amount of PacBio data that can be obtained for a given quantity of source tissue. When a project is limited by cost or tissue availability, a hybrid approach using a mix of short and long read sequences is an alternative to self corrected long read sequences.

Our results show that when 67.4× of 100 bp paired end Illumina reads is used in combination with 10–30× of PacBio sequences, reasonably high quality hybrid assemblies can be obtained, with ∼30× of PacBio sequences yielding the best assembly. In fact, as our results show, a 30× hybrid assembly is less fragmented and higher quality than even a 50× self-corrected assembly (Figure 5). However, our results also show that with the same long molecule data, PacBio only and hybrid assemblies often assemble complementary regions of the genome. The implication here, that different assemblers are joining complementary contigs, suggesting that future assemblers could generate higher quality assemblies with modest coverage data. The merging of a PacBio only and a hybrid assembly results in a better assembly than either of the two alone (Figure 5 and Supplementary Table S1), regardless of the total amount of long molecule sequences (≥30×) used. Thus, projects for which ≥30× of single molecule sequence can be generated are well-served by collecting an additional 50–100× of Illumina data. These data can then be used to generate both a self-corrected assembly and a hybrid assembly, which can then be merged to obtain an assembly of comparable contiguity to PacBio only assemblies using twice the amount of PacBio data (Figure 5). This merged assembly approach produced the highest NG50 of any assembly at all coverage levels at which it could be tested, with little or no tradeoff in base accuracy or misassemblies (Supplementary Figure S8–10).

Nonetheless, it is clear that the tools available for genomic assembly have inherent technical limitations: DBG2OLC assembly contiguity asymptotes as PacBio read coverage passes about 30×, and the PBcR pipeline produces the best assembly when the longest reads that make up 40× (of genome size) of data are corrected and only the longest 25× from the corrected sequences are assembled (13). Indeed, when coverage >25× is used for PacBio only assembly, there is a real loss of assembly quality as coverage increases (data not shown). This may be because an increase in coverage leads to the stochastic accumulation of contradictory reads that cannot be easily reconciled, a limitation of the overlap-layout-consensus algorithm used in assembling the long reads (2,33).

Long read sequencing technologies, such as those offered by PacBio, Oxford Nanopore (34) and Illumina TrueSeq (35) promise to improve the quality of de novo genome assemblies substantially. However, as we have shown using PacBio sequences as an example, not all long read data is equally useful when assembling genomes. We provide empirical validation, perhaps for the first time, of length and quality on assembly contiguity. Additionally, our results provide a novel insight: high-throughput short reads can still be useful in improving contiguity of assemblies created with long reads, even when long read coverage is high. In light of our results, we have compiled a list of best practices for DNA isolation, sequencing and assembly (Supplementary Figures S15 and 16). Particularly important for DNA isolation is quality control of read length via pulsed field gel electrophoresis. Regarding assembly, we recommend that researchers obtain between 50× and 100× Illumina sequence. Next, researchers must determine how much long molecule coverage to obtain: between 25× and 35×, or >35×. With coverage below 35×, PacBio only methods often fail to assemble and produce low contiguity when they do assemble, and thus, we can only confidently recommend a hybrid assembly. Above 35×, we recommend meta assembly of a hybrid and a PacBio only assembly. In this case, we recommend downsampling to the 30× longest PacBio reads when generating the hybrid assembly because hybrid assembly contiguity decreases above this coverage level, but this has not been extensively tested. We show that this approach is effective both in Drosophila and human genomes, which differ in size and extent of repetitive regions.

One challenge in assembly is posed by analyzing data from heterozygous individuals. Heterozygosity is known to make assembly more challenging (5). All of the data evaluated in this study were produced from either isogenic or highly inbred populations (Drosophila) or from a single haploid cell line (human CHM1). Because there is not a comparable dataset available that was produced using heterozygous individuals, we cannot test the effect of heterozygosity on assembly quality. That said, some assemblers (Platanus (23) and Falcon (https://github.com/PacificBiosciences/FALCON)) were designed to produce diploid assemblies from heterozygous sequence data (5). It stands to reason that substituting Falcon in the place of PBcR in this pipeline could improve assembly quality for highly heterozygous samples, but that claim will require further testing.

The recent rapid development of short read sequencing technology has fostered an explosion of genome sequencing. However, as a result of the cost effectiveness and concomitant popularity of short read technologies, the average quality and contiguity of published genomes has plummeted (36). Indeed, short read sequences are poorly suited to the task of assembly, especially when compared with long molecule alternatives. While long molecule sequencing has rekindled the promise of high quality reference genomes for any organism, it is currently substantially more expensive than short read alternatives. In order to mitigate uncertainties inherent in adopting this technology, we have outlined the most salient features to consider when planning a genome assembly project. We have recommended effective DNA isolation and preparation practices that result in long reads that take advantage of what the PacBio technology has to offer. We have also provided a guide for assembly that leads to extremely contiguous genomes even when circumstances prevent the collection of large quantities of long molecule sequence data recommended by current methods.

Supplementary Material

SUPPLEMENTARY DATA

supp_44_19_e147__index.html^{(965B, html)}

Acknowledgments

The authors would like to thank Stephen Richards for sharing the length distribution from Drosophila pseudoobscura Pacific Biosciences data and Sergey Koren and Brian Walenz for their assistance with wgs. We would also like to thank Melanie Oakes and Valentina Ciobanu for assistance with sequencing.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Genomic High-Throughput Facility Shared Resource of the Cancer Center Support Grant at the University of California, Irvine [CA-62203, in part]; NIH Instrumentation Grants [1S10RR025496-01, 1S10OD010794-01]. Funding for open access charge: Corresponding author's startup funds.

Conflict of interest statement. None declared.

REFERENCES

1.Simpson J.T., Pop M. The theory and practice of genome sequence assembly. Annu. Rev. Genomics Hum. Genet. 2015;16:153–172. doi: 10.1146/annurev-genom-090314-050032. [DOI] [PubMed] [Google Scholar]
2.Myers E.W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 1995;2:275–290. doi: 10.1089/cmb.1995.2.275. [DOI] [PubMed] [Google Scholar]
3.Bradnam K.R., Fass J.N., Alexandrov A., Baranay P., Bechner M., Birol I., Boisvert S., Chapman J.A., Chapuis G., Chikhi R., et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10. doi: 10.1186/2047-217X-2-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Baker M. De novo genome assembly: what every biologist should know. Nat. Methods. 2012;9:333–337. [Google Scholar]
5.Koren S., Phillippy A.M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 2015;23:110–120. doi: 10.1016/j.mib.2014.11.014. [DOI] [PubMed] [Google Scholar]
6.Lander E.S., Waterman M.S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. doi: 10.1016/0888-7543(88)90007-9. [DOI] [PubMed] [Google Scholar]
7.Motahari A., Ramchandran K., Tse D., Ma N. IEEE International Symposium on Information Theory. 2013. Optimal DNA shotgun sequencing: noisy reads are as good as noiseless reads; pp. 1640–1644. [Google Scholar]
8.Lam K.-K., Khalak A., Tse D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics. 2014;15(Suppl. 9):S4. doi: 10.1186/1471-2105-15-S9-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Koren S., Harhay G.P., Smith T.P., Bono J.L., Harhay D.M., McVey S.D., Radune D., Bergman N.H., Phillippy A.M. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 2013;14:R101. doi: 10.1186/gb-2013-14-9-r101. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kim K.E., Peluso P., Babayan P., Yeadon P.J., Yu C., Fisher W.W., Chin C.S., Rapicavoli N.A., Rank D.R., Li J., et al. Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data. 2014;1:140045. doi: 10.1038/sdata.2014.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Pendleton M., Sebra R., Pang A.W., Ummat A., Franzen O., Rausch T., Stutz A.M., Stedman W., Anantharaman T., Hastie A., et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods. 2015;12:780–786. doi: 10.1038/nmeth.3454. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Churchill G.A., Waterman M.S. The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992;14:89–98. doi: 10.1016/s0888-7543(05)80288-5. [DOI] [PubMed] [Google Scholar]
13.Berlin K., Koren S., Chin C.S., Drake J.P., Landolin J.M., Phillippy A.M. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015;33:623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]
14.Koren S., Schatz M.C., Walenz B.P., Martin J., Howard J.T., Ganapathy G., Wang Z., Rasko D.A., McCombie W.R., Jarvis E.D., et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 2012;30:693–700. doi: 10.1038/nbt.2280. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gordon D., Huddleston J., Chaisson M.J., Hill C.M., Kronenberg Z.N., Munson K.M., Malig M., Raja A., Fiddes I., Hillier L.W., et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344. doi: 10.1126/science.aae0344. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Bresler G., Bresler M., Tse D. Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics. 2013;14(Suppl. 5):S18. doi: 10.1186/1471-2105-14-S5-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Myers E.W., Sutton G.G., Delcher A.L., Dew I.M., Fasulo D.P., Flanigan M.J., Kravitz S.A., Mobarry C.M., Reinert K.H., Remington K.A., et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. doi: 10.1126/science.287.5461.2196. [DOI] [PubMed] [Google Scholar]
18.Chin C.S., Alexander D.H., Marks P., Klammer A.A., Drake J., Heiner C., Clum A., Copeland A., Huddleston J., Eichler E.E., et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]
19.Walker B.J., Abeel T., Shea T., Priest M., Abouelliel A., Sakthikumar S., Cuomo C.A., Zeng Q., Wortman J., Young S.K., et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Hoskins R.A., Smith C.D., Carlson J.W., Carvalho A.B., Halpern A., Kaminker J.S., Kennedy C., Mungall C.J., Sullivan B.A., Sutton G.G., et al. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biol. 2002;3 doi: 10.1186/gb-2002-3-12-research0085. RESEARCH0085. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gurevich A., Saveliev V., Vyahhi N., Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Salzberg S.L., Phillippy A.M., Zimin A., Puiu D., Magoc T., Koren S., Treangen T.J., Schatz M.C., Delcher A.L., Roberts M., et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–567. doi: 10.1101/gr.131383.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kajitani R., Toshimoto K., Noguchi H., Toyoda A., Ogura Y., Okuno M., Yabana M., Harada M., Nagayasu E., Maruyama H., et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 2014;24:1384–1395. doi: 10.1101/gr.170720.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Salmela L., Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014;30:3506–3514. doi: 10.1093/bioinformatics/btu538. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Langley C.H., Stevens K., Cardeno C., Lee Y.C., Schrider D.R., Pool J.E., Langley S.A., Suarez C., Corbett-Detig R.B., Kolaczkowski B., et al. Genomic variation in natural populations of Drosophila melanogaster. Genetics. 2012;192:533–598. doi: 10.1534/genetics.112.142018. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Au K.F., Underwood J.G., Lee L., Wong W.H. Improving PacBio long read accuracy by short read alignment. PLoS One. 2012;7:e46679. doi: 10.1371/journal.pone.0046679. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Luo R., Liu B., Xie Y., Li Z., Huang W., Yuan J., He G., Chen Y., Pan Q., Liu Y., et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1:18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Simpson J.T., Wong K., Jackman S.D., Schein J.E., Jones S.J., Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Chaisson M.J., Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13:238. doi: 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kurtz S., Phillippy A., Delcher A.L., Smoot M., Shumway M., Antonescu C., Salzberg S.L. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Graham C.A., Hill A.J. Introduction to DNA sequencing. Methods Mol. Biol. 2001;167:1–12. doi: 10.1385/1-59259-113-2:001. [DOI] [PubMed] [Google Scholar]
32.Treangen T.J., Sommer D.D., Angly F.E., Koren S., Pop M. Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics. 2011;33 doi: 10.1002/0471250953.bi1108s33. doi:10.1002/0471250953.bi1108s33. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Miller J.R., Koren S., Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. doi: 10.1016/j.ygeno.2010.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Goodwin S., Gurtowski J., Ethe-Sayers S., Deshpande P., Schatz M.C., McCombie W.R. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015;25:1750–1756. doi: 10.1101/gr.191395.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.McCoy R.C., Taylor R.W., Blauwkamp T.A., Kelley J.L., Kertesz M., Pushkarev D., Petrov D.A., Fiston-Lavier A.-S. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One. 2014;9:e106689. doi: 10.1371/journal.pone.0106689. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Alkan C., Coe B.P., Eichler E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 2011;12:363–376. doi: 10.1038/nrg2958. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

supp_44_19_e147__index.html^{(965B, html)}

supp_gkw654_nar-03607-met-k-2015-File008.pdf^{(2.3MB, pdf)}

supp_gkw654_nar-03607-met-k-2015-File009.xlsx^{(92.5KB, xlsx)}

supp_gkw654_nar-03607-met-k-2015-File010.xlsx^{(12.7KB, xlsx)}

[B1] 1.Simpson J.T., Pop M. The theory and practice of genome sequence assembly. Annu. Rev. Genomics Hum. Genet. 2015;16:153–172. doi: 10.1146/annurev-genom-090314-050032. [DOI] [PubMed] [Google Scholar]

[B2] 2.Myers E.W. Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 1995;2:275–290. doi: 10.1089/cmb.1995.2.275. [DOI] [PubMed] [Google Scholar]

[B3] 3.Bradnam K.R., Fass J.N., Alexandrov A., Baranay P., Bechner M., Birol I., Boisvert S., Chapman J.A., Chapuis G., Chikhi R., et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013;2:10. doi: 10.1186/2047-217X-2-10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Baker M. De novo genome assembly: what every biologist should know. Nat. Methods. 2012;9:333–337. [Google Scholar]

[B5] 5.Koren S., Phillippy A.M. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr. Opin. Microbiol. 2015;23:110–120. doi: 10.1016/j.mib.2014.11.014. [DOI] [PubMed] [Google Scholar]

[B6] 6.Lander E.S., Waterman M.S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2:231–239. doi: 10.1016/0888-7543(88)90007-9. [DOI] [PubMed] [Google Scholar]

[B7] 7.Motahari A., Ramchandran K., Tse D., Ma N. IEEE International Symposium on Information Theory. 2013. Optimal DNA shotgun sequencing: noisy reads are as good as noiseless reads; pp. 1640–1644. [Google Scholar]

[B8] 8.Lam K.-K., Khalak A., Tse D. Near-optimal assembly for shotgun sequencing with noisy reads. BMC Bioinformatics. 2014;15(Suppl. 9):S4. doi: 10.1186/1471-2105-15-S9-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Koren S., Harhay G.P., Smith T.P., Bono J.L., Harhay D.M., McVey S.D., Radune D., Bergman N.H., Phillippy A.M. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 2013;14:R101. doi: 10.1186/gb-2013-14-9-r101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Kim K.E., Peluso P., Babayan P., Yeadon P.J., Yu C., Fisher W.W., Chin C.S., Rapicavoli N.A., Rank D.R., Li J., et al. Long-read, whole-genome shotgun sequence data for five model organisms. Sci. Data. 2014;1:140045. doi: 10.1038/sdata.2014.45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Pendleton M., Sebra R., Pang A.W., Ummat A., Franzen O., Rausch T., Stutz A.M., Stedman W., Anantharaman T., Hastie A., et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods. 2015;12:780–786. doi: 10.1038/nmeth.3454. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Churchill G.A., Waterman M.S. The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992;14:89–98. doi: 10.1016/s0888-7543(05)80288-5. [DOI] [PubMed] [Google Scholar]

[B13] 13.Berlin K., Koren S., Chin C.S., Drake J.P., Landolin J.M., Phillippy A.M. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015;33:623–630. doi: 10.1038/nbt.3238. [DOI] [PubMed] [Google Scholar]

[B14] 14.Koren S., Schatz M.C., Walenz B.P., Martin J., Howard J.T., Ganapathy G., Wang Z., Rasko D.A., McCombie W.R., Jarvis E.D., et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 2012;30:693–700. doi: 10.1038/nbt.2280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Gordon D., Huddleston J., Chaisson M.J., Hill C.M., Kronenberg Z.N., Munson K.M., Malig M., Raja A., Fiddes I., Hillier L.W., et al. Long-read sequence assembly of the gorilla genome. Science. 2016;352:aae0344. doi: 10.1126/science.aae0344. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Bresler G., Bresler M., Tse D. Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics. 2013;14(Suppl. 5):S18. doi: 10.1186/1471-2105-14-S5-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Myers E.W., Sutton G.G., Delcher A.L., Dew I.M., Fasulo D.P., Flanigan M.J., Kravitz S.A., Mobarry C.M., Reinert K.H., Remington K.A., et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. doi: 10.1126/science.287.5461.2196. [DOI] [PubMed] [Google Scholar]

[B18] 18.Chin C.S., Alexander D.H., Marks P., Klammer A.A., Drake J., Heiner C., Clum A., Copeland A., Huddleston J., Eichler E.E., et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods. 2013;10:563–569. doi: 10.1038/nmeth.2474. [DOI] [PubMed] [Google Scholar]

[B19] 19.Walker B.J., Abeel T., Shea T., Priest M., Abouelliel A., Sakthikumar S., Cuomo C.A., Zeng Q., Wortman J., Young S.K., et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Hoskins R.A., Smith C.D., Carlson J.W., Carvalho A.B., Halpern A., Kaminker J.S., Kennedy C., Mungall C.J., Sullivan B.A., Sutton G.G., et al. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biol. 2002;3 doi: 10.1186/gb-2002-3-12-research0085. RESEARCH0085. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Gurevich A., Saveliev V., Vyahhi N., Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. doi: 10.1093/bioinformatics/btt086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Salzberg S.L., Phillippy A.M., Zimin A., Puiu D., Magoc T., Koren S., Treangen T.J., Schatz M.C., Delcher A.L., Roberts M., et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–567. doi: 10.1101/gr.131383.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Kajitani R., Toshimoto K., Noguchi H., Toyoda A., Ogura Y., Okuno M., Yabana M., Harada M., Nagayasu E., Maruyama H., et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 2014;24:1384–1395. doi: 10.1101/gr.170720.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Salmela L., Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014;30:3506–3514. doi: 10.1093/bioinformatics/btu538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Langley C.H., Stevens K., Cardeno C., Lee Y.C., Schrider D.R., Pool J.E., Langley S.A., Suarez C., Corbett-Detig R.B., Kolaczkowski B., et al. Genomic variation in natural populations of Drosophila melanogaster. Genetics. 2012;192:533–598. doi: 10.1534/genetics.112.142018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Au K.F., Underwood J.G., Lee L., Wong W.H. Improving PacBio long read accuracy by short read alignment. PLoS One. 2012;7:e46679. doi: 10.1371/journal.pone.0046679. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Luo R., Liu B., Xie Y., Li Z., Huang W., Yuan J., He G., Chen Y., Pan Q., Liu Y., et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1:18. doi: 10.1186/2047-217X-1-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28.Simpson J.T., Wong K., Jackman S.D., Schein J.E., Jones S.J., Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Chaisson M.J., Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics. 2012;13:238. doi: 10.1186/1471-2105-13-238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Kurtz S., Phillippy A., Delcher A.L., Smoot M., Shumway M., Antonescu C., Salzberg S.L. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Graham C.A., Hill A.J. Introduction to DNA sequencing. Methods Mol. Biol. 2001;167:1–12. doi: 10.1385/1-59259-113-2:001. [DOI] [PubMed] [Google Scholar]

[B32] 32.Treangen T.J., Sommer D.D., Angly F.E., Koren S., Pop M. Next generation sequence assembly with AMOS. Curr. Protoc. Bioinformatics. 2011;33 doi: 10.1002/0471250953.bi1108s33. doi:10.1002/0471250953.bi1108s33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] 33.Miller J.R., Koren S., Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–327. doi: 10.1016/j.ygeno.2010.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34.Goodwin S., Gurtowski J., Ethe-Sayers S., Deshpande P., Schatz M.C., McCombie W.R. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015;25:1750–1756. doi: 10.1101/gr.191395.115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35.McCoy R.C., Taylor R.W., Blauwkamp T.A., Kelley J.L., Kertesz M., Pushkarev D., Petrov D.A., Fiston-Lavier A.-S. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One. 2014;9:e106689. doi: 10.1371/journal.pone.0106689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36.Alkan C., Coe B.P., Eichler E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 2011;12:363–376. doi: 10.1038/nrg2958. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage

Mahul Chakraborty

James G Baldwin-Brown

Anthony D Long

J J Emerson

Abstract

INTRODUCTION

MATERIALS AND METHODS

Preparing high quality DNA library for long reads

Obtaining high quality, high molecular weight (HMW) genomic DNA

Shearing the DNA

Figure 1.

Quality control using FIGE

Library preparation

DNA sequencing

Figure 2.

Figure 3.

PacBio only assembly

Hybrid assembly

Assembly merging

Figure 4.

Downsampling

RESULTS

DNA isolation for long reads

Long read assembly

Figure 5.

Hybrid assembly

Assembly merging

Assessment of assembly quality

Size selection and assembly contiguity

The effects of read quality on assembly

Figure 6.

Merging of human assemblies of the CHM1 cell line

DISCUSSION

Supplementary Material

Acknowledgments

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases