Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2021 Aug 20;49(20):e117. doi: 10.1093/nar/gkab717

SWALO: scaffolding with assembly likelihood optimization

Atif Rahman 1,2,, Lior Pachter 3,4,5,
PMCID: PMC8599790  PMID: 34417615

Abstract

Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.

INTRODUCTION

The development of second generation sequencing technologies (1–4) has led to development of various assays to probe many aspects of interest in molecular and cell biology due to the low cost and high throughput. However, a prerequisite for running many of these assays, genome assembly is yet to be solved adequately using second generation sequencing reads. Long reads from the third generation sequencing technologies such as single molecule real time (SMRT) (5) and nanopore sequencing (6) are transforming genome assembly, and a number of tools have been developed to assemble genomes from long reads (7–10). However, the high cost, low sequencing coverage and high error rates of these technologies mean that many of the genomes are being assembled using reads from second generation technologies, or through a combination of second and third generation data. In addition, an extensive amount of second generation data exists already that can be better utilized. Furthermore, reference guided assembly tools such as RACA (11), Ragout (12) and MeDuSa (13) require an initial draft assembly which is often generated using second generation reads. In this paper, we demonstrate that considerable improvement in genome assembly using second-generation sequencing can be achieved through the application of statistical models for sequencing. The statistical approach we introduce may also be applied to genome assembly using long reads.

Genome assembly typically consists of two major steps. The first step is to merge overlapping reads into contigs which is commonly done using the de Bruijn or overlap graphs. In the second step, known as ‘scaffolding’, described in (14–16), contigs are oriented and ordered using various approaches such as paired-end or mate-pair reads (we use the term read pair to refer to either). Recently, methods have been developed to perform scaffolding using linked reads (17–19), long reads (20–23), and chromosomal conformation data from Hi-C (24–29) that lead to scaffolds with substantially better contiguity compared to scaffolds generated using read-pairs, and hence are the recommended technologies for scaffolding. However, read-pair information from second generation technologies is still widely used for scaffolding (30) and remains a critical part of the genome assembly process. It is hence built into most assemblers (31–36) and a number of stand-alone scaffolders such as Bambus2 (37,38), MIP (39), Opera (40,41), SCARPA (42), SOPRA (43), SSPACE (44), BESST (45) have also been developed to better utilize read-pair information and to resolve ambiguities due to repetitive regions. Most of the scaffolding algorithms rely on heuristics or user input to determine parameters such as minimum number of read-pairs linking contigs to join them ignoring contig lengths, sequencing depth and sequencing errors. In an in-depth study, Hunt et al. evaluated scaffolding tools on real and simulated data and observed that although many of the scaffolders perform well on simulated datasets, they show inconsistent performance across real datasets and mapping tools (46). Their results demonstrate that SGA, SOPRA and ABySS are conservative and make very few scaffolding errors while SOAPdenovo identified more joins at the expense of greater number of errors indicating a scaffolding method achieving a better trade-off of the two may be possible.

Here, we present a scaffolding method called ‘scaffolding with assembly likelihood optimization (Swalo)’. Swalo learns parameters automatically from the data and is largely free of user parameters making it more consistent than other scaffolders. It is also able to make use of multi-mapped read pairs through probabilistic disambiguation which most other scaffolding tools ignore. The method is grounded in rigorous probabilistic models yet proper approximations make the implementation efficient and applicable to practical datasets. We analyze the performance of Swalo using datasets used by Hunt et al. and find that Swalo makes more or similar number of correct joins as other scaffolders while making very few incorrect joins. We also compare Swalo with scaffolding modules built into various assemblers using GAGE datasets (47) and observe that final results obtained by applying Swalo on contigs generated by assemblers are generally better than applying the in-built scaffolding modules of those assemblers. Finally, we apply Swalo to a large dataset from the budgerigar genome and demonstrate that it scales to large datasets without compromising performance.

MATERIALS AND METHODS

Overview of Swalo

Our scaffolding method called Swalo is based on a generative model for sequencing (48). Figure 1 illustrates the main steps of Swalo. In the first step, reads are aligned to contigs, the insert size distribution and error parameters are learned using the reads that map uniquely and the likelihood of the set of contigs is computed using a generative model. We then construct the bi-directed scaffold graph which contains a vertex for each contig and there is an edge between contigs if joining them would result in an increase in the likelihood. It uses probabilistic models to estimate maximum likelihood gaps between contigs correcting for the issue that we may not observe inserts from the entire distribution of insert sizes due to gaps between contigs and lengths of contigs (Supplementary Figure S2) (49,50). It then approximates whether joining contigs would result in an increase in the genome assembly likelihood. We use the EM (expectation maximization) (51,52) algorithm to resolve multi-mapped read pairs. Contigs are then joined if the increase in the likelihood is significantly higher than that of all other conflicting joins as determined by a heuristic. Thus, Swalo takes a step towards maximum likelihood genome assembly (53). Moreover, we select multiple joins consistent with one another using the dynamic programming algorithm for the weighted interval scheduling problem. Each of these steps is described in more detail in the sections below and in Supplementary Notes 3.1-3.3. Our scaffolding method (i) learns parameters from the data making it largely parameter free (Supplementary Note 3.4), (ii) makes use of multi-mapped read pairs that are ignored in most scaffolders and (iii) is able to accurately estimate gaps between contigs facilitating gap-filling.

Figure 1.

Figure 1.

Overview of SWALO. 1. Reads are aligned to contigs, uniquely mapped reads are used to learn the insert size distribution and error parameters, and then the likelihood of the set of contigs is computed (Supplementary Note 3.1, Supplementary Figure S1). 2. The scaffold graph is constructed by first estimating maximum likelihood gaps, g between contigs using the EM algorithm to resolve multimapped read-pairs taking into account that only inserts of sizes between lmin and lmax will be observed due to gaps between contigs and lengths of contigs (Supplementary Figure S2), and then approximating whether changes in number of possible start sites of reads (the regions shaded in grey) lead to an increase or decrease in the assembly likelihood (Supplementary Note 3.2). 3. Finally, we make the joins that are unambiguous or correspond to likelihood increase significantly higher than that of other conflicting joins. If there are multiple contigs (grey) that fit into the gap between contigs being joined we select from them using the dynamic programming algorithm for the weighted interval scheduling problem in following steps. i. Remove contigs (red) with inconsistent edges to other contigs. ii. Select consistent set of contigs (blue) that optimizes likelihood. iii. Remove selected contigs (red) with likelihood increase not significantly higher than conflicting ones not selected (purple). iv. Merge them into scaffolds (Supplementary Note 3.3).

We present here a brief description of methods behind Swalo. More details is available in Supplementary Notes 3.1-3.4.

Learning parameters and computing likelihood of contigs

The first step in Swalo is to estimate parameters and compute likelihood of contigs using the approach presented in (48) (Supplementary Note 3.1). The model incorporates insert size distribution, sequencing errors as well as randomness in read generation. We map reads to contigs and learn insert size distribution and error parameters using reads that map uniquely. The likelihood of contigs are computed with respect to a generative model using the learned parameters.

Given an assembly Inline graphic and a set of reads pairs Inline graphic, assuming reads are generated independently the likelihood is given by

graphic file with name M4.gif

If Mi is the number of times the read ri is mapped to the assembly by an aligner, the probability that the read ri is generated from the assembly is approximated by

graphic file with name M5.gif

where li, j, si, j and ai, j are fragment length, start site and assembly subsequence corresponding to jth mapping of ith read respectively. Therefore, the log likelihood is given by

graphic file with name M6.gif (1)

Here, pF, pS and pE are fragment length, start site and error distributions respectively, as in (48). However, we use a smoothed and truncated version of the insert size distribution for scaffolding (Supplementary Figure S1, Supplementary Note 3.1). The probability that an insert of length l starts at s is

graphic file with name M7.gif

where Inline graphic is the total effective length, i.e. number of possible start sites for insert size l and lc is the length of contig c.

Scaffold graph construction

We then construct the scaffold graph which is a bi-directed graph with a vertex for each contig and there is an edge between contigs if joining the contigs would lead to an increase in the assembly likelihood (Supplementary Note 3.2). The edges are weighted using this increase in likelihood. Edge weights are computed for each pair of contigs such that there are read pairs with two ends mapping to different contigs in the pair. This is done in two steps.

  • First we estimate the maximum likelihood gap between the contigs using a generative model correcting for the issue that we may not observe inserts from the entire distribution (Supplementary Figure S2).

    Consider two contigs separated by a gap g. If the 5′ end of an insert is at s, then we will not observe inserts smaller than lmin and greater than lmax. The probability that a read r is generated is then given by
    graphic file with name M9.gif
    where Inline graphic and Inline graphic are the corrected insert size and start site distributions respectively. Inline graphic is given by
    graphic file with name M13.gif
    and Inline graphic is approximated by
    graphic file with name M15.gif

    We then find the gap g that maximizes likelihood of linking reads, max gr ∈ {linking}p(r) for every pair of contigs with read pairs linking them.

    If there are read pairs that map to multiple pairs of contigs, we resolve them using the expectation-maximization (EM) (51) algorithm.

  • Then we check whether linking the contigs would result in an increase in the likelihood by computing probability of linking reads and adjusting probabilities of all other reads.

    Adjusting the probabilities because of the change in effective length using Equation (1) would require iterating over all multi-mapped reads. In order to make this step practical we make the following approximation.
    graphic file with name M16.gif
    where Inline graphic is insert size corresponding to most probable mapping of read i. This allows us to count the number of reads corresponding to a particular insert size and efficiently calculate the new likelihood. If Inline graphic is the number of reads with insert size Inline graphic, then
    graphic file with name M20.gif

    where Inline graphic is the new effective length if both contigs are sufficiently large compared to insert sizes and can be precomputed for various gap sizes. The adjustments for cases where contigs are small can also be precomputed.

    The likelihood of linking reads and likelihood adjustments of all other reads are combined to estimate the edge weight. We retain the edge and assign it the computed weight if it is positive and delete the edge otherwise.

Selecting joins

Once the scaffold graph is constructed, we first make unambiguous joins, i.e. join contigs connected by an edge with an increase in likelihood and such that one vertex has outdegree one and the other has indegree one. The other possible joins are sorted according to the estimated increase in the likelihood and contigs are joined if likelihood increase of the candidate join is significantly higher than other conflicting joins as determined by a heuristic (Supplementary Note 3.3). If there are other joins consistent with the candidate join, i.e. one or more contigs fit into the gap between the pair of contigs, we select from them using the dynamic programming algorithm for the weighted interval scheduling problem and remove conflicting ones. We choose a conservative approach during joining as unlinked contigs may later be merged using other datasets but incorrect joins would usually remain undetected for de novo assembly and might lead to errors in downstream analysis.

Implementation

The methods have been implemented in a tool called ‘scaffolding with assembly likelihood optimization (Swalo)’ using C/C++. The read alignement and gap estimation phases are parallelized to speed up computation. Swalo is available for download freely at http://atifrahman.github.io/SWALO/.

Data availability

We use the data and analysis scripts used in (46,47) and (54). Scripts to install tools, download data and generate the results used in this paper are available at https://github.com/atifrahman/SWALO/tree/scripts.

The scaffolds generated are also available at the same url. There may be minor variations in results due to thread race during mapping of reads unaligned by Bowtie as random subsets of unaligned reads are mapped.

RESULTS AND DISCUSSION

Comparison with stand-alone scaffolders

To compare performance of Swalo with other stand-alone scaffolders, we use the datasets used by Hunt et al. to evaluate scaffolding tools (46). In addition to the scaffolders considered in the study, we include subsequently published versions of Opera (Opera-LG (41)) and BESST (55). The datasets include four simulated datasets from S. aureus and six real datasets from Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and human chromosome 14 (Supplementary Table S1). Among these the S. aureus, R. sphaeroides and human chromosome 14 datasets were also part of the GAGE project (47). The contigs were generated using Velvet (31) which were then aligned to the reference and split at locations of error to ensure misassembly free contigs. Please see (46) for more details on the datasets. We use Bowtie (56) and Bowtie 2 (57) for mapping reads, analyze results using the scripts provided in (46) and when applicable use the same parameter values for mapping and scaffolding as used in the paper by Hunt et al. (all parameters used are given in Supplementary Table S2). Similar to their paper, we use number of correct joins and number of incorrect joins for comparison as contiguity statistics such as N50 scaffold length biases evaluation towards scaffolders making more joins regardless of whether they are correct or incorrect while corrected N50 scaffold length leads to favorable assessment of scaffolders with more correct joins even if that is at the expense of many more incorrect joins compared to other scaffolders.

Supplementary Table S3 summarizes performance of scaffolding tools on simulated datasets. We find that Swalo makes no incorrect joins for any of the datasets. For 100kb contigs Swalo was able to make 100% of the correct joins using either library and all aligners. When the insert size library of mean 500 bp was used to scaffold 3 kb contigs, Swalo made 99.0%, 99.3% and 99.0% correct joins using Bowtie 2, Bowtie with 0 (-v 0) and 3 (-v 3) mismatches respectively. The only scaffolder that makes more than 99.3% correct joins is Opera at 99.8% when used in conjunction with BWA but this is at the cost of making 0.2% incorrect joins. For 3 kb contigs and 3 kb insert size library, Swalo made 99.6%, 99.8% and 99.6% correct joins for three mapping modes. No other scaffolder made more than 99.6% correct joins. It is worth pointing out that Swalo was able to make more correct joins when used with Bowtie -v 0 compared to Bowtie -v 3 and Bowtie 2 which may be due to reads not being mapped to some regions for the latter two.

Performance of Swalo in comparison to other scaffolders for real datasets is illustrated in Figure 2, Supplementary Figure S2 and Tables S4–S7. For the S. aureus dataset from GAGE, we find that Swalo made more correct joins than all other scaffolders while making 1, 1 and 2 incorrect joins in the three runs corresponding to the three ways of mapping reads. However, closer inspection reveals that one join labeled incorrect in each case is in fact a join from the end to the start of a circular sequence and is actually correct. Similarly for the R. sphaeroides dataset, more correct joins are made by Swalo than all other scaffolders when used in conjunction with Bowtie 2. Again three joins that are marked as incorrect are joins linking the ends to the starts of circular chromosomes or plasmids. we observe that the sequencing error rate for this dataset is high compared to the S. aureus dataset. So, number of reads mapped by Bowtie is quite low (46) resulting in lesser number of joins made by Swalo and other scaffolders when Bowtie is used compared to Bowtie 2.

Figure 2.

Figure 2.

Performance of scaffolders. Scatter plots showing number of correct joins vs incorrect joins made by Swalo and other scaffolders on (A) S. aureus data, (B) R. sphaeroides data, (C) P. falciparum combined short and long insert data, and (D) human chromosome 14 combined long insert and fosmid library data. Up to 1 and 3 joins in (A) and (B), respectively made by Swalo (and possibly other scaffolders) labeled incorrect are joins from ends to starts of circular sequences and are therefore correct. Values for all scaffolders except Swalo, Opera-LG and BESST are from (46).

The P. falciparum genome is known to be hard to assemble due its low GC content. In this case although Swalo does not make more correct joins than all other scaffolders as in other cases, the numbers of correct joins made are only slightly less than that of SOPRA, MIP and SCARPA while number of incorrect joins is less than or similar to what SOPRA made and much less than the numbers for SCARPA and MIP. We observe that many of the contigs have strings of consecutive ‘A’s or ‘T’s where very few reads are mapped to by aligners leading to poor gap estimates which may explain the comparatively smaller numbers of links by Swalo.

Finally, for the combined human chromosome 14 dataset, Swalo makes more correct joins than all other scaffolders except SOAP2 and Opera-LG both of which make more than three times incorrect joins compared to the highest number of incorrect links by Swalo and more than six times the best result by Swalo. Supplementary Figure S1 shows that the long jumping library is in fact a mixture of inserts of two sizes. When they are mapped and used to estimate gaps separately before scaffolding, and the fosmid library is applied on the output, the results improve both in terms of increase in the number of correct joins and decrease in the number of incorrect joins.

Comparison with other scaffolding modules

While Hunt et al. performed a comprehensive evaluation of stand-alone scaffolding tools, scaffolding modules built into some assemblers such as ALLPATHS-LG (33,58), MaSuRCA (59), CABOG (60) were left out as they cannot be run independently. In order to assess performance of Swalo in comparison to scaffolding modules of these assemblers, we ran Swalo on the contigs generated by the assemblers obtained from the GAGE project and compared the results with final results of contig assembly and scaffolding by each of these assemblers. We also include two widely used assemblers SPAdes (36) and Megahit (61) published after the GAGE study.

The results are shown in Table 1. It reveals that Swalo makes fewest number of incorrect joins in all cases except for SPAdes while making more or similar number of correct joins as ALLPATHS-LG and CABOG. For the human chromosome 14 dataset, there are 17 more joins in scaffolds generated by ALLPATHS-LG compared to Swalo output. However, nine more incorrect joins are made by ALLPATHS-LG than Swalo. Although MaSuRCA makes more correct joins than Swalo, this is at the expense of more incorrect joins which is drastically high for human chromosome 14. We observe that Swalo is able to make more correct joins than SPAdes while retaining similar incorrect to correct join ratio overall. Since the Megahit assembler does not have a scaffolding module, the result cannot be compared with Swalo results. However, we find that the contigs generated by Megahit can be reliably scaffolded using Swalo.

Table 1.

Comparison of performance of Swalo with scaffolding modules built into assemblers using GAGE datasets. Comparison of results obtained by running Swalo on contigs generated by various assemblers with final results obtained by those assemblers after scaffolding

Dataset Contig stats Original scaffold stats Swalo stats
Assembler Number Error Correct Incorrect Correct Incorrect
S. aureus
ALLPATHS-LG 60 15 48 0 49 0
MSR-CA 94 25 74 3 64 2
SPAdes 106 11 50 2 69 3
Megahit 91 31 - - 30 1
R. sphaeroides
ALLPATHS-LG 204 41 170 0 186 0
MSR-CA 395 17 347 5 228 3
CABOG 322 33 187 5 239 1
SPAdes 592 33 334 2 444 3
Megahit 605 45 - - 309 2
Human chromosome 14
ALLPATHS-LG 4529 2706 4259 45 4242 36
MSR-CA 30091 1901 27521 1145 20242 167
CABOG 3361 3076 2845 37 2980 32
SPAdes 27583 1876 9662 165 18465 253
Megahit 13150 3770 - - 8326 193

It may be noted that information such as actual position of reads in contigs and distance between contigs in assembly graph that were available to the assemblers could only be inferred by Swalo by mapping the reads back to the contigs using a short aligner such as Bowtie. We believe that if this information were made available by assemblers, the results could be further improved.

Time and memory requirements

Swalo uses statistical models to estimate gaps between contigs and the change in genome assembly likelihood achieved if contigs are joined. As a result it is more computationally intensive than some other scaffolders. However, we make necessary approximations to make Swalo fast, memory efficient and scalable to large datasets. Figure 3 and Supplementary Table S8 show running times and memory usage of Swalo using 32 cores on a machine with Intel Xeon E5 2.70 GHz processors to scaffold Hunt et al. datasets.

Figure 3.

Figure 3.

Running time and memory usage of SWALO. Barplots showing (A) running times and (B) memory usage of Swalo using 32 cores on a machine with Intel Xeon E5 2.70 GHz processors to scaffold Hunt et al. datasets.

Although a comparison of running times is not appropriate since Swalo was run on a different machine to other scaffolding tools, we would like to note that Swalo took from approximately a minute for S. aureus datasets to around 70 minutes for combined human chromosome 14 dataset to run (excluding the time required for mapping). The memory usage ranges from around 40MB for S. aureus to 437MB for combined human chromosome 14 dataset. We find that Swalo can scaffold 19936 contigs from human chromosome 14 using 25.1 million reads in about 70 min and using 437MB of memory.

Application to a large dataset

Finally, we apply Swalo to a large dataset from a bird genome, the budgerigar (Melopsittacus undulatus) dataset from Assemblathon 2 (54). We selected the contigs generated by SOAPdenovo and scaffolded them using the mate-pair libraries. The SOAPdenovo contig file provided included 245 857 contigs with a total length of ∼1.1 Gb and the five mate-pair libraries had more than 730 million read pairs in total. We chose the SOAPdenovo contigs as it is widely used for assembling large genomes including the original assembly of the budgerigar genome by Ganapathy et al. (62). SOAPdenovo has been observed to make large numbers of joins. Although the aggressive scaffolding often leads to many incorrect joins, this allows us to compare the number of joins made by Swalo to that of SOAPdenovo along with their accuracy in absence of a complete reference genome.

Table 2 shows a summary of the mate-pair libraries and the performance of Swalo on this dataset as well as the time and memory usage. We observe that Swalo makes a total of 80 669 joins in comparison to 94 464 joins made by SOAPdenovo. We find that that Swalo automatically switches to a conservative mode (Supplementary Note 3.4) on libraries with insert size means of 10k, 20k and 40k, to keep the number of incorrect joins low.

Table 2.

Performance of Swalo on a large dataset from a bird genome. Description of five mate-pair libraries from the budgerigar (Melopsittacus undulatus) dataset from Assemblathon 2 (54), time and memory requirements of Swalo, and the number of joins made by it. The first two libraries were used in the combined mode. For the last three libraries, Swalo switches to a conservative mode automatically and so the hierarchical approach is used as recommended

Accessions # read pairs Orientation Insert size Mode Scaffolding Time (hh:mm:ss) Peak memory (GB) # joins
ERR244148-150 264 708 963 Mate-pair 2000 Combined 27:36:12 5.09 70 634
ERR244151-152 194 240 419 Mate-pair 5000
ERR244153 94 282 287 Mate-pair 10 000 Hierarchical, Conservative (auto) 5:02:57 2.09 5372
ERR244154 89 722 180 Mate-pair 20 000 Hierarchical, Conservative (auto) 5:26:47 2.14 2957
ERR244155 87 489 651 Mate-pair 40 000 Hierarchical, Conservative (auto) 5:48:02 2.41 1706
Total 730 443 500 43:53:58 5.09 80 669

We also find that Swalo takes less than 44 hours using 32 threads and a peak memory of 5.09 GB to scaffold 245 857 contigs using more than 730 million read pairs. It is worth noting that the times shown in Table 2 exclude the time taken by Bowtie 2 to map the reads which takes substantially more time than the time needed by Swalo to scaffold. This shows that Swalo is efficient and scalable, and thus applicable to scaffolding large genomes.

To assess the correctness of the scaffolds generated by Swalo in comparison to those by SOAPdenovo, we used the chromosome-level assembly of the budgerigar (Melopsittacus undulatus) genome by O’Connor et al. (63) generated through computational and lab-based approaches. The quality of the scaffolds as well as the original SOAPdenovo contigs were assessed using the quality assessment tool QUAST (64). Table 3 summarizes the numbers of correct joins and mis-assemblies made by Swalo and SOAPdenovo. We observe that Swalo makes 78 501 and 2168 correct and incorrect joins respectively in comparison to 90 389 and 4075 by SOAPdenovo i.e. Swalo makes only 13% less correct joins while making 47% less mis-assemblies than SOAPdenovo.

Table 3.

Comparison of results of Swalo and SOAPdenovo on the bird genome dataset. Comparison of number of scaffolds, total number of joins, and number of mis-assemblies introduced while scaffolding by Swalo and SOAPdenovo using five mate-pair libraries from the budgerigar (Melopsittacus undulatus) dataset from Assemblathon 2 (54). In addition, the original numbers of contigs and mis-assemblies in the contigs generated by SOAPdenovo are also shown

Assembly Number of contigs/scaffolds Number of correct joins Number of mis-assemblies (additional)
SOAPdenovo contigs 245 857 - 5658
SOAPdenovo scaffolds 151 393 90 389 4075
Swalo scaffolds 166 894 78 501 2168

CONCLUSION

The results show that Swalo performs well consistently and is able to identify many correct joins while keeping the number of incorrect joins very low. It also shows pareto-optimal performance in the datasets we have analyzed, i.e. there is a run of Swalo such that no other scaffolder in any of their runs was able to make more correct joins while making less than the number of incorrect links by Swalo. We observe that consistent results are achieved when Swalo is used with Bowtie 2. However, when reads are largely error free results achieved using Bowtie with no mismatches can be better possibly due to reads being mapped to more regions compared to Bowtie 2.

Overall we find that Swalo outperforms all other scaffolders on real and simulated datasets. This indicates that genome assembly may be substantially improved through the use of statistical models. The method may further be improved by modifying the heuristic used to select among multiple candidate joins and by considering global properties of the scaffold graph. The methods may also be extended to scaffolding with long reads generated by SMRT and nanopore sequencing. The improvement in scaffolding achieved by a practical method based on assembly likelihoods opens up the possibility that other problems related to genome assembly such as reference guided assembly, mis-assembly correction, copy number estimation, gap-filling may also be amenable to this approach.

Supplementary Material

gkab717_Supplemental_File

ACKNOWLEDGEMENTS

We thank Dan Rokhsar, Páll Melsted, Harold Pimentel, Shannon McCurdy and Nicolas Bray for helpful conversations during the development of SWALO.

Contributor Information

Atif Rahman, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA; Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.

Lior Pachter, Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA; Departments of Mathematics and Molecular & Cell Biology, University of California, Berkeley, CA 94720, USA; Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, Pasadena, CA 91103, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

NIH [R01 HG006129 to L.P., in part]; Fulbright Science & Technology Fellowship [15093630 to A.R., in part]. Funding for open access charge: NIH [R01 HG006129].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.-J., Chen Z.et al.. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005; 437:376–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Harris T.D., Buzby P.R., Babcock H., Beer E., Bowers J., Braslavsky I., Causey M., Colonell J., Dimeo J., Efcavitch J.W.et al.. Single-molecule DNA sequencing of a viral genome. Science. 2008; 320:106–109. [DOI] [PubMed] [Google Scholar]
  • 3. Valouev A., Ichikawa J., Tonthat T., Stuart J., Ranade S., Peckham H., Zeng K., Malek J.A., Costa G., McKernan K.et al.. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res. 2008; 18:1051–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rothberg J.M., Hinz W., Rearick T.M., Schultz J., Mileski W., Davey M., Leamon J.H., Johnson K., Milgrew M.J., Edwards M.et al.. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011; 475:348–352. [DOI] [PubMed] [Google Scholar]
  • 5. Eid J., Fehr A., Gray J., Luong K., Lyle J., Otto G., Peluso P., Rank D., Baybayan P., Bettman B.et al.. Real-time DNA sequencing from single polymerase molecules. Science. 2009; 323:133–138. [DOI] [PubMed] [Google Scholar]
  • 6. Branton D., Deamer D.W., Marziali A., Bayley H., Benner S.A., Butler T., Di Ventra M., Garaj S., Hibbs A., Huang X.et al.. The potential and challenges of nanopore sequencing. Nat. Biotechnol. 2008; 26:1146–1153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Koren S., Walenz B.P., Berlin K., Miller J.R., Bergman N.H., Phillippy A.M.. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017; 27:722–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Chin C.-S., Peluso P., Sedlazeck F.J., Nattestad M., Concepcion G.T., Clum A., Dunn C., O’Malley R., Figueroa-Balderas R., Morales-Cruz A.et al.. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods. 2016; 13:1050–1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics. 2016; 32:2103–2110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Kolmogorov M., Yuan J., Lin Y., Pevzner P.A.. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 2019; 37:540–546. [DOI] [PubMed] [Google Scholar]
  • 11. Kim J., Larkin D.M., Cai Q., Zhang Y., Ge R.-L., Auvil L., Capitanu B., Zhang G., Lewin H.A., Ma J.et al.. Reference-assisted chromosome assembly. Proc. Natl. Acad. Sci. U.S.A. 2013; 110:1785–1790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kolmogorov M., Raney B., Paten B., Pham S.. Ragout—a reference-assisted assembly tool for bacterial genomes. Bioinformatics. 2014; 30:i302–i309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bosi E., Donati B., Galardini M., Brunetti S., Sagot M.-F., Lió P., Crescenzi P., Fani R., Fondi M.. MeDuSa: a multi-draft based scaffolder. Bioinformatics. 2015; 31:2443–2451. [DOI] [PubMed] [Google Scholar]
  • 14. Fleischmann R.D., Adams M.D., White O., Clayton R.A., Kirkness E.F., Kerlavage A.R., Bult C.J., Tomb J.-F., Dougherty B.A., Merrick J.M.et al.. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995; 269:496–512. [DOI] [PubMed] [Google Scholar]
  • 15. Weber J.L., Myers E.W.. Human whole-genome shotgun sequencing. Genome Res. 1997; 7:401–409. [DOI] [PubMed] [Google Scholar]
  • 16. Huson D.H., Reinert K., Myers E.W.. The greedy path-merging algorithm for contig scaffolding. J. ACM. 2002; 49:603–615. [Google Scholar]
  • 17. Yeo S., Coombe L., Warren R.L., Chu J., Birol I.. ARCS: scaffolding genome drafts with linked reads. Bioinformatics. 2018; 34:725–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Coombe L., Zhang J., Vandervalk B.P., Chu J., Jackman S.D., Birol I., Warren R.L.. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers. BMC Bioinformatics. 2018; 19:234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Weisenfeld N.I., Kumar V., Shah P., Church D.M., Jaffe D.B.. Direct determination of diploid genome sequences. Genome Res. 2017; 27:757–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Boetzer M., Pirovano W.. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics. 2014; 15:211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Warren R.L., Yang C., Vandervalk B.P., Behsaz B., Lagman A., Jones S.J., Birol I.. LINKS: scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience. 2015; 4:35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wick R.R., Judd L.M., Gorrie C.L., Holt K.E.. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput. Biol. 2017; 13:e1005595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Cao M.D., Nguyen S.H., Ganesamoorthy D., Elliott A.G., Cooper M.A., Coin L.J.. Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun. 2017; 8:14515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Dudchenko O., Batra S.S., Omer A.D., Nyquist S.K., Hoeger M., Durand N.C., Shamim M.S., Machol I., Lander E.S., Aiden A.P.et al.. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017; 356:92–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Burton J.N., Adey A., Patwardhan R.P., Qiu R., Kitzman J.O., Shendure J.. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 2013; 31:1119–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Kaplan N., Dekker J.. High-throughput genome scaffolding from in vivo DNA interaction frequency. Nat. Biotechnol. 2013; 31:1143–1147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Putnam N.H., O’Connell B.L., Stites J.C., Rice B.J., Blanchette M., Calef R., Troll C.J., Fields A., Hartley P.D., Sugnet C.W.et al.. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 2016; 26:342–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Ghurye J., Pop M., Koren S., Bickhart D., Chin C.-S.. Scaffolding of long read assemblies using long range contact information. BMC Genomics. 2017; 18:527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Ghurye J., Rhie A., Walenz B.P., Schmitt A., Selvaraj S., Pop M., Phillippy A.M., Koren S.. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 2019; 15:e1007273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ghurye J., Pop M.. Modern technologies and algorithms for scaffolding assembled genomes. PLoS Comput. Biol. 2019; 15:e1006994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Zerbino D.R., Birney E.. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18:821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Simpson J.T., Wong K., Jackman S.D., Schein J.E., Jones S.J., Birol I.. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009; 19:1117–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Butler J., MacCallum I., Kleber M., Shlyakhter I.A., Belmonte M.K., Lander E.S., Nusbaum C., Jaffe D.B.. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008; 18:810–820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Luo R., Liu B., Xie Y., Li Z., Huang W., Yuan J., He G., Chen Y., Pan Q., Liu Y.et al.. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012; 1:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Simpson J., Durbin R.. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012; 22:549–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Bankevich A., Nurk S., Antipov D., Gurevich A.A., Dvorkin M., Kulikov A.S., Lesin V.M., Nikolenko S.I., Pham S., Prjibelski A.D.et al.. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 2012; 19:455–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Pop M., Kosack D., Salzberg S.. Hierarchical scaffolding with Bambus. Genome Res. 2004; 14:149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Koren S., Treangen T., Pop M.. Bambus 2: scaffolding metagenomes. Bioinformatics. 2011; 27:2964–2971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Salmela L., Makinen V., Valimaki N., Ylinen J., Ukkonen E.. Fast scaffolding with small independent mixed integer programs. Bioinformatics. 2011; 27:3259–3265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Gao S., Sung W.-K., Nagarajan N.. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J. Comput. Biol. 2011; 18:1681–1691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Gao S., Bertrand D., Chia B.K., Nagarajan N.. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biol. 2016; 17:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Donmez N., Brudno M.. SCARPA: scaffolding reads with practical algorithms. Bioinformatics. 2013; 29:428–434. [DOI] [PubMed] [Google Scholar]
  • 43. Dayarian A., Michael T., Sengupta A.. SOPRA: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010; 11:345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Boetzer M., Henkel C., Jansen H., Butler D., Pirovano W.. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011; 27:578–579. [DOI] [PubMed] [Google Scholar]
  • 45. Sahlin K., Vezzi F., Nystedt B., Lundeberg J., Arvestad L.. BESST-efficient scaffolding of large fragmented assemblies. BMC bioinformatics. 2014; 15:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Hunt M., Newbold C., Berriman M., Otto T.. A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 2014; 15:R42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Salzberg S.L., Phillippy A.M., Zimin A., Puiu D., Magoc T., Koren S., Treangen T.J., Schatz M.C., Delcher A.L., Roberts M.et al.. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2011; 22:557–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Rahman A., Pachter L.. CGAL: computing genome assembly likelihoods. Genome Biol. 2013; 14:R8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Chapman J.A., Ho I., Sunkara S., Luo S., Schroth G.P., Rokhsar D.S.. Meraculous: De Novo genome assembly with short paired-end reads. PLoS ONE. 2011; 6:e23501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Sahlin K., Street N., Lundeberg J., Arvestad L.. Improved gap size estimation for scaffolding algorithms. Bioinformatics. 2012; 28:2215–2222. [DOI] [PubMed] [Google Scholar]
  • 51. Dempster A.P., Laird N.M., Rubin D.B.. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B (Methodological). 1977; 39:1. [Google Scholar]
  • 52. Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., Van Baren M.J., Salzberg S.L., Wold B.J., Pachter L.. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010; 28:511–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Medvedev P., Brudno M.. Maximum likelihood genome assembly. J. Computat. Biol. 2009; 16:1101–1116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Bradnam K., Fass J., Alexandrov A., Baranay P., Bechner M., Birol I., Boisvert S., Chapman J., Chapuis G., Chikhi R.et al.. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013; 2:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Sahlin K., Chikhi R., Arvestad L.. Assembly scaffolding with PE-contaminated mate-pair libraries. Bioinformatics. 2016; 32:1925–1932. [DOI] [PubMed] [Google Scholar]
  • 56. Langmead B., Trapnell C., Pop M., Salzberg S.. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9:357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Gnerre S., MacCallum I., Przybylski D., Ribeiro F.J., Burton J.N., Walker B.J., Sharpe T., Hall G., Shea T.P., Sykes S.et al.. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 2011; 108:1513–1518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Zimin A.V., Marçais G., Puiu D., Roberts M., Salzberg S.L., Yorke J.A.. The MaSuRCA genome assembler. Bioinformatics. 2013; 29:2669–2677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Miller J.R., Delcher A.L., Koren S., Venter E., Walenz B.P., Brownley A., Johnson J., Li K., Mobarry C., Sutton G.. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008; 24:2818–2824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Li D., Liu C.-M., Luo R., Sadakane K., Lam T.-W.. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015; 31:1674–1676. [DOI] [PubMed] [Google Scholar]
  • 62. Ganapathy G., Howard J.T., Ward J.M., Li J., Li B., Li Y., Xiong Y., Zhang Y., Zhou S., Schwartz D.C.et al.. High-coverage sequencing and annotated assemblies of the budgerigar genome. GigaScience. 2014; 3:11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. O’Connor R.E., Farré M., Joseph S., Damas J., Kiazim L., Jennings R., Bennett S., Slack E.A., Allanson E., Larkin D.M.et al.. Chromosome-level assembly reveals extensive rearrangement in saker falcon and budgerigar, but not ostrich, genomes. Genome Biol. 2018; 19:171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Gurevich A., Saveliev V., Vyahhi N., Tesler G.. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013; 29:1072–1075. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab717_Supplemental_File

Data Availability Statement

We use the data and analysis scripts used in (46,47) and (54). Scripts to install tools, download data and generate the results used in this paper are available at https://github.com/atifrahman/SWALO/tree/scripts.

The scaffolds generated are also available at the same url. There may be minor variations in results due to thread race during mapping of reads unaligned by Bowtie as random subsets of unaligned reads are mapped.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES